Discussion:
how to do partial word searches?
Joel Nylund
2009-11-25 00:51:31 UTC
Permalink
Hi, I saw some older postings on this, but didnt see a resolution.

I have a field called title, I would like to be able to find partial
word matches within the title.

For example:

http://localhost:8983/solr/select?q=textTitle:%22*sulli*%22

I would expect it to find:
<str name="textTitle">the daily dish | by andrew sullivan</str>

but it doesnt, it does find sully (which is fine with me also as a
bonus), but doesnt seem to get any of the partial word stuff. Oddly
enough before I lowercased the title, the wildcard matching seemed to
work a bit better, it just didnt deal with the case sensitive query.

At first I had mixed case titles and I read that the wildcard doesn't
work with mixed case, so I created another field that is a lowered
version of the title called "textTitle", it is of type text.

Is it possible with solr to achieve what I am trying to do, if so how?
If not, anything closer than what I have?

thanks
Joel
Erick Erickson
2009-11-25 02:12:59 UTC
Permalink
copying from Eric Hatcher:

See http://issues.apache.org/jira/browse/SOLR-218 - Solr currently
does not have leading wildcard support enabled.

There's a pretty extensive recent exchange on this, see the
thread on the user's list titled

"leading and trailing wildcard query"Best
Erick
Post by Joel Nylund
Hi, I saw some older postings on this, but didnt see a resolution.
I have a field called title, I would like to be able to find partial word
matches within the title.
http://localhost:8983/solr/select?q=textTitle:%22*sulli*%22
<str name="textTitle">the daily dish | by andrew sullivan</str>
but it doesnt, it does find sully (which is fine with me also as a bonus),
but doesnt seem to get any of the partial word stuff. Oddly enough before I
lowercased the title, the wildcard matching seemed to work a bit better, it
just didnt deal with the case sensitive query.
At first I had mixed case titles and I read that the wildcard doesn't work
with mixed case, so I created another field that is a lowered version of the
title called "textTitle", it is of type text.
Is it possible with solr to achieve what I am trying to do, if so how? If
not, anything closer than what I have?
thanks
Joel
Joel Nylund
2009-11-25 13:18:20 UTC
Permalink
Hi Erick,

thanks for the links, I read both of them and I still have no idea
what to do, lots of back and forth, but didn't see any solution on it.

One person talked about indexing the field in reverse and doing and ON
on it, this might work I guess.

thanks
Joel
Post by Erick Erickson
See http://issues.apache.org/jira/browse/SOLR-218 - Solr currently
does not have leading wildcard support enabled.
There's a pretty extensive recent exchange on this, see the
thread on the user's list titled
"leading and trailing wildcard query"Best
Erick
Post by Joel Nylund
Hi, I saw some older postings on this, but didnt see a resolution.
I have a field called title, I would like to be able to find
partial word
matches within the title.
http://localhost:8983/solr/select?q=textTitle:%22*sulli*%22
<str name="textTitle">the daily dish | by andrew sullivan</str>
but it doesnt, it does find sully (which is fine with me also as a bonus),
but doesnt seem to get any of the partial word stuff. Oddly enough before I
lowercased the title, the wildcard matching seemed to work a bit better, it
just didnt deal with the case sensitive query.
At first I had mixed case titles and I read that the wildcard
doesn't work
with mixed case, so I created another field that is a lowered
version of the
title called "textTitle", it is of type text.
Is it possible with solr to achieve what I am trying to do, if so how? If
not, anything closer than what I have?
thanks
Joel
Erick Erickson
2009-11-25 14:03:02 UTC
Permalink
Confession: I haven't had occasion to use the ngram thingy, but here's the
theory....
And note that SOLR has n-gram tokenizers available..

Using a 2-gram example for sullivan, the n-gram would index these tokens...
su, ul, ll, li, iv, va, an. Then at query time in your example, sulli would
be
broken up into su, ul, ll and li. Which, when searched as a phrase
would turn match your field.....

The expense, of course is that your index is larger (but surprisingly not as
much as you'd think). But your queries are much faster.....

That's the theory anyway, the practice is "left as an exercise for the
reader"<G>

But "the folks" generously provided quite an explication of what wildcards
are
all about on the *lucene* user's list, look for a thread titled
"I just don't get wildcards at all" from around 2006. It's a nice background
for
what the underlying problem is, some of the SOLR tokenizers are realizing
some of this I think. And the state of the art has progressed considerably
since then, but the underlying issues are still there...

Sorry I can't be more help here..
Erick
Post by Joel Nylund
Hi Erick,
thanks for the links, I read both of them and I still have no idea what to
do, lots of back and forth, but didn't see any solution on it.
One person talked about indexing the field in reverse and doing and ON on
it, this might work I guess.
thanks
Joel
Post by Erick Erickson
See http://issues.apache.org/jira/browse/SOLR-218 - Solr currently
does not have leading wildcard support enabled.
There's a pretty extensive recent exchange on this, see the
thread on the user's list titled
"leading and trailing wildcard query"Best
Erick
Hi, I saw some older postings on this, but didnt see a resolution.
Post by Joel Nylund
I have a field called title, I would like to be able to find partial word
matches within the title.
http://localhost:8983/solr/select?q=textTitle:%22*sulli*%22
<str name="textTitle">the daily dish | by andrew sullivan</str>
but it doesnt, it does find sully (which is fine with me also as a bonus),
but doesnt seem to get any of the partial word stuff. Oddly enough before I
lowercased the title, the wildcard matching seemed to work a bit better, it
just didnt deal with the case sensitive query.
At first I had mixed case titles and I read that the wildcard doesn't work
with mixed case, so I created another field that is a lowered version of the
title called "textTitle", it is of type text.
Is it possible with solr to achieve what I am trying to do, if so how? If
not, anything closer than what I have?
thanks
Joel
Robert Muir
2009-11-25 14:21:58 UTC
Permalink
Hi, if you are using Solr 1.4 I think you might want to try type text_rev
(look in the example schema.xml)

unless i am mistaken:

this will enable leading wildcard support for that field.
this doesn't do any stemming, which I think might be making your wildcards
behave wierd.
it also enables reverse wildcard support, so some of your substring matches
will be faster.
Post by Joel Nylund
Hi, I saw some older postings on this, but didnt see a resolution.
I have a field called title, I would like to be able to find partial word
matches within the title.
http://localhost:8983/solr/select?q=textTitle:%22*sulli*%22
<str name="textTitle">the daily dish | by andrew sullivan</str>
but it doesnt, it does find sully (which is fine with me also as a bonus),
but doesnt seem to get any of the partial word stuff. Oddly enough before I
lowercased the title, the wildcard matching seemed to work a bit better, it
just didnt deal with the case sensitive query.
At first I had mixed case titles and I read that the wildcard doesn't work
with mixed case, so I created another field that is a lowered version of the
title called "textTitle", it is of type text.
Is it possible with solr to achieve what I am trying to do, if so how? If
not, anything closer than what I have?
thanks
Joel
--
Robert Muir
***@gmail.com
Joel Nylund
2009-12-03 18:21:46 UTC
Permalink
Just for an update on this, I tried text_rev and it seems to work great.

So in summary, if you want partial word matches within a url or small
sentence (title), here is what I did and it seems to work pretty well:

- create an extra field that is all lower case , I used mysql lcase in
the query for DIH
- make that field use text_rev type in schema.xml
- make the query be "sulli OR *sulli*" (the *sulli* doesnt seem to
match sulli if its at the end of the field)

thanks
Joel
Post by Robert Muir
Hi, if you are using Solr 1.4 I think you might want to try type text_rev
(look in the example schema.xml)
this will enable leading wildcard support for that field.
this doesn't do any stemming, which I think might be making your wildcards
behave wierd.
it also enables reverse wildcard support, so some of your substring matches
will be faster.
Post by Joel Nylund
Hi, I saw some older postings on this, but didnt see a resolution.
I have a field called title, I would like to be able to find
partial word
matches within the title.
http://localhost:8983/solr/select?q=textTitle:%22*sulli*%22
<str name="textTitle">the daily dish | by andrew sullivan</str>
but it doesnt, it does find sully (which is fine with me also as a bonus),
but doesnt seem to get any of the partial word stuff. Oddly enough before I
lowercased the title, the wildcard matching seemed to work a bit better, it
just didnt deal with the case sensitive query.
At first I had mixed case titles and I read that the wildcard
doesn't work
with mixed case, so I created another field that is a lowered
version of the
title called "textTitle", it is of type text.
Is it possible with solr to achieve what I am trying to do, if so how? If
not, anything closer than what I have?
thanks
Joel
--
Robert Muir
Rob Ganly
2010-03-10 10:42:03 UTC
Permalink
hi all,

i was having the same problem, i needed to be able to search a substring
anywhere within a word for a specific field. i used the
NGramTokenizerFactory factory in my index analyzer and it seems to work
well. (
http://lucene.apache.org/solr/api/org/apache/solr/analysis/NGramTokenizerFactory.html
).

i created a new field type based on this definition:
http://coderrr.wordpress.com/category/solr/#ngram_schema_xml

apparently it will increased the size of your index and perhaps indexing
time but is working fine at the moment (although i'm currently only using a
testbed of 20'000 records). i will report back if i discover any painful
issues with scaling up!

rob ganly
Post by Joel Nylund
Just for an update on this, I tried text_rev and it seems to work great.
So in summary, if you want partial word matches within a url or small
- create an extra field that is all lower case , I used mysql lcase in the
query for DIH
- make that field use text_rev type in schema.xml
- make the query be "sulli OR *sulli*" (the *sulli* doesnt seem to match
sulli if its at the end of the field)
thanks
Joel
Hi, if you are using Solr 1.4 I think you might want to try type text_rev
Post by Robert Muir
(look in the example schema.xml)
this will enable leading wildcard support for that field.
this doesn't do any stemming, which I think might be making your wildcards
behave wierd.
it also enables reverse wildcard support, so some of your substring matches
will be faster.
Hi, I saw some older postings on this, but didnt see a resolution.
Post by Joel Nylund
I have a field called title, I would like to be able to find partial word
matches within the title.
http://localhost:8983/solr/select?q=textTitle:%22*sulli*%22
<str name="textTitle">the daily dish | by andrew sullivan</str>
but it doesnt, it does find sully (which is fine with me also as a bonus),
but doesnt seem to get any of the partial word stuff. Oddly enough before I
lowercased the title, the wildcard matching seemed to work a bit better, it
just didnt deal with the case sensitive query.
At first I had mixed case titles and I read that the wildcard doesn't work
with mixed case, so I created another field that is a lowered version of the
title called "textTitle", it is of type text.
Is it possible with solr to achieve what I am trying to do, if so how? If
not, anything closer than what I have?
thanks
Joel
--
Robert Muir
Continue reading on narkive:
Loading...