Discussion:
Customizing Solr to handle Leading Wildcard queries
Jana, Kumar Raja
2009-01-15 13:23:08 UTC
Permalink
Hi,



Not being able to perform Leading Wildcard queries is a major handicap.
I want to be able to perform searches like *.pdf to fetch all pdf
documents from Solr.



I have found quite a few threads on this topic and one of the solutions
was that this feature can be enabled by adding:

parser.setAllowLeadingWildcards(true); at Line 92 in QueryParsing.java

Unfortunately, this did not work or may be I was using a different
parser and I don't know how to configure the parsers to make this work.



Can someone please tell me the steps to customize Solr to enable this
feature?



Thanks,

Kumar
Erik Hatcher
2009-01-15 14:28:51 UTC
Permalink
Post by Jana, Kumar Raja
Not being able to perform Leading Wildcard queries is a major
handicap.
I want to be able to perform searches like *.pdf to fetch all pdf
documents from Solr.
For this particular case, I recommend indexing the document type as a
separate field. Something like type:pdf (or use a MIME type string).
Then you can do a very direct and fast query to search or facet by
document types.

Erik
Jana, Kumar Raja
2009-01-15 14:49:24 UTC
Permalink
Hi Erik,

Thanks for the quick reply.
I want to enable leading wildcard query searches in general. The case
mentioned in the earlier mail is just one of the many instances I use
this feature.

-Kumar




-----Original Message-----
From: Erik Hatcher [mailto:***@ehatchersolutions.com]
Sent: Thursday, January 15, 2009 7:59 PM
To: solr-***@lucene.apache.org
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Post by Jana, Kumar Raja
Not being able to perform Leading Wildcard queries is a major
handicap.
I want to be able to perform searches like *.pdf to fetch all pdf
documents from Solr.
For this particular case, I recommend indexing the document type as a
separate field. Something like type:pdf (or use a MIME type string).
Then you can do a very direct and fast query to search or facet by
document types.

Erik
Glen Newton
2009-01-15 14:58:53 UTC
Permalink
If we are talking short single term fields (like a file field that has
a single term like "foo.pdf") then do what the DBMS b-tree indexes did
a long time ago: for every field you want a leading wildcard, insert
it in reverse order. So field file:"foo.pdf" is also stored, indexed
as reverseField:"fdp.oof". Now when someone does a search on
reverseField, like reverseField:*oo.pdf, you reverse the query to be:
fdp.oo*

I believe some of the DBMSs kept a separate reverse b-tree to handle
leading wildcard queries.

And obviously this technique is harder to put in place for arbitrary
sections of text that have to parsed. But a special parser could be
written to handle this as well.

-glen
http://zzzoot.blogspot.com/
Post by Jana, Kumar Raja
Hi Erik,
Thanks for the quick reply.
I want to enable leading wildcard query searches in general. The case
mentioned in the earlier mail is just one of the many instances I use
this feature.
-Kumar
-----Original Message-----
Sent: Thursday, January 15, 2009 7:59 PM
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Post by Jana, Kumar Raja
Not being able to perform Leading Wildcard queries is a major
handicap.
I want to be able to perform searches like *.pdf to fetch all pdf
documents from Solr.
For this particular case, I recommend indexing the document type as a
separate field. Something like type:pdf (or use a MIME type string).
Then you can do a very direct and fast query to search or facet by
document types.
Erik
--
-
Otis Gospodnetic
2009-01-15 16:48:10 UTC
Permalink
Hi ramuK,

I believe you can turn that "on" via the Lucene QueryParser, but of course such searches will be slo(oo)w. You can also index reversed tokens (e.g. *kumar --> rakum*) or you could index n-grams with begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u m a r $")


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
Sent: Thursday, January 15, 2009 9:49:24 AM
Subject: RE: Customizing Solr to handle Leading Wildcard queries
Hi Erik,
Thanks for the quick reply.
I want to enable leading wildcard query searches in general. The case
mentioned in the earlier mail is just one of the many instances I use
this feature.
-Kumar
-----Original Message-----
Sent: Thursday, January 15, 2009 7:59 PM
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Post by Jana, Kumar Raja
Not being able to perform Leading Wildcard queries is a major handicap.
I want to be able to perform searches like *.pdf to fetch all pdf
documents from Solr.
For this particular case, I recommend indexing the document type as a
separate field. Something like type:pdf (or use a MIME type string).
Then you can do a very direct and fast query to search or facet by
document types.
Erik
Jana, Kumar Raja
2009-01-28 07:19:35 UTC
Permalink
Hi,

Thanks Otis, Newton and everyone else for the help on this issue.

Most of the data I index are documents like pdfs, word Docs, open office
documents, etc. I store the content of the document in a field called
content and the remaining metadata of the document like name, id,
created by, modified by, created on, etc in a copy field called
metadata. I am not particularly interested in enabling leading wildcard
characters in the content (although such a possibility would be a
bonus). For this, I've tried implementing the suggestion to store
reverse strings as well as the correct strings for the metadata field.
All leading wildcard queries like "*abc" and searched as "cba*" against
the reversed metadata field. So far so good. Thank you :)

But now, I ran into the scenario where the query string is *abc* :( and
the whole thing came down crashing again. I cannot ignore such queries.
I would rather take the risk of Solr OOMing by enabling the leading
wildcard query searches.

Can someone please tell me the steps to turn on this feature in Lucene
QueryParser? I am sure it will be helpful to many to document such a
procedure on the Wiki or somewhere else. (I am definitely going to do
that once I fix this. Too much trouble this seems to be)
Also, which queryParser does Solr use by default?

Thanks,
Kumar




-----Original Message-----
From: Otis Gospodnetic [mailto:***@yahoo.com]
Sent: Thursday, January 15, 2009 10:18 PM
To: solr-***@lucene.apache.org
Subject: Re: Customizing Solr to handle Leading Wildcard queries

Hi ramuK,

I believe you can turn that "on" via the Lucene QueryParser, but of
course such searches will be slo(oo)w. You can also index reversed
tokens (e.g. *kumar --> rakum*) or you could index n-grams with
begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u
m a r $")


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
Sent: Thursday, January 15, 2009 9:49:24 AM
Subject: RE: Customizing Solr to handle Leading Wildcard queries
Hi Erik,
Thanks for the quick reply.
I want to enable leading wildcard query searches in general. The case
mentioned in the earlier mail is just one of the many instances I use
this feature.
-Kumar
-----Original Message-----
Sent: Thursday, January 15, 2009 7:59 PM
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Post by Jana, Kumar Raja
Not being able to perform Leading Wildcard queries is a major handicap.
I want to be able to perform searches like *.pdf to fetch all pdf
documents from Solr.
For this particular case, I recommend indexing the document type as a
separate field. Something like type:pdf (or use a MIME type string).
Then you can do a very direct and fast query to search or facet by
document types.
Erik
Neal Richter
2009-01-28 08:04:56 UTC
Permalink
leading wildcard search is called grep ;-)

Ditto on the indexing reversed words suggestion.

Can you create a second field in solr that contains /only/ the words
from the fields you care to reverse? Once you do that you could
pre-process the query and look for leading wildcards and address those
(after reversing the query) only against your special
reverse-meta-data field.

The *foo* case really is grep! You nearly by definition have to
linearly scan the index unless some magic is added.

Your options are to extend Otis' ngram suggestion and turn a word like
"baffoonery"
into:

(stored in "meta field")
baffoonery
affoonery
ffoonery
foonery
oonery
onery
nery
ery
ry

Now you can take a query like "*foo*" and drop the leading wildcard
and it will hit on 'foonery'.

Make sense? You are trading index size for not doing a linear scan
like grep. It's not advisable to do this for every word in your
document set ;-)

- Neal Richter
Post by Jana, Kumar Raja
Hi,
Thanks Otis, Newton and everyone else for the help on this issue.
Most of the data I index are documents like pdfs, word Docs, open office
documents, etc. I store the content of the document in a field called
content and the remaining metadata of the document like name, id,
created by, modified by, created on, etc in a copy field called
metadata. I am not particularly interested in enabling leading wildcard
characters in the content (although such a possibility would be a
bonus). For this, I've tried implementing the suggestion to store
reverse strings as well as the correct strings for the metadata field.
All leading wildcard queries like "*abc" and searched as "cba*" against
the reversed metadata field. So far so good. Thank you :)
But now, I ran into the scenario where the query string is *abc* :( and
the whole thing came down crashing again. I cannot ignore such queries.
I would rather take the risk of Solr OOMing by enabling the leading
wildcard query searches.
Can someone please tell me the steps to turn on this feature in Lucene
QueryParser? I am sure it will be helpful to many to document such a
procedure on the Wiki or somewhere else. (I am definitely going to do
that once I fix this. Too much trouble this seems to be)
Also, which queryParser does Solr use by default?
Thanks,
Kumar
-----Original Message-----
Sent: Thursday, January 15, 2009 10:18 PM
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Hi ramuK,
I believe you can turn that "on" via the Lucene QueryParser, but of
course such searches will be slo(oo)w. You can also index reversed
tokens (e.g. *kumar --> rakum*) or you could index n-grams with
begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u
m a r $")
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
Sent: Thursday, January 15, 2009 9:49:24 AM
Subject: RE: Customizing Solr to handle Leading Wildcard queries
Hi Erik,
Thanks for the quick reply.
I want to enable leading wildcard query searches in general. The case
mentioned in the earlier mail is just one of the many instances I use
this feature.
-Kumar
-----Original Message-----
Sent: Thursday, January 15, 2009 7:59 PM
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Post by Jana, Kumar Raja
Not being able to perform Leading Wildcard queries is a major handicap.
I want to be able to perform searches like *.pdf to fetch all pdf
documents from Solr.
For this particular case, I recommend indexing the document type as a
separate field. Something like type:pdf (or use a MIME type string).
Then you can do a very direct and fast query to search or facet by
document types.
Erik
Neal Richter
2009-01-28 08:10:29 UTC
Permalink
Oh wait.. looks like Otis' suggestion of "index n-grams with begin/end
delim characters" and relying on phrase-searching to link the chains
of characters.. logically doing a better version of my previous email.

- Neal
Post by Neal Richter
leading wildcard search is called grep ;-)
Ditto on the indexing reversed words suggestion.
Can you create a second field in solr that contains /only/ the words
from the fields you care to reverse? Once you do that you could
pre-process the query and look for leading wildcards and address those
(after reversing the query) only against your special
reverse-meta-data field.
The *foo* case really is grep! You nearly by definition have to
linearly scan the index unless some magic is added.
Your options are to extend Otis' ngram suggestion and turn a word like
"baffoonery"
(stored in "meta field")
baffoonery
affoonery
ffoonery
foonery
oonery
onery
nery
ery
ry
Now you can take a query like "*foo*" and drop the leading wildcard
and it will hit on 'foonery'.
Make sense? You are trading index size for not doing a linear scan
like grep. It's not advisable to do this for every word in your
document set ;-)
- Neal Richter
Post by Jana, Kumar Raja
Hi,
Thanks Otis, Newton and everyone else for the help on this issue.
Most of the data I index are documents like pdfs, word Docs, open office
documents, etc. I store the content of the document in a field called
content and the remaining metadata of the document like name, id,
created by, modified by, created on, etc in a copy field called
metadata. I am not particularly interested in enabling leading wildcard
characters in the content (although such a possibility would be a
bonus). For this, I've tried implementing the suggestion to store
reverse strings as well as the correct strings for the metadata field.
All leading wildcard queries like "*abc" and searched as "cba*" against
the reversed metadata field. So far so good. Thank you :)
But now, I ran into the scenario where the query string is *abc* :( and
the whole thing came down crashing again. I cannot ignore such queries.
I would rather take the risk of Solr OOMing by enabling the leading
wildcard query searches.
Can someone please tell me the steps to turn on this feature in Lucene
QueryParser? I am sure it will be helpful to many to document such a
procedure on the Wiki or somewhere else. (I am definitely going to do
that once I fix this. Too much trouble this seems to be)
Also, which queryParser does Solr use by default?
Thanks,
Kumar
-----Original Message-----
Sent: Thursday, January 15, 2009 10:18 PM
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Hi ramuK,
I believe you can turn that "on" via the Lucene QueryParser, but of
course such searches will be slo(oo)w. You can also index reversed
tokens (e.g. *kumar --> rakum*) or you could index n-grams with
begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u
m a r $")
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
Sent: Thursday, January 15, 2009 9:49:24 AM
Subject: RE: Customizing Solr to handle Leading Wildcard queries
Hi Erik,
Thanks for the quick reply.
I want to enable leading wildcard query searches in general. The case
mentioned in the earlier mail is just one of the many instances I use
this feature.
-Kumar
-----Original Message-----
Sent: Thursday, January 15, 2009 7:59 PM
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Post by Jana, Kumar Raja
Not being able to perform Leading Wildcard queries is a major handicap.
I want to be able to perform searches like *.pdf to fetch all pdf
documents from Solr.
For this particular case, I recommend indexing the document type as a
separate field. Something like type:pdf (or use a MIME type string).
Then you can do a very direct and fast query to search or facet by
document types.
Erik
Otis Gospodnetic
2009-01-28 21:08:14 UTC
Permalink
Yeah, I think the begin/end chars are very helpful here. But I like the suggestion of figuring out which words really need to support leading wildcards...although that's typically impossible to predict, since people are typically free to enter whatever queries they feel like.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
Sent: Wednesday, January 28, 2009 3:10:29 AM
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Oh wait.. looks like Otis' suggestion of "index n-grams with begin/end
delim characters" and relying on phrase-searching to link the chains
of characters.. logically doing a better version of my previous email.
- Neal
Post by Neal Richter
leading wildcard search is called grep ;-)
Ditto on the indexing reversed words suggestion.
Can you create a second field in solr that contains /only/ the words
from the fields you care to reverse? Once you do that you could
pre-process the query and look for leading wildcards and address those
(after reversing the query) only against your special
reverse-meta-data field.
The *foo* case really is grep! You nearly by definition have to
linearly scan the index unless some magic is added.
Your options are to extend Otis' ngram suggestion and turn a word like
"baffoonery"
(stored in "meta field")
baffoonery
affoonery
ffoonery
foonery
oonery
onery
nery
ery
ry
Now you can take a query like "*foo*" and drop the leading wildcard
and it will hit on 'foonery'.
Make sense? You are trading index size for not doing a linear scan
like grep. It's not advisable to do this for every word in your
document set ;-)
- Neal Richter
Post by Jana, Kumar Raja
Hi,
Thanks Otis, Newton and everyone else for the help on this issue.
Most of the data I index are documents like pdfs, word Docs, open office
documents, etc. I store the content of the document in a field called
content and the remaining metadata of the document like name, id,
created by, modified by, created on, etc in a copy field called
metadata. I am not particularly interested in enabling leading wildcard
characters in the content (although such a possibility would be a
bonus). For this, I've tried implementing the suggestion to store
reverse strings as well as the correct strings for the metadata field.
All leading wildcard queries like "*abc" and searched as "cba*" against
the reversed metadata field. So far so good. Thank you :)
But now, I ran into the scenario where the query string is *abc* :( and
the whole thing came down crashing again. I cannot ignore such queries.
I would rather take the risk of Solr OOMing by enabling the leading
wildcard query searches.
Can someone please tell me the steps to turn on this feature in Lucene
QueryParser? I am sure it will be helpful to many to document such a
procedure on the Wiki or somewhere else. (I am definitely going to do
that once I fix this. Too much trouble this seems to be)
Also, which queryParser does Solr use by default?
Thanks,
Kumar
-----Original Message-----
Sent: Thursday, January 15, 2009 10:18 PM
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Hi ramuK,
I believe you can turn that "on" via the Lucene QueryParser, but of
course such searches will be slo(oo)w. You can also index reversed
tokens (e.g. *kumar --> rakum*) or you could index n-grams with
begin/end delim characters (e.g. kumar -> ^ k u m a r $, *kumar -> "k u
m a r $")
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: "Jana, Kumar Raja"
Sent: Thursday, January 15, 2009 9:49:24 AM
Subject: RE: Customizing Solr to handle Leading Wildcard queries
Hi Erik,
Thanks for the quick reply.
I want to enable leading wildcard query searches in general. The case
mentioned in the earlier mail is just one of the many instances I use
this feature.
-Kumar
-----Original Message-----
Sent: Thursday, January 15, 2009 7:59 PM
Subject: Re: Customizing Solr to handle Leading Wildcard queries
Post by Jana, Kumar Raja
Not being able to perform Leading Wildcard queries is a major handicap.
I want to be able to perform searches like *.pdf to fetch all pdf
documents from Solr.
For this particular case, I recommend indexing the document type as a
separate field. Something like type:pdf (or use a MIME type string).
Then you can do a very direct and fast query to search or facet by
document types.
Erik
Loading...