Solr Wildcard Search for large amount of text

Discussion:

octopus

2015-06-27 10:27:33 UTC

Hi, I'm looking at Solr's features for wildcard search used for a large
amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
to generate tokens for wildcard searching.

For Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria",
"nigeria", "nigerian"

However, I have a large amount of text out there which requires wildcard
search and it's not viable to use EdgeNGrameFilterFactory as the amount of
processing will be too huge. Do you have any suggestions/advice please?

Thank you so much for your time!

--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-tp4214392.html
Sent from the Solr - User mailing list archive at Nabble.com.

Upayavira

2015-06-27 13:39:36 UTC

Permalink

That is one way to implement wildcarda, but isnt the most efficient.

Just index normally, tokenized, and search with an asterisk suffix, e.g.
foo*

This will build a finite state transformer that will make wildcard
handling efficient.

Upayavira

Post by octopus
Hi, I'm looking at Solr's features for wildcard search used for a large
amount of text. I read on the net that solr.EdgeNGramFilterFactory is used
to generate tokens for wildcard searching.
For Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria",
"nigeria", "nigerian"
However, I have a large amount of text out there which requires wildcard
search and it's not viable to use EdgeNGrameFilterFactory as the amount of
processing will be too huge. Do you have any suggestions/advice please?
Thank you so much for your time!
--
http://lucene.472066.n3.nabble.com/Solr-Wildcard-Search-for-large-amount-of-text-tp4214392.html
Sent from the Solr - User mailing list archive at Nabble.com.

Shawn Heisey

2015-06-27 14:06:51 UTC

Permalink

Both edgengrams and wildcards are ways to do this. There are advantages
and disadvantages to both ways.

To do a wildcard search, Solr (Lucene really) must look up all the
matching terms in the index and substitute them into the query so that
it becomes a large number of simple string matches. If you have a large
number of terms in your index, that can be slow. The expensive work
(expanding the terms) is done for every single query.

The edgengram filter does similar work, but it does it at *index* time,
rather than query time. At query time, you are doing a simple string
match with one term, although the index contains many more terms,
because the very expensive work was done at index time.

It's difficult to know which approach will be more efficient on *your*
index without experimentation, but there is a general rule when it comes
to Solr performance: As much as possible, do the expensive work at index
time.

Thanks,
Shawn

Erick Erickson

2015-06-27 15:41:08 UTC

Permalink

Try it and see ;).

My experience is that wildcards work fine, although
what "fine" is up to you to decide _if_ you restrict
it to requiring at least two leading "real" characters,
and I actually prefer three. I.e.
ab* or abc*. Note that if you require leading
wildcards, use the reverse wildcard filter.

I will vociferously argue that single-letter wildcards are
not useful anyway. I mean every single document in your
corpus will probably match every single-letter wildcard
(a*, b*, whatever), providing no benefit to the user.

And, the need for wildcards can often be reduced or
eliminated if you use can autosuggest or autocomplete.
Of course if you're trying to satisfy more complex use
cases where the user is composing their own complex
clauses that may not apply.

FWIW,
Erick

Post by Shawn Heisey

Both edgengrams and wildcards are ways to do this. There are advantages
and disadvantages to both ways.
To do a wildcard search, Solr (Lucene really) must look up all the
matching terms in the index and substitute them into the query so that
it becomes a large number of simple string matches. If you have a large
number of terms in your index, that can be slow. The expensive work
(expanding the terms) is done for every single query.
The edgengram filter does similar work, but it does it at *index* time,
rather than query time. At query time, you are doing a simple string
match with one term, although the index contains many more terms,
because the very expensive work was done at index time.
It's difficult to know which approach will be more efficient on *your*
index without experimentation, but there is a general rule when it comes
to Solr performance: As much as possible, do the expensive work at index
time.
Thanks,
Shawn

Jack Krupansky

2015-06-27 15:54:38 UTC

Permalink

What do you want actual user queries to look like? I mean, having to
explicitly write asterisks after every term is a real pain.

Indexing ngrams has the advantage that phrase queries and edismax phrase
boosting work automatically. Phrases don't work with explicit wildcard
queries.

The only real downside to ngrams is that they explode the size of the
index. But memory is supposed to be cheap these days. I mean, compare the
cost of the extra RAM (to keep the full index in memory) to the cost to
users of tehir productivity constructing queries and having expensive staff
to help them figure out why various queries don't work as expected.

How big is your corpus - number of documents and average document size?

-- Jack Krupansky