Discussion:
solr wildcard queries and analyzers
Kári Hreinsson
2011-01-11 11:58:09 UTC
Permalink
Hi,

I am having a problem with the fact that no text analysis are performed on wildcard queries. I have the following field type (a bit simplified):
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" />
</analyzer>
</fieldType>

My problem has to do with Icelandic characters, when I index a document with a text field including the word "sjálfsögðu" it gets indexed as "sjalfsogdu" (because of the ASCIIFoldingFilterFactory which replaces the Icelandic characters with their English equivalents). Then, when I search (without a wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as a result. This is convenient since it enables people to search without using accented characters and yet get the results they want (e.g. if they are working on computers with English keyboards).

However this all falls apart when using wildcard searches, then the search string isn't passed through the filters, and even if I search for "sjálf*" I don't get any results because the index doesn't contain the original words (I get result if I search for "sjalf*"). I know people have been having a similar problem with the case sensitivity of wildcard queries and most often the solution seems to be to lowercase the string before passing it on to solr, which is not exactly an optimal solution (yet a simple one in that case). The Icelandic characters complicate things a bit and applying the same solution (doing the lowercasing and character mapping) in my application seems like unnecessary duplication of code already part of solr, not to mention complication of my application and possible maintenance down the road.

Is there any way around this? How are people solving this? Is there a way to apply the filters to wildcard queries? I guess removing the ASCIIFoldingFilterFactory is the simplest "solution" but this "normalization" (of the text done by the filter) is often very useful.

I hope I'm not overlooking some obvious explanation. :/

Thanks in advance,
Kári Hreinsson
Matti Oinas
2011-01-11 12:19:35 UTC
Permalink
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers

On wildcard and fuzzy searches, no text analysis is performed on the
search word.
Hi,
   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.WhitespaceTokenizerFactory" />
       <filter class="solr.TrimFilterFactory" />
       <filter class="solr.LowerCaseFilterFactory" />
       <filter class="solr.ASCIIFoldingFilterFactory" />
     </analyzer>
   </fieldType>
My problem has to do with Icelandic characters, when I index a document with a text field including the word "sjálfsögðu" it gets indexed as "sjalfsogdu" (because of the ASCIIFoldingFilterFactory which replaces the Icelandic characters with their English equivalents).  Then, when I search (without a wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as a result.  This is convenient since it enables people to search without using accented characters and yet get the results they want (e.g. if they are working on computers with English keyboards).
However this all falls apart when using wildcard searches, then the search string isn't passed through the filters, and even if I search for "sjálf*" I don't get any results because the index doesn't contain the original words (I get result if I search for "sjalf*").  I know people have been having a similar problem with the case sensitivity of wildcard queries and most often the solution seems to be to lowercase the string before passing it on to solr, which is not exactly an optimal solution (yet a simple one in that case).  The Icelandic characters complicate things a bit and applying the same solution (doing the lowercasing and character mapping) in my application seems like unnecessary duplication of code already part of solr, not to mention complication of my application and possible maintenance down the road.
Is there any way around this?  How are people solving this?  Is there a way to apply the filters to wildcard queries?  I guess removing the ASCIIFoldingFilterFactory is the simplest "solution" but this "normalization" (of the text done by the filter) is often very useful.
I hope I'm not overlooking some obvious explanation. :/
Thanks in advance,
Kári Hreinsson
Matti Oinas
2011-01-11 12:25:44 UTC
Permalink
Sorry, the message was not meant to be sent here. We are struggling
with the same problem here.
Post by Matti Oinas
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
On wildcard and fuzzy searches, no text analysis is performed on the
search word.
Hi,
   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.WhitespaceTokenizerFactory" />
       <filter class="solr.TrimFilterFactory" />
       <filter class="solr.LowerCaseFilterFactory" />
       <filter class="solr.ASCIIFoldingFilterFactory" />
     </analyzer>
   </fieldType>
My problem has to do with Icelandic characters, when I index a document with a text field including the word "sjálfsögðu" it gets indexed as "sjalfsogdu" (because of the ASCIIFoldingFilterFactory which replaces the Icelandic characters with their English equivalents).  Then, when I search (without a wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as a result.  This is convenient since it enables people to search without using accented characters and yet get the results they want (e.g. if they are working on computers with English keyboards).
However this all falls apart when using wildcard searches, then the search string isn't passed through the filters, and even if I search for "sjálf*" I don't get any results because the index doesn't contain the original words (I get result if I search for "sjalf*").  I know people have been having a similar problem with the case sensitivity of wildcard queries and most often the solution seems to be to lowercase the string before passing it on to solr, which is not exactly an optimal solution (yet a simple one in that case).  The Icelandic characters complicate things a bit and applying the same solution (doing the lowercasing and character mapping) in my application seems like unnecessary duplication of code already part of solr, not to mention complication of my application and possible maintenance down the road.
Is there any way around this?  How are people solving this?  Is there a way to apply the filters to wildcard queries?  I guess removing the ASCIIFoldingFilterFactory is the simplest "solution" but this "normalization" (of the text done by the filter) is often very useful.
I hope I'm not overlooking some obvious explanation. :/
Thanks in advance,
Kári Hreinsson
Matti Oinas
2011-01-11 12:47:52 UTC
Permalink
This might be the solution.

http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
Post by Matti Oinas
Sorry, the message was not meant to be sent here. We are struggling
with the same problem here.
Post by Matti Oinas
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
On wildcard and fuzzy searches, no text analysis is performed on the
search word.
Hi,
   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.WhitespaceTokenizerFactory" />
       <filter class="solr.TrimFilterFactory" />
       <filter class="solr.LowerCaseFilterFactory" />
       <filter class="solr.ASCIIFoldingFilterFactory" />
     </analyzer>
   </fieldType>
My problem has to do with Icelandic characters, when I index a document with a text field including the word "sjálfsögðu" it gets indexed as "sjalfsogdu" (because of the ASCIIFoldingFilterFactory which replaces the Icelandic characters with their English equivalents).  Then, when I search (without a wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as a result.  This is convenient since it enables people to search without using accented characters and yet get the results they want (e.g. if they are working on computers with English keyboards).
However this all falls apart when using wildcard searches, then the search string isn't passed through the filters, and even if I search for "sjálf*" I don't get any results because the index doesn't contain the original words (I get result if I search for "sjalf*").  I know people have been having a similar problem with the case sensitivity of wildcard queries and most often the solution seems to be to lowercase the string before passing it on to solr, which is not exactly an optimal solution (yet a simple one in that case).  The Icelandic characters complicate things a bit and applying the same solution (doing the lowercasing and character mapping) in my application seems like unnecessary duplication of code already part of solr, not to mention complication of my application and possible maintenance down the road.
Is there any way around this?  How are people solving this?  Is there a way to apply the filters to wildcard queries?  I guess removing the ASCIIFoldingFilterFactory is the simplest "solution" but this "normalization" (of the text done by the filter) is often very useful.
I hope I'm not overlooking some obvious explanation. :/
Thanks in advance,
Kári Hreinsson
Kári Hreinsson
2011-01-12 12:46:15 UTC
Permalink
Have you made any progress? Since the AnalyzingQueryParser doesn't inherit from QParserPlugin solr doesn't want to use it but I guess we could implement a similar parser that does inherit from QParserPlugin?

Switching parser seems to be what is needed? Has really no one solved this before?

- Kári

----- Original Message -----
From: "Matti Oinas" <***@gmail.com>
To: solr-***@lucene.apache.org
Sent: Tuesday, 11 January, 2011 12:47:52 PM
Subject: Re: solr wildcard queries and analyzers

This might be the solution.

http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
Post by Matti Oinas
Sorry, the message was not meant to be sent here. We are struggling
with the same problem here.
Post by Matti Oinas
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
On wildcard and fuzzy searches, no text analysis is performed on the
search word.
Hi,
   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.WhitespaceTokenizerFactory" />
       <filter class="solr.TrimFilterFactory" />
       <filter class="solr.LowerCaseFilterFactory" />
       <filter class="solr.ASCIIFoldingFilterFactory" />
     </analyzer>
   </fieldType>
My problem has to do with Icelandic characters, when I index a document with a text field including the word "sjálfsögðu" it gets indexed as "sjalfsogdu" (because of the ASCIIFoldingFilterFactory which replaces the Icelandic characters with their English equivalents).  Then, when I search (without a wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as a result.  This is convenient since it enables people to search without using accented characters and yet get the results they want (e.g. if they are working on computers with English keyboards).
However this all falls apart when using wildcard searches, then the search string isn't passed through the filters, and even if I search for "sjálf*" I don't get any results because the index doesn't contain the original words (I get result if I search for "sjalf*").  I know people have been having a similar problem with the case sensitivity of wildcard queries and most often the solution seems to be to lowercase the string before passing it on to solr, which is not exactly an optimal solution (yet a simple one in that case).  The Icelandic characters complicate things a bit and applying the same solution (doing the lowercasing and character mapping) in my application seems like unnecessary duplication of code already part of solr, not to mention complication of my application and possible maintenance down the road.
Is there any way around this?  How are people solving this?  Is there a way to apply the filters to wildcard queries?  I guess removing the ASCIIFoldingFilterFactory is the simplest "solution" but this "normalization" (of the text done by the filter) is often very useful.
I hope I'm not overlooking some obvious explanation. :/
Thanks in advance,
Kári Hreinsson
Jayendra Patil
2011-01-12 22:54:54 UTC
Permalink
Had the same issues with international characters and wildcard searches.

One workaround we implemented, was to index the field with and without the
ASCIIFoldingFilterFactory.
You would have an original field and one with english equivalent to be used
during searching.

Wildcard searches with english equivalent or international terms would match
either of those.
Also, lowere case the search terms if you are using lowercasefilter during
indexing.

Reagrds,
Jayendra
Post by Kári Hreinsson
Have you made any progress? Since the AnalyzingQueryParser doesn't inherit
from QParserPlugin solr doesn't want to use it but I guess we could
implement a similar parser that does inherit from QParserPlugin?
Switching parser seems to be what is needed? Has really no one solved this before?
- Kári
----- Original Message -----
Sent: Tuesday, 11 January, 2011 12:47:52 PM
Subject: Re: solr wildcard queries and analyzers
This might be the solution.
http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
Post by Matti Oinas
Sorry, the message was not meant to be sent here. We are struggling
with the same problem here.
Post by Matti Oinas
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
On wildcard and fuzzy searches, no text analysis is performed on the
search word.
Post by Kári Hreinsson
Hi,
I am having a problem with the fact that no text analysis are performed
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
Post by Matti Oinas
Post by Matti Oinas
Post by Kári Hreinsson
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" />
</analyzer>
</fieldType>
My problem has to do with Icelandic characters, when I index a document
with a text field including the word "sjálfsögðu" it gets indexed as
"sjalfsogdu" (because of the ASCIIFoldingFilterFactory which replaces the
Icelandic characters with their English equivalents). Then, when I search
(without a wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as
a result. This is convenient since it enables people to search without
using accented characters and yet get the results they want (e.g. if they
are working on computers with English keyboards).
Post by Matti Oinas
Post by Matti Oinas
Post by Kári Hreinsson
However this all falls apart when using wildcard searches, then the
search string isn't passed through the filters, and even if I search for
"sjálf*" I don't get any results because the index doesn't contain the
original words (I get result if I search for "sjalf*"). I know people have
been having a similar problem with the case sensitivity of wildcard queries
and most often the solution seems to be to lowercase the string before
passing it on to solr, which is not exactly an optimal solution (yet a
simple one in that case). The Icelandic characters complicate things a bit
and applying the same solution (doing the lowercasing and character mapping)
in my application seems like unnecessary duplication of code already part of
solr, not to mention complication of my application and possible maintenance
down the road.
Post by Matti Oinas
Post by Matti Oinas
Post by Kári Hreinsson
Is there any way around this? How are people solving this? Is there a
way to apply the filters to wildcard queries? I guess removing the
ASCIIFoldingFilterFactory is the simplest "solution" but this
"normalization" (of the text done by the filter) is often very useful.
Post by Matti Oinas
Post by Matti Oinas
Post by Kári Hreinsson
I hope I'm not overlooking some obvious explanation. :/
Thanks in advance,
Kári Hreinsson
Matti Oinas
2011-01-13 07:11:39 UTC
Permalink
I'm little busy right now, but I'm going to try to find suitable
parser or if none is found then I think the only solution is to write
a new one.
Post by Jayendra Patil
Had the same issues with international characters and wildcard searches.
One workaround we implemented, was to index the field with and without the
ASCIIFoldingFilterFactory.
You would have an original field and one with english equivalent to be used
during searching.
Wildcard searches with english equivalent or international terms would match
either of those.
Also, lowere case the search terms if you are using lowercasefilter during
indexing.
Reagrds,
Jayendra
Have you made any progress?  Since the AnalyzingQueryParser doesn't inherit
from QParserPlugin solr doesn't want to use it but I guess we could
implement a similar parser that does inherit from QParserPlugin?
Switching parser seems to be what is needed?  Has really no one solved this
before?
- Kári
----- Original Message -----
Sent: Tuesday, 11 January, 2011 12:47:52 PM
Subject: Re: solr wildcard queries and analyzers
This might be the solution.
http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html
Post by Matti Oinas
Sorry, the message was not meant to be sent here. We are struggling
with the same problem here.
Post by Matti Oinas
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
On wildcard and fuzzy searches, no text analysis is performed on the
search word.
Post by Kári Hreinsson
Hi,
I am having a problem with the fact that no text analysis are performed
   <fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
Post by Matti Oinas
Post by Matti Oinas
Post by Kári Hreinsson
     <analyzer>
       <tokenizer class="solr.WhitespaceTokenizerFactory" />
       <filter class="solr.TrimFilterFactory" />
       <filter class="solr.LowerCaseFilterFactory" />
       <filter class="solr.ASCIIFoldingFilterFactory" />
     </analyzer>
   </fieldType>
My problem has to do with Icelandic characters, when I index a document
with a text field including the word "sjálfsögðu" it gets indexed as
"sjalfsogdu" (because of the ASCIIFoldingFilterFactory which replaces the
Icelandic characters with their English equivalents).  Then, when I search
(without a wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as
a result.  This is convenient since it enables people to search without
using accented characters and yet get the results they want (e.g. if they
are working on computers with English keyboards).
Post by Matti Oinas
Post by Matti Oinas
Post by Kári Hreinsson
However this all falls apart when using wildcard searches, then the
search string isn't passed through the filters, and even if I search for
"sjálf*" I don't get any results because the index doesn't contain the
original words (I get result if I search for "sjalf*").  I know people have
been having a similar problem with the case sensitivity of wildcard queries
and most often the solution seems to be to lowercase the string before
passing it on to solr, which is not exactly an optimal solution (yet a
simple one in that case).  The Icelandic characters complicate things a bit
and applying the same solution (doing the lowercasing and character mapping)
in my application seems like unnecessary duplication of code already part of
solr, not to mention complication of my application and possible maintenance
down the road.
Post by Matti Oinas
Post by Matti Oinas
Post by Kári Hreinsson
Is there any way around this?  How are people solving this?  Is there a
way to apply the filters to wildcard queries?  I guess removing the
ASCIIFoldingFilterFactory is the simplest "solution" but this
"normalization" (of the text done by the filter) is often very useful.
Post by Matti Oinas
Post by Matti Oinas
Post by Kári Hreinsson
I hope I'm not overlooking some obvious explanation. :/
Thanks in advance,
Kári Hreinsson
Loading...