Discussion:
Making a String field case-insensitive
Zheng Lin Edwin Yeo
2017-11-01 08:50:16 UTC
Permalink
Hi,

Would like to find out, what is the best way to lower-case a String index
in Solr, to make it case insensitive, while preserving the structure of the
string (ie It should not break into different tokens at space, and should
not remove any characters or symbols)

I found that solr.StrField does not use lower case filter. But if I change
it to solr.TextField and uses Standard Tokenizer, the fields get broken up.

Eg:

For this configuration,

<fieldType name="string_lower" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

The string "*SYStem 500 **" gets broken down into this

*system | 500*

The system and 500 are separated into 2 tokens, which is not what we want.
Also, the * is being removed.


We will like to have something like this. This will preserve what it is as
a string but just lowercase it.

*system 500 **
Emir Arnautović
2017-11-01 10:08:14 UTC
Permalink
Hi,
You can use KeywordTokenizer and LowerCaseTokenFilterFactory.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/
Post by Zheng Lin Edwin Yeo
Hi,
Would like to find out, what is the best way to lower-case a String index
in Solr, to make it case insensitive, while preserving the structure of the
string (ie It should not break into different tokens at space, and should
not remove any characters or symbols)
I found that solr.StrField does not use lower case filter. But if I change
it to solr.TextField and uses Standard Tokenizer, the fields get broken up.
For this configuration,
<fieldType name="string_lower" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The string "*SYStem 500 **" gets broken down into this
*system | 500*
The system and 500 are separated into 2 tokens, which is not what we want.
Also, the * is being removed.
We will like to have something like this. This will preserve what it is as
a string but just lowercase it.
*system 500 **
Zheng Lin Edwin Yeo
2017-11-02 02:08:26 UTC
Permalink
Hi Emir,

Thanks for your advice. This works.

Regards,
Edwin
Post by Emir Arnautović
Hi,
You can use KeywordTokenizer and LowerCaseTokenFilterFactory.
HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/
Post by Zheng Lin Edwin Yeo
Hi,
Would like to find out, what is the best way to lower-case a String index
in Solr, to make it case insensitive, while preserving the structure of
the
Post by Zheng Lin Edwin Yeo
string (ie It should not break into different tokens at space, and should
not remove any characters or symbols)
I found that solr.StrField does not use lower case filter. But if I
change
Post by Zheng Lin Edwin Yeo
it to solr.TextField and uses Standard Tokenizer, the fields get broken
up.
Post by Zheng Lin Edwin Yeo
For this configuration,
<fieldType name="string_lower" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The string "*SYStem 500 **" gets broken down into this
*system | 500*
The system and 500 are separated into 2 tokens, which is not what we
want.
Post by Zheng Lin Edwin Yeo
Also, the * is being removed.
We will like to have something like this. This will preserve what it is
as
Post by Zheng Lin Edwin Yeo
a string but just lowercase it.
*system 500 **
Loading...