Discussion:
Preventing empty strings in index
Marian Steinbach
2011-12-05 09:01:40 UTC
Permalink
Hi!

I am surprised to find an empty string as the most frequent index term in
one of my fields. Until now I didn't even know that empty strings would be
indexed.

Here is the schema.xml excerpt for that field:

<fieldType name="text_terms" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="^[0-9]+$"
replacement="" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_terms.txt"
ignoreCase="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_terms.txt" />
</analyzer>
</fieldType>

<field name="terms" type="text_terms" indexed="true" stored="false"
multiValued="true"/>


I have the suspicion that PatternReplaceFilterFactory
with pattern="^[0-9]+$" is causing the empty strings. I introduced that
filter to prevent numbers-only strings from being added to the index.

Any hint on how I can get rid of numbers AND empty strings?

Thanks!

Marian
Tomás Fernández Löbbe
2011-12-05 12:28:27 UTC
Permalink
You could try adding a
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory

Regards,

Tomás
Post by Marian Steinbach
Hi!
I am surprised to find an empty string as the most frequent index term in
one of my fields. Until now I didn't even know that empty strings would be
indexed.
<fieldType name="text_terms" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="^[0-9]+$"
replacement="" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_terms.txt"
ignoreCase="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_terms.txt" />
</analyzer>
</fieldType>
<field name="terms" type="text_terms" indexed="true" stored="false"
multiValued="true"/>
I have the suspicion that PatternReplaceFilterFactory
with pattern="^[0-9]+$" is causing the empty strings. I introduced that
filter to prevent numbers-only strings from being added to the index.
Any hint on how I can get rid of numbers AND empty strings?
Thanks!
Marian
Marian Steinbach
2011-12-05 16:11:47 UTC
Permalink
That seems pretty straightforward. Thanks!
Post by Tomás Fernández Löbbe
You could try adding a
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
Regards,
Tomás
Loading...