Discussion:
shingles + stop words
David Hastings
2018-12-07 14:18:12 UTC
Permalink
Hey there, I have a field type defined as such:
<fieldType name="skw2" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedStopFilterFactory" managed="english"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2"
outputUnigrams="false" fillerToken="" maxShingleSize="2"/>
</analyzer>
</fieldType>

but whats happening is the shingles being returned are often times "
nonstopword"
with the space being defined as the filter token. I was hoping that the
ManagedStopFilterFactory would have removed the stop words completely
before going to the shingle factory, and would have returned "nonstopword1
nonstopword2" with an indexed value of
"nonstopword1 stopword1 stopword2 nonstopword2" but obviously isnt the
case. is there a way to force it as such?

Thanks, David
Emir Arnautović
2018-12-10 15:10:02 UTC
Permalink
Hi David,
As you already observed shingles are concatenating tokens based on positions and in case of stopwords it results in empty string (you can configure it to be something else with fillerToken option).
You can do the following:
1. if you do not have too many stopwords, you could use PatternReplaceChartFilter to remove stopwords before it hits tokenizer. That way stopwords will not increase positions and it’ll result with expected shingles. This way you will loose managed part of stopwords and will have to reload cores in order to change stopwords.
2. customise stopword filter not to increment positions when finds stopword.
3. customise shingle filter to be able to add desired flag

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/
Post by David Hastings
<fieldType name="skw2" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ManagedStopFilterFactory" managed="english"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2"
outputUnigrams="false" fillerToken="" maxShingleSize="2"/>
</analyzer>
</fieldType>
but whats happening is the shingles being returned are often times "
nonstopword"
with the space being defined as the filter token. I was hoping that the
ManagedStopFilterFactory would have removed the stop words completely
before going to the shingle factory, and would have returned "nonstopword1
nonstopword2" with an indexed value of
"nonstopword1 stopword1 stopword2 nonstopword2" but obviously isnt the
case. is there a way to force it as such?
Thanks, David
Loading...