Discussion:
Copy field a source of copy field
tstusr
2017-07-17 22:26:57 UTC
Permalink
Hi

We want to use a copy field as a source for another copy field or some kind
of post processing of a field.

The problem is here. We have a field from a text that is captured by a
field, like this:

<copyField source="attr_content*" dest="species"/>

which has (at the end of the processing) just the words in a field.

<field name="species" type="species_type" stored="true" indexed="true"
termVectors="true" termPositions="true" termOffsets="true"/>

<fieldType name="species_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping/mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[0-9]+|(\-)(\s*)" replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
<filter class="solr.KeepWordFilterFactory" words="species.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

So, what we want to do now is to implement a faceting according to some post
processing of this field by using this as a source for another field.

<copyField source="species" dest="genus"/>

<fieldType name="genus_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeepWordFilterFactory" words="genus.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>


As soon as I understand. We don't have a value on genus because the chain is
ended. Nevertheless, we are also not available to make two processings to
first, capture the words on species and then make a new capture for the
genus.

As an example imagine we have on species

abies durangensis
abies flinckii

so, after post processing, we expect to have only
abies

which is a word in genus files

I was as clear as possible with the problem, but maybe there are some black
holes in the explanation.

Hope you can help me.





--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425.html
Sent from the Solr - User mailing list archive at Nabble.com.
Erick Erickson
2017-07-18 00:26:13 UTC
Permalink
In a word, "no". Copyfields are not chained together. I'm not at all
sure what you're trying to accomplish with those filter chains anyway,
By shingling _then_ doing the stopwords, you'll have some input like
abies durangensis

become

abies
abies_durangensis
durangensis

Then put that through your keepwords filter which presumably only has
species in it so it would throw out abies and abies_durangensis unless
those are in your keepwords file.... Seems a waste.

That aside, you can construct one long analysis chain that combined
the genus and species chains and just copy from attr_content* into
both. You wouldn't get the different tokenization, but presumably you
don't particularly need it on the second part of the chain.

Best,
Erick
Post by tstusr
Hi
We want to use a copy field as a source for another copy field or some kind
of post processing of a field.
The problem is here. We have a field from a text that is captured by a
<copyField source="attr_content*" dest="species"/>
which has (at the end of the processing) just the words in a field.
<field name="species" type="species_type" stored="true" indexed="true"
termVectors="true" termPositions="true" termOffsets="true"/>
<fieldType name="species_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping/mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[0-9]+|(\-)(\s*)" replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
<filter class="solr.KeepWordFilterFactory" words="species.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
So, what we want to do now is to implement a faceting according to some post
processing of this field by using this as a source for another field.
<copyField source="species" dest="genus"/>
<fieldType name="genus_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeepWordFilterFactory" words="genus.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
As soon as I understand. We don't have a value on genus because the chain is
ended. Nevertheless, we are also not available to make two processings to
first, capture the words on species and then make a new capture for the
genus.
As an example imagine we have on species
abies durangensis
abies flinckii
so, after post processing, we expect to have only
abies
which is a word in genus files
I was as clear as possible with the problem, but maybe there are some black
holes in the explanation.
Hope you can help me.
--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425.html
Sent from the Solr - User mailing list archive at Nabble.com.
tstusr
2017-07-18 15:49:49 UTC
Permalink
Ok, I know shingling will join with "_".

But that is the behaviour we want, imagine we have this fields (contained in
species file):

abarema idiopoda
abutilon bakerianum

Those become in:
abarema
idiopoda
abutilon
bakerianum
abarema_idiopoda
abutilon_bakerianum

But now in my genus file maybe is only the word abarema, so, we end up with
a field with only that word.

So, the requirements here, are to be able to find all species in species
files (step one) and then make a facet with species in file genus, step two.

It seems reasonable to just chain the fields, I just forgot solr didn't
change the field, as Shawn points (thanks for it).

So what we came here is to make 2 fields the first with species.

<fieldType name="species_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping/mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[0-9]+|(\-)(\s*)" replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
<filter class="solr.KeepWordFilterFactory" words="species.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

And the second one (genus), which contains genus that has to be for facet
purposes, like this:

<fieldType name="genus_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping/mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[0-9]+|(\-)(\s*)" replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
<filter class="solr.KeepWordFilterFactory" words="species.txt"
ignoreCase="true"/>
<filter class="solr.KeepWordFilterFactory" words="genus.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Nevertheless, there is no second processing for keep word filter as (I)
expect. Am I missing something?






--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346593.html
Sent from the Solr - User mailing list archive at Nabble.com.
Erick Erickson
2017-07-18 15:57:27 UTC
Permalink
The code is very simple, it looks at a quick glance like it just reads
the words in then the "accept" method just returns true or false based
on whether the text file contains the token.

Are you sure you reloaded your core/collection and pushed the changed
schema to the right place? The admin/analysis page is very helpful
here, your indexing side should have two keep word filters and you
should be able to see each transformation (uncheck the "verbose"
checkbox for more readability.

Best,
Erick
Post by tstusr
Ok, I know shingling will join with "_".
But that is the behaviour we want, imagine we have this fields (contained in
abarema idiopoda
abutilon bakerianum
abarema
idiopoda
abutilon
bakerianum
abarema_idiopoda
abutilon_bakerianum
But now in my genus file maybe is only the word abarema, so, we end up with
a field with only that word.
So, the requirements here, are to be able to find all species in species
files (step one) and then make a facet with species in file genus, step two.
It seems reasonable to just chain the fields, I just forgot solr didn't
change the field, as Shawn points (thanks for it).
So what we came here is to make 2 fields the first with species.
<fieldType name="species_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping/mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[0-9]+|(\-)(\s*)" replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
<filter class="solr.KeepWordFilterFactory" words="species.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And the second one (genus), which contains genus that has to be for facet
<fieldType name="genus_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping/mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[0-9]+|(\-)(\s*)" replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
<filter class="solr.KeepWordFilterFactory" words="species.txt"
ignoreCase="true"/>
<filter class="solr.KeepWordFilterFactory" words="genus.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Nevertheless, there is no second processing for keep word filter as (I)
expect. Am I missing something?
--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346593.html
Sent from the Solr - User mailing list archive at Nabble.com.
tstusr
2017-07-18 16:45:17 UTC
Permalink
It seems that is just taking the last file of keep words.

<Loading Image...>

Now for control purposes, I have in genus file:

<http://lucene.472066.n3.nabble.com/file/n4346601/Screen_Shot_2017-07-18_at_11.png>

And just is taking the composed field, abutilon aurantiacum.

By testing with
abutilon aurantiacum
abutilon bakerianum

<http://lucene.472066.n3.nabble.com/file/n4346601/Screen_Shot_2017-07-18_at_11.png>

It's is not possible to put 2 tokenizers in a field, am I right? Because I
just think there is a missing split in between the 2 KWFs.



--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346601.html
Sent from the Solr - User mailing list archive at Nabble.com.
tstusr
2017-07-18 16:49:24 UTC
Permalink
Well, I have no idea why that images display as did.

The correct order is:

Field chain analyzer.
<Loading Image...>

KWF-genus file
<Loading Image...>

Test output.
<Loading Image...>

Sorry for the mistake



--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346602.html
Sent from the Solr - User mailing list archive at Nabble.com.
Erick Erickson
2017-07-18 18:43:44 UTC
Permalink
Multiple keyword files work just fine for me.

one issue you're having is that multi-word keepwords aren't going to
do what you expect. The analysis chains work on _tokens_, and only see
one at a time. Plus (apparently) the input is broken up on whitespace
(the docs aren't entirely clear on this, but can be inferred by "one
per line").

Even if there were multi-word keepwords, it wouldn't work as you
apparently expect. The problem is that the analysis chain first breaks
the input into tokens. So even if a "single" keepword were "a b", and
your input was "a b", by the time it gets to the keepword filter the
context would be lost. So the filter would see just "a" and say "nope
it doesn't match 'a b', throw it out". Ditto with "b".

Since keepwords are apparently split on whitespace though, in the
example above both would be kept. The keepword list is "a" and "b" so
in the above example both match and are kept.

Best,
Erkck
Post by tstusr
Well, I have no idea why that images display as did.
Field chain analyzer.
<http://lucene.472066.n3.nabble.com/file/n4346602/1.png>
KWF-genus file
<http://lucene.472066.n3.nabble.com/file/n4346602/3.png>
Test output.
<http://lucene.472066.n3.nabble.com/file/n4346602/2.png>
Sorry for the mistake
--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346602.html
Sent from the Solr - User mailing list archive at Nabble.com.
tstusr
2017-07-18 21:23:25 UTC
Permalink
Well, for me it's kind of strange because it's working only with words that
have blank spaces. It seems that maybe I'm not explaining well.

My field is defined as follows:

<fieldType name="genus_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping/mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[0-9]+|(\-)(\s*)" replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
<filter class="solr.KeepWordFilterFactory" words="species.txt"
ignoreCase="true"/>
<filter class="solr.KeepWordFilterFactory" words="genus.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

We have 2 KWF files, "species" and then "genus". It seems that is just
working with genus.

Since I'm not able to use copy fields, what choices I have?



--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346665.html
Sent from the Solr - User mailing list archive at Nabble.com.
Erick Erickson
2017-07-18 22:30:55 UTC
Permalink
OK, I take it back. Keepwords handle multiple words just fine. So I
have to rewind.

I'm having no trouble at all applying multiple, successive keepwords
filters, even when there are multiple words on a single line in the
keepwords file. Your use of shingles in here is probably going to
confuse things, so I'd probably recommend taking that out until you
work out what's happening with multiple keepwords filters, then add it
back in.

The images you pasted almost look like you're showing the contents of
elevate.xml, but I suspect that's bogus.

But I think this is an XY problem, you're asking about how to chain
copyFields and we got off into talking about chaining keepwords and
the like. You state:

"So, the requirements here, are to be able to find all species in
species files (step one) and then make a facet with species in file
genus, step two."

Then you say:

"And the second one (genus), which contains genus that has to be for
facet purposes, like this"

How are those reconciled? Do you want facets on the genus+species? Or
just on the genus? Or both? So let's just start over.

What's also missing is why you think you need keepwords in the first
place. Is this a free-text field you're trying to extract
genus/species from? Or do you have the genus/species extracted
already?

Give us two docs, a sample search and what you want as outcome.
Because if you just want to facet on genus then do a copyField simply
to a "genus" field that strips out everything but the genus (however
you implement that, tricky given sub-species perhaps).

Ditto if you want to facet on species. Just a species_facet field that
you put whatever you want into. Or just use KeywordTokenizer for
species if you're guaranteed that you want the whole field.

You can then use copyField to copy as you wish.

Best,
Erick
Post by tstusr
Well, for me it's kind of strange because it's working only with words that
have blank spaces. It seems that maybe I'm not explaining well.
<fieldType name="genus_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping/mapping-ISOLatin1Accent.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[0-9]+|(\-)(\s*)" replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
<filter class="solr.KeepWordFilterFactory" words="species.txt"
ignoreCase="true"/>
<filter class="solr.KeepWordFilterFactory" words="genus.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
We have 2 KWF files, "species" and then "genus". It seems that is just
working with genus.
Since I'm not able to use copy fields, what choices I have?
--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346665.html
Sent from the Solr - User mailing list archive at Nabble.com.
tstusr
2017-07-19 18:07:07 UTC
Permalink
Well, our documents consist on pdf files (between 20 to 200 pages).

So, we catch words of all the file, for that, we use the extract handler,
that's why we have this fields:

<copyField source="attr_conten*" dest="genus"/>
<copyField source="attr_conten*" dest="specie"/>

We catch species in all the pdf content (On attr_content field)

Species captured are used for ranking purposes. So, we have to have the
whole name, that's why we use shingles. As an example, we catch from the
pdf:

abelmoschus achanioides
abies colimensis
abies concolor

Because that information is important, we provide a facet of those species,
grouped by genus (just the first word of the species). So, in the facet we
have to have:

abelmoschus (1)
abies (2)

Nevertheless, we need a sort of subquery, because first, we need the
complete species and then of those results facet by genus. For example:

the abies something else (This phrase shouldn't have to be captured)
the abies concolor something else (This phrase should've to be captured) ->
Finish with just "abies concolor" and for consequence then captured by genus

I realized that all genus are contained on species.

So, there is a way to make a facet with just the first word of a field, like
I've got for the field:

abelmoschus achanioides
abies colimensis
abies concolor

Just use the first word of those?



--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346846.html
Sent from the Solr - User mailing list archive at Nabble.com.
Erick Erickson
2017-07-19 23:19:54 UTC
Permalink
OK, you'll need two fields pretty much for certain. The trick is
getting _only_ genus names in the genus field.

The simplest thing to do would be a straight copyField with a single
keep word filter that contains a list of all the genera. That
presupposes that the genera are disjoint sets from all other words.
You search on your species field and facet on the genus field.

But assuming your genera are not disjoint from all other words, hmmmm.
Do you have a way of unambiguously identifying genus/species pairs in
the text you're processing? If you do we can work with that, but
without that you're talking entity recognition of some sort.

BTW, there's no real need to shingle the species field, just search
for "genus species" as a phrase. Unless those two appear next to each
other in order you won't get a hit.

Best,
Erick
Post by tstusr
Well, our documents consist on pdf files (between 20 to 200 pages).
So, we catch words of all the file, for that, we use the extract handler,
<copyField source="attr_conten*" dest="genus"/>
<copyField source="attr_conten*" dest="specie"/>
We catch species in all the pdf content (On attr_content field)
Species captured are used for ranking purposes. So, we have to have the
whole name, that's why we use shingles. As an example, we catch from the
abelmoschus achanioides
abies colimensis
abies concolor
Because that information is important, we provide a facet of those species,
grouped by genus (just the first word of the species). So, in the facet we
abelmoschus (1)
abies (2)
Nevertheless, we need a sort of subquery, because first, we need the
the abies something else (This phrase shouldn't have to be captured)
the abies concolor something else (This phrase should've to be captured) ->
Finish with just "abies concolor" and for consequence then captured by genus
I realized that all genus are contained on species.
So, there is a way to make a facet with just the first word of a field, like
abelmoschus achanioides
abies colimensis
abies concolor
Just use the first word of those?
--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4346846.html
Sent from the Solr - User mailing list archive at Nabble.com.
tstusr
2017-07-20 16:55:09 UTC
Permalink
Well, correct me if I'm wrong.

Your suggestion is to use species field as a source of genus field. We try
with this

<copyField source="attr_conten*" dest="species"/>
<copyField source="species" dest="genus"/>

Where species work as described and genus just use a KWF, like this:

<fieldType name="genus_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="genus.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

But now, the problem now is different.

When we try the behavior in analysis section in solr provided UI it works as
expected.

Nevertheless, when we use it at indexing time (When we post pdf files, to
extractor) the field doesn't even appear. We think it's because the info
becomes from another copyField.

Did I misunderstand your suggestion?



--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4347013.html
Sent from the Solr - User mailing list archive at Nabble.com.
Erick Erickson
2017-07-20 19:24:29 UTC
Permalink
Yep, we're not communication ;)

Use the original source field for the genus, as:

<copyField source="attr_conten*" dest="species"/>
<copyField source="attr_conten*" dest="genus"/>

The difficulty here is that there might be false hits if the genera
names happen to match words in the input that are not part of a
genus/species pair.
Post by tstusr
Well, correct me if I'm wrong.
Your suggestion is to use species field as a source of genus field. We try
with this
<copyField source="attr_conten*" dest="species"/>
<copyField source="species" dest="genus"/>
<fieldType name="genus_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="genus.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
But now, the problem now is different.
When we try the behavior in analysis section in solr provided UI it works as
expected.
Nevertheless, when we use it at indexing time (When we post pdf files, to
extractor) the field doesn't even appear. We think it's because the info
becomes from another copyField.
Did I misunderstand your suggestion?
--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4347013.html
Sent from the Solr - User mailing list archive at Nabble.com.
tstusr
2017-07-25 16:15:30 UTC
Permalink
Je, I also think that!.

We have some serious gaps on what you explain to me.

First, you point me that there's no real need to use ShingleFilter, I tried
with all Tokenizer and the result is the same, the species are not caught.
On the simplest scenario I've got this:

<fieldType name="genus_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class=""/> PUT YOUR FAVORITE TOKENIZER HERE
<filter class="solr.KeepWordFilterFactory" words="species.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>

And testing on Analysis tab, wouldn't catch any tag with blank space, like
"acacia acicularis". Am I missing something?

Then, by using ShingleFilter, tags with blank space are caught correctly.

But you said you're having no trouble applying multiple successive keepword
filters. So, I just use 2 KWF files as I depict:

<fieldType name="genus_type" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3"
outputUnigrams="true"/>
<filter class="solr.KeepWordFilterFactory" words="species.txt"
ignoreCase="true"/>
<filter class="solr.KeepWordFilterFactory" words="genus.txt"
ignoreCase="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

On species file there's only one line, that is "hey you"
on genus file, there's also one line, which is "hey"

Catching nothing at all for the second KWF

<Loading Image...>


Well, I have to say I'm so confused with this behaviour, have I forgot
something?




--
View this message in context: http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4347541.html
Sent from the Solr - User mailing list archive at Nabble.com.

Shawn Heisey
2017-07-18 01:12:14 UTC
Permalink
Post by tstusr
We want to use a copy field as a source for another copy field or some kind
of post processing of a field.
<snip>
Post by tstusr
As an example imagine we have on species
abies durangensis
abies flinckii
so, after post processing, we expect to have only
abies
which is a word in genus files
Let's say that you have this in your schema, and you index "Test Words"
(note the capital letters) in field a:

<copyField source="a" dest="b"/>

Let's say that the index analysis on field a has the whitespace
tokenizer, a lowercase filter, and a stopword filter with "test" in the
list. This means that the search terms for field a on that document
will only have "words" included.

You might be expecting field b to only receive "words" when it gets
copied from field a ... but this is NOT what happens. Field b receives
the original text sent to field a, which is "Test Words", including both
words and the uppercase letters.

I think that transitive copies *do* work, so that you can copy field a
to b, then field b to c, though I am not 100 percent sure about that.
If that does work, the end field in the chain is still going to receive
"Test Words" like you sent to field a.

Chaining analysis through copyField does not work.

Thanks,
Shawn
Loading...