Discussion:
Can I use per field analyzers and dynamic fields?
Paolo Castagna
2010-05-05 13:14:52 UTC
Permalink
Hi,
I have an existing Lucene application which I want to port to Solr.

A scenario I need to support requires me to use dynamic fields
with Solr, since users can add new fields at runtime.

At the same time, the existing Lucene application is using a
PerFieldAnalyzerWrapper in order to use different analyzers
for different fields.

One possible solution (server side) requires a custom QParser
which would use a PerFieldAnalyzerWrapper, but perhaps
there is a better (client side only) way to do that.

Do you have any suggestion on how I could use per field
analyzers with dynamic fields?

Regards,
Paolo
Erik Hatcher
2010-05-05 13:19:08 UTC
Permalink
Paolo,

Solr takes care of associating fields with the proper analysis defined
in schema.xml already. This, of course, depends on which query parser
you're using, but both the standard Solr query parser and dismax do
the right thing analysis-wise automatically.

But, I think you need to elaborate on what you're doing in your Lucene
application to know more specifically. A dynamic field specification
in Solr is associated with only a single field type, so you'll want to
use different dynamic field patterns for different types of fields.

Erik
Post by Paolo Castagna
Hi,
I have an existing Lucene application which I want to port to Solr.
A scenario I need to support requires me to use dynamic fields
with Solr, since users can add new fields at runtime.
At the same time, the existing Lucene application is using a
PerFieldAnalyzerWrapper in order to use different analyzers
for different fields.
One possible solution (server side) requires a custom QParser
which would use a PerFieldAnalyzerWrapper, but perhaps
there is a better (client side only) way to do that.
Do you have any suggestion on how I could use per field
analyzers with dynamic fields?
Regards,
Paolo
Paolo Castagna
2010-05-05 13:53:27 UTC
Permalink
Hi Erik,
first of all, thanks for your reply.

The "source" of my problems is the fact that I do not know in advance the
field names. Users are allowed to decide they own field names, they can,
at runtime, add new fields and different Lucene documents might have
different field names.

So, in addition to some custom and known field names, I have in my
schema.xml file a dynamicField:

<dynamicField name="*" type="text" indexed="true" stored="true"
multiValued="true" />

The corresponding fieldType is:

<fieldType name="text" class="solr.TextField">
<analyzer type="index">
...
</analyzer>
<analyzer type="query">
...
</analyzer>
</fieldType>

This allows me to specify a fixed (i.e. it cannot change at runtime) and
"common" (i.e. it's the same for all dynamicField with name="*") set of
analyzers.

At the same time, in my Lucene application, users are allowed to configure
at runtime different analyzers per field. With Lucene I achieve this
using a PerFieldAnalyzerWrapper at indexing (i.e. IndexWriter and
IndexModifier
allow me to specify an Analyzer in their constructors) and query time
(i.e. QueryParser allows me to specify an Analyzer in its constructor).

Dynamic field patterns allows me to create "groups" of different types of
fields, but they will expose the users to the field patterns itself and remove
their freedom to chose field names as they want.

Perhaps, another way to express my problem is: could I use a
PerFieldAnalyzerWrapper in the above <fieldType> section?
If I do that, how can I configure it at runtime?

Thanks again,
Paolo
Post by Erik Hatcher
Paolo,
Solr takes care of associating fields with the proper analysis defined in
schema.xml already.  This, of course, depends on which query parser you're
using, but both the standard Solr query parser and dismax do the right thing
analysis-wise automatically.
But, I think you need to elaborate on what you're doing in your Lucene
application to know more specifically.  A dynamic field specification in
Solr is associated with only a single field type, so you'll want to use
different dynamic field patterns for different types of fields.
       Erik
Post by Paolo Castagna
Hi,
I have an existing Lucene application which I want to port to Solr.
A scenario I need to support requires me to use dynamic fields
with Solr, since users can add new fields at runtime.
At the same time, the existing Lucene application is using a
PerFieldAnalyzerWrapper in order to use different analyzers
for different fields.
One possible solution (server side) requires a custom QParser
which would use a PerFieldAnalyzerWrapper, but perhaps
there is a better (client side only) way to do that.
Do you have any suggestion on how I could use per field
analyzers with dynamic fields?
Regards,
Paolo
Chris Hostetter
2010-05-07 20:42:55 UTC
Permalink
:
: The "source" of my problems is the fact that I do not know in advance the
: field names. Users are allowed to decide they own field names, they can,
: at runtime, add new fields and different Lucene documents might have
: different field names.

I would suggest you abstract away the field names your users pick and the
underlying fieldnames you use when dealing with solr -- so create the list
of fieldTypes you want to support (with all of the individual analzyer
configurations that are valid) and then create a dynamicField
corrisponding to each one.

then if your user tells you they want an "author" field associated with
the type "text_en" you can map that in your application to
"author_text_end" at both indexing and query time.

This will also let you map the same "logical field names" (from your
user's perspective) to different "internal field names" (from Solr's
perspective) based on usage -- searching the "author" field might be
against "author_text_en" but sorting on "author" might use
"author_string".

(Some notes were drafted up a while back on making this kind of field name
aliasing a feature of Solr, but nothing ever came of it...
http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams
)

-Hoss
Paolo Castagna
2010-05-09 07:20:17 UTC
Permalink
Hi,
thank you for your reply.

What you suggested is a good idea and I am probably going to follow it.

However, I'd like to hear a comment on the approach of doing the parsing
using Lucene and then constructing a SolrQuery from a Lucene Query:

QueryParser parser = new QueryParser("", analyzer);
Query lucene_query = parser.parse("title:dog title:The author:me
author:the the cat is on the table");
...
SolrQuery solr_query = new SolrQuery();
solr_query.setQuery(lucene_query.toString());

What are the drawbacks of this approach?

Similarly, at indexing time:

StringBuffer solr_value = new StringBuffer();
TokenStream ts = analyzer.tokenStream("title", new StringReader(value));
Token token;
while ((token = ts.next()) != null) {
solr_value.append(token.termText()).append(" ");
}
SolrInputDocument solr_document = new SolrInputDocument();
solr_document.addField("title", solr_value.toString());
...

What are the drawbacks of this approach?

Paolo
Post by Chris Hostetter
: The "source" of my problems is the fact that I do not know in advance the
: field names. Users are allowed to decide they own field names, they can,
: at runtime, add new fields and different Lucene documents might have
: different field names.
I would suggest you abstract away the field names your users pick and the
underlying fieldnames you use when dealing with solr -- so create the list
of fieldTypes you want to support (with all of the individual analzyer
configurations that are valid) and then create a dynamicField
corrisponding to each one.
then if your user tells you they want an "author" field associated with
the type "text_en" you can map that in your application to
"author_text_end" at both indexing and query time.
This will also let you map the same "logical field names" (from your
user's perspective) to different "internal field names" (from Solr's
perspective) based on usage -- searching the "author" field might be
against "author_text_en" but sorting on "author" might use
"author_string".
(Some notes were drafted up a while back on making this kind of field name
aliasing a feature of Solr, but nothing ever came of it...
http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams
)
-Hoss
Chris Hostetter
2010-05-12 20:00:02 UTC
Permalink
: However, I'd like to hear a comment on the approach of doing the parsing
: using Lucene and then constructing a SolrQuery from a Lucene Query:

I believe you are asking about doing this in the client code? using the
Lucene QueryParser to parse a string using an analyzer, then toString'ing
that and sending it across hte wire to Solr?

i would strongly advise against it.

Query.toString() is intended purely as a debugging tool, not as a
serialization mechanism. It's very possible for the toString() value of
a query to not be useful in attempting to recreate the query --
particularly if the analyzer being used by Solr for the "re-parse" doesn't
know to expect terms that have already been stemmed, or modified in the
various ways the clinet may hvae done so (and if you have to go to all
that work to make solr know about what you've pre-analyzed, why not just
let solr do it for you?)

: Similarly, at indexing time:
...
: What are the drawbacks of this approach?

Hmmm... well besides hte drawback of doing all the hard work solr will do
for you, i suppose that as long as you are extremely careful to manage
both the indexing side and the query side externally from Solr then there
is nothing wrong with this appraoch -- you would essentailly just have a
single field type in your schema.xml that would use a whitespace tokenizer
-- but again, this would make you lose out on a lot of solr's features
(notably: the stored values in your index would be the post-analyze
tokens, you would be force to trust the clients 100% to send you clean
data at index and query time intead of being able to configure it
centrally, etc...)

In short: i don't see any advantages, but i see a lot of room for error.


-Hoss
Paolo Castagna
2010-05-13 06:49:13 UTC
Permalink
Post by Chris Hostetter
: However, I'd like to hear a comment on the approach of doing the parsing
I believe you are asking about doing this in the client code? using the
Lucene QueryParser to parse a string using an analyzer, then toString'ing
that and sending it across hte wire to Solr?
Yes.
Post by Chris Hostetter
i would strongly advise against it.
Thank you.
Post by Chris Hostetter
Query.toString() is intended purely as a debugging tool, not as a
serialization mechanism. It's very possible for the toString() value of
a query to not be useful in attempting to recreate the query --
particularly if the analyzer being used by Solr for the "re-parse" doesn't
know to expect terms that have already been stemmed, or modified in the
various ways the clinet may hvae done so (and if you have to go to all
that work to make solr know about what you've pre-analyzed, why not just
let solr do it for you?)
Is there a (better) way to construct a Solr's SolrQuery object from a
Lucene's Query object?
Post by Chris Hostetter
...
: What are the drawbacks of this approach?
Hmmm... well besides hte drawback of doing all the hard work solr will do
for you, i suppose that as long as you are extremely careful to manage
both the indexing side and the query side externally from Solr then there
is nothing wrong with this appraoch -- you would essentailly just have a
single field type in your schema.xml that would use a whitespace tokenizer
-- but again, this would make you lose out on a lot of solr's features
(notably: the stored values in your index would be the post-analyze
tokens, you would be force to trust the clients 100% to send you clean
data at index and query time intead of being able to configure it
centrally, etc...)
The rationale for wanting doing all the analysis (both query time and
indexing time) client side is that I have an application which is using
Lucene and it is already doing that and it has some "unusual"
requirements (i.e. almost all fields are dynamicFields with
custom/configurable analyzers per field).

I completely agree with everything you said and with the "dangers" of
doing the analysis client side and then let Solr re-analyzing again
server side. However, as you suggested, a simple whitespace tokenizer
on Solr should be relatively safe.

Definitely, your previous suggestion of using dynamicFields for each
of the possible analyzer configurations and transparently mapping field
names with "prefixes"|"postfixes" to select the right dynamicField
"type" is a better option.
Post by Chris Hostetter
In short: i don't see any advantages, but i see a lot of room for error.
Yep. Got it.

Paolo
Post by Chris Hostetter
-Hoss
Paolo Castagna
2010-05-05 14:58:45 UTC
Permalink
Post by Erik Hatcher
But, I think you need to elaborate on what you're doing in your Lucene
application to know more specifically.
Hi Erik,
perhaps, this is another way to explain and maybe solve my issue...

At query time (everything here is just an illustrative example):

PerFieldAnalyzerWrapper analyzer =
new PerFieldAnalyzerWrapper(new WhitespaceAnalyzer());
analyzer.addAnalyzer("title", new SimpleAnalyzer());
analyzer.addAnalyzer("author", new StandardAnalyzer());
...

// Lucene is doing the analysis client side...
QueryParser parser = new QueryParser("", analyzer);
Query lucene_query = parser.parse("title:dog title:The author:me
author:the the cat is on the table");
...
// Solr query is build from the query string analyzed by Lucene
SolrQuery solr_query = new SolrQuery();
solr_query.setQuery(lucene_query.toString());

This way, I don't need to do the per field analysis over dynamic
fields with Solr (on the server side).

Similarly, but a little bit more involuted, at indexing time:

String value = "The CAT is on the table";

Instead of (i.e. Lucene legacy/old existing application):

IndexWriter writer = new IndexWriter(directory, analyzer);
Document lucene_document = new Document();
Field field = new Field("title", value, Field.Store.YES,
Field.Index.TOKENIZED);
lucene_document.add(field);
writer.addDocument(lucene_document);

I will do something like:

StringBuffer solr_value = new StringBuffer();
TokenStream ts = analyzer.tokenStream("title", new StringReader(value));
Token token;
while ((token = ts.next()) != null) {
solr_value.append(token.termText()).append(" ");
}
SolrInputDocument solr_document = new SolrInputDocument();
solr_document.addField("title", solr_value.toString());
...


What do you think?

Thanks again,
Paolo
Continue reading on narkive:
Loading...