Discussion:
Is semicolon a character that needs escaping?
Michael Lackhoff
2010-09-02 19:35:08 UTC
Permalink
According to http://lucene.apache.org/java/2_9_1/queryparsersyntax.html
only these characters need escaping:
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
but with this simple query:
TI:stroke; AND TI:journal
I got the error message:
HTTP ERROR: 400
Unknown sort order: TI:journal

My first guess was that it was a URL encoding issue but everything looks
fine:
http://localhost:8983/solr/select/?q=TI%3Astroke%3B+AND+TI%3Ajournal&version=2.2&start=0&rows=10&indent=on
as you can see, the semicolon is encoded as %3B
There is no problem when the query ends with the semicolon:
TI:stroke;
gives no error.
The first query also works if I escape the semicolon:
TI:stroke\; AND TI:journal
From this I conclude that there is a bug either in the docs or in the
query parser or I missed something. What is wrong here?

-Michael
Ken Krugler
2010-09-02 22:57:24 UTC
Permalink
Post by Michael Lackhoff
According to http://lucene.apache.org/java/2_9_1/
queryparsersyntax.html
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
TI:stroke; AND TI:journal
HTTP ERROR: 400
Unknown sort order: TI:journal
My first guess was that it was a URL encoding issue but everything looks
http://localhost:8983/solr/select/?q=TI%3Astroke%3B+AND+TI%3Ajournal&version=2.2&start=0&rows=10&indent=on
as you can see, the semicolon is encoded as %3B
TI:stroke;
gives no error.
TI:stroke\; AND TI:journal
From this I conclude that there is a bug either in the docs or in the
query parser or I missed something. What is wrong here?
The docs need to be updated, I believe. From some code I wrote back in
2006...

// Also note that we escape ';', as Solr uses this to support
embedding
// commands into the query string (yikes), and the code base
we're using
// has a bug where if the ';' doesn't have two tokens after
it (white-
// space separated) then you get an array index out of bounds
error.

I also had this note, no idea if it's still an issue:

// Before we do regular escaping, work around a bug in the
Lucene query
// parser. If the last character is a '\', we can escape it
as '\\', but
// if we build an expression that looks like xxx AND
(<querytext\>) then
// the Lucene query parser will treat the final '\' before
the ')' as
// a signal to escape the ')' character. That's just wrong,
but for now
// we'll just strip off any trailing '\' characters in the
clause.

But in general escaping characters in a query gets tricky - if you can
directly build queries versus pre-processing text sent to the query
parser, you'll save yourself some pain and suffering.

Also, since I did the above code the DisMaxRequestHandler has been
added to Solr, and it (IIRC) tries to be smart about handling this
type of escaping for you.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
Michael Lackhoff
2010-09-03 02:15:04 UTC
Permalink
Post by Ken Krugler
The docs need to be updated, I believe. From some code I wrote back in
2006...
[...]
Thanks this explains it very well.
Post by Ken Krugler
But in general escaping characters in a query gets tricky - if you can
directly build queries versus pre-processing text sent to the query
parser, you'll save yourself some pain and suffering.
What do you mean by these two alternatives? That is, what exactly could
I do better?
Post by Ken Krugler
Also, since I did the above code the DisMaxRequestHandler has been
added to Solr, and it (IIRC) tries to be smart about handling this
type of escaping for you.
Dismax is not (yet) an option because we need the full lucene syntax
within the query. Perhaps this will change with the new enhanced dismax
request handler but I didn't play with it enough (will do with the next
release).

-Michael
Ken Krugler
2010-09-03 03:42:20 UTC
Permalink
Hi Michael,
Post by Michael Lackhoff
Post by Ken Krugler
But in general escaping characters in a query gets tricky - if you can
directly build queries versus pre-processing text sent to the query
parser, you'll save yourself some pain and suffering.
What do you mean by these two alternatives? That is, what exactly could
I do better?
By "can build...", I meant if you can come up with a GUI whereby the
user doesn't have to use special characters (other than say quoting)
then you can take a collection of clauses and programmatically build
your query, without using the query parser.

The code I wound up having to write for what seemed like simple
escaping quickly got complex and convoluted - e.g. if you want to
allow "AND" as a term, and don't want it to get processed specially by
the query parser.
Post by Michael Lackhoff
Post by Ken Krugler
Also, since I did the above code the DisMaxRequestHandler has been
added to Solr, and it (IIRC) tries to be smart about handling this
type of escaping for you.
Dismax is not (yet) an option because we need the full lucene syntax
within the query.
OK - in that case sounds like you're stuck with escaping.


-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
Michael Lackhoff
2010-09-03 05:49:41 UTC
Permalink
Hi Ken,
Post by Ken Krugler
Post by Michael Lackhoff
Post by Ken Krugler
But in general escaping characters in a query gets tricky - if you can
directly build queries versus pre-processing text sent to the query
parser, you'll save yourself some pain and suffering.
What do you mean by these two alternatives? That is, what exactly could
I do better?
By "can build...", I meant if you can come up with a GUI whereby the
user doesn't have to use special characters (other than say quoting)
then you can take a collection of clauses and programmatically build
your query, without using the query parser.
I think I have that (escaping of characters that have a special meaning
in Solr). I just didn't know that the semicolon is one of them. So it
would be nice if the docs could be updated to account for this.

Thanks again
-Michael
Chris Hostetter
2010-09-07 22:05:15 UTC
Permalink
: Subject: Is semicolon a character that needs escaping?
...
: >From this I conclude that there is a bug either in the docs or in the
: query parser or I missed something. What is wrong here?

Back in Solr 1.1, the standard query parser treated ";" as a special
character and looked for sort instructions after it.

Starting in Solr 1.2 (released in 2007) a "sort" param was added, and
semicolon was only considered a special character if you did not
explicilty mention a "sort" param (for back compatibility)

Starting with Solr 1.4, the default was changed so that semicolon wasn't
considered a meta-character even if you didn't have a sort param -- you
have to explicilty select the "lucenePlusSort" QParser to get this
behavior.

I can only assume that if you are seeing this behavior, you are either
using a very old version of Solr, or you have explicitly selected the
lucenePlusSort parser somewhere in your params/config.

This was heavily documented in CHANGES.txt for Solr 1.4 (you can find
mention of it when searching for either ";" or "semicolon")



-Hoss

--
http://lucenerevolution.org/ ... October 7-8, Boston
http://bit.ly/stump-hoss ... Stump The Chump!
Michael Lackhoff
2010-09-08 06:16:48 UTC
Permalink
Post by Chris Hostetter
: Subject: Is semicolon a character that needs escaping?
...
: >From this I conclude that there is a bug either in the docs or in the
: query parser or I missed something. What is wrong here?
Back in Solr 1.1, the standard query parser treated ";" as a special
character and looked for sort instructions after it.
Starting in Solr 1.2 (released in 2007) a "sort" param was added, and
semicolon was only considered a special character if you did not
explicilty mention a "sort" param (for back compatibility)
Starting with Solr 1.4, the default was changed so that semicolon wasn't
considered a meta-character even if you didn't have a sort param -- you
have to explicilty select the "lucenePlusSort" QParser to get this
behavior.
I can only assume that if you are seeing this behavior, you are either
using a very old version of Solr, or you have explicitly selected the
lucenePlusSort parser somewhere in your params/config.
This was heavily documented in CHANGES.txt for Solr 1.4 (you can find
mention of it when searching for either ";" or "semicolon")
I am using 1.3 without a sort param which explains it, I think. It would
be nice to update to 1.4 but we try to avoid such actions on a
production server as long as everything runs fine (the semicolon thing
was only reported recently).

Many thanks for your detailed explanation!
-Michael
Chris Hostetter
2010-09-08 19:17:23 UTC
Permalink
: I am using 1.3 without a sort param which explains it, I think. It would
: be nice to update to 1.4 but we try to avoid such actions on a
: production server as long as everything runs fine (the semicolon thing
: was only reported recently).

if you don't currenlty use "sort" at all, then adding a default sort param
of "score desc" to your solr config for that handler, you shouldn't have
to ever worry about semicolons again.

(i'm fairly certainSolr 1.3 supported "Defaults" - i may be wrong ... you
might have to add that hardcoded sort param in your client)


-Hoss

--
http://lucenerevolution.org/ ... October 7-8, Boston
http://bit.ly/stump-hoss ... Stump The Chump!

Loading...