Discussion:
Synonyms and hyphens
Alireza Salimi
2012-07-03 20:05:28 UTC
Permalink
Hi,

I'm not sure if anybody has experienced this behavior before or not.
I noticed that 'hyphen' plays a very important role here.
I used Solr's default example directory.
http://localhost:8983/solr/select/?q=name:(gb-mb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND
results in "parsedquery":"+name:gb +name:gib +name:gigabyte
+name:gigabytes +name:mb +name:mib +name:megabyte +name:megabytes",

While searching
http://localhost:8984/solr/select/?q=name:(gbmb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND
results in "parsedquery":"+(name:gb name:gib name:gigabyte name:gigabytes)
+(name:mb name:mib name:megabyte name:megabytes)",

If you notice to the first query - with hyphens - you can see that the
results of
parsing is totally different. I know that hyphens are special characters in
Solr,
but there's no way that the first query returns any entry because it's
asking for
ALL synonyms.

Am I missing something here?

Thanks
--
Alireza Salimi
Java EE Developer
Alireza Salimi
2012-07-04 11:50:15 UTC
Permalink
Hi,

Does anybody know why hyphen '-' and q.op=AND causes such a big difference
between the two queries? I thought hyphens are removed by StandardTokenizer
which means theoretically the two queries should be the same!

Thanks
Post by Alireza Salimi
Hi,
I'm not sure if anybody has experienced this behavior before or not.
I noticed that 'hyphen' plays a very important role here.
I used Solr's default example directory.
http://localhost:8983/solr/select/?q=name:(gb-mb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND
results in "parsedquery":"+name:gb +name:gib +name:gigabyte
+name:gigabytes +name:mb +name:mib +name:megabyte +name:megabytes",
While searching http://localhost:8984/solr/select/?q=name:(gbmb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND
results in "parsedquery":"+(name:gb name:gib name:gigabyte
name:gigabytes) +(name:mb name:mib name:megabyte name:megabytes)",
If you notice to the first query - with hyphens - you can see that the
results of
parsing is totally different. I know that hyphens are special characters
in Solr,
but there's no way that the first query returns any entry because it's
asking for
ALL synonyms.
Am I missing something here?
Thanks
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
Jack Krupansky
2012-07-04 16:26:59 UTC
Permalink
Terms with embedded special characters are treated as phrases with spaces in
place of the special characters. So, "gb-mb" is treated as if you had
enclosed the term in quotes.

-- Jack Krupansky
-----Original Message-----
From: Alireza Salimi
Sent: Wednesday, July 04, 2012 6:50 AM
To: solr-***@lucene.apache.org
Subject: Re: Synonyms and hyphens

Hi,

Does anybody know why hyphen '-' and q.op=AND causes such a big difference
between the two queries? I thought hyphens are removed by StandardTokenizer
which means theoretically the two queries should be the same!

Thanks

On Tue, Jul 3, 2012 at 4:05 PM, Alireza Salimi
Post by Alireza Salimi
Hi,
I'm not sure if anybody has experienced this behavior before or not.
I noticed that 'hyphen' plays a very important role here.
I used Solr's default example directory.
http://localhost:8983/solr/select/?q=name:(gb-mb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND
results in "parsedquery":"+name:gb +name:gib +name:gigabyte
+name:gigabytes +name:mb +name:mib +name:megabyte +name:megabytes",
While searching
http://localhost:8984/solr/select/?q=name:(gbmb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND
results in "parsedquery":"+(name:gb name:gib name:gigabyte
name:gigabytes) +(name:mb name:mib name:megabyte name:megabytes)",
If you notice to the first query - with hyphens - you can see that the
results of
parsing is totally different. I know that hyphens are special characters
in Solr,
but there's no way that the first query returns any entry because it's
asking for
ALL synonyms.
Am I missing something here?
Thanks
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
Alireza Salimi
2012-07-04 17:05:26 UTC
Permalink
Wow, I didn't know that. Is there a way to disable this feature? I mean, is
it something coming from the Analyzer?
Post by Jack Krupansky
Terms with embedded special characters are treated as phrases with spaces
in place of the special characters. So, "gb-mb" is treated as if you had
enclosed the term in quotes.
-- Jack Krupansky
-----Original Message----- From: Alireza Salimi
Sent: Wednesday, July 04, 2012 6:50 AM
Subject: Re: Synonyms and hyphens
Hi,
Does anybody know why hyphen '-' and q.op=AND causes such a big difference
between the two queries? I thought hyphens are removed by StandardTokenizer
which means theoretically the two queries should be the same!
Thanks
Hi,
Post by Alireza Salimi
I'm not sure if anybody has experienced this behavior before or not.
I noticed that 'hyphen' plays a very important role here.
I used Solr's default example directory.
http://localhost:8983/solr/**select/?q=name:(gb-mb)&**
version=2.2&start=0&rows=10&**indent=on&debugQuery=on&**
indent=on&wt=json&q.op=AND<http://localhost:8983/solr/select/?q=name:(gb-mb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND>
results in "parsedquery":"+name:gb +name:gib +name:gigabyte
+name:gigabytes +name:mb +name:mib +name:megabyte +name:megabytes",
While searching http://localhost:8984/solr/**
select/?q=name:(gbmb)&version=**2.2&start=0&rows=10&indent=on&**
debugQuery=on&indent=on&wt=**json&q.op=AND<http://localhost:8984/solr/select/?q=name:(gbmb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND>
results in "parsedquery":"+(name:gb name:gib name:gigabyte
name:gigabytes) +(name:mb name:mib name:megabyte name:megabytes)",
If you notice to the first query - with hyphens - you can see that the
results of
parsing is totally different. I know that hyphens are special characters
in Solr,
but there's no way that the first query returns any entry because it's
asking for
ALL synonyms.
Am I missing something here?
Thanks
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
Jack Krupansky
2012-07-04 17:37:12 UTC
Permalink
There is one other detail that should clarify the situation. At query time,
the query parser itself is breaking your query into space-delimited terms,
and only calling the analyzer for each of those terms, each of which will be
treated as if a quoted phrase. So it doesn't matter whether it is the
standard analyzer or word delimiter filter or other filter that is breaking
up the compound term.

And the default "query operator" only applies to the "terms" as the query
parser parsed them, not for the sub-terms of a compound term like CD-ROM or
gb-mb.

-- Jack Krupansky

-----Original Message-----
From: Alireza Salimi
Sent: Wednesday, July 04, 2012 12:05 PM
To: solr-***@lucene.apache.org
Subject: Re: Synonyms and hyphens

Wow, I didn't know that. Is there a way to disable this feature? I mean, is
it something coming from the Analyzer?

On Wed, Jul 4, 2012 at 12:26 PM, Jack Krupansky
Post by Jack Krupansky
Terms with embedded special characters are treated as phrases with spaces
in place of the special characters. So, "gb-mb" is treated as if you had
enclosed the term in quotes.
-- Jack Krupansky
-----Original Message----- From: Alireza Salimi
Sent: Wednesday, July 04, 2012 6:50 AM
Subject: Re: Synonyms and hyphens
Hi,
Does anybody know why hyphen '-' and q.op=AND causes such a big difference
between the two queries? I thought hyphens are removed by
StandardTokenizer
which means theoretically the two queries should be the same!
Thanks
Hi,
Post by Alireza Salimi
I'm not sure if anybody has experienced this behavior before or not.
I noticed that 'hyphen' plays a very important role here.
I used Solr's default example directory.
http://localhost:8983/solr/**select/?q=name:(gb-mb)&**
version=2.2&start=0&rows=10&**indent=on&debugQuery=on&**
indent=on&wt=json&q.op=AND<http://localhost:8983/solr/select/?q=name:(gb-mb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND>
results in "parsedquery":"+name:gb +name:gib +name:gigabyte
+name:gigabytes +name:mb +name:mib +name:megabyte +name:megabytes",
While searching http://localhost:8984/solr/**
select/?q=name:(gbmb)&version=**2.2&start=0&rows=10&indent=on&**
debugQuery=on&indent=on&wt=**json&q.op=AND<http://localhost:8984/solr/select/?q=name:(gbmb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND>
results in "parsedquery":"+(name:gb name:gib name:gigabyte
name:gigabytes) +(name:mb name:mib name:megabyte name:megabytes)",
If you notice to the first query - with hyphens - you can see that the
results of
parsing is totally different. I know that hyphens are special characters
in Solr,
but there's no way that the first query returns any entry because it's
asking for
ALL synonyms.
Am I missing something here?
Thanks
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
Alireza Salimi
2012-07-04 17:56:23 UTC
Permalink
ok, so how can I prevent this behavior to happen?
As you can see the parsed query is very different in these two cases.
Post by Jack Krupansky
There is one other detail that should clarify the situation. At query
time, the query parser itself is breaking your query into space-delimited
terms, and only calling the analyzer for each of those terms, each of which
will be treated as if a quoted phrase. So it doesn't matter whether it is
the standard analyzer or word delimiter filter or other filter that is
breaking up the compound term.
And the default "query operator" only applies to the "terms" as the query
parser parsed them, not for the sub-terms of a compound term like CD-ROM or
gb-mb.
-- Jack Krupansky
-----Original Message----- From: Alireza Salimi
Sent: Wednesday, July 04, 2012 12:05 PM
Subject: Re: Synonyms and hyphens
Wow, I didn't know that. Is there a way to disable this feature? I mean, is
it something coming from the Analyzer?
Terms with embedded special characters are treated as phrases with spaces
Post by Jack Krupansky
in place of the special characters. So, "gb-mb" is treated as if you had
enclosed the term in quotes.
-- Jack Krupansky
-----Original Message----- From: Alireza Salimi
Sent: Wednesday, July 04, 2012 6:50 AM
Subject: Re: Synonyms and hyphens
Hi,
Does anybody know why hyphen '-' and q.op=AND causes such a big difference
between the two queries? I thought hyphens are removed by
StandardTokenizer
which means theoretically the two queries should be the same!
Thanks
Post by Alireza Salimi
*
Hi,
Post by Alireza Salimi
I'm not sure if anybody has experienced this behavior before or not.
I noticed that 'hyphen' plays a very important role here.
I used Solr's default example directory.
http://localhost:8983/solr/****select/?q=name:(gb-mb)&**<http://localhost:8983/solr/**select/?q=name:(gb-mb)&**>
version=2.2&start=0&rows=10&****indent=on&debugQuery=on&**
indent=on&wt=json&q.op=AND<htt**p://localhost:8983/solr/**
select/?q=name:(gb-mb)&**version=2.2&start=0&rows=10&**
indent=on&debugQuery=on&**indent=on&wt=json&q.op=AND<http://localhost:8983/solr/select/?q=name:(gb-mb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND>
results in "parsedquery":"+name:gb +name:gib +name:gigabyte
+name:gigabytes +name:mb +name:mib +name:megabyte +name:megabytes",
While searching http://localhost:8984/solr/**
select/?q=name:(gbmb)&version=****2.2&start=0&rows=10&indent=**on&**
debugQuery=on&indent=on&wt=****json&q.op=AND<http://**
localhost:8984/solr/select/?q=**name:(gbmb)&version=2.2&start=**
0&rows=10&indent=on&**debugQuery=on&indent=on&wt=**json&q.op=AND<http://localhost:8984/solr/select/?q=name:(gbmb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND>
results in "parsedquery":"+(name:gb name:gib name:gigabyte
name:gigabytes) +(name:mb name:mib name:megabyte name:megabytes)",
If you notice to the first query - with hyphens - you can see that the
results of
parsing is totally different. I know that hyphens are special characters
in Solr,
but there's no way that the first query returns any entry because it's
asking for
ALL synonyms.
Am I missing something here?
Thanks
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
Jack Krupansky
2012-07-04 19:18:31 UTC
Permalink
You could pre-process your queries to convert hyphen and other special
characters to spaces.

-- Jack Krupansky

-----Original Message-----
From: Alireza Salimi
Sent: Wednesday, July 04, 2012 12:56 PM
To: solr-***@lucene.apache.org
Subject: Re: Synonyms and hyphens

ok, so how can I prevent this behavior to happen?
As you can see the parsed query is very different in these two cases.

On Wed, Jul 4, 2012 at 1:37 PM, Jack Krupansky
Post by Jack Krupansky
There is one other detail that should clarify the situation. At query
time, the query parser itself is breaking your query into space-delimited
terms, and only calling the analyzer for each of those terms, each of which
will be treated as if a quoted phrase. So it doesn't matter whether it is
the standard analyzer or word delimiter filter or other filter that is
breaking up the compound term.
And the default "query operator" only applies to the "terms" as the query
parser parsed them, not for the sub-terms of a compound term like CD-ROM or
gb-mb.
-- Jack Krupansky
-----Original Message----- From: Alireza Salimi
Sent: Wednesday, July 04, 2012 12:05 PM
Subject: Re: Synonyms and hyphens
Wow, I didn't know that. Is there a way to disable this feature? I mean, is
it something coming from the Analyzer?
Terms with embedded special characters are treated as phrases with spaces
Post by Jack Krupansky
in place of the special characters. So, "gb-mb" is treated as if you had
enclosed the term in quotes.
-- Jack Krupansky
-----Original Message----- From: Alireza Salimi
Sent: Wednesday, July 04, 2012 6:50 AM
Subject: Re: Synonyms and hyphens
Hi,
Does anybody know why hyphen '-' and q.op=AND causes such a big difference
between the two queries? I thought hyphens are removed by
StandardTokenizer
which means theoretically the two queries should be the same!
Thanks
Post by Alireza Salimi
*
Hi,
Post by Alireza Salimi
I'm not sure if anybody has experienced this behavior before or not.
I noticed that 'hyphen' plays a very important role here.
I used Solr's default example directory.
http://localhost:8983/solr/****select/?q=name:(gb-mb)&**<http://localhost:8983/solr/**select/?q=name:(gb-mb)&**>
version=2.2&start=0&rows=10&****indent=on&debugQuery=on&**
indent=on&wt=json&q.op=AND<htt**p://localhost:8983/solr/**
select/?q=name:(gb-mb)&**version=2.2&start=0&rows=10&**
indent=on&debugQuery=on&**indent=on&wt=json&q.op=AND<http://localhost:8983/solr/select/?q=name:(gb-mb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND>
results in "parsedquery":"+name:gb +name:gib +name:gigabyte
+name:gigabytes +name:mb +name:mib +name:megabyte +name:megabytes",
While searching http://localhost:8984/solr/**
select/?q=name:(gbmb)&version=****2.2&start=0&rows=10&indent=**on&**
debugQuery=on&indent=on&wt=****json&q.op=AND<http://**
localhost:8984/solr/select/?q=**name:(gbmb)&version=2.2&start=**
0&rows=10&indent=on&**debugQuery=on&indent=on&wt=**json&q.op=AND<http://localhost:8984/solr/select/?q=name:(gbmb)&version=2.2&start=0&rows=10&indent=on&debugQuery=on&indent=on&wt=json&q.op=AND>
results in "parsedquery":"+(name:gb name:gib name:gigabyte
name:gigabytes) +(name:mb name:mib name:megabyte name:megabytes)",
If you notice to the first query - with hyphens - you can see that the
results of
parsing is totally different. I know that hyphens are special characters
in Solr,
but there's no way that the first query returns any entry because it's
asking for
ALL synonyms.
Am I missing something here?
Thanks
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
--
Alireza Salimi
Java EE Developer
Chris Hostetter
2012-07-10 21:46:58 UTC
Permalink
Which version of Solr are you using?

: Terms with embedded special characters are treated as phrases with spaces in
: place of the special characters. So, "gb-mb" is treated as if you had enclosed
: the term in quotes.

take a look at "autoGeneratePhraseQueries" option on your field type ...
dependingon the "version" attribute of your <schema /> it may be
defaulting to true.

Setting it to false should cause it to treat "gb" and "mb" as distinct
terms.



-Hoss

Loading...