normalizing the score

Discussion:

Paul Libbrecht

2011-04-05 11:14:35 UTC

Hello list,

I did not find a wiki page about normalization.
All I found was: http://search.lucidimagination.com/search/document/9d06882d97db5c59/a_question_about_solr_score

where Hoss suggests to normalize depending on the maxScore.

I am not comfortable with that since, at least, I want that a search for "the wombats" in a directory of mathematical concepts, and display that all scores are pretty bad and not display 1.0 for matches that are only on the word "the".

It seems that the strategy would be to normalize by maxScore if the maxScore is bigger than 1.0.
Can you confirm that?
Isn't there going to be similar edge cases as above?

I remember a time where Lucene results' score were always normalized. That seems to be not in SOLR, or?

thanks in advance

paul

Chris Hostetter

2011-04-25 20:53:57 UTC

Permalink

: All I found was: http://search.lucidimagination.com/search/document/9d06882d97db5c59/a_question_about_solr_score
:
: where Hoss suggests to normalize depending on the maxScore.

to be clear, i do not (nor have i ever) suggested that someone normalize
based on maxScore.

my point there was that when [people *insist* on providing osme sort of
normalization, the maxScore is always available if they want to use it

: I am not comfortable with that since, at least, I want that a search for
: "the wombats" in a directory of mathematical concepts, and display that
: all scores are pretty bad and not display 1.0 for matches that are only
: on the word "the".

the crux of the problem is in deciding what you want to normalize relative
to -- the "ideal" solution is to normalize relative the maximum *possible*
score for *any* query against your corpus, but that's not something that's
generally feasible to do (and based on experiments i tried once, it didn't
seem like it would be very useful anyway)

: It seems that the strategy would be to normalize by maxScore if the maxScore is bigger than 1.0.
: Can you confirm that?
: Isn't there going to be similar edge cases as above?
:
: I remember a time where Lucene results' score were always normalized.
: That seems to be not in SOLR, or?

once upon a time, lucene's most "beginer freindly" api did provide
normalized scores, using the approach you described (divide by max score
if max score greater then 1.0) and it had all of the problems you might
expect -- but some people liked it because they had an irrational dislike
for scores greater then 1.

Solr has never supported those psuedo-nromalize scores, and lucene's java
API eventually got rid of them.

-Hoss

Paul Libbrecht

2011-04-25 21:03:04 UTC

Permalink

Thanks for the precision Hoss,

that is helpful an explanation.
I am still unsure how it is ever possible to display score-bars for which you need some normalization... but that's for another day.

I feel indications of match quality is still somehow a science that has not blossomed yet.
Sorting by score is, however, in very good shape.

paul

Post by Chris Hostetter
: All I found was: http://search.lucidimagination.com/search/document/9d06882d97db5c59/a_question_about_solr_score
: where Hoss suggests to normalize depending on the maxScore.
to be clear, i do not (nor have i ever) suggested that someone normalize
based on maxScore.
my point there was that when [people *insist* on providing osme sort of
normalization, the maxScore is always available if they want to use it
: I am not comfortable with that since, at least, I want that a search for
: "the wombats" in a directory of mathematical concepts, and display that
: all scores are pretty bad and not display 1.0 for matches that are only
: on the word "the".
the crux of the problem is in deciding what you want to normalize relative
to -- the "ideal" solution is to normalize relative the maximum *possible*
score for *any* query against your corpus, but that's not something that's
generally feasible to do (and based on experiments i tried once, it didn't
seem like it would be very useful anyway)
: It seems that the strategy would be to normalize by maxScore if the maxScore is bigger than 1.0.
: Can you confirm that?
: Isn't there going to be similar edge cases as above?
: I remember a time where Lucene results' score were always normalized.
: That seems to be not in SOLR, or?
once upon a time, lucene's most "beginer freindly" api did provide
normalized scores, using the approach you described (divide by max score
if max score greater then 1.0) and it had all of the problems you might
expect -- but some people liked it because they had an irrational dislike
for scores greater then 1.
Solr has never supported those psuedo-nromalize scores, and lucene's java
API eventually got rid of them.
-Hoss