Single call for distributed IDF?

Discussion:

Walter Underwood

2017-01-24 18:09:00 UTC

I tried running with the LRUStatsCache for global IDF, but the performance penalty was pretty big. The 95th percentile response time went from 3.4 seconds to 13 seconds. Oops.

We should not need a separate call to get the tf and df stats. Those are already calculated when doing the first request. I worked on a search engine that did it that way twenty years ago.

In the past, there would have been an IP obstacle, but I think that is resolved.

wunder
Walter Underwood
***@wunderwood.org
http://observer.wunderwood.org/ (my blog)

Joel Bernstein

2017-01-24 18:34:39 UTC

Permalink

This may help out:
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/ScoreNodesStream.java#L208

This points to some code that calculates global idf for a list of terms.
Not sure if this matches you use case. It seems to be very fast.

Joel Bernstein
http://joelsolr.blogspot.com/

Post by Walter Underwood
I tried running with the LRUStatsCache for global IDF, but the performance
penalty was pretty big. The 95th percentile response time went from 3.4
seconds to 13 seconds. Oops.
We should not need a separate call to get the tf and df stats. Those are
already calculated when doing the first request. I worked on a search
engine that did it that way twenty years ago.
In the past, there would have been an IP obstacle, but I think that is resolved.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)

Walter Underwood

2017-01-24 18:39:29 UTC

Permalink

I know how to do it. You return df for each term and num_docs then recalculate idf. I wrote up how we did it in Ultraseek XPA about ten years ago, though with MonkeyRank instead of global IDF.

https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <https://observer.wunderwood.org/2007/04/04/progressive-reranking/>

I was wondering why Solr makes a separate request to each shard for that information instead of piggybacking it on the original request.

wunder
Walter Underwood
***@wunderwood.org
http://observer.wunderwood.org/ (my blog)

Post by Joel Bernstein
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/ScoreNodesStream.java#L208
This points to some code that calculates global idf for a list of terms.
Not sure if this matches you use case. It seems to be very fast.
Joel Bernstein
http://joelsolr.blogspot.com/

Joel Bernstein

2017-01-24 18:43:19 UTC

Permalink

Ah, I thought you were just interested in a fast way to get at IDF. This
approach does take a callback but it's really fast.

Joel Bernstein
http://joelsolr.blogspot.com/

Post by Walter Underwood
I know how to do it. You return df for each term and num_docs then
recalculate idf. I wrote up how we did it in Ultraseek XPA about ten years
ago, though with MonkeyRank instead of global IDF.
https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <
https://observer.wunderwood.org/2007/04/04/progressive-reranking/>
I was wondering why Solr makes a separate request to each shard for that
information instead of piggybacking it on the original request.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)

Post by Joel Bernstein
https://github.com/apache/lucene-solr/blob/master/solr/

solrj/src/java/org/apache/solr/client/solrj/io/stream/
ScoreNodesStream.java#L208

Post by Joel Bernstein
This points to some code that calculates global idf for a list of terms.
Not sure if this matches you use case. It seems to be very fast.
Joel Bernstein
http://joelsolr.blogspot.com/

Post by Walter Underwood
I tried running with the LRUStatsCache for global IDF, but the

performance