Discussion:
Single call for distributed IDF?
Walter Underwood
2017-01-24 18:09:00 UTC
Permalink
I tried running with the LRUStatsCache for global IDF, but the performance penalty was pretty big. The 95th percentile response time went from 3.4 seconds to 13 seconds. Oops.

We should not need a separate call to get the tf and df stats. Those are already calculated when doing the first request. I worked on a search engine that did it that way twenty years ago.

In the past, there would have been an IP obstacle, but I think that is resolved.

wunder
Walter Underwood
***@wunderwood.org
http://observer.wunderwood.org/ (my blog)
Joel Bernstein
2017-01-24 18:34:39 UTC
Permalink
This may help out:
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/ScoreNodesStream.java#L208

This points to some code that calculates global idf for a list of terms.
Not sure if this matches you use case. It seems to be very fast.

Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
I tried running with the LRUStatsCache for global IDF, but the performance
penalty was pretty big. The 95th percentile response time went from 3.4
seconds to 13 seconds. Oops.
We should not need a separate call to get the tf and df stats. Those are
already calculated when doing the first request. I worked on a search
engine that did it that way twenty years ago.
In the past, there would have been an IP obstacle, but I think that is resolved.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Walter Underwood
2017-01-24 18:39:29 UTC
Permalink
I know how to do it. You return df for each term and num_docs then recalculate idf. I wrote up how we did it in Ultraseek XPA about ten years ago, though with MonkeyRank instead of global IDF.

https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <https://observer.wunderwood.org/2007/04/04/progressive-reranking/>

I was wondering why Solr makes a separate request to each shard for that information instead of piggybacking it on the original request.

wunder
Walter Underwood
***@wunderwood.org
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/ScoreNodesStream.java#L208
This points to some code that calculates global idf for a list of terms.
Not sure if this matches you use case. It seems to be very fast.
Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
I tried running with the LRUStatsCache for global IDF, but the performance
penalty was pretty big. The 95th percentile response time went from 3.4
seconds to 13 seconds. Oops.
We should not need a separate call to get the tf and df stats. Those are
already calculated when doing the first request. I worked on a search
engine that did it that way twenty years ago.
In the past, there would have been an IP obstacle, but I think that is resolved.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Joel Bernstein
2017-01-24 18:43:19 UTC
Permalink
Ah, I thought you were just interested in a fast way to get at IDF. This
approach does take a callback but it's really fast.

Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
I know how to do it. You return df for each term and num_docs then
recalculate idf. I wrote up how we did it in Ultraseek XPA about ten years
ago, though with MonkeyRank instead of global IDF.
https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <
https://observer.wunderwood.org/2007/04/04/progressive-reranking/>
I was wondering why Solr makes a separate request to each shard for that
information instead of piggybacking it on the original request.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
https://github.com/apache/lucene-solr/blob/master/solr/
solrj/src/java/org/apache/solr/client/solrj/io/stream/
ScoreNodesStream.java#L208
Post by Joel Bernstein
This points to some code that calculates global idf for a list of terms.
Not sure if this matches you use case. It seems to be very fast.
Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
I tried running with the LRUStatsCache for global IDF, but the
performance
Post by Joel Bernstein
Post by Walter Underwood
penalty was pretty big. The 95th percentile response time went from 3.4
seconds to 13 seconds. Oops.
We should not need a separate call to get the tf and df stats. Those are
already calculated when doing the first request. I worked on a search
engine that did it that way twenty years ago.
In the past, there would have been an IP obstacle, but I think that is resolved.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Walter Underwood
2017-01-24 19:01:27 UTC
Permalink
Specifically, I’m talking about this:

<statsCache class="org.apache.solr.search.stats.LRUStatsCache”/>

Adding that line increased our 95th percentile response time by 10 seconds.

wunder
Walter Underwood
***@wunderwood.org
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
Ah, I thought you were just interested in a fast way to get at IDF. This
approach does take a callback but it's really fast.
Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
I know how to do it. You return df for each term and num_docs then
recalculate idf. I wrote up how we did it in Ultraseek XPA about ten years
ago, though with MonkeyRank instead of global IDF.
https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <
https://observer.wunderwood.org/2007/04/04/progressive-reranking/>
I was wondering why Solr makes a separate request to each shard for that
information instead of piggybacking it on the original request.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
https://github.com/apache/lucene-solr/blob/master/solr/
solrj/src/java/org/apache/solr/client/solrj/io/stream/
ScoreNodesStream.java#L208
Post by Joel Bernstein
This points to some code that calculates global idf for a list of terms.
Not sure if this matches you use case. It seems to be very fast.
Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
I tried running with the LRUStatsCache for global IDF, but the
performance
Post by Joel Bernstein
Post by Walter Underwood
penalty was pretty big. The 95th percentile response time went from 3.4
seconds to 13 seconds. Oops.
We should not need a separate call to get the tf and df stats. Those are
already calculated when doing the first request. I worked on a search
engine that did it that way twenty years ago.
In the past, there would have been an IP obstacle, but I think that is resolved.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Joel Bernstein
2017-01-24 20:28:29 UTC
Permalink
Ok my mistake, I was thinking you were writing your own component and
needed a fast way to get global IDF. You're looking for fast global IDF
during the scoring it sounds like. That seems like a reasonable thing to
want.

In the piggy backing approach you mention does the aggregator node parse
the query and fetch the IDF, then pass it along to the shards?



Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
<statsCache class="org.apache.solr.search.stats.LRUStatsCache”/>
Adding that line increased our 95th percentile response time by 10 seconds.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
Ah, I thought you were just interested in a fast way to get at IDF. This
approach does take a callback but it's really fast.
Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
I know how to do it. You return df for each term and num_docs then
recalculate idf. I wrote up how we did it in Ultraseek XPA about ten
years
Post by Joel Bernstein
Post by Walter Underwood
ago, though with MonkeyRank instead of global IDF.
https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <
https://observer.wunderwood.org/2007/04/04/progressive-reranking/>
I was wondering why Solr makes a separate request to each shard for that
information instead of piggybacking it on the original request.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
https://github.com/apache/lucene-solr/blob/master/solr/
solrj/src/java/org/apache/solr/client/solrj/io/stream/
ScoreNodesStream.java#L208
Post by Joel Bernstein
This points to some code that calculates global idf for a list of
terms.
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Not sure if this matches you use case. It seems to be very fast.
Joel Bernstein
http://joelsolr.blogspot.com/
On Tue, Jan 24, 2017 at 1:09 PM, Walter Underwood <
Post by Walter Underwood
I tried running with the LRUStatsCache for global IDF, but the
performance
Post by Joel Bernstein
Post by Walter Underwood
penalty was pretty big. The 95th percentile response time went from
3.4
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
seconds to 13 seconds. Oops.
We should not need a separate call to get the tf and df stats. Those
are
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
already calculated when doing the first request. I worked on a search
engine that did it that way twenty years ago.
In the past, there would have been an IP obstacle, but I think that is resolved.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Joel Bernstein
2017-01-24 20:30:02 UTC
Permalink
Reading your blogs now.

Joel Bernstein
http://joelsolr.blogspot.com/
Post by Joel Bernstein
Ok my mistake, I was thinking you were writing your own component and
needed a fast way to get global IDF. You're looking for fast global IDF
during the scoring it sounds like. That seems like a reasonable thing to
want.
In the piggy backing approach you mention does the aggregator node parse
the query and fetch the IDF, then pass it along to the shards?
Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
<statsCache class="org.apache.solr.search.stats.LRUStatsCache”/>
Adding that line increased our 95th percentile response time by 10 seconds.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
Ah, I thought you were just interested in a fast way to get at IDF. This
approach does take a callback but it's really fast.
Joel Bernstein
http://joelsolr.blogspot.com/
On Tue, Jan 24, 2017 at 1:39 PM, Walter Underwood <
Post by Walter Underwood
I know how to do it. You return df for each term and num_docs then
recalculate idf. I wrote up how we did it in Ultraseek XPA about ten
years
Post by Joel Bernstein
Post by Walter Underwood
ago, though with MonkeyRank instead of global IDF.
https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <
https://observer.wunderwood.org/2007/04/04/progressive-reranking/>
I was wondering why Solr makes a separate request to each shard for
that
Post by Joel Bernstein
Post by Walter Underwood
information instead of piggybacking it on the original request.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
https://github.com/apache/lucene-solr/blob/master/solr/
solrj/src/java/org/apache/solr/client/solrj/io/stream/
ScoreNodesStream.java#L208
Post by Joel Bernstein
This points to some code that calculates global idf for a list of
terms.
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Not sure if this matches you use case. It seems to be very fast.
Joel Bernstein
http://joelsolr.blogspot.com/
On Tue, Jan 24, 2017 at 1:09 PM, Walter Underwood <
Post by Walter Underwood
I tried running with the LRUStatsCache for global IDF, but the
performance
Post by Joel Bernstein
Post by Walter Underwood
penalty was pretty big. The 95th percentile response time went from
3.4
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
seconds to 13 seconds. Oops.
We should not need a separate call to get the tf and df stats. Those
are
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
already calculated when doing the first request. I worked on a search
engine that did it that way twenty years ago.
In the past, there would have been an IP obstacle, but I think that
is
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
resolved.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Walter Underwood
2017-01-31 16:59:14 UTC
Permalink
The usual reason to do a second call to get the stats for global IDF is to get around an Infoseek patent on the single call version. But that patent finally expired a couple of years ago, so now there is no reason to do a second call.

wunder
Walter Underwood
***@wunderwood.org
http://observer.wunderwood.org/ (my blog)
Post by Walter Underwood
<statsCache class="org.apache.solr.search.stats.LRUStatsCache”/>
Adding that line increased our 95th percentile response time by 10 seconds.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
Ah, I thought you were just interested in a fast way to get at IDF. This
approach does take a callback but it's really fast.
Joel Bernstein
http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
Post by Walter Underwood
I know how to do it. You return df for each term and num_docs then
recalculate idf. I wrote up how we did it in Ultraseek XPA about ten years
ago, though with MonkeyRank instead of global IDF.
https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <
https://observer.wunderwood.org/2007/04/04/progressive-reranking/>
I was wondering why Solr makes a separate request to each shard for that
information instead of piggybacking it on the original request.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
https://github.com/apache/lucene-solr/blob/master/solr/
solrj/src/java/org/apache/solr/client/solrj/io/stream/
ScoreNodesStream.java#L208
Post by Joel Bernstein
This points to some code that calculates global idf for a list of terms.
Not sure if this matches you use case. It seems to be very fast.
Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
I tried running with the LRUStatsCache for global IDF, but the
performance
Post by Joel Bernstein
Post by Walter Underwood
penalty was pretty big. The 95th percentile response time went from 3.4
seconds to 13 seconds. Oops.
We should not need a separate call to get the tf and df stats. Those are
already calculated when doing the first request. I worked on a search
engine that did it that way twenty years ago.
In the past, there would have been an IP obstacle, but I think that is resolved.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Joel Bernstein
2017-01-31 17:47:21 UTC
Permalink
I think I understand the process you describe in your blog. I'm not sure
that I would choose to do that approach. For some Streaming Expressions
work I was doing I fetched the global IDF for the specific terms upfront at
the aggregator node. This was taking around 5-10 milli-seconds in my tests,
and I implemented no caching at all with this, so the calls to shards were
made each time. With a simple cache it could be made more efficient. Once
you have the IDF at aggregator node you could push the global IDF to the
shards pretty easily. Granted this does involve another call to the shards,
but the overhead was so low that it seemed acceptable.

This is quite different then what you describe and also quite different
then the stats caching approach which is currently in Solr.

Maybe I'm just bias to my own approach, but it seems simple and fast.

Joel Bernstein
http://joelsolr.blogspot.com/
Post by Walter Underwood
The usual reason to do a second call to get the stats for global IDF is to
get around an Infoseek patent on the single call version. But that patent
finally expired a couple of years ago, so now there is no reason to do a
second call.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Walter Underwood
<statsCache class="org.apache.solr.search.stats.LRUStatsCache”/>
Adding that line increased our 95th percentile response time by 10
seconds.
Post by Walter Underwood
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
Ah, I thought you were just interested in a fast way to get at IDF. This
approach does take a callback but it's really fast.
Joel Bernstein
http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
On Tue, Jan 24, 2017 at 1:39 PM, Walter Underwood <
Post by Walter Underwood
I know how to do it. You return df for each term and num_docs then
recalculate idf. I wrote up how we did it in Ultraseek XPA about ten
years
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
ago, though with MonkeyRank instead of global IDF.
https://observer.wunderwood.org/2007/04/04/progressive-reranking/ <
https://observer.wunderwood.org/2007/04/04/progressive-reranking/>
I was wondering why Solr makes a separate request to each shard for
that
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
information instead of piggybacking it on the original request.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Post by Joel Bernstein
https://github.com/apache/lucene-solr/blob/master/solr/
solrj/src/java/org/apache/solr/client/solrj/io/stream/
ScoreNodesStream.java#L208
Post by Joel Bernstein
This points to some code that calculates global idf for a list of
terms.
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Not sure if this matches you use case. It seems to be very fast.
Joel Bernstein
http://joelsolr.blogspot.com/
On Tue, Jan 24, 2017 at 1:09 PM, Walter Underwood <
Post by Walter Underwood
I tried running with the LRUStatsCache for global IDF, but the
performance
Post by Joel Bernstein
Post by Walter Underwood
penalty was pretty big. The 95th percentile response time went from
3.4
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
seconds to 13 seconds. Oops.
We should not need a separate call to get the tf and df stats. Those
are
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
already calculated when doing the first request. I worked on a search
engine that did it that way twenty years ago.
In the past, there would have been an IP obstacle, but I think that
is
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
Post by Joel Bernstein
Post by Walter Underwood
resolved.
wunder
Walter Underwood
http://observer.wunderwood.org/ (my blog)
Loading...