Discussion:
Getting facet counts for 10,000 most relevant hits
Burton-West, Tom
2011-09-24 00:59:08 UTC
Permalink
If relevance ranking is working well, in theory it doesn't matter how many hits you get as long as the best results show up in the first page of results. However, the default in choosing which facet values to show is to show the facets with the highest count in the entire result set. Is there a way to issue some kind of a filter query or facet query that would show only the facet counts for the 10,000 most relevant search results?

As an example, if you search in our full-text collection for "jaguar" you get 170,000 hits. If I am looking for the car rather than the OS or the animal, I might expect to be able to click on a facet and limit my results to the car. However, facets containing the word car or automobile are not in the top 5 facets that we show. If you click on "more" you will see "automobile periodicals" but not the rest of the facets containing the word automobile . This occurs because the facet counts are for all 170,000 hits. The facet counts for at least 160,000 irrelevant hits are included (assuming only the top 10,000 hits are relevant) .

What we would like to do is get the facet counts for the N most relevant documents and select the 5 or 30 facet values with the highest counts for those relevant documents.

Is this possible or would it require writing some lucene or Solr code?

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
Lan
2011-09-29 23:39:32 UTC
Permalink
I implemented a similar feature for a categorization suggestion service. I
did the faceting in the client code, which is not exactly the best
performing but it worked very well.

It would be nice to have the Solr server do the faceting for performance.
Post by Burton-West, Tom
If relevance ranking is working well, in theory it doesn't matter how many
hits you get as long as the best results show up in the first page of
results. However, the default in choosing which facet values to show is
to show the facets with the highest count in the entire result set. Is
there a way to issue some kind of a filter query or facet query that would
show only the facet counts for the 10,000 most relevant search results?
As an example, if you search in our full-text collection for "jaguar" you
get 170,000 hits. If I am looking for the car rather than the OS or the
animal, I might expect to be able to click on a facet and limit my results
to the car. However, facets containing the word car or automobile are not
in the top 5 facets that we show. If you click on "more" you will see
"automobile periodicals" but not the rest of the facets containing the
word automobile . This occurs because the facet counts are for all
170,000 hits. The facet counts for at least 160,000 irrelevant hits are
included (assuming only the top 10,000 hits are relevant) .
What we would like to do is get the facet counts for the N most relevant
documents and select the 5 or 30 facet values with the highest counts for
those relevant documents.
Is this possible or would it require writing some lucene or Solr code?
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
--
View this message in context: http://lucene.472066.n3.nabble.com/Getting-facet-counts-for-10-000-most-relevant-hits-tp3363459p3380852.html
Sent from the Solr - User mailing list archive at Nabble.com.
Burton-West, Tom
2011-09-30 22:40:54 UTC
Permalink
Hi Lan,

I figured out how to do this in a kludgey way on the client side but it seems this could be implemented much more efficiently at the Solr/Lucene level. I described my kludge and posted a question about this to the dev list, but so far have not received any replies (http://lucene.472066.n3.nabble.com/Solr-should-provide-an-option-to-show-only-most-relevant-facet-values-tc3374285.html). I also found Solr-385, but I don't understand how grouping solves the problem. It looks like a much different issue to me.

The problem I am trying to solve is that I only have room in the interface to show 30 facet values at the most and whether these are ordered by facet counts against the entire result set or by the highest ranking score of a member of a facet-value group, the problem is that we want to base the facet counts/ranking on only the top N hits rather than the entire result set. In my use case the top 10,000 hits versus all 170,000.

Tom

-----Original Message-----
From: Lan [mailto:***@gmail.com]
Sent: Thursday, September 29, 2011 7:40 PM
To: solr-***@lucene.apache.org
Subject: Re: Getting facet counts for 10,000 most relevant hits

I implemented a similar feature for a categorization suggestion service. I
did the faceting in the client code, which is not exactly the best
performing but it worked very well.

It would be nice to have the Solr server do the faceting for performance.
Post by Burton-West, Tom
If relevance ranking is working well, in theory it doesn't matter how many
hits you get as long as the best results show up in the first page of
results. However, the default in choosing which facet values to show is
to show the facets with the highest count in the entire result set. Is
there a way to issue some kind of a filter query or facet query that would
show only the facet counts for the 10,000 most relevant search results?
As an example, if you search in our full-text collection for "jaguar" you
get 170,000 hits. If I am looking for the car rather than the OS or the
animal, I might expect to be able to click on a facet and limit my results
to the car. However, facets containing the word car or automobile are not
in the top 5 facets that we show. If you click on "more" you will see
"automobile periodicals" but not the rest of the facets containing the
word automobile . This occurs because the facet counts are for all
170,000 hits. The facet counts for at least 160,000 irrelevant hits are
included (assuming only the top 10,000 hits are relevant) .
What we would like to do is get the facet counts for the N most relevant
documents and select the 5 or 30 facet values with the highest counts for
those relevant documents.
Is this possible or would it require writing some lucene or Solr code?
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
--
View this message in context: http://lucene.472066.n3.nabble.com/Getting-facet-counts-for-10-000-most-relevant-hits-tp3363459p3380852.html
Sent from the Solr - User mailing list archive at Nabble.com.
Chris Hostetter
2011-10-01 01:19:50 UTC
Permalink
: I figured out how to do this in a kludgey way on the client side but it
: seems this could be implemented much more efficiently at the Solr/Lucene
: level. I described my kludge and posted a question about this to the

It can, and I have -- but only for the case of a single node...

In general the faceting code in solr just needs a DocSet. the default
imple uses the DocSet computed as aside effect when executing the main
search, but a custom SearchComponent could pick any DocSet it wants.

A few years back I wrote a custom faceting plugin that computed a "score"
for each constraint based on:
* Editorially assigned weights from a config file
* the number of matching documents (ie: normal constraint count)
* the number of matching documents from hte first N results

...where the last number was determined by internally executing the search
with "rows" of N, to generate a DocList object, nad then converting that
DocList into a DocSet, and using that as the input to SimpleFacetCounts.

Ignoring the "Editorial weights" part of the above, the logic for
"scoring" constraints based on the other two factors is general enough
thta it could be implemented in solr, we just need a way to configure "N"
and what kind of function should be applied to the two counts.

...But...

This approach really breaks down in a distributed model. You can't do the
same quick and easy DocList->DocSet transformation on each node, you have
to do more complicated federating logic like the existing FacetComponent
code does, and even there we don't have anything that would help with the
"only the first N" type logic. My best idea would be to do the same thing
you describe in your "kludge" approach to solving this in the client...

: (http://lucene.472066.n3.nabble.com/Solr-should-provide-an-option-to-show-only-most-relevant-facet-values-tc3374285.html).

...the coordinator would have to query all of the shards for their top N,
and then tell each one exactly which of those docs to include in the
"weighted facets constraints" count ... which would make for some relaly
big requests if N is large.

the only sane way to do this type of thing efficiently in a distributed
setup would probably be to treat the "top N" part of the goal as a
"guideline" for a sampling problem, telling each shard to consider only
*their* top N results when computing the top facets in shardReq #1, and
then do the same "give me an exact count" type logic in shardReq #2
thta we already do. So the constraints picked may not acutally be
the top constraints for the first N docs across the whole collection (just
like right now they aren't garunteed to be the top constraints for all
docs in the collection in a long tail situation), but they would
representative of the "first-ish" docs across the whole collection.

-Hoss
Burton-West, Tom
2011-10-03 18:05:23 UTC
Permalink
Thanks so much for your reply Hoss,

I didn't realize how much more complicated this gets with distributed search. Do you think it's worth opening a JIRA issue for this?
Is there already some ongoing work on the faceting code that this might fit in with?

In the meantime, I think I'll go ahead and do some performance tests on my kludge. That might work for us as an interim measure until I have time to dive into the Solr/Lucene distributed faceting code.

Tom

-----Original Message-----
From: Chris Hostetter [mailto:***@fucit.org]
Sent: Friday, September 30, 2011 9:20 PM
To: solr-***@lucene.apache.org
Subject: RE: Getting facet counts for 10,000 most relevant hits


: I figured out how to do this in a kludgey way on the client side but it
: seems this could be implemented much more efficiently at the Solr/Lucene
: level. I described my kludge and posted a question about this to the

It can, and I have -- but only for the case of a single node...

In general the faceting code in solr just needs a DocSet. the default
imple uses the DocSet computed as aside effect when executing the main
search, but a custom SearchComponent could pick any DocSet it wants.

A few years back I wrote a custom faceting plugin that computed a "score"
for each constraint based on:
* Editorially assigned weights from a config file
* the number of matching documents (ie: normal constraint count)
* the number of matching documents from hte first N results

...where the last number was determined by internally executing the search
with "rows" of N, to generate a DocList object, nad then converting that
DocList into a DocSet, and using that as the input to SimpleFacetCounts.

Ignoring the "Editorial weights" part of the above, the logic for
"scoring" constraints based on the other two factors is general enough
thta it could be implemented in solr, we just need a way to configure "N"
and what kind of function should be applied to the two counts.

...But...

This approach really breaks down in a distributed model. You can't do the
same quick and easy DocList->DocSet transformation on each node, you have
to do more complicated federating logic like the existing FacetComponent
code does, and even there we don't have anything that would help with the
"only the first N" type logic. My best idea would be to do the same thing
you describe in your "kludge" approach to solving this in the client...

: (http://lucene.472066.n3.nabble.com/Solr-should-provide-an-option-to-show-only-most-relevant-facet-values-tc3374285.html).

...the coordinator would have to query all of the shards for their top N,
and then tell each one exactly which of those docs to include in the
"weighted facets constraints" count ... which would make for some relaly
big requests if N is large.

the only sane way to do this type of thing efficiently in a distributed
setup would probably be to treat the "top N" part of the goal as a
"guideline" for a sampling problem, telling each shard to consider only
*their* top N results when computing the top facets in shardReq #1, and
then do the same "give me an exact count" type logic in shardReq #2
thta we already do. So the constraints picked may not acutally be
the top constraints for the first N docs across the whole collection (just
like right now they aren't garunteed to be the top constraints for all
docs in the collection in a long tail situation), but they would
representative of the "first-ish" docs across the whole collection.

-Hoss
Chris Hostetter
2011-10-15 22:02:28 UTC
Permalink
: I didn't realize how much more complicated this gets with distributed
: search. Do you think it's worth opening a JIRA issue for this?

features are always worth opening jiras for if you have ideas related to
those features to add as comments (or a patch)

by all means open a jira and put whatever relevant notes you think make
sense (crib from my email as much as you want)

as i (think i) smentioned: the only feasible way i can think of to
appraoch this type of problem in a generalized way at scale is to think
about hte API as a "sampling" API, where instead of specying absolute (ie:
give me the top 100 constraints from the top 10,000 matches) the API works
in terms of "goals" (ie: suggest the top 100 constraints based on top 10%
matches") and then solr has some wiggle room -- it can ask each shard for
the 100*N constraints from the top (10*M)% matches, then weght all those
constraints based on how many matches come from each shard to pick the
final 100 constraints, then ask each shard for the final counts from those
constraints (like it already does)

: Is there already some ongoing work on the faceting code that this might fit in with?

not that i know of.


-Hoss

Loading...