Discussion:
Is there a way to retrieve the a term's position/offset in Solr
forest_soup
2017-03-27 07:09:25 UTC
Permalink
We are going to implement a feature:
When opening a document whose body field is already indexed in Solr, if we
issued a keyword search before opening the doc, highlight the keyword in the
opening document.

That needs the position/offset info of the keyword in the doc's index, which
I think can be indexed or stored in solr in anyway. And we are searching
ways to retrieve them from any solr api.

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-position-offset-in-Solr-tp4326931.html
Sent from the Solr - User mailing list archive at Nabble.com.
Emir Arnautovic
2017-03-27 11:02:10 UTC
Permalink
It seems to me that you are looking for Solr's highlighting functionality:

https://cwiki.apache.org/confluence/display/solr/Highlighting

HTH,
Emir
Post by forest_soup
When opening a document whose body field is already indexed in Solr, if we
issued a keyword search before opening the doc, highlight the keyword in the
opening document.
That needs the position/offset info of the keyword in the doc's index, which
I think can be indexed or stored in solr in anyway. And we are searching
ways to retrieve them from any solr api.
Thanks!
--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-position-offset-in-Solr-tp4326931.html
Sent from the Solr - User mailing list archive at Nabble.com.
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/
forest_soup
2017-03-28 08:59:40 UTC
Permalink
Thanks Eric.

Actually solr highlighting function does not meet my requirement. My
requirement is not showing the highlighted words in snippets, but show them
in the whole opening document. So I would like to get the term's
position/offset info from solr. I went through the highlight feature, but
found that exact info(position/offset) is not returned.
If you know that info within highlighting feature, could you please point it
out to me?

The most promising way seems to be /tvrh and tv.offsets/tv.positions
parameters. But I haven't tried it. Any comments on that one?

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-position-offset-in-Solr-tp4326931p4327149.html
Sent from the Solr - User mailing list archive at Nabble.com.
Bjarke Buur Mortensen
2017-03-28 10:46:54 UTC
Permalink
Well, you can get Solr to highlight the entire field if that's what you are
after by setting:
hl.fragsize=0

From
https://cwiki.apache.org/confluence/display/solr/Highlighting#Highlighting-Usage
:
Specifies the approximate size, in characters, of fragments to consider for
highlighting. *0* indicates that no fragmenting should be considered and
the whole field value should be used.
Post by forest_soup
Thanks Eric.
Actually solr highlighting function does not meet my requirement. My
requirement is not showing the highlighted words in snippets, but show them
in the whole opening document. So I would like to get the term's
position/offset info from solr. I went through the highlight feature, but
found that exact info(position/offset) is not returned.
If you know that info within highlighting feature, could you please point it
out to me?
The most promising way seems to be /tvrh and tv.offsets/tv.positions
parameters. But I haven't tried it. Any comments on that one?
Thanks!
--
View this message in context: http://lucene.472066.n3.
nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-
position-offset-in-Solr-tp4326931p4327149.html
Sent from the Solr - User mailing list archive at Nabble.com.
simon
2017-03-28 13:02:48 UTC
Permalink
You might want to take a look at the patch in
https://issues.apache.org/jira/browse/SOLR-4722 - 'Highlighter which
generates a list of query term position(s) for each item in a list of
documents, or returns null if highlighting is disabled.' I've used it for
retrieving the term positions with no need for actual highlighting. The
patch is pretty old - I applied it to Solr 4.10 I think, so will probably
need some work for later releases.

HTH

-Simon
Post by forest_soup
Thanks Eric.
Actually solr highlighting function does not meet my requirement. My
requirement is not showing the highlighted words in snippets, but show them
in the whole opening document. So I would like to get the term's
position/offset info from solr. I went through the highlight feature, but
found that exact info(position/offset) is not returned.
If you know that info within highlighting feature, could you please point it
out to me?
The most promising way seems to be /tvrh and tv.offsets/tv.positions
parameters. But I haven't tried it. Any comments on that one?
Thanks!
--
View this message in context: http://lucene.472066.n3.
nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-
position-offset-in-Solr-tp4326931p4327149.html
Sent from the Solr - User mailing list archive at Nabble.com.
forest_soup
2017-03-29 02:44:15 UTC
Permalink
Thanks All!

Actually we are going to show the highlighted words in a rich text format
instead of the plain text which was indexed. So the hl.fragsize=0 seems not
work for me..

And for the patch(SOLR-4722), haven't tried it. Hope it can return the
position/offset info.

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-position-offset-in-Solr-tp4326931p4327339.html
Sent from the Solr - User mailing list archive at Nabble.com.
Bjarke Buur Mortensen
2017-03-30 06:25:36 UTC
Permalink
OK, so the next thing to do would be to index and store the rich text ...
is it HTML? Because then you can use HTMLStripCharFilterFactory in your
analyzer, and still get the correct highlight back with hl.fragsize=0.

I would think that you will have a hard time using the term positions, if
what you are indexing is somehow transformed before indexing and you want
to map the positions back to the untransformed text.
Post by forest_soup
Thanks All!
Actually we are going to show the highlighted words in a rich text format
instead of the plain text which was indexed. So the hl.fragsize=0 seems not
work for me..
And for the patch(SOLR-4722), haven't tried it. Hope it can return the
position/offset info.
Thanks!
--
View this message in context: http://lucene.472066.n3.
nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-
position-offset-in-Solr-tp4326931p4327339.html
Sent from the Solr - User mailing list archive at Nabble.com.
forest_soup
2017-03-30 08:39:39 UTC
Permalink
Unfortunately the rich text is not an html/xml/doc/pdf or any other popular
rich text format. And we would like to show the highlighted text in the
doc's own specific viewer. That's why I'm eagerly want the offset.

The /tvrh(term vector component) and tv.offsets/tv.positions can give us
such info, but they returns all terms' data instead of the being searched
ones. So we are still seeking ways to filter the results.

Any ideas?

Thanks!



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-position-offset-in-Solr-tp4326931p4327623.html
Sent from the Solr - User mailing list archive at Nabble.com.
Rick Leir
2017-03-30 10:25:23 UTC
Permalink
Hi forest
Do you have a html to richtext converter? You could use it on the highlighter's output. Otherwise you could count characters in the html. That might only be useful if your richtext font is fixed width.
Cheers -- Rick
Post by forest_soup
Unfortunately the rich text is not an html/xml/doc/pdf or any other popular
rich text format. And we would like to show the highlighted text in the
doc's own specific viewer. That's why I'm eagerly want the offset.
The /tvrh(term vector component) and tv.offsets/tv.positions can give us
such info, but they returns all terms' data instead of the being searched
ones. So we are still seeking ways to filter the results.
Any ideas?
Thanks!
--
http://lucene.472066.n3.nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-position-offset-in-Solr-tp4326931p4327623.html
Sent from the Solr - User mailing list archive at Nabble.com.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
forest_soup
2017-04-07 08:25:38 UTC
Permalink
Thanks Rick. Unfortunately we have no that converter, so we have to count
characters in the rich text.



--
View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-position-offset-in-Solr-tp4326931p4328859.html
Sent from the Solr - User mailing list archive at Nabble.com.

Bjarke Buur Mortensen
2017-03-30 15:00:27 UTC
Permalink
OK, that complicates things a bit.

I would still try to go for a solution where you store the rich text in
Solr, but make sure you tokenize it correctly.

If the format is relatively simple, you could use either a regexp pattern
tokenizer
https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-SimplifiedRegularExpressionPatternTokenizer

or perhaps, before tokenization, use a pattern replace char filter to strip
out the parts of the rich text that should not be indexed
https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory

I assume that you have some process for converting the rich text to plain
text before indexing, so if you can replicate that process using Solr's
charfilters, tokenizers and filters then that would allow you to use the
highlighter to get the rich text back.

HTH,
Bjarle
Post by forest_soup
Unfortunately the rich text is not an html/xml/doc/pdf or any other popular
rich text format. And we would like to show the highlighted text in the
doc's own specific viewer. That's why I'm eagerly want the offset.
The /tvrh(term vector component) and tv.offsets/tv.positions can give us
such info, but they returns all terms' data instead of the being searched
ones. So we are still seeking ways to filter the results.
Any ideas?
Thanks!
--
View this message in context: http://lucene.472066.n3.
nabble.com/Is-there-a-way-to-retrieve-the-a-term-s-
position-offset-in-Solr-tp4326931p4327623.html
Sent from the Solr - User mailing list archive at Nabble.com.
Loading...