[Xapian-discuss] add_posting(): term position significance - line or offset?

Richard Boulton richard at lemurconsulting.com
Tue Nov 18 16:38:17 GMT 2008


Henry wrote:
> Greets,
> 
> WRT add_posting() and the term's position:  presumably it's best to  
> use the actual offset in the source as the position, rather than the  
> line number containing the term, right?

The usual use is to store the "word number" at which a word appears, and 
this is probably what you want.  However, you could store the line 
number if you wanted: phrase searches (with a window of phrase-size) 
would then match when the words were fairly spread out (ie, up to one 
per line).

I recommend using word number, anyway, unless you have a very odd 
situation I've not thought of.

> I take it this may result in more accurate phrase searching, and  
> better general search results since term items' proximity would  
> increase their score.

Note that Xapian currently doesn't modify the weight of a phrase based 
on how close together the terms are - phrase searches either match a 
phrase (in which case the weight is the sum of the weights of the 
constituent terms), or don't match the phrase (in which case the phrase 
contributes no weight, and the document won't be returned (unless other 
parts of the query match it)).  This is something that could be 
improved, but we haven't had the time (or motivation) to fix it yet...

-- 
Richard



More information about the Xapian-discuss mailing list