[Xapian-discuss] add_posting(): term position significance - line or offset?

Richard Boulton richard at lemurconsulting.com
Tue Nov 18 18:41:10 GMT 2008


Henry wrote:
>> Not currently...
>>
>> Cheers,
>>    Olly
> 
> 
> Pity - is that an issue which needs to be addressed in search-code  
> only, or indexing and search?
> 
> Hmm, based on my admittedly superficial understanding of Xapian so  
> far:  if the positional info is available for all term postings, then  
> could the search code not be extended to score higher for terms closer  
> together?  This to my mind would be a rather important aspect of  
> scoring, and one which I'd like to explore with a view to possible  
> sponsored development (small personal purse, so don't get too excited:).

It could be implemented simply by using the existing database structures 
(ie, the stored term postings).  However, the performance of such 
searches might not be terribly good - currently the performance of 
phrase searches is significantly worse than that of non-phrase searches, 
because the positional information is stored separately, so requires 
extra disk accesses.

If we moved to a sitation where we were accessing the positional 
information for pretty much every search term, we would probably be 
better off moving the positional information into the main posting 
lists.  However, this would slow down non-positional searches.

There are various things we could try to mitigate or avoid such a 
slowdown - one idea would be to store extra terms indicating the nearby 
occurrence of "significant" words (identified by some algorithm such as 
picking words with a high term weight), and use them instead of 
positional information to search for the nearby occurrence of words.

Summary: it could be implemented just by changing the search part of the 
code, but for good performance we might need to fiddle with the index 
structures, too.

-- 
Richard



More information about the Xapian-discuss mailing list