[Xapian-discuss] Ranking and term proximity

Sun Sep 4 19:11:12 BST 2011

goran kent writes:
 > Hi,
 > 
 > I was reading an article recently about how google ranks results
 > (among many other things of course) based on the proximity of the
 > search terms in the source documents.  In addition, the position of
 > the search terms in the search query string itself is also taken into
 > consideration when determining how important each term is.
 > 
 > Does Xapian do something similar - at least for the first part?
 > 
 > For example, if I search for 'Olly Betts' - without double quotes in
 > two documents the first of which the terms 'Olly' and 'Betts' are
 > widely separated, and the second contains the terms 'Olly Betts' right
 > next to each other, will the latter document score higher?  Please
 > tell me it is.

Hopefuly one of the Xapian developer will refute me, but I think that
Xapian does no such thing, leaving such things to the application
software. 

Recoll has an option to automatically add a phrase search to simple
queries, in order to obtain the effect you describe, but it's off by
default because phrase/proximity searches can be very slow, especially if
the terms are common.

By the way, Google handling of common word phrases looks nothing short of
magic to my insufficiently advanced mind, and I'd be quite interested by an
explanation of how they do it.

I've been playing with indexing adjacent common terms as an n-gram, but the
index size grows so fast that I'm losing a lot of the performance
improvements. It would appear that some of the Google PhDs are actually
hired for good reason :)

Possibly, another approach for automatic proximity boost would be to prune
the common terms from the generated phrase, but this looks a bit like
admitting defeat and we're left with the to be or not to be issue.

If someone has shareable ideas in this area, I'd be quite willing to
experiment.

jf