[Xapian-discuss] Ranking and term proximity

goran kent gorankent at gmail.com
Tue Sep 6 07:35:32 BST 2011

On Sun, Sep 4, 2011 at 8:10 PM, Jean-Francois Dockes <jf at dockes.org> wrote:
>  > For example, if I search for 'Olly Betts' - without double quotes in
>  > two documents the first of which the terms 'Olly' and 'Betts' are
>  > widely separated, and the second contains the terms 'Olly Betts' right
>  > next to each other, will the latter document score higher?  Please
>  > tell me it is.
> Hopefuly one of the Xapian developer will refute me, but I think that
> Xapian does no such thing, leaving such things to the application
> software.

This is rather sad indeed - one would think this is rather fundamental
in determining how important a document is.

It reminds me of search on gmane.com - almost utterly useless because
of this issue (and also no ranking based on links - but this is
implementation, not xapian per se).  You'll get search results with a
bazaar of highlighted terms, but no consideration for proximity terms.
 Gmane.com should be a showcase for Xapian.

For example:

The second result has both terms in close proximity (the title *and*
body), yet is not ranked 1st.

I wish I had the money to sponsor development of this and other
important issues - rather than support for more languages like Lua, et
al, or tweaking Omega.  Search performance and ranking should reign
supreme for a project like Xapian.  Reminds me of
http://trac.xapian.org/ticket/326 - chert (without patches, but even
with, it's still bad) is 7x SLOWER than the older flint format.
That's embarrassing.  Yes, one can argue that chert *may* perform
better with larger indexes, but hell, that's still a bad start...  Can
you imagine trying to justify/explain that kind of degradation in a
commercial product?  You'd be laughed right out the conference room.

Anyway, we can but hope.


