[Xapian-discuss] Ranking and term proximity
goran kent
gorankent at gmail.com
Tue Sep 6 07:35:32 BST 2011
On Sun, Sep 4, 2011 at 8:10 PM, Jean-Francois Dockes <jf at dockes.org> wrote:
> > For example, if I search for 'Olly Betts' - without double quotes in
> > two documents the first of which the terms 'Olly' and 'Betts' are
> > widely separated, and the second contains the terms 'Olly Betts' right
> > next to each other, will the latter document score higher? Please
> > tell me it is.
>
> Hopefuly one of the Xapian developer will refute me, but I think that
> Xapian does no such thing, leaving such things to the application
> software.
This is rather sad indeed - one would think this is rather fundamental
in determining how important a document is.
It reminds me of search on gmane.com - almost utterly useless because
of this issue (and also no ranking based on links - but this is
implementation, not xapian per se). You'll get search results with a
bazaar of highlighted terms, but no consideration for proximity terms.
Gmane.com should be a showcase for Xapian.
For example:
http://search.gmane.org/?query=search+the+list&author=&group=gmane.discuss&sort=relevance&DEFAULTOP=and&xP=Zsearch%09Zlist&xFILTERS=Gdiscuss---A
The second result has both terms in close proximity (the title *and*
body), yet is not ranked 1st.
I wish I had the money to sponsor development of this and other
important issues - rather than support for more languages like Lua, et
al, or tweaking Omega. Search performance and ranking should reign
supreme for a project like Xapian. Reminds me of
http://trac.xapian.org/ticket/326 - chert (without patches, but even
with, it's still bad) is 7x SLOWER than the older flint format.
That's embarrassing. Yes, one can argue that chert *may* perform
better with larger indexes, but hell, that's still a bad start... Can
you imagine trying to justify/explain that kind of degradation in a
commercial product? You'd be laughed right out the conference room.
Anyway, we can but hope.
:)
More information about the Xapian-discuss
mailing list