[Xapian-discuss] Xapian's scoring/sorting compared to Google's

Olly Betts olly at survex.com
Tue Dec 16 04:07:20 GMT 2008


On Mon, Dec 15, 2008 at 01:12:05PM +0200, Henry wrote:
> For the sake of argument and general discussion, let's assume you have
> a value similar to Google's PageRank which you use for secondary
> sorting (ie, relevance first, then pagerank).
>
> Is this the best approach to use for sorting (to approach the general
> results of Google in a simplistic fashion)?

My suggestion for using a "page reputation" score such as PageRank would
be to apply an extra weight contribution to each match using
Xapian::PostingSource, though that's not been in a release yet so you'll
have to use SVN trunk at present.

> My gut impression from  Google results is that this is /roughly/ what
> they're doing, or am I  wrong?  Is Google sorting by PageRank first,
> *then* result relevance?

Actually, I personally doubt PageRank as such features much if at all in
Google's document ranking these days - people have worked out how to
game it too well, and it seems unlikely that more than ten years of
development work by Google's thousands of employees hasn't found
something better.  Microsoft Research certainly claim to have done so:

http://portal.acm.org/citation.cfm?id=1135881

Google undoubtably do still perform analysis of the network of links
between pages (there is certainly useful information in there), but I
suspect it bears at most a passing resemblance to PageRank.

I heard a talk by one of the Google "search quality" team last year -
of course he didn't go into much detail, but interestingly PageRank
was only mentioned when talking about the history of Google...

Anyway, the trick to using a query-independent weight for web-scale
search is that you order the documents in your database by decreasing
query-independent weight.

If you want your results ordered *only* by the query-independent weight,
then you can simply stop when you've found enough matches!  If you also
want to include a relevance weighting something like BM25 the
ever-decreasing possible contribution from the query-independent weight
will still help you be able to stop much sooner.

You can implement this technique using Xapian::PostingSource fairly
easily.

Cheers,
    Olly



More information about the Xapian-discuss mailing list