[Xapian-devel] GSoc Project Idea Weighting Schemes (Ranking)

Olly Betts olly at survex.com
Wed Nov 26 02:22:27 GMT 2014


On Mon, Nov 24, 2014 at 10:39:52AM +0530, Abhishek Singh Kushwah wrote:
> Well certainly BM25 offers stability and comparatively speed too which is
> why it is more preferred than others.

We don't actually know if BM25 is faster than the other schemes that
were implemented in Xapian more recently.  We know that the upper
bounds we have for some aren't very tight, but others have bounds
comparable to those we have for BM25, and the connection between
the tightness of the bound and the speed of the match is fairly
complex.

BM25 is currently the default simply because it always has been.  It
would be good to compare the schemes for speed so we can make an
informed decision about what the default should be.

> What I have tried to understood from your point, no new schemes are needed
> to be implemented for now at least in this GSoC.

No, that's not what I'm trying to say at all.  If there are interesting
schemes which can be implemented, I'm all for that.

> So probably the default scheme needs to be improved and the previous
> implemented schemes in restricted domain needs to be brought forward.
> 
> Probably you are thinking for improvements in Unigram Language Modelling
> and Bi-gram Language Modelling implemented in GSoC 2012. If that's the case
> then your explanation towards more appropriate goals would be appreciated.

I'm not thinking of any particular improvements.  You said "There seems
a considerable hope in editing the algorithms to increase efficiency
and speed and implementing new ones in use", so I asked what you had
in mind.

If you're hoping to work on this for GSOC, then your proposal will need
to have concrete details of what you plan to do, so I'm trying to nudge
you in that direction.

> One such another feature you mentioned to add support for getting the
> number of unique terms is a great idea and can be implemented possibly for
> the purpose of getting more statistics in this GSoC.

That's already implemented - I just gave it as an example of adding
extra statistics to allow a weighting scheme to be supported.

Cheers,
    Olly



More information about the Xapian-devel mailing list