[Xapian-discuss] Future of Xapian (long)

Olly Betts olly@survex.com
Tue, 22 Jun 2004 17:11:12 +0100


On Tue, Jun 22, 2004 at 12:32:05AM +0100, James Aylett wrote:
> On Sun, Jun 20, 2004 at 04:32:13PM +0100, Francis Irving wrote:
> 
> > I would add this one here, which is very important:
> > 
> > j. Complete bindings.  Fix all the various naming problems and error
> > reporting problems in all the bindings (Perl and PHP in particular),
> 
> We're not going to be able to address this properly until we get stuck
> into 0.9.x and look at return codes not exceptions.

We can't address all the error reporting issues yet, but there are some
methods which simply aren't wrapped for no very good reason - e.g. the
Perl bindings essentially currently implement the union of what Alex,
Francis, and I have needed, plus the odd method Alex or I noticed along
the way.  That could be fixed by simply sitting down with the C++
headers and the bindings code and making sure the two match up.  At
least for Perl - I'm less familiar with how the SWIG bindings work.

> > and make them feature complete so you can do anything in Xapian with
> > them.  People are quite likely to use the bindings, rather than C++,
> > for a search engine, and the incompleteness of the bindings would have
> > put me off if I'd had a choice.  
> 
> PHP isn't going to get directors unless SWIG can manage it - I
> wouldn't want to try to write the wrappers natively, because PHP's
> internals are horrible.

It would be good to note what isn't wrapped and why.  It's unhelpful to
users to only discover this after embarking on a project.  Worse if it
is something we can't easily wrap - a missing method which we can fix
within an hour or so of being made aware of it is less of a problem.

> > k.  Make it easier to to date sorting, and meta-variable weighting.
> > Things like adding a date weight which decays at it goes back in time.
> > And some of the other experimental stuff which is the code.  By "make
> > it easier", I mean finish off what there is to do this, document it
> > well in a manual (you probably need a proper manual), and make sure
> > it works from all the language bindings.
> 
> This can be done as a WeightDecider (I think), it just needs
> documenting and tidying up.

More "finish implementing" than "tidying up".  Currently the function is
hard coded.  It was a prototype of the idea I did for Ananova, but they
decided they wanted sort-bands instead.

Incidentally, I feel sort-bands is really a useless misfeature.  It's
an implementation of a feature from a Muscat product (FX or empower or
something), which Ananova were replacing with Xapian - it wasn't a good
feature there, and time hasn't improved it.  I think perhaps we should
consider removing it, especially if we can provide something better to
achieve similar ends.

The motivation behind sort-bands is to allow a mixture of sorting by
something (date say) and ordering by relevance.

So what it does it to take the results and convert the relevance to a
percentage.  If you follow the derivation of the probabilistic model
from Bayes' theorem, you'll realise that these percentages aren't
very meaningful.  A document with a higher percentage is expected to
be better than one with a lower one, and 100% means a document matched
all the terms, but beyond that the percentage is arbitrary - a 50%
document isn't really half as good as a 100% one in any way at all.
The only justification for producing a percentage is that it's a
good way to convey weight information to users.

But sort-bands split mset entries into a number (say 5) bands using
these percentages.  Within each band, documents are sorted by date
(or whetever).

So I get the most recent document which scored 80-100% down to the least
recent which scores this.  Then the most recent which scored 60-80%, down
to the oldest which scores this.  And so on.  So recent documents are
scattered throughout the hitlist, and high scoring documents are fairly
randomly moved around.

It's so ass-backwards.  I could see an argument for sorting into day
(or week/month/year or perhaps even hour depending on application)
*then* by percentage score within each day.  That would make a lot of
sense for a news site.  And a scheme for giving additional weight to
recent documents - often a recent document is inherently more likely to
be relevant.  And even more often, give two otherwise equally relevant
documents, you'd prefer the more recent one.

Is anyone actually currently using sort-bands?  Does it actually do what
you want, or would you prefer some other scheme?

Cheers,
    Olly