[Xapian-discuss] Making SORTAFTER useful in omega?

Tue Sep 30 05:19:03 BST 2008

On Sat, Sep 27, 2008 at 02:44:32PM +0200, Arjen van der Meijden wrote:
> On 27-9-2008 14:22 Jim wrote:
> > If I understand what you have done, results with relevancy in the
> > 90-100 bucket  could be grouped and then sorted on something like
> > date.  Same for 80-89, etc.  Did I understand that correctly?
>
> That was what I wanted to do first. But the percentage is based on the 
> maximum weight of all documents in the result set, and at the time of 
> this sorter's work, the maximum isn't known yet. And to get that 
> working, the patch would've  been quite a bit more complex.

What Jim describes is how the old "sort bands" feature worked.

The history of this feature is that a Muscat product (I forget the
names) which Ananova wanted to replace with Xapian had the feature, so
we implemented it for Xapian when I was working there.

Internally, it is indeed complex to implement efficiently because as you
say, you don't know the maximum weight until the match is completed.  So
you have to calculate the "worst" case each time the current maximum
weight increases to find the minimum weight which could end up in the
MSet, and then keep a load more items in the "proto-MSet" just in case
they are actually needed (because the tail end of the MSet will get
reordered, but you don't know where the band boundaries are).

The matcher is already a complex piece of code - adding complexity
affects its maintainability as features interact and it's hard to
provide good test coverage.  So features which complicate it need to
justify their presence, both in absolute terms, and also compared with
other features we want to add.

Externally, there's certainly a demand for a way of combining ordering
from relevance and from a sort, but the problem with sort bands as the
solution to this is that it's so very arbitrary.  If we sort into 10
bands, then a newer document which score 90% is ranked ahead of an older
document scoring 99%.  But a newer document scoring 89% won't beat an
older document scoring 90% even though in the first case the ranking
score difference is 9 times larger.

In fact, the percentage scores are rather arbitrary to start with.  If
you follow through the derivation of the Probabilistic weighting
formula, it only ensures that the mapping from probability to weight
is monotonically increasing - so a higher probability means a higher
weight - not that there's any proportionality.  So a 90% match is likely
to be better than a 45% match, but probably not twice as good in any
meaningful way.  The reason for providing percentages is that people
are better able to grasp the weight scaled to some standard scale and
colloquially people use percentages in this sort of way - for example,
if I say "I'm 90% sure I did <something>", it's unlikely I've calculated
probabilities, but it's better than me being "50% sure".

So splitting the match set into bands by percentage is rather arbitrary
to start with.

(Interestingly, explicitly reporting a score for each match seems to
have fallen out of favour, perhaps due to the popularity of Google
which doesn't provide them.)

Anecdotally, I was asked about sort bands by a couple of people before
we removed it, and both times they found it a bit of an odd feature
when I explained how it worked.

Unfortunately, we don't really have a replacement for combining
relevance and sort key rankings in 1.0.x.  The nearest is probably to
set up the weighting scheme parameters to produce less variation it
weights (for BM25Weight: k2 = 0 and b = 0; for TradWeight: k = 0).

In trunk you can use PostingSource to apply an extra weight to each
document depending on the sort key.  This is added to the relevance
weight, which avoids the odd effects with sort bands across vs within
the bands.  The matcher has a bound for the extra weight, so this is
also pretty efficient.  For some more information, see:

http://trac.xapian.org/browser/trunk/xapian-core/docs/postingsource.rst

> So what I did is just round off the weight, so the change of collapses 
> increases. Rather than getting 14,3576 and 14,3623 you can get 14,36 (or 
> 14,4 or 14). So the secondary sorting will decide which of those two 
> documents gets shown first.
> It isn't perfect, sometimes the maximum weight is somewhere below 1, 
> sometimes its far over 20 orso, so the change of collapses still depends 
> on the weight-distribution, but for documents that are close in weight, 
> its much more common to collapse with my patch, than without it. You can 
> see in your result set whether there is a high probability of this patch 
> being helpful or not; if the top 100 results are all within 90-100 its 
> going to give you a few collapses, if the top is much more distributed, 
> it won't do much.
> 
> I'd hoped someone on the list would have some ideas on how to make it a 
> bit more predictable.

I guess you could choose the scaling based on the maximum possible
weight (which is known before the matcher starts) but this approach
seems to suffer from the same sort of arbitrariness that sort bands did.

Cheers,
    Olly