[Xapian-discuss] Making SORTAFTER useful in omega?
Olly Betts
olly at survex.com
Fri Oct 3 14:15:15 BST 2008
On Thu, Oct 02, 2008 at 12:24:24PM +0200, Arjen van der Meijden wrote:
> On 30-9-2008 6:19 Olly Betts wrote:
> >
> > So splitting the match set into bands by percentage is rather arbitrary
> > to start with.
>
> Agreed, but then again the notion of 'equally relevant matches' is also
> very hard to describe.
I don't think that was actually the thinking behind the sort_bands
feature - it always seemed to be used with a small number of bands.
But I guess if you make the bands narrower, it's approximating that
idea.
But I'm not sure that this is a useful way to think about it. As I see
it, the issue isn't whether these documents are equally relevant or not,
but that there are other factors than the words in the documents and the
user's query which determine how relevant a document is to that user
in response to that query.
Some examples of other factors are document age, information from link
analysis, geographical distance.
> > Unfortunately, we don't really have a replacement for combining
> > relevance and sort key rankings in 1.0.x. The nearest is probably to
> > set up the weighting scheme parameters to produce less variation it
> > weights (for BM25Weight: k2 = 0 and b = 0; for TradWeight: k = 0).
>
> I tried that, but with 100 results afaik there were only a few documents
> sorted differently when switching from descending to ascending ordering
> when using tradweight with k = 0.
>
> So that's why I came up with the idea to round those scores, to increase
> the change of documents being sorted prior to a similar scoring but
> older document.
The patch is problematic though. The choice not to pass the weight to
the Sorter object was deliberate, because we didn't see a use for the
weight there (you can sort by relevance before or after the sorter
already) and passing it means we always need to calculate it. If
we change that we should quantify the overhead of doing so, and if it's
measurable, consider a way for the Sorter to say that it wants the
weight.
Also, the min_weight check isn't appropriate in the sorter case, since
it's wrong unless the sorter uses the weight unmodified as a primary key
(and why would it? That can be achieved more simply and efficiently
using the existing API). That's easy to fix though.
But more fundamentally, using a sorter to essentially perform a "sort by
relevance" is also rather inefficient as the matcher can't use of its
weight-based optimisations.
I think if you want to take this approach it would be better to round
the weights coming out of the Weight object. If you round the weight
(and max_weight) for each term to the nearest 0.1 (say) then you'll get
very similar answers to rounding the summed weight for each document to
the nearest 0.1 but the matcher can still work its magic.
> > In trunk you can use PostingSource to apply an extra weight to each
> > document depending on the sort key. This is added to the relevance
> > weight, which avoids the odd effects with sort bands across vs within
> > the bands. The matcher has a bound for the extra weight, so this is
> > also pretty efficient. For some more information, see:
> >
> > http://trac.xapian.org/browser/trunk/xapian-core/docs/postingsource.rst
>
> I'm not really sure if I understand how this works. But increasing the
> weight of a newer result just because its newer may not be what a user
> expects either. So you should at least use it with relatively small
> increments, so the document is only bumped slightly above more or less
> similar results.
Certainly the range of weights should be chosen according to how
important "recentness" is likely to be.
> > I guess you could choose the scaling based on the maximum possible
> > weight (which is known before the matcher starts) but this approach
> > seems to suffer from the same sort of arbitrariness that sort bands did.
>
> I think your postingsource does too? Altough both approaches not as bad
> as the sort-bands.
It is true that the magnitude of the extra weight needs to be chosen,
much as the "precision" to round off to need to be chosen.
But I don't think it suffers in the way I was thinking of. When rounding
the weight to the nearest 0.1, then 1.04 and 0.95 are equivalent and so
ordered by their dates, but 1.05 is better than either even if it is
older than both, despite being much closer to 1.04 than 0.95 is. That's
much like the issue I highlighted with sortbands, except that the bands
are (probably) much narrower here.
But if you add a date-based weight to the unrounded weights, there's no
arbitrary partitioning dependent only on how close to the interval ends
weights happen to be.
There's also research behind the idea, for example:
http://research.yahoo.com/node/1745
I think this paper (or maybe it's another) covers a way to choose the
size of the weights, but it does require using a training set of queries
and relevance judgements, so it's not great for casual use.
> Do you have plans to tackle the underlying issue anytime soon?
Hmm, well PostingSource was the plan to tackle the underlying issue...
Perhaps we have different ideas what the underlying issue is - it seems
to me to be how to combine query-based weights and sort-by-date. What
are you seeing it as?
Cheers,
Olly
More information about the Xapian-discuss
mailing list