[Xapian-discuss] Similarity Measures

Olly Betts olly at survex.com
Tue Jun 27 18:06:32 BST 2006


On Tue, Jun 27, 2006 at 09:21:47AM -0700, Gavin Mendel-Gleason wrote:
> I'm currently trying to implement a similarity measure
> for xapian.  Ideally I'd like to be able to calculate
> the following: 
> 
> for document i, and document j

I don't quite see how you would fit this into Xapian's ranking
framework, unless you fix document i and compute the similarity
for all the other documents as the ranking.  Effectively document
i would be the query, and you'd compute the similarity for each
document j for fixed i.

> s_ij =  a_ij / ( L_i + L_j + a_ij) 
> 
> Where L_i is the number of terms in the document i
> and a_ij is:
> 
> a_ij = Sum[ t_ik t_jk ] 
> 
> Where t_ij is 1 if term "j" occurs in document i.
> 
> From looking at the source code for weights it appears
> that the sum should be cut up into peices that can be
> calculated incrementally.

Yes, each query term which occurs in a document can contribute some
weight, and there's an optional extra weight only related to the
document (such as a document length adjustment factor).  These are all 
summed to get the total weight for a particular document.

This was originally chosen as it allowed us to implemented the various
probabilistic weighting formulae while allowing various optimisations
to reduce the work required.  It turns out to be flexible enough for 
a wide selection of potential weighting formulae, but there are
obviously plenty of conceivable formulae you can't rearrange to fit this
pattern.

> Is it possible to calculate this value within the current weight
> framework.

I don't see a way to do it.  You could calculate a_ij because that's a
sum over the terms in each document, but the a_ij on the bottom is hard
to handle.  The only transformation I can see is:

s_ij = 1 - ( L_i + L_j ) / ( L_i + L_j + a_ij )

But I don't see that leading anywhere.

You can safely transform your ranking formula through any strictly
increasing monotonic function (such as log or e^x) without altering the
rank order, but I don't see how to use that here.

Cheers,
    Olly



More information about the Xapian-discuss mailing list