[Xapian-discuss] Similarity model for matching

Thu May 11 20:19:38 BST 2006

On Thu, May 11, 2006 at 08:23:21PM +0200, Emmanuel Eckard wrote:
> Is there any documentation about what similarity model (models ?) can be used 
> with Xapian ? (Things like SMART, LSI, PLSI, etc.)

This is implemented by subclasses of Xapian::Weight.

Currently we include BoolWeight (all matching documents have the same
weight), TradWeight (the "traditional" probabilistic weighting model),
and BM25Weight (the BM25 weighting formula as used in Okapi).  You can
also effectively use BM11 and BM15 by setting suitable parameters on
BN25Weight.

It's possible to support many other weighting schemes - you can
implement them without modifying the library by just subclassing
(though we'd be happy to accept any useful ones for inclusion in
the library).

I'm not really familiar with the weighting schemes you mention (if you
mean the Gerard Salton SMART, then that's the vector space model I
believe so that should be implementable.)

The requirements are that you can express the score a document gets as a
summation of weight(term, document) plus an optional weight(document),
and that for a given term and collection you can give upper bounds on
these weights (the requirements for upper bounds could potentially be
lifted if it's too restrictive).  Good upper bounds help the matcher
optimise which is important for large real world deployments, but if
your interest is academic evaulations they probably matter less.

This can of course also handle a weighting scheme where factors from
each term are multiplied together (just take the log of the formula!)

We can't currently handled cases where there's a weight contribution
when terms which DON'T index a particular document.  I've not looked at
whether it would be feasible to add support for that.

I'm planning to add support for some of the Divergence from Randomness
weighting schemes, but this requires us to track a few more statistics
if we want good upper bounds so I need to do that first.

Here's an sample implementation of "Coordinate matching" (score 1 for
each matching term):

class CoordWeight : public Xapian::Weight {
    public:
        MyWeight * clone() const {
            return new MyWeight;
        }
        MyWeight() { }
        ~MyWeight() { }
        std::string name() const { return "Coord"; }
        std::string serialise() const { return ""; }
        MyWeight * unserialise(const std::string & /*s*/) const {
            return new MyWeight;
        }
        Xapian::weight get_sumpart(Xapian::termcount /*wdf*/, Xapian::doclength /*len*/) const { return 1; }
        Xapian::weight get_maxpart() const { return 1; }

        Xapian::weight get_sumextra(Xapian::doclength /*len*/) const { return 0; }
        Xapian::weight get_maxextra() const { return 0; }

        bool get_sumpart_needs_doclength() const { return false; }
};

> Also, are there any modules to natively read TERC collections and output 
> result files as edible for TrecEval ? Everytime I ran on an occurrence of 
> "TREC" was about a discussion about results found by TREC, but never of 
> people using Xapian for TREC-related research...

I'm not aware of anything publically available.  It would be useful to
have though to save people reinventing the wheel.

The indexing side should be very easy - it really just needs a script to
convert the input into something which can be piped into scriptindex.

Cheers,
    Olly