[Xapian-discuss] Xapian and research in IR: a few suggestions from experience

Tue Sep 4 20:55:11 BST 2007

On Mon, Sep 03, 2007 at 11:27:27AM +0200, Emmanuel Eckard wrote:
> In its present state, Xapian offers Databases and Documents, from where 
> TermIterators and PostingIterators allow accessing individual terms and 
> documents, as well as information such as Term Frequency (called "Word 
> Document Frequency", wdf)

Close, it's "Within Document Frequency"!

> and idf (Invert Document Frequency) terms.

To be precise, we store the number of documents which each term occurs
in, which can be used to calculate the idf.

> 1) Set generic "slots" on all objects susceptible of being associated with 
> latent data in a latent retrieval model; this would include the major 
> components of Xapian: Databases, Documents and Termiterators 

Do you have some pointers to the models you have in mind so I can get
an idea what sort of data we might be talking about?  (I can see this
could be useful for storing a "reputation" score for each document,
derived from link analysis, user clicks, etc.)

I've actually been looking at implementing support for Divergence from
Randomness models:

http://ir.dcs.gla.ac.uk/wiki/DivergenceFromRandomness

Studies show that these can give retrieval results as good or better
than BM25, but some of them are parameter free, whereas BM25 has 4
parameters, which ideally need tuning for best performance - the best
values vary depending on the documents in the collection.  In an
academic study people are willing to do that, but in the real world it's
rarely done.  So being parameter free is a great feature for us.

To implement these efficiently, we need to know some more per-database
statistics (for example, an upper bound on the document length), so I'm
looking at storing these.  Note that at least in this case they can't
just be opaque data since the database backend needs to update their
values, and we need to know how to derive the combined statistic when
searching several databases.

They could be generic data with suitable callbacks (as you suggest), but
in this case I think it makes more sense to just code them in directly,
since the space required for per-database statistics is a small fixed
overhead, and some of the statistics could also be used to help optimise
BM25 weighted searches.  Also callbacks aren't available with some of
the bindings (e.g. PHP, Tcl).

I don't think the DfR schemes will need extra per-document or
per-posting information.  An extra issue for that is that it's hard to
know how best to compactly store generic information, which is
especially important for per-posting statistics.  Perhaps that can be
addressed by just leaving packing and unpacking up to the user.

> 2) Offer generic read/write methods. Coherence of the data should be 
> maintained, for instance by offering virtual 
> mehods "documentAdded()", "documentRemoved()", ... to allow the user to apply 
> necessary operations on his data is needed.

I don't think we can usefully support subclassing of the PIMPL classes
(i.e. most of the Xapian API classes), but that's a detail really -
there are other ways to allow this such as passing in an object upon
which these hooks will be called.

> For a less important and fundamental suggestion, I'd like to mention that in 
> research, it is often important to have unique and determined identifiers 
> (strings) for documents. I have seen this done by using prefixed terms (which 
> is not very clean) or by using the "data" field of documents (which lacks an 
> iterator: one cannot jump to one particular document easily this way). It 
> might be interesting to do something on this level (maybe simply by wrapping 
> the "prefixed term" way into something cleaner).

What do you have in mind?  You can already add/replace or delete a
document by term.  An overloaded version of get_document() which could
retrieve the first document matching a particular term would be fairly
easy to add and might save some internal work over creating a
PostingIterator.

Actually using arbitrary strings as docids would be more tricky.
Non-sparse numeric ids are easy to compress, and useful externally too
(e.g. omindex uses a vector<bool> to track which docids it has updated
during a run).  Although non-numeric ids would avoid the mapping of
docids we current have to do when searching multiple databases.

Cheers,
    Olly