[Xapian-devel] Explanation of how Eset works

Thu Jan 10 10:12:29 GMT 2013

On Thu, Jan 10, 2013 at 12:33:55PM +0530, aarsh shah wrote:
> Hey guys hi.I am trying to understand how Xapian works .I read the
> Theoretical Background to Xapian doc
> and the report by Salton and Jones.I still cant seem to understand how Eset
> works

When you do a search with the results sorted by relevance (the default),
you give Xapian some terms (the query) and it gives you the document
ids of the best documents according to a weighting formula which uses
the terms you give it and how frequently they occur in each document and
how many documents they occur in in total.  The returned document ids
are called the "match set" or "MSet" (though "set" is a bit misleading,
as it is ordered).

The process for forming the "ESet" is very similar, except that the
roles of documents and terms are exchanged (and the mathematical
formulae for the ESet are closely related to those for "TradWeight").
So you pass in some document ids the user says are relevant (the "RSet",
or "relevant set" - this one is unordered, so is actually a set in the
mathematical sense) and using those documents and how frequently each
term occurs in them compared to how frequently it occurs in the whole
collection, Xapian computes the ESet.

The ESet will tend to contain terms which occur more in the documents
in the RSet than you would expect if terms were evenly spread through
the collection.

If you want to play around with it, you can see it on xapian.org -
e.g. a search for "relevance" gives you a list of suggested expand
terms in a green box:

http://xapian.org/search?P=relevance

As you can see, it makes some good suggestions.

These are produced by assuming the top 5 (IIRC) documents are relevant.
You can select your own RSet by using the green checkboxes alongside
each result - e.g. if I mark the first and third (which are API and
internal docs for the RSet class) and click "Search" I get (URL
simplified by hand to keep it short):

http://xapian.org/search?P=relevance&R=1059&R=55

You can see that the suggested terms are now more focussed on those
related to the RSet API in particular.

> How exactly does Xapian add terms to expand a query ?

It's up to the developer using Xapian what they do with the terms in the
ESet, but combining them with a query from the user is quite - in C++
you can write:

    // Parse the query string from the user.
    Xapian::Query q = qp.parse_query(query_string);

    // OR together all the terms in the ESet.
    Xapian::Query eset_query(Xapian::Query::OP_OR, eset.begin(), eset.end());

    // And OR together the two.
    q = Xapian::Query(Xapian::Query::OP_OR, q, eset_query);

> Assuming we have a list of the k most important terms, how do we
> decide which term to add to the query and will be in context with the
> query ?

I'm not sure quite what you're asking.  The ESet is formed given an
RSet, which is a set of document ids.  That might come from searching
using a query (as in the first xapian.org search above), but it doesn't
have to.  There's no direct connection between the query and the ESet
anyway.

> And to decide r in W(t)=rw(t), is r obtained from the user every time a
> MSET is returned and is r for the term updated and stored for a term every
> time a query is performed ?

No, r is "how many of the documents in the given RSet contain term t".

Cheers,
    Olly