[Xapian-discuss] Newbie question: ESets, finding similar documents

Thu Dec 11 23:48:21 GMT 2008

On Wed, Dec 10, 2008 at 09:17:47AM +0000, Ben Campbell wrote:
> Maybe it's just my understanding of ESets which is wrong... is it a case 
> that Enquire::get_eset() returns terms that are common across the Rset? 

It returns terms which are more common in the RSet than in the document
collection as a whole.

> In some cases I'm using up to 30 documents in my RSet, so it doesn't 
> seem so silly that all those annoying little terms _are_ the common 
> ones... (although I'd have thought there'd be some weighting to take 
> into account how common terms are across the database as a whole).

There is, but by chance sometimes common words which don't convey useful
meaning will happen to relatively more common in the RSet.

In your case you're using an RSet based on articles written by a
particular author, and a particular author's writing style may lead to
this happening more often.

> I've checked out the source and will have a bit of a poke around to try 
> and increase my understanding what get_eset() actually does.

The source doesn't really make the algorithm explicitly clear, and the
documentation in overview doesn't go into details:

http://xapian.org/docs/overview

This really should be documented.  I've started a new wiki page to track
such things:

http://trac.xapian.org/wiki/MissingDocumentation

Currently the algorithm uses an adjusted weighting function to ensure it
never goes negative.  I think in that case it would be better to simply
reject terms which would get a negative weight.  But that would only
mean you'd get fewer terms back if you asked for a lot - it shouldn't
affect the order of those returned.

Cheers,
    Olly