[Xapian-discuss] Newbie question: ESets, finding similar documents

Mon Dec 15 10:09:54 GMT 2008

Olly Betts wrote:
> On Wed, Dec 10, 2008 at 09:17:47AM +0000, Ben Campbell wrote:
>> Maybe it's just my understanding of ESets which is wrong... is it a case 
>> that Enquire::get_eset() returns terms that are common across the Rset? 
> 
> It returns terms which are more common in the RSet than in the document
> collection as a whole.

Ahhh - cool, that makes sense.

>> In some cases I'm using up to 30 documents in my RSet, so it doesn't 
>> seem so silly that all those annoying little terms _are_ the common 
>> ones... (although I'd have thought there'd be some weighting to take 
>> into account how common terms are across the database as a whole).
> 
> There is, but by chance sometimes common words which don't convey useful
> meaning will happen to relatively more common in the RSet.
> 
> In your case you're using an RSet based on articles written by a
> particular author, and a particular author's writing style may lead to
> this happening more often.

Ahh!
I _think_ it happens most often for columnists, journos who write 
opinion pieces on a variety of topics, and also use a lot of personal 
pronouns... "I", "we", "you" etc...
So those journos would have a higher frequency for those terms than over 
the database as a whole, hence their appearance in the ESet.

And I do get useful terms from get_eset(), once I filter out some of 
these cruft words. The results I'm getting now seem really good!

> This really should be documented.  I've started a new wiki page to track
> such things:
> 
> http://trac.xapian.org/wiki/MissingDocumentation

Are you interested in patches to the source code to add a couple of 
small notes to the API docs?
eg "returns terms which are more common in the RSet than in the document 
collection as a whole" would make all the difference to the get_eset() 
notes...

Thanks very much for all the help!

Ben.