[Xapian-discuss] Newbie question: ESets, finding similar documents

Tue Dec 16 03:01:20 GMT 2008

On Mon, Dec 15, 2008 at 10:09:54AM +0000, Ben Campbell wrote:
> Olly Betts wrote:
> > In your case you're using an RSet based on articles written by a
> > particular author, and a particular author's writing style may lead to
> > this happening more often.
> 
> Ahh!
> I _think_ it happens most often for columnists, journos who write 
> opinion pieces on a variety of topics, and also use a lot of personal 
> pronouns... "I", "we", "you" etc...
> So those journos would have a higher frequency for those terms than over 
> the database as a whole, hence their appearance in the ESet.

Yes, that sounds very plausible.

Essentially, these words actually are good discriminators between the
sets of documents in such cases, but they don't convey any actual
meaning, so aren't good to present to users.

> And I do get useful terms from get_eset(), once I filter out some of 
> these cruft words. The results I'm getting now seem really good!
> 
> > This really should be documented.  I've started a new wiki page to track
> > such things:
> > 
> > http://trac.xapian.org/wiki/MissingDocumentation
> 
> Are you interested in patches to the source code to add a couple of 
> small notes to the API docs?

Definitely.  Patches to improve documentation are most welcome - please
attach them to a ticket in trac to make sure they don't get lost.  Even
if you can't figure out what it should say, just knowing where needs
clarification or other improvement is useful.  For that sort of thing,
a note on the wiki is fine instead of a ticket:

http://trac.xapian.org/wiki/MissingDocumentation

Cheers,
    Olly