[Xapian-discuss] Newbie question: ESets, finding similar documents
Ben Campbell
ben at scumways.com
Mon Dec 15 10:09:54 GMT 2008
Olly Betts wrote:
> On Wed, Dec 10, 2008 at 09:17:47AM +0000, Ben Campbell wrote:
>> Maybe it's just my understanding of ESets which is wrong... is it a case
>> that Enquire::get_eset() returns terms that are common across the Rset?
>
> It returns terms which are more common in the RSet than in the document
> collection as a whole.
Ahhh - cool, that makes sense.
>> In some cases I'm using up to 30 documents in my RSet, so it doesn't
>> seem so silly that all those annoying little terms _are_ the common
>> ones... (although I'd have thought there'd be some weighting to take
>> into account how common terms are across the database as a whole).
>
> There is, but by chance sometimes common words which don't convey useful
> meaning will happen to relatively more common in the RSet.
>
> In your case you're using an RSet based on articles written by a
> particular author, and a particular author's writing style may lead to
> this happening more often.
Ahh!
I _think_ it happens most often for columnists, journos who write
opinion pieces on a variety of topics, and also use a lot of personal
pronouns... "I", "we", "you" etc...
So those journos would have a higher frequency for those terms than over
the database as a whole, hence their appearance in the ESet.
And I do get useful terms from get_eset(), once I filter out some of
these cruft words. The results I'm getting now seem really good!
> This really should be documented. I've started a new wiki page to track
> such things:
>
> http://trac.xapian.org/wiki/MissingDocumentation
Are you interested in patches to the source code to add a couple of
small notes to the API docs?
eg "returns terms which are more common in the RSet than in the document
collection as a whole" would make all the difference to the get_eset()
notes...
Thanks very much for all the help!
Ben.
More information about the Xapian-discuss
mailing list