[Xapian-discuss] Newbie question: ESets, finding similar documents

Ben Campbell ben at scumways.com
Thu Dec 11 08:13:52 GMT 2008


Olly Betts wrote:
> On Tue, Dec 09, 2008 at 05:54:17PM +0000, Ben Campbell wrote:
[snip]
>> But get_eset often returns me useless terms, eg:
>> ['Zsay', 'are', 'Zare', 'says', 'but', 'Zbut', 'be', 'it', 'Zyear', 
>> 'Zthat', 'that', 'is', 'Zis', 'Zit', 'Zbe', 'Zthere', 'on', 'Zon', 
>> 'for', 'Zfor']
[snip]
> You can provide an ExpandDecider which rejects terms like these.  You
> probably don't want both stemmed and unstemmed forms - again an
> ExpandDecider can take care of that.

I'm now rejecting a whole bunch of offending useless words (and also 
rejecting unstemmed terms) in my ExpandDecider, and I now seem to be 
getting pretty respectable results, which is cool. So I'm pretty happy 
I've got something which works well!
(and quicker than I'd originally thought - I'd expected the runs I want 
to do to take days, but it looks like it'll be hours instead - hooray!)

But I'm still a little uneasy that those words were in the expanded set 
in the first place... pretty much every document in my database would 
have those words, so surely that would disqualify them from being a 
useful part of an ESet?

Thanks,
Ben.




More information about the Xapian-discuss mailing list