[Xapian-discuss] Newbie question: ESets, finding similar documents
Ben Campbell
ben at scumways.com
Thu Dec 11 08:13:52 GMT 2008
Olly Betts wrote:
> On Tue, Dec 09, 2008 at 05:54:17PM +0000, Ben Campbell wrote:
[snip]
>> But get_eset often returns me useless terms, eg:
>> ['Zsay', 'are', 'Zare', 'says', 'but', 'Zbut', 'be', 'it', 'Zyear',
>> 'Zthat', 'that', 'is', 'Zis', 'Zit', 'Zbe', 'Zthere', 'on', 'Zon',
>> 'for', 'Zfor']
[snip]
> You can provide an ExpandDecider which rejects terms like these. You
> probably don't want both stemmed and unstemmed forms - again an
> ExpandDecider can take care of that.
I'm now rejecting a whole bunch of offending useless words (and also
rejecting unstemmed terms) in my ExpandDecider, and I now seem to be
getting pretty respectable results, which is cool. So I'm pretty happy
I've got something which works well!
(and quicker than I'd originally thought - I'd expected the runs I want
to do to take days, but it looks like it'll be hours instead - hooray!)
But I'm still a little uneasy that those words were in the expanded set
in the first place... pretty much every document in my database would
have those words, so surely that would disqualify them from being a
useful part of an ESet?
Thanks,
Ben.
More information about the Xapian-discuss
mailing list