[Xapian-discuss] Newbie question: ESets, finding similar documents
Ben Campbell
ben at scumways.com
Tue Dec 9 17:54:17 GMT 2008
I'm using ESets to look for similar documents using the following method:
1) build an RSet of example documents (my dataset consists of newspaper
articles, and my RSet is a bunch of articles written by a particular
journalist)
2) use Enquire::get_eset(20, reldocs) to get an ESet
3) build a query using the terms in the ESet (term OP_OR temr OP_OR term
etc...)
But get_eset often returns me useless terms, eg:
['Zsay', 'are', 'Zare', 'says', 'but', 'Zbut', 'be', 'it', 'Zyear',
'Zthat', 'that', 'is', 'Zis', 'Zit', 'Zbe', 'Zthere', 'on', 'Zon',
'for', 'Zfor']
(the particular journalist in this example covers environmental issues,
so I'm interested in other articles which are about the environment -
I'd want terms like "environment", "oil", "climate" etc...)
Now... most of these terms would be considered stopwords - should I be
using a stopper to avoid indexing them in the first place? I was under
the impression that it was best to leave such words in for positional
reasons...
Does anyone have any good ideas on how I could improve my results?
I'd have thought that the terms I'm getting back were so frequent that
they'd be useless for an ESet... but maybe I don't really understand how
ESets are intended to be used... is there any particular documentation I
might have missed?
Any suggestions welcome!
Thanks,
Ben Campbell
More information about the Xapian-discuss
mailing list