[Xapian-discuss] Newbie question: ESets, finding similar documents

Ben Campbell ben at scumways.com
Tue Dec 9 17:54:17 GMT 2008


I'm using ESets to look for similar documents using the following method:

1) build an RSet of example documents (my dataset consists of newspaper 
articles, and my RSet is a bunch of articles written by a particular 
journalist)
2) use Enquire::get_eset(20, reldocs) to get an ESet
3) build a query using the terms in the ESet (term OP_OR temr OP_OR term 
etc...)

But get_eset often returns me useless terms, eg:
['Zsay', 'are', 'Zare', 'says', 'but', 'Zbut', 'be', 'it', 'Zyear', 
'Zthat', 'that', 'is', 'Zis', 'Zit', 'Zbe', 'Zthere', 'on', 'Zon', 
'for', 'Zfor']
(the particular journalist in this example covers environmental issues, 
so I'm interested in other articles which are about the environment - 
I'd want terms like "environment", "oil", "climate" etc...)

Now... most of these terms would be considered stopwords - should I be 
using a stopper to avoid indexing them in the first place? I was under 
the impression that it was best to leave such words in for positional 
reasons...

Does anyone have any good ideas on how I could improve my results?
I'd have thought that the terms I'm getting back were so frequent that 
they'd be useless for an ESet... but maybe I don't really understand how 
ESets are intended to be used... is there any particular documentation I 
  might have missed?

Any suggestions welcome!
Thanks,
Ben Campbell




More information about the Xapian-discuss mailing list