[Xapian-discuss] Newbie question: ESets, finding similar documents

Wed Dec 10 01:08:51 GMT 2008

On Tue, Dec 09, 2008 at 05:54:17PM +0000, Ben Campbell wrote:
> I'm using ESets to look for similar documents using the following method:
> 
> 1) build an RSet of example documents (my dataset consists of newspaper 
> articles, and my RSet is a bunch of articles written by a particular 
> journalist)
> 2) use Enquire::get_eset(20, reldocs) to get an ESet
> 3) build a query using the terms in the ESet (term OP_OR temr OP_OR term 
> etc...)
> 
> But get_eset often returns me useless terms, eg:
> ['Zsay', 'are', 'Zare', 'says', 'but', 'Zbut', 'be', 'it', 'Zyear', 
> 'Zthat', 'that', 'is', 'Zis', 'Zit', 'Zbe', 'Zthere', 'on', 'Zon', 
> 'for', 'Zfor']

I'm surprised that the list is so bad.

> (the particular journalist in this example covers environmental issues, 
> so I'm interested in other articles which are about the environment - 
> I'd want terms like "environment", "oil", "climate" etc...)
> 
> Now... most of these terms would be considered stopwords - should I be 
> using a stopper to avoid indexing them in the first place? I was under 
> the impression that it was best to leave such words in for positional 
> reasons...

Yes, that's generally best.

> Does anyone have any good ideas on how I could improve my results?

You can provide an ExpandDecider which rejects terms like these.  You
probably don't want both stemmed and unstemmed forms - again an
ExpandDecider can take care of that.

You might find OmegaExpandDecider::operator() in query.cc useful to
look at:

http://trac.xapian.org/browser/trunk/xapian-applications/omega/query.cc#L2265

Cheers,
    Olly