[Xapian-discuss] Newbie question: ESets, finding similar documents

Ben Campbell ben at scumways.com
Wed Dec 10 09:17:47 GMT 2008


Olly Betts wrote:
> On Wed, Dec 10, 2008 at 02:39:14AM +0000, Olly Betts wrote:
>> I took a look in case the code was wrong.  I think it is when handling
>> multiple databases, though it's unclear to me what effect the bug would
>> have.  But if you're expanding over multiple databases, this may not be
>> helping...
> 
> Actually, no, it is correct after all.

Moot point anyway - I'm using a single database (it's got about a 
million documents in it and weighs in at 12GB, if that has any bearing).

I'm already using an ExpandDecider to filter out various prefixed terms. 
I'll alter it to remove the unstemmed "Z" terms too, and try using 
stopwords (although somehow stopwords here feel like a bit of a kludge).
I'm using the python bindings, but I can't imagine that making any 
difference.

Maybe it's just my understanding of ESets which is wrong... is it a case 
that Enquire::get_eset() returns terms that are common across the Rset? 
In some cases I'm using up to 30 documents in my RSet, so it doesn't 
seem so silly that all those annoying little terms _are_ the common 
ones... (although I'd have thought there'd be some weighting to take 
into account how common terms are across the database as a whole).
I've checked out the source and will have a bit of a poke around to try 
and increase my understanding what get_eset() actually does.

Thanks!
Ben.




More information about the Xapian-discuss mailing list