[Xapian-discuss] Newbie question: ESets, finding similar documents
Ben Campbell
ben at scumways.com
Wed Dec 10 09:17:47 GMT 2008
Olly Betts wrote:
> On Wed, Dec 10, 2008 at 02:39:14AM +0000, Olly Betts wrote:
>> I took a look in case the code was wrong. I think it is when handling
>> multiple databases, though it's unclear to me what effect the bug would
>> have. But if you're expanding over multiple databases, this may not be
>> helping...
>
> Actually, no, it is correct after all.
Moot point anyway - I'm using a single database (it's got about a
million documents in it and weighs in at 12GB, if that has any bearing).
I'm already using an ExpandDecider to filter out various prefixed terms.
I'll alter it to remove the unstemmed "Z" terms too, and try using
stopwords (although somehow stopwords here feel like a bit of a kludge).
I'm using the python bindings, but I can't imagine that making any
difference.
Maybe it's just my understanding of ESets which is wrong... is it a case
that Enquire::get_eset() returns terms that are common across the Rset?
In some cases I'm using up to 30 documents in my RSet, so it doesn't
seem so silly that all those annoying little terms _are_ the common
ones... (although I'd have thought there'd be some weighting to take
into account how common terms are across the database as a whole).
I've checked out the source and will have a bit of a poke around to try
and increase my understanding what get_eset() actually does.
Thanks!
Ben.
More information about the Xapian-discuss
mailing list