[Xapian-discuss] UTF8 support plans (without stemming)

Thu Apr 28 21:11:05 BST 2005

On Thu, Apr 28, 2005 at 08:44:26PM +0100, James Aylett wrote:
> That precision is more important than recall with a corpus containing
> lots of good matches is precisely the reason Google is so successful -
> it's (effectively) optimised for this behaviour. It's also why it's so
> rubbish when you're looking for a very specific document (although
> there may be no good single way of dealing with this with a corpus as
> large as Google's).

EuroFerret came at this whole problem a rather different way.  First it
stemmed everything, then it would pick out the 60 "best" terms for each
document, and only index it on those.  We also indexed the best 12
word-pairs so that phrases could be searched for, at least after a
fashion.

This generally worked very well (at least on the 40 million documents
we indexed at the time).  I think there was scope for improving how
"best" was measured too.

But the best part was this meant our index was a little over 1KB per
document, including document samples, etc.  That was with Muscat36 DA
databases, and Quartz is more compact that they were.

> Google's method on a small corpus didn't seem terribly effective to me
> (using GMail with a few months' emails from a few mailing lists).

I've not tried GMail, but I've noticed searching Google Groups seems to
return less good ranking than the web search - I've wondered if that's
because page rank can't really be used there.

> Incidentally, I notice that intro_ir, in the "Probabilistic term
> weights" section, contains, in its second to last sentence, "... to
> distinguish it from wdf and wdq introduced below". I'm pretty sure
> that should be 'wqf' not 'wdq'.

It should.

Actually, that whole paragraph appears to be wrong - n is the term
frequency, not the collection frequency, at least in our nomenclature.

Cheers,
    Olly