[Xapian-discuss] UTF8 support plans (without stemming)

Thu Apr 28 20:44:26 BST 2005

On Thu, Apr 28, 2005 at 08:06:39PM +0100, Olly Betts wrote:

> > When you are looking for enough hits in a near infinite document set the 
> > drop in recall can be hidden, because the user never knows what they 
> > miss out on - as long as there are enough results - because they never 
> > were going to look at all good results anyway.
> 
> Indeed - if there are a lot of possible answers, precision matters much
> more than recall.  Google appears to only automatically turn on stemming
> (or synonyms) for "hard" queries, which makes some sense from this point
> of view.

That precision is more important than recall with a corpus containing
lots of good matches is precisely the reason Google is so successful -
it's (effectively) optimised for this behaviour. It's also why it's so
rubbish when you're looking for a very specific document (although
there may be no good single way of dealing with this with a corpus as
large as Google's).

Turning on some form of term conflation, be it stemming or synonym
expansion, when you find fewer matches than you were looking for,
should raise recall while lowering precision. If you've got a very
small result set without conflation, this isn't a bad way of doing
things without having the impact of poor precision play a part
(because trebling the number of results while doubling the number of
useless results with a small result set will still leave you few
enough results that the user can sort through by eye, refining
precision themselves).

Google's method on a small corpus didn't seem terribly effective to me
(using GMail with a few months' emails from a few mailing lists). It
didn't even have plural to singular mapping last time I tried. (I had
to /remember/ what I'd called one of my mailing lists rather than
guess!) Using omega, with a custom indexing engine, to index all my
emails for about six years did a somewhat better job. (Although all of
this is somewhat subjective, I'm sure I'm not the only person who has
noticed this.)

Incidentally, I notice that intro_ir, in the "Probabilistic term
weights" section, contains, in its second to last sentence, "... to
distinguish it from wdf and wdq introduced below". I'm pretty sure
that should be 'wqf' not 'wdq'.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org