[Xapian-discuss] Stemming, stopping, and multiple languages

James Aylett james-xapian at tartarus.org
Mon Jul 20 12:09:44 BST 2009


On Mon, Jul 20, 2009 at 12:26:54AM +0100, John Leach wrote:

> I think you could do this with a custom stemmer class, that stems a
> query for more than one language at a time (I'm assuming it's possible
> to return more than one stem to the Term Generator - if not, I guess
> you'd need a custom TermGenerator instead, that could be given multiple
> stemmers).

You can't return more than one term from a stemmer at once, but if you
know the language of the document as you're indexing this doesn't
matter. Index according to the language it is. If you don't know the
language, you *could* index twice -- I'd recommend into either
different databases, or different documents with some kind of
discriminating tag (Omega uses a tag prefix of 'L' with the ISO
language code; so you'll get Len-GB and Lfr-fr and so forth). You
could also try language detection, which may be good enough to be
useful.

Then at search time, either you're searching in a given context (the
user has said they only want English documents, for instance); this is
unusual, so an alternative would be to parse the query twice, once for
each of the two languages, and combine the two resultant queries
(generated from the QueryParser) into a super query, with two legs
over OP_OR, each taking the QueryParser result and combining with
the language tag (Len-GB or whatever, as above) using OP_FILTER.

You cannot do search across multiple languages with stemming without
careful planning, and part of that planning is how you're going to
distinguish between the different languages.

(There are undoubtedly other ways of doing it, and other people on the
list will probably jump in and suggest them. In particular, the use of
OP_OR concerns me a little as it may do strange things to document
ranking. As a separate note, in cases where you know that
Accept-Language is giving you useful information, you may be able to
eliminate the second language, or perhaps apply OP_SCALE_WEIGHT
according to the q-values.)

J

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org



More information about the Xapian-discuss mailing list