[Xapian-discuss] Stemming, stopping, and multiple languages

Mon Jul 20 00:26:54 BST 2009

On Sun, 2009-07-19 at 17:22 +0200, Sean McCleary wrote:
> Hello all, I'm just getting started with Xapian and have a question
> about stemming, stop words, and multiple languages.

I've just been thinking mysef how to do this recently, so I'll try and
help.  I'm not that familiar with the internals of Xapian yet, so some
details might not be totally accurate.

> So I have two Xapian databases, one containing documents in English,
> and another containing documents in German.  When I index them, I
> don't use a stemmer or a stop words, as I've read that it's considered
> best practice to apply a stemmer and stop words at the time of
> searching, not indexing.

Stemming only usually works if you do it both when indexing and
searching.  If you stem just when searching, then you'll be searching
for terms that do not exist in the database (the database itself knows
nothing (well, very little) about stemming - all the magic is done at
the tokenizing stage by the term generator)

e.g: with no stemming at index time, the term "fishing" is stored in the
database as it is.  When you conduct a stemmed search, a query of
"fishing" will be stemmed to "fish" which will not match the document.

Though actually, with Xapian, the term generator returns both the
stemmed and unstemmed terms, so you might not have noticed the broken
stemming in your case, unless you were testing carefully.

> So when I'm searching one database at a time, it's easy.  Load a
> stemmer for the appropriate language, load the stop words.
> 
> When I want to search through both at once, I can easily load both
> databases.  But it seems that the stemmer and stop words are applied
> to to the query, not the databases.

Yes, you give the QueryParser the stemmer and stoppers, not the
databases.  The stemming and stopping is done on the query and the
resulting query is executed on the databases

Another clue is that you set-up multiple databases for search with the
add_database function on a Database object.  That function doesn't
provide a way to give a stemmer/stopper at the same time.

> So if I had, for example, the
> word "die" (which means "the") in my list of German stop words, it
> would also exclude the word "die" (as in, "cease to be alive") from
> any English documents as well, right?  The same problem applies to the
> stemmer -- I can only load one for one of the languages.

As I just learnt yesterday, the stop words are actually still tokenized,
they're just not stemmed!  So in this particular case, a search on die
would be ok. But that is not really your point :)

> Is there any way around this?  Or does this mean I need to apply
> stemmers and stop words at the time of indexing to get this to work?

I think you could do this with a custom stemmer class, that stems a
query for more than one language at a time (I'm assuming it's possible
to return more than one stem to the Term Generator - if not, I guess
you'd need a custom TermGenerator instead, that could be given multiple
stemmers).

As for stoppers, with the current behaviour it's fine as it only affects
stemming (so your custom Stemmer/TermGenerator would be given
appropriate stoppers too).

If you wanted to fully remove stop words from the query though, that
would be more complicated - I think you'd have to know what stop words
from one language are other words in another and not stop them when
searching databases in both those languages, erk!

John.
http://johnleach.co.uk