[Xapian-discuss] Stopword addition and stemming

Olly Betts olly at survex.com
Mon Nov 15 10:48:48 GMT 2010


On Mon, Nov 15, 2010 at 10:35:59AM +0200, goran kent wrote:
> Stemming:  I've turned on stemming, etc, but how can I confirm that
> it's being used in searches?  What should I look/search for?

Look for Z-prefixed terms in the output of query.get_description().

> Stopwords:  I'm trying out xapian on a regional dataset (searching
> data from a *.co.us TLD, eg) .  I've noticed that searching for [bob
> co.us] results in *very* slow search times (tens of seconds), since it
> seems to be searching for two extremely common (almost every document
> will have something.co.us in it) terms "co" and "us", and the
> not-so-common "bob".  Searching only for "bob" is quick.
> 
> Would it make sense to add "co" and "us" to the stopword list to
> prevent that kind of catastrophic slowdown in search time?  Since the
> dataset is obviously about ".co.us" I feel it's kind of redundant to
> be searching for something you know is there...

It often does make sense to choose stopwords based on the vocabulary of
the text collection you are working with.  And "us" would probably be a
stopword in English anyway.

But here bob.co.us is interpreted as a phrase, and stopwords are included
in phrases by the QueryParser.

In this case, I'm not sure you would want to ignore the ".co.us" part
anyway - "bob.co.us" probably has a meaning sufficiently distinct from
that of "bob" that you wouldn't want to conflate them.

If you aren't already using Xapian 1.2, phrase searching should be faster
with the new default chert backend.

The patch in this ticket can also make a huge difference to slow phrase
cases:

http://trac.xapian.org/ticket/394

It really needs cleaning up and folding into trunk, but I've not had
time to do so yet.  If you try it, feedback would be much appreciated.

Another option would be to treat '.' as a word character when between
two letters, and so tokenise bob.co.us as a single term, but that's not
supported by TermGenerator and QueryParser currently, so you'd have to
patch Xapian or tokenise documents and queries yourself.

Cheers,
    Olly



More information about the Xapian-discuss mailing list