[Xapian-discuss] termgenerator + stopper behaviour

John Leach john at johnleach.co.uk
Sat Jul 18 19:24:03 BST 2009


On Sat, 2009-07-18 at 18:00 +0100, Olly Betts wrote:
> On Sat, Jul 18, 2009 at 12:33:29PM +0100, John Leach wrote:
> > I'd expect the TermGenerator to not return terms that the stopper
> > returned true for, but it does.  This is the expected behaviour right?
> 
> Actually, no.  The stopper is used to avoid indexing stemmed forms of
> stopwords, but we still index the unstemmed forms so that searches for
> phrases containing stopwords can be supported.

Ah yes, I see that if I add the stemmer, the stopped words are not
stemmed:

["Zbrown", "Zfox", "Zquick", "and", "brown", "fox", "quick", "the"]


> This isn't mentioned in the collated API documentation, but is here:
> 
>     http://xapian.org/docs/termgenerator.html
> 
> I'll add a note to the API documentation.

Thanks, that'll probaby save a bit of headscratching for those partial
to only API docs :)

> There probably should be an option to not index stopwords at all, but
> there isn't at the moment.

I'd agree - in fact, I'd gather those comparisons of indexers I've seen
lately probably don't take that into account (I'm guessing most other
tokenizers don't index stop words, which means Xapian is doing more work
then them!)

I just did a *quick* test with the termgenerator.html text.  Adding that
text 1000 times to an in-memory Xapian database took an average of 3.6
seconds on my hardware

If I then removed the stop words[1] from the text and reran the tests it
was averaging 2.3 seconds.  

Doing this with a disk based db, the database size dropped from 7.6M to
5.6M.

So not indexing stopwords in this case increased performance by over 50%
and reduced disk-database size by over 30%

This obviously isn't a reliable prediction of how the TermGenerator will
perform if it ignored stop words, as it will have to do the stopping on
every document (this test removed the stopwords once, before the test
begins :) but it looks interesting.

Thanks for your help Olly.

John.
http://johnleach.co.uk

[1]http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words





More information about the Xapian-discuss mailing list