[Xapian-discuss] ideas on picking stopwords

Mon Mar 30 14:02:36 BST 2009

Olly Betts wrote:
> Snowball has lists for a number of the supported languages, e.g.
> English is here:
> http://snowball.tartarus.org/algorithms/english/stop.txt

Ahh - that's what I was looking for! (although, I guess it also depends 
on the nature of your corpus - there may be other terms which are 
useless due to their high frequency).

> Academic studies seem often seem to use large stopword lists, often
> including a few words that seem like useful search terms, but web search
> engines generally don't seem to use stopword lists at all.  It used to
> be impossible to search for "the" in Google if my memory serves, and
> Google definitely used to ignore a small number of common words unless
> prefixed by "+", but neither is the case now.

Just to clarify - when using the Xapian TermGenerator with a stopper, it 
still adds the unstemmed version of stopped terms, omitting only the 
stemmed (Z-prefixed) version, right?

> Managing Gigabytes suggests indexing all words, pointing out that the
> posting lists and positional information for very common terms
> compress very well, so they don't take up as much space as you might
> expect.

Hmm, interesting - I've not yet done any really comprehensive 
comparisons on my data yet, but I will do.

Thanks!
Ben.