[Xapian-discuss] ideas on picking stopwords
Ben Campbell
ben at scumways.com
Mon Mar 30 14:02:36 BST 2009
Olly Betts wrote:
> Snowball has lists for a number of the supported languages, e.g.
> English is here:
> http://snowball.tartarus.org/algorithms/english/stop.txt
Ahh - that's what I was looking for! (although, I guess it also depends
on the nature of your corpus - there may be other terms which are
useless due to their high frequency).
> Academic studies seem often seem to use large stopword lists, often
> including a few words that seem like useful search terms, but web search
> engines generally don't seem to use stopword lists at all. It used to
> be impossible to search for "the" in Google if my memory serves, and
> Google definitely used to ignore a small number of common words unless
> prefixed by "+", but neither is the case now.
Just to clarify - when using the Xapian TermGenerator with a stopper, it
still adds the unstemmed version of stopped terms, omitting only the
stemmed (Z-prefixed) version, right?
> Managing Gigabytes suggests indexing all words, pointing out that the
> posting lists and positional information for very common terms
> compress very well, so they don't take up as much space as you might
> expect.
Hmm, interesting - I've not yet done any really comprehensive
comparisons on my data yet, but I will do.
Thanks!
Ben.
More information about the Xapian-discuss
mailing list