[Xapian-discuss] ideas on picking stopwords

Olly Betts olly at survex.com
Sat Mar 28 05:45:40 GMT 2009


On Fri, Mar 27, 2009 at 04:58:05PM -0400, Deron Meranda wrote:
> On Thu, Mar 26, 2009 at 12:14 PM, Ben Campbell <ben at scumways.com> wrote:
> > I'm looking at adding some stopwords to my indexing procedure, and was
> > wondering if anyone had any good rules of thumb on how to pick which
> > words to blacklist. It all seems a little... well... vague. Although I
> > guess it kind of depends on the sort of documents you're wanting to index.

Snowball has lists for a number of the supported languages, e.g.
English is here:

http://snowball.tartarus.org/algorithms/english/stop.txt

> You may want to consider not defining any stop words at all.  You
> probably don't need them, and using them has down sides and
> very little positive benefit in most cases.
> 
> See http://xapian.org/docs/stemming.html (near the end), which describes
> them.

Academic studies seem often seem to use large stopword lists, often
including a few words that seem like useful search terms, but web search
engines generally don't seem to use stopword lists at all.  It used to
be impossible to search for "the" in Google if my memory serves, and
Google definitely used to ignore a small number of common words unless
prefixed by "+", but neither is the case now.

Managing Gigabytes suggests indexing all words, pointing out that the
posting lists and positional information for very common terms
compress very well, so they don't take up as much space as you might
expect.

Cheers,
    Olly



More information about the Xapian-discuss mailing list