[Xapian-discuss] ideas on picking stopwords

Ben Campbell ben at scumways.com
Thu Mar 26 16:14:00 GMT 2009


I'm looking at adding some stopwords to my indexing procedure, and was 
wondering if anyone had any good rules of thumb on how to pick which 
words to blacklist. It all seems a little... well... vague. Although I 
guess it kind of depends on the sort of documents you're wanting to index.

My current idea is to write a little script to output the terms with the 
highest frequency in my existing database (just over 1 million 
documents), manually eyeball that list to make sure it's sensible, and 
then use them as my stopwords.

Are there any more "correct" approaches that people could suggest?

(I only need to worry about english language for now, which helps a 
little :-)

Thanks,
Ben.




More information about the Xapian-discuss mailing list