[Xapian-discuss] ideas on picking stopwords
Ben Campbell
ben at scumways.com
Thu Mar 26 16:14:00 GMT 2009
I'm looking at adding some stopwords to my indexing procedure, and was
wondering if anyone had any good rules of thumb on how to pick which
words to blacklist. It all seems a little... well... vague. Although I
guess it kind of depends on the sort of documents you're wanting to index.
My current idea is to write a little script to output the terms with the
highest frequency in my existing database (just over 1 million
documents), manually eyeball that list to make sure it's sensible, and
then use them as my stopwords.
Are there any more "correct" approaches that people could suggest?
(I only need to worry about english language for now, which helps a
little :-)
Thanks,
Ben.
More information about the Xapian-discuss
mailing list