[Xapian-discuss] ideas on picking stopwords

Ben Campbell ben at scumways.com
Mon Mar 30 13:55:26 BST 2009


Deron Meranda wrote:
> On Thu, Mar 26, 2009 at 12:14 PM, Ben Campbell <ben at scumways.com> wrote:
>> I'm looking at adding some stopwords to my indexing procedure, and was
>> wondering if anyone had any good rules of thumb on how to pick which
>> words to blacklist. It all seems a little... well... vague. Although I
>> guess it kind of depends on the sort of documents you're wanting to index.
> 
> You may want to consider not defining any stop words at all.  You
> probably don't need them, and using them has down sides and
> very little positive benefit in most cases.

My main aim is to reduce the size of my index, so I'm prepared to suffer 
the odd loss of capability here and there :-)
It seems that the xapian termgenerator indexes unstemmed versions of 
stopwords anyway - it's only the stemmed versions which are omitted... I 
guess this is so exact phrase searches can still be done.
Any other downsides I should look out for?

> See http://xapian.org/docs/stemming.html (near the end), which describes
> them.
> If you do want them though, you could in fact use Xapian itself to
> give you the list.  Just index everything completely first, to get
> a "corpus".  Then Xapian can tell you the most frequent terms.
> Those would supposedly become your stop words; and you can
> go back and re-index everything again with the stop words in
> place.  If you want to.

Yep, that's what I've been doing - and I'm glad to some hints in 
docs/stemming.html that my approach is more or less the right one!

I suppose I should use the same stopword list (and strategy) when 
constructing queries too - am I right thinking this?

Thanks for the reply!
Ben.




More information about the Xapian-discuss mailing list