[Xapian-discuss] Get a list of all terms in an indexed corpus

VanL van.lindberg at gmail.com
Fri Oct 8 15:38:27 BST 2010


Hello,

I have a corpus that I have indexed with xapian/xappy and I would now
like to generate a corpus-specific list of stopwords. (This is a
technical corpus, so a typical stopword list wouldn't be helpful.)

My first thought was to ask the xapian database for a list of terms
followed by their frequency. My intuition is that I could probably bring
together a list of stopwords by examining the head and tail of the list.
This would allow me to exclude both terms that are too common as well as
unique but non-informative terms.

Is there a good way to get this information?

Thanks,

Van




More information about the Xapian-discuss mailing list