[Xapian-discuss] Get a list of all terms in an indexed corpus

Richard Boulton richard at tartarus.org
Fri Oct 8 16:21:23 BST 2010


On 8 October 2010 15:38, VanL <van.lindberg at gmail.com> wrote:
> Hello,
>
> I have a corpus that I have indexed with xapian/xappy and I would now
> like to generate a corpus-specific list of stopwords. (This is a
> technical corpus, so a typical stopword list wouldn't be helpful.)

Xapian doesn't store a lits of terms sorted by frequency, so you'll
need to do that sorting yourself outside xapian.

Using xappy, you can call
SearchConnection.iter_terms_for_field(fieldname) to get an iterator
over the terms generated from a given field.   However, this doesn't
return the frequencies of the terms, and returns them in lexicographic
order.

Using xapian, you can call xapian.Database.allterms() to get an
iterator over all the terms.  This iterator returns
xapian.TermListItem objects, which have a .termfreq property
containing the number of documents the term occurs in (and a .term
property containing the term string itself).  You'll still need to
sort the frequencies, but this should give you what you need.

Hope this helps,

-- 
Richard



More information about the Xapian-discuss mailing list