[Xapian-discuss] Statistics across the database

Richard Boulton richard at lemurconsulting.com
Mon Apr 16 12:03:33 BST 2007


Andreas Marienborg wrote:
> I was wondering if there is any easy way to retrieve statistics for the 
> database, like the top 1000 terms for instance?

As you've probably guessed from the silence, there isn't really much 
support for this.  There are a few statistics like the number of 
documents in the database which are easy: see the documentation for the 
database object - in particular, get_doccount(), get_lastdocid() and 
get_avlength() may be of interest.

If you say what statistics you actually want though, we might be able to 
suggest ways to get them.

Regarding the "top 1000 terms" you mention - the first question is "top" 
  by what measure?  You probably want to rank by something like the term 
weight used by the standard ranking formula, but you could also rank by 
absolute frequency, or by some heuristic based on what the terms look like.

Xapian itself doesn't implement a calculation of this statistic, but you 
could implement it outside Xapian, using an "allterms" iterator to get 
the list of all terms, calculating a weight for each, and then keeping 
the top 1000.  The weight calculation would probably depend on the term 
frequency, which is easily available from the term iterator.

It might be interesting to add this to Xapian, but I'm not sure what it 
would be used for.  What would you use the information for?

Other statistics, like "which terms are most frequently used in 
searches" can't be calculated from Xapian since it doesn't keep the 
necessary logs, but it might be interesting for an application built on 
top of Xapian to keep track of them.

-- 
Richard



More information about the Xapian-discuss mailing list