[Xapian-discuss] Statistics across the database
Richard Boulton
richard at lemurconsulting.com
Mon Apr 16 12:03:33 BST 2007
Andreas Marienborg wrote:
> I was wondering if there is any easy way to retrieve statistics for the
> database, like the top 1000 terms for instance?
As you've probably guessed from the silence, there isn't really much
support for this. There are a few statistics like the number of
documents in the database which are easy: see the documentation for the
database object - in particular, get_doccount(), get_lastdocid() and
get_avlength() may be of interest.
If you say what statistics you actually want though, we might be able to
suggest ways to get them.
Regarding the "top 1000 terms" you mention - the first question is "top"
by what measure? You probably want to rank by something like the term
weight used by the standard ranking formula, but you could also rank by
absolute frequency, or by some heuristic based on what the terms look like.
Xapian itself doesn't implement a calculation of this statistic, but you
could implement it outside Xapian, using an "allterms" iterator to get
the list of all terms, calculating a weight for each, and then keeping
the top 1000. The weight calculation would probably depend on the term
frequency, which is easily available from the term iterator.
It might be interesting to add this to Xapian, but I'm not sure what it
would be used for. What would you use the information for?
Other statistics, like "which terms are most frequently used in
searches" can't be calculated from Xapian since it doesn't keep the
necessary logs, but it might be interesting for an application built on
top of Xapian to keep track of them.
--
Richard
More information about the Xapian-discuss
mailing list