[Xapian-discuss] Statistics across the database
Andreas Marienborg
andreas at startsiden.no
Mon Apr 16 12:13:26 BST 2007
On Apr 16, 2007, at 1:03 PM, Richard Boulton wrote:
> Andreas Marienborg wrote:
>> I was wondering if there is any easy way to retrieve statistics
>> for the database, like the top 1000 terms for instance?
>
> If you say what statistics you actually want though, we might be
> able to suggest ways to get them.
>
The most important at the moment is the one I mentioned.
> Regarding the "top 1000 terms" you mention - the first question is
> "top" by what measure? You probably want to rank by something
> like the term weight used by the standard ranking formula, but you
> could also rank by absolute frequency, or by some heuristic based
> on what the terms look like.
>
Number of occurances, and/or number of documents with at least one
occurence of the term.
> Xapian itself doesn't implement a calculation of this statistic,
> but you could implement it outside Xapian, using an "allterms"
> iterator to get the list of all terms, calculating a weight for
> each, and then keeping the top 1000. The weight calculation would
> probably depend on the term frequency, which is easily available
> from the term iterator.
>
I tried that, but on an index with 1.6 million documents, it quickly
got very slow to calculate it, that's why I was hoping that xapian
had some sort of built in data that I could use :)
> It might be interesting to add this to Xapian, but I'm not sure
> what it would be used for. What would you use the information for?
>
I was going to use it to find words that was used in too many
documents for them to have any significant meaning in determining
groups of documents. Sort of terms that I would like to ignore in
some other algorithms because they are too common.
> Other statistics, like "which terms are most frequently used in
> searches" can't be calculated from Xapian since it doesn't keep the
> necessary logs, but it might be interesting for an application
> built on top of Xapian to keep track of them.
>
Yes, that I understand is not xapians domain :)
thanks alot for your response Richard :)
- andreas
More information about the Xapian-discuss
mailing list