[Xapian-discuss] Statistics across the database

Andreas Marienborg andreas at startsiden.no
Mon Apr 16 12:13:26 BST 2007


On Apr 16, 2007, at 1:03 PM, Richard Boulton wrote:

> Andreas Marienborg wrote:
>> I was wondering if there is any easy way to retrieve statistics  
>> for the database, like the top 1000 terms for instance?
>

> If you say what statistics you actually want though, we might be  
> able to suggest ways to get them.
>

The most important at the moment is the one I mentioned.

> Regarding the "top 1000 terms" you mention - the first question is  
> "top"  by what measure?  You probably want to rank by something  
> like the term weight used by the standard ranking formula, but you  
> could also rank by absolute frequency, or by some heuristic based  
> on what the terms look like.
>

Number of occurances, and/or number of documents with at least one  
occurence of the term.

> Xapian itself doesn't implement a calculation of this statistic,  
> but you could implement it outside Xapian, using an "allterms"  
> iterator to get the list of all terms, calculating a weight for  
> each, and then keeping the top 1000.  The weight calculation would  
> probably depend on the term frequency, which is easily available  
> from the term iterator.
>

I tried that, but on an index with 1.6 million documents, it quickly  
got very slow to calculate it, that's why I was hoping that xapian  
had some sort of built in data that I could use :)

> It might be interesting to add this to Xapian, but I'm not sure  
> what it would be used for.  What would you use the information for?
>

I was going to use it to find words that was used in too many  
documents for them to have any significant meaning in determining  
groups of documents. Sort of terms that I would like to ignore in  
some other algorithms because they are too common.

> Other statistics, like "which terms are most frequently used in  
> searches" can't be calculated from Xapian since it doesn't keep the  
> necessary logs, but it might be interesting for an application  
> built on top of Xapian to keep track of them.
>

Yes, that I understand is not xapians domain :)



thanks alot for your response Richard :)


- andreas




More information about the Xapian-discuss mailing list