[Xapian-discuss] Counting and statistics

Andreas Marienborg andreas at startsiden.no
Wed Mar 28 12:00:27 BST 2007


On Mar 28, 2007, at 10:30 AM, Richard Boulton wrote:

> Andreas Marienborg wrote:
>> I was wondering if it is possible to figure out "popular terms" in  
>> a given set of documents (not the entire database, but lets say  
>> the 1000 last articles).
>
> You want to read the documentation for the Enquire::get_eset() method.
> This takes a list of "relevant documents" (as an RSet object), and  
> returns a list of terms.  The terms returned will be ordered by a  
> weighting function, which rewards terms which are high frequency in  
> the documents in the RSet compared to the corpus as a whole.
>
> In a sense, this method is the dual of the get_mset() method - it  
> returns a list of terms given a list of documents.
>
> If you want to dig into the code of omega, you'll find that the  
> implementation of the topterms functionality there uses this method.


Thanks for your speedy reply!

I've been trying the methods you outlined, and also tried to read the  
omega source code.

I have managed to get some sort of result, by adding every doc in the  
RSet, then using that to build an ESet.

Is there any way to "skip" some terms when building the ESet? I tried  
with:

	my $eset = $enquire->get_eset(10, $rset, sub { my $term = shift;  
warn "in decider!"; return 1; });

but that just gives me the following error upon execution:

	Usage: Search::Xapian::Enquire::get_eset(THIS, maxitems, rset) at ./ 
script/nyheter_search_word_count.pl line 74.

Is the solution then to skip some terms upon reading the eset? I am  
wondering because we have alot of terms that I do not
want to include in my calculations (like Hhost.tld Ccategoryid  
M200703 Y2007 etc).

Also, on an ESetIterator, it is not possible to get the number of  
occurances, or number of documents containing it, just the weight?  
Where can I read about how this weight is calculated?

thanks again!

andreas



More information about the Xapian-discuss mailing list