[Xapian-discuss] Finding Max Possible Weight of a Document

Kenneth Loafman kenneth at loafman.com
Tue Feb 6 00:47:28 GMT 2007


Olly Betts wrote:
> On Fri, Jan 26, 2007 at 06:57:37AM -0600, Kenneth Loafman wrote:
>> Is there a way, without running a match, to find the max possible weight 
>> of a document?  This could be with or without consideration of the 
>> length of the document.  I have looked at all of the docs available on 
>> the web and installed on the system and may just be overlooking it.
> 
> Are you trying to find the max possible weight of a particular document,
> or of any document in the database?

Max weight of each document relative to the corpus.

> If it's any document in the database, you can call Enquire::get_mset()
> with maxitems = 0 and get_max_possible() on the resulting MSet will give
> you an upper bound (in this case, no actual matching happens).

I did not know that would be valid without a previous match.  Thanks!

>> The most direct way would be to sum the term weights times term freq of 
>> each document, but it would be nice if there was a call to do just that.
> 
> The document weight isn't necessarily calculated by such a sum.

Just being extremely simplistic here, a rough estimation.

> There's a weight from each term, which typically is a function of the
> wdf (i.e. the frequency of the term in a particular document) but not
> necessarily in the form of a product.  There's also an optional extra
> term in the sum (dependent on document length).
> 
> Perhaps you could tell us what you're trying to achieve here?

If you take the weights of the documents relative to the entire corpus 
(the max weight of the document), you can find outliers in the set, i.e. 
legal documents mixed in a corpus of medical documents.  My current 
project is to find outliers and other exceptions in a large corpus.

...Thanks,
...Ken



More information about the Xapian-discuss mailing list