[Xapian-devel] Handling Negative value due to logarithm of probabilities.

Olly Betts olly at survex.com
Fri Apr 27 09:03:30 BST 2012


On Fri, Apr 27, 2012 at 07:29:32AM +0100, Olly Betts wrote:
> On Fri, Apr 27, 2012 at 10:28:30AM +0530, Gaurav Arora wrote:
> > Moreover selecting a large enough K would be a tricky task as as no K would
> > be large enough since log(x) -> -inf as x -> 0
> 
> Well, I'm not saying we should try to pick K such that we never clamp,
> just large enough that the clamping is fairly rare.
> 
> How is Pi actually calculated?  I'm not sure I've seen that detail
> anywhere.

Looking at the Ponte and Croft paper, they use:

 Pi = wdf/doclength

If we are using 1 for the wdf when a term isn't present in a document,
then for a given collection, it is always true that:

 Pi >= 1/doc_length_upper_bound

So if we set:

K = doc_length_upper_bound

we can ensure that K.Pi >= 1 and not have to worry about clamping the
log to be non-negative.

So it looks like we can actually pick a non-insane K which will ensure
we never clamp.  Maybe that would be inefficient though, and actually a
smaller K would work equally well for retrieval, yet be faster.

Cheers,
    Olly



More information about the Xapian-devel mailing list