[Xapian-devel] Handling Negative value due to logarithm of probabilities.
olly at survex.com
Fri Apr 27 09:03:30 BST 2012
On Fri, Apr 27, 2012 at 07:29:32AM +0100, Olly Betts wrote:
> On Fri, Apr 27, 2012 at 10:28:30AM +0530, Gaurav Arora wrote:
> > Moreover selecting a large enough K would be a tricky task as as no K would
> > be large enough since log(x) -> -inf as x -> 0
> Well, I'm not saying we should try to pick K such that we never clamp,
> just large enough that the clamping is fairly rare.
> How is Pi actually calculated? I'm not sure I've seen that detail
Looking at the Ponte and Croft paper, they use:
Pi = wdf/doclength
If we are using 1 for the wdf when a term isn't present in a document,
then for a given collection, it is always true that:
Pi >= 1/doc_length_upper_bound
So if we set:
K = doc_length_upper_bound
we can ensure that K.Pi >= 1 and not have to worry about clamping the
log to be non-negative.
So it looks like we can actually pick a non-insane K which will ensure
we never clamp. Maybe that would be inefficient though, and actually a
smaller K would work equally well for retrieval, yet be faster.
More information about the Xapian-devel