[Xapian-devel] Handling Negative value due to logarithm of probabilities.

Olly Betts olly at survex.com
Fri Apr 27 07:29:32 BST 2012


On Fri, Apr 27, 2012 at 10:28:30AM +0530, Gaurav Arora wrote:
> In continuation of the discussion of melange comments,about negative value
> returned in matcher due to logarithm of probabilities.
> 
> *If we make K suitably large, we could clamp each log(K.Pi) to be >= 0,
> and this change will only affect really low probability terms (those with
> Pi < 1/K, so you can adjust K to suit):*
> 
> *W' = sum(i=1,...,n, max(log(K.Pi), 0))*
> 
> Did you mean for low probability the the value returned by log(K.Pi) would
> be negative. So replace lower probability, which still gives negative value
> by 0?

Yes, that was my thought.

> Assigning 0 will be equivalent to rejecting term from the query
> completely,which hurts the retrieval performance in Language Model as the
> term missing from the document are smoothened with collection frequency.

Isn't having a negative value here rejecting it as least as much?

By clamping those negative values of log(K.Pi), it seems to me that
we're increasing the amount of importance we attribute to such terms,
not rejecting them.

If you consider the original product, what we're doing is:

product(i=1,...,n, max(Pi, 1/K))

I understand that in the ngram LM models, it's common to substitute a
non-zero value for the frequency of an ngram which isn't present, which
suggests the above isn't too unreasonable a thing to do.

> I think we must try the smoothing from collection statistics if the
> document term probability doesn't work(is generating negative value).
> 
> *sum(*
> 
> *i=1,...,n, if( max(log(K.Pi), 0) == 0)*
> 
> *max(max(log(K.Pcollec.i),0)*
> 
> *else*
> 
> ***log(K.Pi)*
> 
> *)*
> 
> In case both doesnt work return 0 would be only option .

Yes, perhaps that's better.

> Moreover selecting a large enough K would be a tricky task as as no K would
> be large enough since log(x) -> -inf as x -> 0

Well, I'm not saying we should try to pick K such that we never clamp,
just large enough that the clamping is fairly rare.

How is Pi actually calculated?  I'm not sure I've seen that detail
anywhere.

> Should we approach selecting value of K by statistically, i will mean to
> run the unigram Weighting scheme on large collection and observing lowest
> probability which could be found and hence approximating the value of K or
> any other method.

My thought was we'd try different values and see how it affected
retrieval performance and runtime, and either choose an appropriate
value, or make it user-specifiable with a sensible default.

> I asked same Question on Stack overflow about this.
> 
> http://goo.gl/ykwN4
> 
> They suggested:
> 
> *"Could you simple take the negative of the logarithm? Since you are
> dealing with probabilities (i.e. values <= 1), the logarithm is always
> negative,
> so negating it will always make it positive."*
> 
> But this approach wont be a good idea as large values will indicates low
> probabilites,small values will indicate high probabilities.Hence matcher
> will tend to skip some good documents from ranked list due to lower weight.

Yes, negating won't help, unless we also change the matcher to look for
low values not high ones, which would be quite tricky I think.

It also means we'd end up with a value between 0 and infinity, whereas
we want to have an upper bound if at all possible.

Cheers,
    Olly



More information about the Xapian-devel mailing list