[Xapian-devel] Handling Negative value due to logarithm of probabilities.

Gaurav Arora gauravarora.daiict at gmail.com
Fri Apr 27 05:58:30 BST 2012


Hi,

In continuation of the discussion of melange comments,about negative value
returned in matcher due to logarithm of probabilities.

*I**f we make K suitably large, we could clamp each log(K.Pi) to be >= 0,
and this change will only affect really low probability terms (those with
Pi < 1/K, so you can adjust K to suit):*

*W' = sum(i=1,...,n, max(log(K.Pi), 0))*

Did you mean for low probability the the value returned by log(K.Pi) would
be negative. So replace lower probability, which still gives negative value
by 0?

Assigning 0 will be equivalent to rejecting term from the query
completely,which hurts the retrieval performance in Language Model as the
term missing from the document are smoothened with collection frequency.

I think we must try the smoothing from collection statistics if the
document term probability doesn't work(is generating negative value).

*sum(*

*i=1,...,n, if( max(log(K.Pi), 0) == 0)*

*max(max(log(K.Pcollec.i),0)*

*else*

***log(K.Pi)*

*)*

In case both doesnt work return 0 would be only option .


Moreover selecting a large enough K would be a tricky task as as no K would
be large enough since log(x) -> -inf as x -> 0

Should we approach selecting value of K by statistically, i will mean to
run the unigram Weighting scheme on large collection and observing lowest
probability which could be found and hence approximating the value of K or
any other method.


I asked same Question on Stack overflow about this.

http://goo.gl/ykwN4


They suggested:

*"Could you simple take the negative of the logarithm? Since you are
dealing with probabilities (i.e. values <= 1), the logarithm is always
negative,
so negating it will always make it positive."*

But this approach wont be a good idea as large values will indicates low
probabilites,small values will indicate high probabilities.Hence matcher
will tend to skip some good documents from ranked list due to lower weight.


Thanks,
*-- *
with regards
Gaurav A.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120427/63df3bb6/attachment-0001.htm>


More information about the Xapian-devel mailing list