[Xapian-devel] Implementation of the PL2 weighting scheme of the DFR Framework

Mon Mar 11 20:25:51 GMT 2013

Hello guys.I am working on implementing the PL2 weighting scheme of the DFR
framework by Gianni Amati.
It uses the Poisson approximation of the Binomial as the probabilistic
model (P), the Laplace law of succession to calculate the after effect of
sampling or the risk gain (L) and within document frequency  normalization
H2(2) (as proposed by Amati in his PHD thesis).

The formula for w(t,d) in this scheme is given by::-

w(t,d) = wqf * L * P
where
         wqf = within query frequency
         L = Laplace law of after effect sampling =1 / (wdfn + 1)
         P = wdfn * log (wdfn / lamda) + (lamda - wdfn) log(e) + 0.5 * log
(2 * pi * wdfn)
         wdfn = wdf * (1+c * log(average length of document in database /
length of document d )) (H2 Normalization )
         lamda = mean of the Poisson distrubution = Collection frequency of
the term   / Size of the database
         and the base of all logarithms is 2.
         c is a constant  parameter

The code is almost complete but I am stuck at a few places which are as
follows:-

1.) Calculating the upper bound of the weight for the get_maxpart( )
function
              This one calculation has been giving me sleepless nights for
a couple of days now.The problem is that L is
              a decreasing function for wdfn and P as per my calculations
is a increasing function . I arrived at this conclusion
              because the derivative of L is always negative and the
derivative of P is always positive (In the derivative of P, log (lamda)
              will always be negative as in his thesis,Amati states that
for the PL2 model, collection frequency of term << Collection
              Size and so lamda will always be less than one .) .So, in
order to find the upper bound,I simply substituted wdf=1 for L
              and used wdf = wdf_upper_bound for P and multiplied them by
using upper doc length bound and lower doc length
              for wdfn of L and P respectively.However,this does not give
that tight a bound.Not a word has been spoken about
              upper bounds on DFR weights in Amati's thesis or on his
papers on DFR .I even tried differentiating the product of
              L and P and equated that to zero as discussed on IRC but that
landed me with a complicated equation with no answer
              in sight.Please tell me what you'll think.

2.) Documentation

              This scheme belongs to a family of weighting schemes.Please
do let me if I need to write any additional documentation
               to introduce this new framework and the new weighting
schemes.

Please do let know if you'll need additional information to help me
out.Want to finish this scheme and move on the DPH scheme,which is yet
another interesting weighting scheme of the DFR framework.

-Regards
-Aarsh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130312/4dca0543/attachment.htm>