[Xapian-devel] Implementation of the PL2 weighting scheme of the DFR Framework

aarsh shah aarshkshah1992 at gmail.com
Mon Mar 18 04:07:31 GMT 2013


Hello guys.I have sent a pull request for the PL2 scheme.Please do have a
look.I will begin work on the DPH scheme after working on the feedback of
this scheme.Also,I'll add the feature tests for this scheme after I've
worked on the feedback.Thanks.

https://github.com/xapian/xapian/pull/8


On Wed, Mar 13, 2013 at 12:14 AM, aarsh shah <aarshkshah1992 at gmail.com>wrote:

> Sorry, the derivative of P evaluates to
>
> Q +  (Q * log (Q * wdf) ) -( Q * log lamda) - Q + (0.5 / wdf )
>
> = Q * (log (Q * wdf) - log (lamda)  +( 0.5 / (Q * wdf)))
>
>
>
> On Tue, Mar 12, 2013 at 11:44 PM, aarsh shah <aarshkshah1992 at gmail.com>wrote:
>
>> >I think the formula you have there for wdfn is wrong -
>> >http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels<http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels>gives LaTeX source for
>> >what it calls tfn, which is your wdfn:
>> >tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l})
>> >So the '1+c*' bit is *inside* the log, i.e.:
>> >wdfn = wdf * log(1 + c * average_doc_length/length_of_d)
>>
>> Yeah sorry,that was a typo.
>>
>>
>> >That agrees with Amati's slides from ESSIR 2007, except that he says
>> >"ln" not "log" in the tfn formula, which suggests log to the base e.
>> >The base of the log would only make a factor difference to wdfn, but
>> >that seems like it would affect the overall weight in a more complex
>> >way.
>> >Amati's phd thesis also has ln (equation 6.10 on page 118), and that
>> >ln is necessary for the H1 approximating H2 for large l observation
>> >to work (end of page 118).
>>
>> I know,I've gone through the relevant part of the thesis.But the thing
>> is,Amati also did a paper on DFR schemes along with Van Rijsberge titled
>> *Probabilistic models of information retrieval based on measuring
>> divergence from randomness.
>> *I've read the paper through and through and in this paper,he has taken
>> the base of the log to be 2 everywhere including the H2
>> normalization.Moreover,the terrier code has also taken the base to be 2 .
>> So,I decided to go along with it. Please do let me know if you think a base
>> of e is better .
>>
>>
>> >I'm not sure I totally follow you here.
>> >Differentiating P with respect to wdf I get:
>> >Q*{log(wdfn) + wdfn/(2*ln(wdfn)) - log lambda - log e + 1/(4*wdfn)}
>> >Where Q = log(1+c*avg_l/l)
>>
>> Umm...Olly , I get a different derivative .
>> P = (Q * wdf * log( (Q * wdf ) / lamda) + ((lamda - Q * wdf) * log e) +(
>> 0.5 * log(2 * pi * Q *wdf)))
>>
>> Let the base of all logarithms be e here . Then ,
>>
>> P = ( Q * wdf * log (Q * wdf)) - (Q * wdf * log (lamda)) + (lamda - Q*
>> wdf) + (0.5 * log (2 * pi * Q * wdf))
>>
>> because log e =1 and log (a / b) = log a - log b.
>>
>> Therefore ,the derivative of P is
>>
>> P = (Q * wdf * (1 / (Q * wdf))  * Q) + (Q * log (Q * wdf)) - (Q * log
>> lamda) +  (0 -Q) + (0.5 * (1 / (2 * pi * Q * wdf)) * 2 * pi * Q)
>>
>> because of the derivative of::-
>> any constant = 0
>> log x = 1 / x
>> log(kx) = (1 / k * x ) * k where k is any constant
>> f(x) * g(x) = ((f(x) * g ' (x)) + (f ' (x) * g(x)) )
>>
>> Hence,derivative of P evaluates to :
>>
>> Q + (Q * log Q) + (Q * log (Q * wdf) ) -( Q * log lamda) - Q + (0.5 / wdf
>> )
>>
>> because log (a *b) = log a + log b
>>
>> Thus,derivative of P is
>>
>> Q * (log Q + log (Q * wdf ) - log lamda + (0.5 / (Q *wdf)))
>>
>>
>> >can't wdfn be < 1, in which case log(wdfn) and
>> >ln(wdfn) will be negative.
>>
>> I just realized that this is true, if (1 + c ave_d  / len_d) is less than
>> the base of the log for H2,wdfn can be less than 1  and log wdfn will be
>> negative.I'll work on it and get back to you.And yes,in his thesis,Amati
>> has considered only non negative values of c.
>>
>> Ill keep working on refining the bound to make it as tight as I can.Will
>> send in a pull request in a couple of days.Thank you so much for your
>> detailed help . Implementing the other DFR schemes will be comparatively
>> simpler once I understand this scheme properly.
>>
>> -Regards
>> -Aarsh
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>>
>>
>> On Tue, Mar 12, 2013 at 7:38 PM, Olly Betts <olly at survex.com> wrote:
>>
>>> On Tue, Mar 12, 2013 at 01:55:51AM +0530, aarsh shah wrote:
>>> > Hello guys.I am working on implementing the PL2 weighting scheme of
>>> the DFR
>>> > framework by Gianni Amati.
>>> > It uses the Poisson approximation of the Binomial as the probabilistic
>>> > model (P), the Laplace law of succession to calculate the after effect
>>> of
>>> > sampling or the risk gain (L) and within document frequency
>>>  normalization
>>> > H2(2) (as proposed by Amati in his PHD thesis).
>>> >
>>> > The formula for w(t,d) in this scheme is given by::-
>>> >
>>> > w(t,d) = wqf * L * P
>>> > where
>>> >          wqf = within query frequency
>>> >          L = Laplace law of after effect sampling =1 / (wdfn + 1)
>>> >          P = wdfn * log (wdfn / lamda) + (lamda - wdfn) log(e) + 0.5 *
>>> log
>>> > (2 * pi * wdfn)
>>> >          wdfn = wdf * (1+c * log(average length of document in
>>> database /
>>> > length of document d )) (H2 Normalization )
>>> >          lamda = mean of the Poisson distrubution = Collection
>>> frequency of
>>> > the term   / Size of the database
>>> >          and the base of all logarithms is 2.
>>> >          c is a constant  parameter
>>>
>>> I think the formula you have there for wdfn is wrong -
>>> http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels gives LaTeX source for
>>> what it calls tfn, which is your wdfn:
>>>
>>> tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l})
>>>
>>> So the '1+c*' bit is *inside* the log, i.e.:
>>>
>>>  wdfn = wdf * log(1 + c * average_doc_length/length_of_d)
>>>
>>> That agrees with Amati's slides from ESSIR 2007, except that he says
>>> "ln" not "log" in the tfn formula, which suggests log to the base e.
>>> The base of the log would only make a factor difference to wdfn, but
>>> that seems like it would affect the overall weight in a more complex
>>> way.
>>>
>>> Amati's phd thesis also has ln (equation 6.10 on page 118), and that
>>> ln is necessary for the H1 approximating H2 for large l observation
>>> to work (end of page 118).
>>>
>>> > The code is almost complete but I am stuck at a few places which are as
>>> > follows:-
>>> >
>>> > 1.) Calculating the upper bound of the weight for the get_maxpart( )
>>> > function
>>> >               This one calculation has been giving me sleepless nights
>>> for
>>> > a couple of days now.The problem is that L is
>>> >               a decreasing function for wdfn and P as per my
>>> calculations
>>> > is a increasing function . I arrived at this conclusion
>>> >               because the derivative of L is always negative
>>>
>>> OK so far.
>>>
>>> > and the
>>> > derivative of P is always positive (In the derivative of P, log (lamda)
>>> >               will always be negative as in his thesis,Amati states
>>> that
>>> > for the PL2 model, collection frequency of term << Collection
>>> >               Size and so lamda will always be less than one .)
>>>
>>> I'm not sure I totally follow you here.
>>>
>>> Differentiating P with respect to wdf I get:
>>>
>>> Q*{log(wdfn) + wdfn/(2*ln(wdfn)) - log lambda - log e + 1/(4*wdfn)}
>>>
>>> Where Q = log(1+c*avg_l/l) which is > 0 since avg_l, l, and (I assume) c
>>> are all > 0, so Q doesn't affect when this does/doesn't become 0.
>>>
>>> If lambda < 1, then as you say (- log lamba) is positive, but there's
>>> the -log(e) and also can't wdfn be < 1, in which case log(wdfn) and
>>> ln(wdfn) will be negative.
>>>
>>> > .So, in
>>> > order to find the upper bound,I simply substituted wdf=1 for L
>>> >               and used wdf = wdf_upper_bound for P and multiplied them
>>> by
>>> > using upper doc length bound and lower doc length
>>> >               for wdfn of L and P respectively.However,this does not
>>> give
>>> > that tight a bound.
>>>
>>> Having an upper bound is a good start.
>>>
>>> > Not a word has been spoken about
>>> >               upper bounds on DFR weights in Amati's thesis or on his
>>> > papers on DFR .
>>>
>>> The upper bounds are only needed because of some optimisations we do, so
>>> I'm not surprised if they aren't discussed elsewhere.
>>>
>>> > I even tried differentiating the product of
>>> >               L and P and equated that to zero as discussed on IRC but
>>> that
>>> > landed me with a complicated equation with no answer
>>> >               in sight.Please tell me what you'll think.
>>>
>>> I got that down to this, but indeed it doesn't look easy to solve:
>>>
>>> log(w) + w*(w+1)*log(e)/log(w) = log(2*pi)+2*log(e*lamba)+lamba*log(e)
>>>
>>> I'm not all that confident that's even right either.
>>>
>>> > 2.) Documentation
>>> >
>>> >               This scheme belongs to a family of weighting
>>> schemes.Please
>>> > do let me if I need to write any additional documentation
>>> >                to introduce this new framework and the new weighting
>>> > schemes.
>>>
>>> I don't think we need a lengthy document about them, but it would be
>>> useful to refer people to where the schemes are defined, and also
>>> briefly mention when they might want to use a particular scheme - for
>>> example, PL2 is apparently good if you want early precision.
>>>
>>> Cheers,
>>>     Olly
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130318/738989ee/attachment.htm>


More information about the Xapian-devel mailing list