[Xapian-devel] Implementation of the PL2 weighting scheme of the DFR Framework

Tue Mar 12 18:44:15 GMT 2013

Sorry, the derivative of P evaluates to

Q +  (Q * log (Q * wdf) ) -( Q * log lamda) - Q + (0.5 / wdf )

= Q * (log (Q * wdf) - log (lamda)  +( 0.5 / (Q * wdf)))

On Tue, Mar 12, 2013 at 11:44 PM, aarsh shah <aarshkshah1992 at gmail.com>wrote:

> >I think the formula you have there for wdfn is wrong -
> >http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels<http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels>gives LaTeX source for
> >what it calls tfn, which is your wdfn:
> >tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l})
> >So the '1+c*' bit is *inside* the log, i.e.:
> >wdfn = wdf * log(1 + c * average_doc_length/length_of_d)
>
> Yeah sorry,that was a typo.
>
>
> >That agrees with Amati's slides from ESSIR 2007, except that he says
> >"ln" not "log" in the tfn formula, which suggests log to the base e.
> >The base of the log would only make a factor difference to wdfn, but
> >that seems like it would affect the overall weight in a more complex
> >way.
> >Amati's phd thesis also has ln (equation 6.10 on page 118), and that
> >ln is necessary for the H1 approximating H2 for large l observation
> >to work (end of page 118).
>
> I know,I've gone through the relevant part of the thesis.But the thing
> is,Amati also did a paper on DFR schemes along with Van Rijsberge titled
> *Probabilistic models of information retrieval based on measuring
> divergence from randomness.
> *I've read the paper through and through and in this paper,he has taken
> the base of the log to be 2 everywhere including the H2
> normalization.Moreover,the terrier code has also taken the base to be 2 .
> So,I decided to go along with it. Please do let me know if you think a base
> of e is better .
>
>
> >I'm not sure I totally follow you here.
> >Differentiating P with respect to wdf I get:
> >Q*{log(wdfn) + wdfn/(2*ln(wdfn)) - log lambda - log e + 1/(4*wdfn)}
> >Where Q = log(1+c*avg_l/l)
>
> Umm...Olly , I get a different derivative .
> P = (Q * wdf * log( (Q * wdf ) / lamda) + ((lamda - Q * wdf) * log e) +(
> 0.5 * log(2 * pi * Q *wdf)))
>
> Let the base of all logarithms be e here . Then ,
>
> P = ( Q * wdf * log (Q * wdf)) - (Q * wdf * log (lamda)) + (lamda - Q*
> wdf) + (0.5 * log (2 * pi * Q * wdf))
>
> because log e =1 and log (a / b) = log a - log b.
>
> Therefore ,the derivative of P is
>
> P = (Q * wdf * (1 / (Q * wdf))  * Q) + (Q * log (Q * wdf)) - (Q * log
> lamda) +  (0 -Q) + (0.5 * (1 / (2 * pi * Q * wdf)) * 2 * pi * Q)
>
> because of the derivative of::-
> any constant = 0
> log x = 1 / x
> log(kx) = (1 / k * x ) * k where k is any constant
> f(x) * g(x) = ((f(x) * g ' (x)) + (f ' (x) * g(x)) )
>
> Hence,derivative of P evaluates to :
>
> Q + (Q * log Q) + (Q * log (Q * wdf) ) -( Q * log lamda) - Q + (0.5 / wdf )
>
> because log (a *b) = log a + log b
>
> Thus,derivative of P is
>
> Q * (log Q + log (Q * wdf ) - log lamda + (0.5 / (Q *wdf)))
>
>
> >can't wdfn be < 1, in which case log(wdfn) and
> >ln(wdfn) will be negative.
>
> I just realized that this is true, if (1 + c ave_d  / len_d) is less than
> the base of the log for H2,wdfn can be less than 1  and log wdfn will be
> negative.I'll work on it and get back to you.And yes,in his thesis,Amati
> has considered only non negative values of c.
>
> Ill keep working on refining the bound to make it as tight as I can.Will
> send in a pull request in a couple of days.Thank you so much for your
> detailed help . Implementing the other DFR schemes will be comparatively
> simpler once I understand this scheme properly.
>
> -Regards
> -Aarsh
>
>
>
>
>
>
>
>
>
>
>>
>>
>>
>
> On Tue, Mar 12, 2013 at 7:38 PM, Olly Betts <olly at survex.com> wrote:
>
>> On Tue, Mar 12, 2013 at 01:55:51AM +0530, aarsh shah wrote:
>> > Hello guys.I am working on implementing the PL2 weighting scheme of the
>> DFR
>> > framework by Gianni Amati.
>> > It uses the Poisson approximation of the Binomial as the probabilistic
>> > model (P), the Laplace law of succession to calculate the after effect
>> of
>> > sampling or the risk gain (L) and within document frequency
>>  normalization
>> > H2(2) (as proposed by Amati in his PHD thesis).
>> >
>> > The formula for w(t,d) in this scheme is given by::-
>> >
>> > w(t,d) = wqf * L * P
>> > where
>> >          wqf = within query frequency
>> >          L = Laplace law of after effect sampling =1 / (wdfn + 1)
>> >          P = wdfn * log (wdfn / lamda) + (lamda - wdfn) log(e) + 0.5 *
>> log
>> > (2 * pi * wdfn)
>> >          wdfn = wdf * (1+c * log(average length of document in database
>> /
>> > length of document d )) (H2 Normalization )
>> >          lamda = mean of the Poisson distrubution = Collection
>> frequency of
>> > the term   / Size of the database
>> >          and the base of all logarithms is 2.
>> >          c is a constant  parameter
>>
>> I think the formula you have there for wdfn is wrong -
>> http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels gives LaTeX source for
>> what it calls tfn, which is your wdfn:
>>
>> tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l})
>>
>> So the '1+c*' bit is *inside* the log, i.e.:
>>
>>  wdfn = wdf * log(1 + c * average_doc_length/length_of_d)
>>
>> That agrees with Amati's slides from ESSIR 2007, except that he says
>> "ln" not "log" in the tfn formula, which suggests log to the base e.
>> The base of the log would only make a factor difference to wdfn, but
>> that seems like it would affect the overall weight in a more complex
>> way.
>>
>> Amati's phd thesis also has ln (equation 6.10 on page 118), and that
>> ln is necessary for the H1 approximating H2 for large l observation
>> to work (end of page 118).
>>
>> > The code is almost complete but I am stuck at a few places which are as
>> > follows:-
>> >
>> > 1.) Calculating the upper bound of the weight for the get_maxpart( )
>> > function
>> >               This one calculation has been giving me sleepless nights
>> for
>> > a couple of days now.The problem is that L is
>> >               a decreasing function for wdfn and P as per my
>> calculations
>> > is a increasing function . I arrived at this conclusion
>> >               because the derivative of L is always negative
>>
>> OK so far.
>>
>> > and the
>> > derivative of P is always positive (In the derivative of P, log (lamda)
>> >               will always be negative as in his thesis,Amati states that
>> > for the PL2 model, collection frequency of term << Collection
>> >               Size and so lamda will always be less than one .)
>>
>> I'm not sure I totally follow you here.
>>
>> Differentiating P with respect to wdf I get:
>>
>> Q*{log(wdfn) + wdfn/(2*ln(wdfn)) - log lambda - log e + 1/(4*wdfn)}
>>
>> Where Q = log(1+c*avg_l/l) which is > 0 since avg_l, l, and (I assume) c
>> are all > 0, so Q doesn't affect when this does/doesn't become 0.
>>
>> If lambda < 1, then as you say (- log lamba) is positive, but there's
>> the -log(e) and also can't wdfn be < 1, in which case log(wdfn) and
>> ln(wdfn) will be negative.
>>
>> > .So, in
>> > order to find the upper bound,I simply substituted wdf=1 for L
>> >               and used wdf = wdf_upper_bound for P and multiplied them
>> by
>> > using upper doc length bound and lower doc length
>> >               for wdfn of L and P respectively.However,this does not
>> give
>> > that tight a bound.
>>
>> Having an upper bound is a good start.
>>
>> > Not a word has been spoken about
>> >               upper bounds on DFR weights in Amati's thesis or on his
>> > papers on DFR .
>>
>> The upper bounds are only needed because of some optimisations we do, so
>> I'm not surprised if they aren't discussed elsewhere.
>>
>> > I even tried differentiating the product of
>> >               L and P and equated that to zero as discussed on IRC but
>> that
>> > landed me with a complicated equation with no answer
>> >               in sight.Please tell me what you'll think.
>>
>> I got that down to this, but indeed it doesn't look easy to solve:
>>
>> log(w) + w*(w+1)*log(e)/log(w) = log(2*pi)+2*log(e*lamba)+lamba*log(e)
>>
>> I'm not all that confident that's even right either.
>>
>> > 2.) Documentation
>> >
>> >               This scheme belongs to a family of weighting
>> schemes.Please
>> > do let me if I need to write any additional documentation
>> >                to introduce this new framework and the new weighting
>> > schemes.
>>
>> I don't think we need a lengthy document about them, but it would be
>> useful to refer people to where the schemes are defined, and also
>> briefly mention when they might want to use a particular scheme - for
>> example, PL2 is apparently good if you want early precision.
>>
>> Cheers,
>>     Olly
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130313/c8dd1a42/attachment-0001.htm>