[Xapian-devel] Implementation of the PL2 weighting scheme of the DFR Framework

aarsh shah aarshkshah1992 at gmail.com
Tue Mar 12 18:14:33 GMT 2013


>I think the formula you have there for wdfn is wrong -
>http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels<http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels>gives LaTeX source for
>what it calls tfn, which is your wdfn:
>tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l})
>So the '1+c*' bit is *inside* the log, i.e.:
>wdfn = wdf * log(1 + c * average_doc_length/length_of_d)

Yeah sorry,that was a typo.

>That agrees with Amati's slides from ESSIR 2007, except that he says
>"ln" not "log" in the tfn formula, which suggests log to the base e.
>The base of the log would only make a factor difference to wdfn, but
>that seems like it would affect the overall weight in a more complex
>way.
>Amati's phd thesis also has ln (equation 6.10 on page 118), and that
>ln is necessary for the H1 approximating H2 for large l observation
>to work (end of page 118).

I know,I've gone through the relevant part of the thesis.But the thing
is,Amati also did a paper on DFR schemes along with Van Rijsberge titled
*Probabilistic models of information retrieval based on measuring
divergence from randomness.
*I've read the paper through and through and in this paper,he has taken the
base of the log to be 2 everywhere including the H2
normalization.Moreover,the terrier code has also taken the base to be 2 .
So,I decided to go along with it. Please do let me know if you think a base
of e is better .

>I'm not sure I totally follow you here.
>Differentiating P with respect to wdf I get:
>Q*{log(wdfn) + wdfn/(2*ln(wdfn)) - log lambda - log e + 1/(4*wdfn)}
>Where Q = log(1+c*avg_l/l)

Umm...Olly , I get a different derivative .
P = (Q * wdf * log( (Q * wdf ) / lamda) + ((lamda - Q * wdf) * log e) +(
0.5 * log(2 * pi * Q *wdf)))

Let the base of all logarithms be e here . Then ,

P = ( Q * wdf * log (Q * wdf)) - (Q * wdf * log (lamda)) + (lamda - Q* wdf)
+ (0.5 * log (2 * pi * Q * wdf))

because log e =1 and log (a / b) = log a - log b.

Therefore ,the derivative of P is

P = (Q * wdf * (1 / (Q * wdf))  * Q) + (Q * log (Q * wdf)) - (Q * log
lamda) +  (0 -Q) + (0.5 * (1 / (2 * pi * Q * wdf)) * 2 * pi * Q)

because of the derivative of::-
any constant = 0
log x = 1 / x
log(kx) = (1 / k * x ) * k where k is any constant
f(x) * g(x) = ((f(x) * g ' (x)) + (f ' (x) * g(x)) )

Hence,derivative of P evaluates to :

Q + (Q * log Q) + (Q * log (Q * wdf) ) -( Q * log lamda) - Q + (0.5 / wdf )

because log (a *b) = log a + log b

Thus,derivative of P is

Q * (log Q + log (Q * wdf ) - log lamda + (0.5 / (Q *wdf)))

>can't wdfn be < 1, in which case log(wdfn) and
>ln(wdfn) will be negative.

I just realized that this is true, if (1 + c ave_d  / len_d) is less than
the base of the log for H2,wdfn can be less than 1  and log wdfn will be
negative.I'll work on it and get back to you.And yes,in his thesis,Amati
has considered only non negative values of c.

Ill keep working on refining the bound to make it as tight as I can.Will
send in a pull request in a couple of days.Thank you so much for your
detailed help . Implementing the other DFR schemes will be comparatively
simpler once I understand this scheme properly.

-Regards
-Aarsh









>
>
>

On Tue, Mar 12, 2013 at 7:38 PM, Olly Betts <olly at survex.com> wrote:

> On Tue, Mar 12, 2013 at 01:55:51AM +0530, aarsh shah wrote:
> > Hello guys.I am working on implementing the PL2 weighting scheme of the
> DFR
> > framework by Gianni Amati.
> > It uses the Poisson approximation of the Binomial as the probabilistic
> > model (P), the Laplace law of succession to calculate the after effect of
> > sampling or the risk gain (L) and within document frequency
>  normalization
> > H2(2) (as proposed by Amati in his PHD thesis).
> >
> > The formula for w(t,d) in this scheme is given by::-
> >
> > w(t,d) = wqf * L * P
> > where
> >          wqf = within query frequency
> >          L = Laplace law of after effect sampling =1 / (wdfn + 1)
> >          P = wdfn * log (wdfn / lamda) + (lamda - wdfn) log(e) + 0.5 *
> log
> > (2 * pi * wdfn)
> >          wdfn = wdf * (1+c * log(average length of document in database /
> > length of document d )) (H2 Normalization )
> >          lamda = mean of the Poisson distrubution = Collection frequency
> of
> > the term   / Size of the database
> >          and the base of all logarithms is 2.
> >          c is a constant  parameter
>
> I think the formula you have there for wdfn is wrong -
> http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels gives LaTeX source for
> what it calls tfn, which is your wdfn:
>
> tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l})
>
> So the '1+c*' bit is *inside* the log, i.e.:
>
>  wdfn = wdf * log(1 + c * average_doc_length/length_of_d)
>
> That agrees with Amati's slides from ESSIR 2007, except that he says
> "ln" not "log" in the tfn formula, which suggests log to the base e.
> The base of the log would only make a factor difference to wdfn, but
> that seems like it would affect the overall weight in a more complex
> way.
>
> Amati's phd thesis also has ln (equation 6.10 on page 118), and that
> ln is necessary for the H1 approximating H2 for large l observation
> to work (end of page 118).
>
> > The code is almost complete but I am stuck at a few places which are as
> > follows:-
> >
> > 1.) Calculating the upper bound of the weight for the get_maxpart( )
> > function
> >               This one calculation has been giving me sleepless nights
> for
> > a couple of days now.The problem is that L is
> >               a decreasing function for wdfn and P as per my calculations
> > is a increasing function . I arrived at this conclusion
> >               because the derivative of L is always negative
>
> OK so far.
>
> > and the
> > derivative of P is always positive (In the derivative of P, log (lamda)
> >               will always be negative as in his thesis,Amati states that
> > for the PL2 model, collection frequency of term << Collection
> >               Size and so lamda will always be less than one .)
>
> I'm not sure I totally follow you here.
>
> Differentiating P with respect to wdf I get:
>
> Q*{log(wdfn) + wdfn/(2*ln(wdfn)) - log lambda - log e + 1/(4*wdfn)}
>
> Where Q = log(1+c*avg_l/l) which is > 0 since avg_l, l, and (I assume) c
> are all > 0, so Q doesn't affect when this does/doesn't become 0.
>
> If lambda < 1, then as you say (- log lamba) is positive, but there's
> the -log(e) and also can't wdfn be < 1, in which case log(wdfn) and
> ln(wdfn) will be negative.
>
> > .So, in
> > order to find the upper bound,I simply substituted wdf=1 for L
> >               and used wdf = wdf_upper_bound for P and multiplied them by
> > using upper doc length bound and lower doc length
> >               for wdfn of L and P respectively.However,this does not give
> > that tight a bound.
>
> Having an upper bound is a good start.
>
> > Not a word has been spoken about
> >               upper bounds on DFR weights in Amati's thesis or on his
> > papers on DFR .
>
> The upper bounds are only needed because of some optimisations we do, so
> I'm not surprised if they aren't discussed elsewhere.
>
> > I even tried differentiating the product of
> >               L and P and equated that to zero as discussed on IRC but
> that
> > landed me with a complicated equation with no answer
> >               in sight.Please tell me what you'll think.
>
> I got that down to this, but indeed it doesn't look easy to solve:
>
> log(w) + w*(w+1)*log(e)/log(w) = log(2*pi)+2*log(e*lamba)+lamba*log(e)
>
> I'm not all that confident that's even right either.
>
> > 2.) Documentation
> >
> >               This scheme belongs to a family of weighting schemes.Please
> > do let me if I need to write any additional documentation
> >                to introduce this new framework and the new weighting
> > schemes.
>
> I don't think we need a lengthy document about them, but it would be
> useful to refer people to where the schemes are defined, and also
> briefly mention when they might want to use a particular scheme - for
> example, PL2 is apparently good if you want early precision.
>
> Cheers,
>     Olly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130312/c117d99c/attachment.htm>


More information about the Xapian-devel mailing list