[Xapian-devel] Implementation of the PL2 weighting scheme of the DFR Framework

Tue Mar 12 14:08:39 GMT 2013

On Tue, Mar 12, 2013 at 01:55:51AM +0530, aarsh shah wrote:
> Hello guys.I am working on implementing the PL2 weighting scheme of the DFR
> framework by Gianni Amati.
> It uses the Poisson approximation of the Binomial as the probabilistic
> model (P), the Laplace law of succession to calculate the after effect of
> sampling or the risk gain (L) and within document frequency  normalization
> H2(2) (as proposed by Amati in his PHD thesis).
> 
> The formula for w(t,d) in this scheme is given by::-
> 
> w(t,d) = wqf * L * P
> where
>          wqf = within query frequency
>          L = Laplace law of after effect sampling =1 / (wdfn + 1)
>          P = wdfn * log (wdfn / lamda) + (lamda - wdfn) log(e) + 0.5 * log
> (2 * pi * wdfn)
>          wdfn = wdf * (1+c * log(average length of document in database /
> length of document d )) (H2 Normalization )
>          lamda = mean of the Poisson distrubution = Collection frequency of
> the term   / Size of the database
>          and the base of all logarithms is 2.
>          c is a constant  parameter

I think the formula you have there for wdfn is wrong -
http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels gives LaTeX source for
what it calls tfn, which is your wdfn:

tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l})

So the '1+c*' bit is *inside* the log, i.e.:

 wdfn = wdf * log(1 + c * average_doc_length/length_of_d)

That agrees with Amati's slides from ESSIR 2007, except that he says
"ln" not "log" in the tfn formula, which suggests log to the base e.
The base of the log would only make a factor difference to wdfn, but
that seems like it would affect the overall weight in a more complex
way.

Amati's phd thesis also has ln (equation 6.10 on page 118), and that
ln is necessary for the H1 approximating H2 for large l observation
to work (end of page 118).

> The code is almost complete but I am stuck at a few places which are as
> follows:-
> 
> 1.) Calculating the upper bound of the weight for the get_maxpart( )
> function
>               This one calculation has been giving me sleepless nights for
> a couple of days now.The problem is that L is
>               a decreasing function for wdfn and P as per my calculations
> is a increasing function . I arrived at this conclusion
>               because the derivative of L is always negative

OK so far.

> and the
> derivative of P is always positive (In the derivative of P, log (lamda)
>               will always be negative as in his thesis,Amati states that
> for the PL2 model, collection frequency of term << Collection
>               Size and so lamda will always be less than one .)

I'm not sure I totally follow you here.

Differentiating P with respect to wdf I get:

Q*{log(wdfn) + wdfn/(2*ln(wdfn)) - log lambda - log e + 1/(4*wdfn)}

Where Q = log(1+c*avg_l/l) which is > 0 since avg_l, l, and (I assume) c
are all > 0, so Q doesn't affect when this does/doesn't become 0.

If lambda < 1, then as you say (- log lamba) is positive, but there's
the -log(e) and also can't wdfn be < 1, in which case log(wdfn) and
ln(wdfn) will be negative.

> .So, in
> order to find the upper bound,I simply substituted wdf=1 for L
>               and used wdf = wdf_upper_bound for P and multiplied them by
> using upper doc length bound and lower doc length
>               for wdfn of L and P respectively.However,this does not give
> that tight a bound.

Having an upper bound is a good start.

> Not a word has been spoken about
>               upper bounds on DFR weights in Amati's thesis or on his
> papers on DFR .

The upper bounds are only needed because of some optimisations we do, so
I'm not surprised if they aren't discussed elsewhere.

> I even tried differentiating the product of
>               L and P and equated that to zero as discussed on IRC but that
> landed me with a complicated equation with no answer
>               in sight.Please tell me what you'll think.

I got that down to this, but indeed it doesn't look easy to solve:

log(w) + w*(w+1)*log(e)/log(w) = log(2*pi)+2*log(e*lamba)+lamba*log(e)

I'm not all that confident that's even right either.

> 2.) Documentation
> 
>               This scheme belongs to a family of weighting schemes.Please
> do let me if I need to write any additional documentation
>                to introduce this new framework and the new weighting
> schemes.

I don't think we need a lengthy document about them, but it would be
useful to refer people to where the schemes are defined, and also
briefly mention when they might want to use a particular scheme - for
example, PL2 is apparently good if you want early precision.

Cheers,
    Olly