Hello guys.I have sent a pull request for the PL2 scheme.Please do have a look.I will begin work on the DPH scheme after working on the feedback of this scheme.Also,I'll add the feature tests for this scheme after I've worked on the feedback.Thanks. <br>
<br><a href="https://github.com/xapian/xapian/pull/8">https://github.com/xapian/xapian/pull/8</a><br><br><br><div class="gmail_quote">On Wed, Mar 13, 2013 at 12:14 AM, aarsh shah <span dir="ltr"><<a href="mailto:aarshkshah1992@gmail.com" target="_blank">aarshkshah1992@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Sorry, the derivative of P evaluates to<br><br>Q + (Q * log (Q * wdf) ) -( Q * log lamda) - Q + (0.5 / wdf )<br><br>= Q * (log (Q * wdf) - log (lamda) +( 0.5 / (Q * wdf)))<div class="HOEnZb">
<div class="h5"><br><br><br><div class="gmail_quote">On Tue, Mar 12, 2013 at 11:44 PM, aarsh shah <span dir="ltr"><<a href="mailto:aarshkshah1992@gmail.com" target="_blank">aarshkshah1992@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>>I think the formula you have there for wdfn is wrong -<br>
<a href="http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels" target="_blank">>http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels</a> gives LaTeX source for<br>
>what it calls tfn, which is your wdfn: <br>
>tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l})<br>>So the '1+c*' bit is *inside* the log, i.e.:<br>>wdfn = wdf * log(1 + c * average_doc_length/length_of_d)<br><br></div>Yeah sorry,that was a typo.<div>
<br>
<br>>That agrees with Amati's slides from ESSIR 2007, except that he says<br>
>"ln" not "log" in the tfn formula, which suggests log to the base e.<br>
>The base of the log would only make a factor difference to wdfn, but<br>>that seems like it would affect the overall weight in a more complex<br>>way.<br>
>Amati's phd thesis also has ln (equation 6.10 on page 118), and that<br>
>ln is necessary for the H1 approximating H2 for large l observation<br>
>to work (end of page 118).<br><br></div>I know,I've gone through the relevant part of the thesis.But the thing is,Amati also did a paper on DFR schemes along with Van Rijsberge titled <br><i>Probabilistic models of information retrieval
based on measuring divergence from randomness.<br></i>I've read the paper through and through and in this paper,he has taken the base of the log to be 2 everywhere including the H2 normalization.Moreover,the terrier code has also taken the base to be 2 . So,I decided to go along with it. Please do let me know if you think a base of e is better .<div>
<br>
<br>>I'm not sure I totally follow you here.
<br>
>Differentiating P with respect to wdf I get:<br>
>Q*{log(wdfn) + wdfn/(2*ln(wdfn)) - log lambda - log e + 1/(4*wdfn)}<br>>Where Q = log(1+c*avg_l/l)<br>
<br></div>Umm...Olly , I get a different derivative . <br>P = (Q * wdf * log( (Q * wdf ) / lamda) + ((lamda - Q * wdf) * log e) +( 0.5 * log(2 * pi * Q *wdf)))<br> <br>Let the base of all logarithms be e here . Then ,<br>
<br>P = ( Q * wdf * log (Q * wdf)) - (Q * wdf * log (lamda)) + (lamda - Q* wdf) + (0.5 * log (2 * pi * Q * wdf))<br>
<br>because log e =1 and log (a / b) = log a - log b.<br> <br>Therefore ,the derivative of P is<br><br>P = (Q * wdf * (1 / (Q * wdf)) * Q) + (Q * log (Q * wdf)) - (Q * log lamda) + (0 -Q) + (0.5 * (1 / (2 * pi * Q * wdf)) * 2 * pi * Q)<br>
<br>because of the derivative of::-<br>any constant = 0<br>log x = 1 / x<br>log(kx) = (1 / k * x ) * k where k is any constant<br>f(x) * g(x) = ((f(x) * g ' (x)) + (f ' (x) * g(x)) )<br><br>Hence,derivative of P evaluates to :<br>
<br>Q + (Q * log Q) + (Q * log (Q * wdf) ) -( Q * log lamda) - Q + (0.5 / wdf )<br><br>because log (a *b) = log a + log b <br><br>Thus,derivative of P is<br><br>Q * (log Q + log (Q * wdf ) - log lamda + (0.5 / (Q *wdf)))<div>
<br>
<br>>can't wdfn be < 1, in which case log(wdfn) and<br>
>ln(wdfn) will be negative.<br><br></div>I just realized that this is true, if (1 + c ave_d / len_d) is less than the base of the log for H2,wdfn can be less than 1 and log wdfn will be negative.I'll work on it and get back to you.And yes,in his thesis,Amati has considered only non negative values of c.<br>
<br>Ill keep working on refining the bound to make it as tight as I can.Will send in a pull request in a couple of days.Thank you so much for your detailed help . Implementing the other DFR schemes will be comparatively simpler once I understand this scheme properly.<br>
<br>-Regards<span><font color="#888888"><br>-Aarsh</font></span><div><div><br><br><br><br><br><br><br><br><br><blockquote class="gmail_quote"></blockquote>
<blockquote class="gmail_quote"></blockquote><blockquote class="gmail_quote"></blockquote><blockquote class="gmail_quote">
<br>
<br>
<br>
</blockquote><br><br><div class="gmail_quote">On Tue, Mar 12, 2013 at 7:38 PM, Olly Betts <span dir="ltr"><<a href="mailto:olly@survex.com" target="_blank">olly@survex.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>On Tue, Mar 12, 2013 at 01:55:51AM +0530, aarsh shah wrote:<br>
> Hello guys.I am working on implementing the PL2 weighting scheme of the DFR<br>
> framework by Gianni Amati.<br>
> It uses the Poisson approximation of the Binomial as the probabilistic<br>
> model (P), the Laplace law of succession to calculate the after effect of<br>
> sampling or the risk gain (L) and within document frequency normalization<br>
> H2(2) (as proposed by Amati in his PHD thesis).<br>
><br>
> The formula for w(t,d) in this scheme is given by::-<br>
><br>
> w(t,d) = wqf * L * P<br>
> where<br>
> wqf = within query frequency<br>
> L = Laplace law of after effect sampling =1 / (wdfn + 1)<br>
> P = wdfn * log (wdfn / lamda) + (lamda - wdfn) log(e) + 0.5 * log<br>
> (2 * pi * wdfn)<br>
> wdfn = wdf * (1+c * log(average length of document in database /<br>
> length of document d )) (H2 Normalization )<br>
> lamda = mean of the Poisson distrubution = Collection frequency of<br>
> the term / Size of the database<br>
> and the base of all logarithms is 2.<br>
> c is a constant parameter<br>
<br>
</div>I think the formula you have there for wdfn is wrong -<br>
<a href="http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels" target="_blank">http://ir.dcs.gla.ac.uk/wiki/FormulasOfDFRModels</a> gives LaTeX source for<br>
what it calls tfn, which is your wdfn:<br>
<br>
tfn=tf\cdot\log_2(1+c\cdot\frac{avg\_l}{l})<br>
<br>
So the '1+c*' bit is *inside* the log, i.e.:<br>
<br>
wdfn = wdf * log(1 + c * average_doc_length/length_of_d)<br>
<br>
That agrees with Amati's slides from ESSIR 2007, except that he says<br>
"ln" not "log" in the tfn formula, which suggests log to the base e.<br>
The base of the log would only make a factor difference to wdfn, but<br>
that seems like it would affect the overall weight in a more complex<br>
way.<br>
<br>
Amati's phd thesis also has ln (equation 6.10 on page 118), and that<br>
ln is necessary for the H1 approximating H2 for large l observation<br>
to work (end of page 118).<br>
<div><br>
> The code is almost complete but I am stuck at a few places which are as<br>
> follows:-<br>
><br>
> 1.) Calculating the upper bound of the weight for the get_maxpart( )<br>
> function<br>
> This one calculation has been giving me sleepless nights for<br>
> a couple of days now.The problem is that L is<br>
> a decreasing function for wdfn and P as per my calculations<br>
> is a increasing function . I arrived at this conclusion<br>
> because the derivative of L is always negative<br>
<br>
</div>OK so far.<br>
<div><br>
> and the<br>
> derivative of P is always positive (In the derivative of P, log (lamda)<br>
> will always be negative as in his thesis,Amati states that<br>
> for the PL2 model, collection frequency of term << Collection<br>
> Size and so lamda will always be less than one .)<br>
<br>
</div>I'm not sure I totally follow you here.<br>
<br>
Differentiating P with respect to wdf I get:<br>
<br>
Q*{log(wdfn) + wdfn/(2*ln(wdfn)) - log lambda - log e + 1/(4*wdfn)}<br>
<br>
Where Q = log(1+c*avg_l/l) which is > 0 since avg_l, l, and (I assume) c<br>
are all > 0, so Q doesn't affect when this does/doesn't become 0.<br>
<br>
If lambda < 1, then as you say (- log lamba) is positive, but there's<br>
the -log(e) and also can't wdfn be < 1, in which case log(wdfn) and<br>
ln(wdfn) will be negative.<br>
<div><br>
> .So, in<br>
> order to find the upper bound,I simply substituted wdf=1 for L<br>
> and used wdf = wdf_upper_bound for P and multiplied them by<br>
> using upper doc length bound and lower doc length<br>
> for wdfn of L and P respectively.However,this does not give<br>
> that tight a bound.<br>
<br>
</div>Having an upper bound is a good start.<br>
<div><br>
> Not a word has been spoken about<br>
> upper bounds on DFR weights in Amati's thesis or on his<br>
> papers on DFR .<br>
<br>
</div>The upper bounds are only needed because of some optimisations we do, so<br>
I'm not surprised if they aren't discussed elsewhere.<br>
<div><br>
> I even tried differentiating the product of<br>
> L and P and equated that to zero as discussed on IRC but that<br>
> landed me with a complicated equation with no answer<br>
> in sight.Please tell me what you'll think.<br>
<br>
</div>I got that down to this, but indeed it doesn't look easy to solve:<br>
<br>
log(w) + w*(w+1)*log(e)/log(w) = log(2*pi)+2*log(e*lamba)+lamba*log(e)<br>
<br>
I'm not all that confident that's even right either.<br>
<div><br>
> 2.) Documentation<br>
><br>
> This scheme belongs to a family of weighting schemes.Please<br>
> do let me if I need to write any additional documentation<br>
> to introduce this new framework and the new weighting<br>
> schemes.<br>
<br>
</div>I don't think we need a lengthy document about them, but it would be<br>
useful to refer people to where the schemes are defined, and also<br>
briefly mention when they might want to use a particular scheme - for<br>
example, PL2 is apparently good if you want early precision.<br>
<br>
Cheers,<br>
Olly<br>
</blockquote></div><br>
</div></div></blockquote></div><br>
</div></div></blockquote></div><br>