<div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>So if we set:</div>
<br>
K = doc_length_upper_bound<br>
<br>
we can ensure that K.Pi >= 1 and not have to worry about clamping the<br>
log to be non-negative. </blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
So it looks like we can actually pick a non-insane K which will ensure<br>
we never clamp. Maybe that would be inefficient though, and actually a<br>
smaller K would work equally well for retrieval, yet be faster.<br>
<br></blockquote><div>Yes, I think this will serve as very good point to start with value of <b>K</b> .Later as i am also planning to write accuracy test for checking the accuracy of weighting scheme with value in literature,then i think running accuracy testing with various values of <b>K </b><span>see how it affected </span><span>retrieval performance and </span><font color="#222222" face="arial, sans-serif">run-time</font>. We can find value of K which is more efficient than <b> K = doc_length_upper_bound </b> and not also compromise performance.</div>
<div><br></div><div>I think even after having this it would be good idea to allow user to specify value of <b>K.</b> keeping the value found by us as default.</div><div><br></div><div>I was thinking about scheme and had thought :</div>
<div><br></div><div>If we consider two documents with *document 1* matching(contains) 3 query terms and *document 2* matching(contains) 2 query terms then</div><div><br></div><div>virtual weight function would be equivalent to:</div>
<div><br>
</div><div>Wdocument1` = Wdocument1(orig) + 2log(K) = log(K.P1) + log(K.P2)</div><div><br></div><div>Wdocument2` = Wdocument2(orig) + 3log(K) = log(K.P1) + log(K.P2) + log(K.P3)</div></div><div><br></div><div>i hope this would happen since for *document 1* matcher would call weight class 2 times and for *document 2* it would call Weight class 3 times.</div>
<div><br></div><div>Since *document 2* have more number of matching terms it should probably be ranked higher but in case wdf for terms present in *document 1* were quite higher and they *document 1* could actually over come *document 2* then this ranking wont be appropriate as it will still rank *document 2* higher due to large <b>K </b> value being added.</div>
<div><br></div><div><br></div><div>Thanks,</div><div><br></div><div>-- </div>with regards<br>Gaurav A.<br>
</div>