Hi everyone, <div><div>I'm Akshay, an Information Science undergrad from Bangalore. I'm interested in Information Retrieval and I'd like to contribute to Xapian as a part of GSoC and later to feed my interests. </div>
<div><br></div><div>I liked the idea of adding more weighting schemes (Project #2). I did a project last semester on Document Retrieval on Hadoop using TF-IDF and Cosine Similarity (the query had to be a document).</div>
<div>I read about BM25 from the resources. I don't have a good idea about DFR. I'm referring to [1] and [2] for more information on DFR in addition to the resources mentioned on the page.</div><div><br></div><div>
And in the project description, I couldn't understand this - <i>"Additionally, for faster searching, an upper bound on each component is needed (each database stores a number of summary statistics to help with this - if additional statistics would be useful, you could add them as part of the project)."</i></div>
<div>I'm thinking components refer to the per-document term weights and the DF/IDF weights? Any elaboration would be really helpful.</div><div><br></div><div>Can someone please point me to patches/bugs that are related to this project so I can understand the existing code better, especially related to Xapian::Weight class or anything else that can get me started with Xapian codebase? </div>
<div><br></div><div>References - </div><div>[1] <a href="http://dl.acm.org/citation.cfm?id=582416&dl=ACM&coll=DL&CFID=88428217&CFTOKEN=33171494" target="_blank">http://dl.acm.org/citation.cfm?id=582416&dl=ACM&coll=DL&CFID=88428217&CFTOKEN=33171494</a></div>
<div>[2]
<a href="http://terrier.org/docs/v3.5/dfr_description.html" target="_blank">http://terrier.org/docs/v3.5/dfr_description.html</a> <br clear="all">
<div><br></div>-- <br>Regards,<br>Akshay<br>
</div></div>