<div dir="ltr"><div>Hello, </div><div><br></div><div><br></div><div>I'm Shivang Bansal, a 3rd year Computer Science Engineering undergraduate at Institute of Engineering & Technology in Lucknow, India. This mail is an expression of my interest for Google Summer of Code program of this year. I want to apologize for getting in so late. Actually I would have contacted earlier, but sudden demise of my Grandfather disabled me in doing so. </div><div><br></div><div>I am interested in woking with your organisation for two reasons. <b>First</b>, I believe that Xapian provides an incredibly useful tool for any developer to avail the advantages of a Search Engine. Had I known of its existence earlier I would have used it extensively in a project regarding web scraping in which I had scraped the products' information from various e-commerce websites in a database enabling the users to search for their required product and compare its price across different sites (just like any well known price comparator website). <b>Second</b>, I am fascinated by Information Retrieval and I am eager to explore its computational side. Specifically, I am interested in updating and complementing the current library to further improve its functionalities.</div><div><br></div><div><br></div><div>The project I would like to work on in the summers is <b>Weighting Schemes</b>. </div><div>Till now, I have successfully build Xapian on my PC along with reading the complete guide given on the following link:</div><div><a href="http://getting-started-with-xapian.readthedocs.io/en/latest/index.html" target="_blank">http://getting-started-with-xa<wbr>pian.readthedocs.io/en/latest/<wbr>index.html</a> and worked on <i>A practical example </i>given<i> </i>in the guide, in which, although I faced some problems regarding the <i>shared library</i> <b>xapian-delve-1.5</b> but solved them eventually after some googling and going through some of the related messages in IRC Archives.</div><div><br></div><div>Moreover, I have gone through the code base particularly <b>xapian-core</b> and studied <b>Xapian::Weight</b> class thoroughly and looked at the implementations of some of the already defined weighting schemes. </div><div><br></div><div>Currently, I've started to look at the ticket <a href="https://trac.xapian.org/ticket/744" target="_blank">https://trac.xapian.<wbr>org/ticket/744</a> and trying to devise a way so that <i>get_sumpart() </i>method in every weight subclass does not need to be updated after the merging. Also, I am going through <b>xapian-api </b>to get more essence of the code base.</div><div><br></div><div><br></div><div>The project ideas which I would like to propose and get some feedback on are:</div><div><br></div><div><b>1)</b> Currently, Xapian supports the weighting schemes which rely on bag-of-word representation of documents assuming that each term in the document is independent of each other which however is a debatable topic as in some places word order and word dependence do matter<i> </i>for eg- <i>Mary is quicker than John </i>and<i> John is quicker than Mary</i> are different. </div><div>I want to implement <b>Graph-of-word</b> repres<wbr>entation in Xapian which is a solution to such cases as it considers the relationship order between the terms in a document using an unweighted directed graph of terms. This representation can be further used to define a new weighting scheme, <b>TW-IDF</b> (TW = Term Weight , IDF = Inverse Document Frequency) which <i>significantly outperforms</i> <b>TF-IDF </b>&<b> BM25</b> and in some cases its extension <b>BM25+</b> on various standard TREC datasets. This effectiveness is not achieved at the cost of its efficiency. It is confirmed by various experiments shown in [2]. </div><div><br></div><div><div>The papers which I have referred for the above are :-</div><div>[1] <a href="https://www.researchgate.net/publication/220479875_Graph-based_term_weighting_for_information_retrieval" target="_blank">https://www.researchgate.n<wbr>et/publication/220479875_Graph<wbr>-based_term_weighting_for_info<wbr>rmation_retrieval</a><b><br></b></div><div>[2] <a href="https://pdfs.semanticscholar.org/8eac/d0f01ab0f53706561dda0ce8d1f96544a348.pdf" target="_blank">https://pdfs.semanticschol<wbr>ar.org/8eac/d0f01ab0f53706561d<wbr>da0ce8d1f96544a348.pdf</a> </div></div><div><br></div><div><br></div><div><b>2)</b> There is another Weighting Scheme I would like to implement in Xapian, <b>TF-ATO</b> (Term Frequency - Average Term Occurrences) mentioned in <a href="http://eprints.nottingham.ac.uk/31329/1/dls_ukci2014.pdf" target="_blank">http://eprints.nottingham.ac.u<wbr>k/31329/1/dls_ukci2014.pdf</a> wit<wbr>h a discriminative approach which uses<b> document centroid</b> as a <i>threshold</i> for normalising documents in document collection. This document centroid is used to remove less significant weights in the documents and helps to achieve higher retrieval effectiveness. The average improving rate in <b>precision</b> (non-interpolated average precision) between TF-IDF and TF-ATO is <b>40-50 %.</b> This scheme with some extra work, also performs effectively for dynamic data collection.</div><div><br></div><div><br></div><div>It would be great to have your opinions on these project ideas as it will help me to come up with a proposal on how to implement them. I realise that it's quite late but I will start working on my proposal asap so that I can improve by getting your feedback on the same. <br></div><div><br></div><div>I'm sorry if this mail gets too long. Thank you so much for your time.</div><div><br></div><div><br></div><div>Shivang Bansal</div><div><br></div><div><br></div></div>