GSoC-2017 Introduction and Project Discussion

Sun Mar 19 12:15:22 GMT 2017

On 16 Mar 2017, at 13:31, Shivang Bansal <shivangbansal1995 at gmail.com> wrote:

> I'm Shivang Bansal, a 3rd year Computer Science Engineering undergraduate at Institute of Engineering & Technology in Lucknow, India.

Hi Shivang! Welcome to Xapian :)

> The project I would like to work on in the summers is Weighting Schemes. 
> Till now, I have successfully build Xapian on my PC along with reading the complete guide given on the following link:
> http://getting-started-with-xapian.readthedocs.io/en/latest/index.html and worked on A practical example given in the guide, in which, although I faced some problems regarding the shared library xapian-delve-1.5 but solved them eventually after some googling and going through some of the related messages in IRC Archives.

It's worth pointing out that xapian-delve is not a shared library. Perhaps you mean you had difficulties with shared libraries while building or running it? The fact that you're using xapian-delve-1.5 tells me that you're using the development work from git, which is not the recommended approach in the getting started guide, but is absolutely the right way when working on Xapian's codebase. (You probably need to look at the Xapian developer guide to ensure you're set up properly with the development codebase: https://xapian-developer-guide.readthedocs.io/en/latest/)

> The project ideas which I would like to propose and get some feedback on are:
> 
> 1) Currently, Xapian supports the weighting schemes which rely on bag-of-word representation of documents assuming that each term in the document is independent of each other which however is a debatable topic as in some places word order and word dependence do matter for eg- Mary is quicker than John and John is quicker than Mary are different. 
> I want to implement Graph-of-word representation in Xapian which is a solution to such cases as it considers the relationship order between the terms in a document using an unweighted directed graph of terms. This representation can be further used to define a new weighting scheme, TW-IDF (TW = Term Weight , IDF = Inverse Document Frequency) which significantly outperforms TF-IDF & BM25 and in some cases its extension BM25+ on various standard TREC datasets. This effectiveness is not achieved at the cost of its efficiency. It is confirmed by various experiments shown in [2]. 

You'll need to propose quite a lot of detail on how you're going to implement this using the Xapian backend database and weighting system. I suspect you'll have to extend it a fair amount to support TW-IDF, because we have no graph support at present. I haven't read the paper though, so it's possible you can do this using some preprocessing in some way.

It's also worth noting that we've sometimes seen quite different evaluation results to the academic research in the past. There's a module that implements some evaluation metrics (https://github.com/samuelharden/xapian-evaluation) which can be used to gauge how a new weighting scheme compares to the others we have, designed to run with TREC data. (We also have organisational access to the FIRE datasets for this purpose.)

(Ideally we'd merge the evaluation work into the main repository rather than keeping it separate, but we haven't had time to do that yet. That could be a useful part of a summer of code project looking at weighting schemes.)

> 2) There is another Weighting Scheme I would like to implement in Xapian, TF-ATO (Term Frequency - Average Term Occurrences) mentioned in http://eprints.nottingham.ac.uk/31329/1/dls_ukci2014.pdf with a discriminative approach which uses document centroid as a threshold for normalising documents in document collection. This document centroid is used to remove less significant weights in the documents and helps to achieve higher retrieval effectiveness. The average improving rate in precision (non-interpolated average precision) between TF-IDF and TF-ATO is 40-50 %. This scheme with some extra work, also performs effectively for dynamic data collection.

Again, you'll need to propose how to fit TF-ATO into Xapian's database and weighting framework. Can the document centroid approach be done as a processing step during indexing? (Again, I haven't had time to read the paper yet.)

Best,
James

-- 
 James Aylett
 devfort.com — spacelog.org — tartarus.org/james/