<div dir="ltr">Hello,<div><span style="color:rgb(0,0,0);white-space:pre-wrap"><br></span></div><div><span style="color:rgb(0,0,0);white-space:pre-wrap">I am Uppinder Chugh (irc nick: icebyte), a senior year undergraduate student majoring in Computer Science and Engineering at Indian Institute of Technology, Guwahati. I'm interested to work on the idea of adding the functionality of search result diversification to Xapian. After having brief conversations with mentors on IRC,  I would like to compile the discussions and further discuss the approach I have in mind for implementing the same. First I'd like to give a brief justification of choosing the proposed method:</span><br></div><div><br></div><div><font color="#000000"><span style="white-space:pre-wrap">    In the literature [1], diversification of search results can be broadly classified into two methods: Implicit (uses document features) and Explicit (uses query based features such as query logs/click through logs).  The current state-of-the-art in terms of effectiveness and efficiency are explicit methods, but with major drawbacks:</span></font></div><div><br></div><div>    a) Requires external source of data (other than documents) such as query logs/click logs.</div><div>    b) Doesn't work well on rare queries (mainly due to lack of such data in the external source).</div><div><br></div><div>    To make diversification in search results more widely usable and have it work out of the box, it is desirable to implement an implicit method. After going through some papers I found [2] to be suitable for our need. It mentions an efficient implicit method that runs under 100ms and has effectiveness close to that of the best explicit methods. I'm briefly presenting the details of the methods that I propose to implement:</div><div><div style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="color:rgb(0,0,0);white-space:pre-wrap"><br class="gmail-Apple-interchange-newline">1)  Implementing GLS-MPT:</span></div><div style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="color:rgb(0,0,0);white-space:pre-wrap">    First I'll be implementing the GLS algorithm (details on page 6 of [2]) with objective function based on MPT (details on page 13 of [2]). According to the paper, avg. latency for the GLS-MPT is ~800ms (table V of [2]), which involves preprocessing (storing pairwise distance of top N documents, N = 100) and the actual diversification which finally returns top-k (k = 20) diversified results. Note: k = 20 and N = 100 are suitable as usually users don't page too far and even if they do, it's suitable to let the results tail off in relevance.</span></div><div style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="color:rgb(0,0,0);white-space:pre-wrap">    The above procedure can be included in the pipeline of producing the final results by wrapping Xapian::Mset Matcher::get_set(..).  The algorithm requires top-N document vectors and their relevant scores, which are readily available. This gives a complete diversification implementation, although it being quite slow.</span></div><div style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="color:rgb(0,0,0);white-space:pre-wrap"><br></span></div><div style="text-align:start;text-indent:0px;text-decoration-style:initial;text-decoration-color:initial"><font color="#000000"><span style="white-space:pre-wrap">2) Optimising GLS-MPT: </span></font></div></div><div style="text-align:start;text-indent:0px;text-decoration-style:initial;text-decoration-color:initial"><font color="#000000"><span style="white-space:pre-wrap">    One of the main contributions of the paper is optimising GLS-MPT. It involves clustering the top N documents and storing the distance between each document in N and each cluster (there are k clusters). They term this method C2-GLS </span></font><span style="color:rgb(0,0,0);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-transform:none;white-space:pre-wrap;word-spacing:0px;text-align:start;text-indent:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">(details on page 7 and 8 of [2]). </span><font color="#000000"><span style="white-space:pre-wrap">The clustering method used is LC [3]. The avg. latency reduces to ~20-40 ms (table V of [2]), while improving the effectiveness.</span></font></div><div style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="color:rgb(0,0,0);white-space:pre-wrap">    This involves modifying the code of 1) and implementing the LC clustering algorithm. LC can be added as an algorithm in xapian-core/cluster.</span></div><div><br></div><div>3) Evaluation:</div><div>    The evaluation metric widely used in literature are alpha-NDCG [4] and ERR-IA [5]. These are also among the metrics used by the main paper [2], thus allowing for direct comparison with the author's results. The data set corpus used in the literature the most is ClueWeb09 with queries from TREC 2009/2010 diversification task. The problem is that both these are not freely available. For implementation, I'd like to implement alpha-NDCG and if time persists then ERR-IA. </div><div><br></div><div>Please let me know what you think, any kind of feedback is appreciated.     </div><div><span style="color:rgb(0,0,0);white-space:pre-wrap"><br></span></div><div><span style="color:rgb(0,0,0);white-space:pre-wrap"><br></span></div><div><span style="color:rgb(0,0,0);white-space:pre-wrap">[1] Search Result Diversification Santos et al. 2015 (Survey Paper)</span></div><div><span style="color:rgb(0,0,0);white-space:pre-wrap">[2] Scalable and Efficient Web Search Result Diversification Naini et al. 2016<br></span></div><div><span style="color:rgb(0,0,0);white-space:pre-wrap">[3] Modelling efficient novelty-based search result diversification in metric spaces Gil-Costa et al. 2013<br></span></div><div><span style="color:rgb(0,0,0);white-space:pre-wrap">[4] Novelty and Diversity in Information Retrieval Evaluation Clarke et al. 2008 </span></div><div><span style="color:rgb(0,0,0);white-space:pre-wrap">[5] Intent-based diversification of web search results: Metrics and algorithms. Chapelle et al. 2011b</span></div><div><span style="color:rgb(0,0,0);white-space:pre-wrap"><br></span></div><div><br></div></div>