GSOC 2018: Diversification of Search Results

Uppinder Chugh uppinderchugh at
Thu Apr 26 19:53:15 BST 2018

Thanks for selecting my proposal for GSoC, looking forward to
contributing further to Xapian. I've posted this in the IRC but didn't
receive any reply, so I'm presuming this must've been missed and thus
posting it here. As proposed, I plan to use ClueWeb09 Category B
dataset for evaluating diversification. A hosted copy is available
( which may
be accessed but requires a license. The license is free and granted to
an organisation by applying online
. If a maintainer could have a look at this, that would be great. It's
mentioned on the website that it takes around 2 weeks to obtain the
license, and as discussed in the interview, I might evaluate the
GLS-MPT implementation before moving on to optimizations (C2-GLS).

On Sat, Mar 10, 2018 at 12:08 AM, Uppinder Chugh
<uppinderchugh at> wrote:
> Hi, I'd like to share my proposal for GSoC and get feedback on it.
> Thanks,
> Uppinder Chugh
> On Mon, Feb 26, 2018 at 2:14 AM, Uppinder Chugh <uppinderchugh at> wrote:
>> In particular, I have the following doubts:
>> a) Is wrapping Xapian::Mset matcher::get_set(..) suitable in this scenario and with the api? Also, how can I allow the user to manually allow diversification while he configures his result set via Matcher API?
>> b) Should I include the LC clustering algorithm in xapian-core/cluster (as there's the base class Cluster which can be inherited) or make it part of diversification implementation.
>> c) Apart from the proposed methods, I'd be writing automated tests, examples and documenting the new feature. Some tips here are appreciated as I've never written tests for code. Also, for documenting, I believe only getting-started-with-xapian should be updated with examples for using the new feature.
>> Apart from the above, if I'm missing something or didn't go into enough detail, please let me know. :)

More information about the Xapian-devel mailing list