GSOC-2016 Project : Clustering of search results

Sat Mar 5 17:28:43 GMT 2016

Hello devs,

I am Richhiey Thomas, pursuing my third year of undergraduate studies in
Computer Science from Mumbai University. I had gone through the project
list for this year and the project idea based on clustering caught my
attention. I spoke to Assem Chelli on IRC who guided me to the code and got
me started.

I started going through the code and have successfully built Xapian on my
machine. If I am not mistaken, the currently implemented clustering branch
has used heirarchial clustering which has quadratic complexity or higher,
which naturally makes it very inefficient for large data sets.

A better approach to clustering would be to use K-means clustering or a
variant (like bisecting K-means) which provides equal or better performance
than the implemented heirarchial technique having lesser (linear)
complexity, thus being more favourable to large datasets.

It would be great to have your opinions on how to go about this or any
other approaches that may be more favorable for document clustering in the
context of Xapian.

Thanks! :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160305/cd3c8695/attachment.html>