GSOC-2016 Project : Clustering of search results

Sun Mar 6 01:47:08 GMT 2016

On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote:

> I am Richhiey Thomas, pursuing my third year of undergraduate studies in
> Computer Science from Mumbai University. I had gone through the project
> list for this year and the project idea based on clustering caught my
> attention. I spoke to Assem Chelli on IRC who guided me to the code and got
> me started.

Hi Richhiey! That's great, and it's good that IRC is working for
you. (People at some universities occasionally have trouble getting
onto IRC, unfortunately.)

> I started going through the code and have successfully built Xapian on my
> machine. If I am not mistaken, the currently implemented clustering branch
> has used heirarchial clustering which has quadratic complexity or higher,
> which naturally makes it very inefficient for large data sets.
> 
> A better approach to clustering would be to use K-means clustering or a
> variant (like bisecting K-means) which provides equal or better performance
> than the implemented heirarchial technique having lesser (linear)
> complexity, thus being more favourable to large datasets.
> 
> It would be great to have your opinions on how to go about this or any
> other approaches that may be more favorable for document clustering in the
> context of Xapian.

K-Means or something related certainly seems like a viable approach,
so what you'll need to do is to come up with a proposal of how you'd
implement this in Xapian (either with reference to the previous work,
or separately), and also how you'd go about evaluating the performance
of your implementation (both in terms of usefulness of the clustering,
and in terms of speed!).

Hopefully that's enough to get you started, but please do ask any
questions you have while working through this.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org