GSOC-2016 Project : Clustering of search results

Sun Mar 6 20:06:43 GMT 2016

On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <james-xapian at tartarus.org
<javascript:_e(%7B%7D,'cvml','james-xapian at tartarus.org');>> wrote:

> On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote:
>
> K-Means or something related certainly seems like a viable approach,
> so what you'll need to do is to come up with a proposal of how you'd
> implement this in Xapian (either with reference to the previous work,
> or separately), and also how you'd go about evaluating the performance
> of your implementation (both in terms of usefulness of the clustering,
> and in terms of speed!).
>
>
>  Sorry about sending this msg twice. I made a mistake in the last mail and
needed to make sure this one will be read.

Thanks for the reply James!
I went through the code in a little more detail and there are a few things
I noticed and a few questions I have.

First off, the distance metric used in the current implementation is the
cosine measure. Though useful, K-means implicitly uses Euclidian distance
as a measure of document similarity between two document term vectors.
Hence, simply creating one more class for a distance metric by just
inheriting the DocSim base class will be good. Using the tf-idf weights, we
can find term weights and instead of using these vectors for cosine
similarity, euclid distance can be found out.

With a similarity measure in place, we can initialize the k centroids using
k-means++, an algorithm used for choosing the initial centroids in k-means,
to avoid poor clustering results. The distance between document vectors and
centroids can be found out and documents are added to clusters accordingly,
identified by their doc-id's. The new centroid is again found and this
process will continue till convergence.

https://en.wikipedia.org/wiki/K-means%2B%2B

I am slightly unaware about performance evaluation but the cluster quality
can be evaluated through F-measure and I guess we can check the running
times of both the implementations to check for usefulness in terms of
speed.

My questions are:
1) Can you direct me on how to convert this raw idea into a proposal in
context to Xapian with more detail? What areas do I focus on?
2) It would be great if you could elaborate a little on the performance
evaluation part that I haven't been able to follow too well.

Thanks! :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160307/a5ce6299/attachment.html>