GSOC-2016 Project : Clustering of search results

Richhiey Thomas richhiey.thomas at
Sun Mar 6 19:53:41 GMT 2016

> On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <james-xapian at>
> wrote:
>> On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote:
>> K-Means or something related certainly seems like a viable approach,
>> so what you'll need to do is to come up with a proposal of how you'd
>> implement this in Xapian (either with reference to the previous work,
>> or separately), and also how you'd go about evaluating the performance
>> of your implementation (both in terms of usefulness of the clustering,
>> and in terms of speed!).
>> Thanks for the reply James!
> I went through the code in a little more detail and there are a few things
> I noticed and a few questions I have.
> First off, the distance metric used in the current implementation is the
> cosine measure. Though useful, K-means implicitly uses Euclidian distance
> as a measure of document similarity between two document term vectors.
> Hence, simply creating one more class for a distance metric by just
> inheriting the DocSim base class will be good. Using the tf-idf weights, we
> can find term weights and instead of using these vectors for cosine
> similarity, euclid distance can be found out.
> With a similarity measure in place, we can initialize the k centroids
> using k-means++, an algorithm used for choosing the initial centroids in
> k-means, to avoid poor clustering results. The distance between document
> vectors and centroids can be found out and documents are added to clusters
> accordingly, identified by their doc-id's. The new centroid is again found
> and this process will continue till convergence.
> I am slightly unaware about performance evaluation but the cluster quality
> can be evaluated through F-measure and I guess we can check the running
> times of both the implementations to check for usefulness in terms of
> speed.
> My questions are:
> 1) Can you direct me on how to convert this raw idea into a proposal in
> context to Xapian with more detail? What areas do I focus on?
> 2) It would be great if you could elaborate a little on the performance
> evaluation part that I haven't been able to follow too well.
> Thanks! :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Xapian-devel mailing list