<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><div class="gmail_quote"><span class="">On Sun, Mar 6, 2016 at 7:17 AM, James Aylett <span dir="ltr"><<a href="mailto:james-xapian@tartarus.org" target="_blank">james-xapian@tartarus.org</a>></span> wrote:<br></span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class=""><span>On Sat, Mar 05, 2016 at 10:58:43PM +0530, Richhiey Thomas wrote:<br>
</span><span><br>
</span></span><span class="">K-Means or something related certainly seems like a viable approach,<br>
so what you'll need to do is to come up with a proposal of how you'd<br>
implement this in Xapian (either with reference to the previous work,<br>
or separately), and also how you'd go about evaluating the performance<br>
of your implementation (both in terms of usefulness of the clustering,<br>
and in terms of speed!).<span><font color="#888888"><br>
<br></font></span></span></blockquote><div>Thanks for the reply James!<br></div><div>I went through the code in a little more detail and there are a few things I noticed and a few questions I have.<br><br></div><div>First off, the distance metric used in the current implementation is the cosine measure. Though useful, K-means implicitly uses Euclidian distance as a measure of document similarity between two document term vectors. Hence, simply creating one more class for a distance metric by just inheriting the DocSim base class will be good. Using the tf-idf weights, we can find term weights and instead of using these vectors for cosine similarity, euclid distance can be found out.<br><br></div><div>With a similarity measure in place, we can initialize the k centroids using k-means++, an algorithm used for choosing the initial centroids in k-means, to avoid poor clustering results. The distance between document vectors and centroids can be found out and documents are added to clusters accordingly, identified by their doc-id's. The new centroid is again found and this process will continue till convergence.<br></div><div><br><a href="https://en.wikipedia.org/wiki/K-means%2B%2B" target="_blank">https://en.wikipedia.org/wiki/K-means%2B%2B</a><br><br></div><div>I am slightly unaware about performance evaluation but the cluster quality can be evaluated through F-measure and I guess we can check the running times of both the implementations to check for usefulness in terms of speed. <br><br></div><div>My questions are:<br></div><div>1) Can you direct me on how to convert this raw idea into a proposal in context to Xapian with more detail? What areas do I focus on?<br></div><div>2) It would be great if you could elaborate a little on the performance evaluation part that I haven't been able to follow too well.<br><br></div><div>Thanks! :)<br></div><div><br><br></div></div></div></div>
</blockquote></div><br></div></div>