<div dir="ltr"><div style="font-size:12.8000001907349px">I was not sharing it on maling list because i thought that someone can use all ideas i proposed in their GSOC proposal.</div><div style="font-size:12.8000001907349px"><br></div><div style="font-size:12.8000001907349px">Surely i will contribute to xapian project.</div><div style="font-size:12.8000001907349px">sorry if that was against the rules</div><div style="font-size:12.8000001907349px">The algorithm is not developed by me but after having much research on various clustering techniques.</div><div style="font-size:12.8000001907349px">I found that there is a new algorithm called CLUBS(Clustering Using Binary Splitting) which gives better results than kmeans++ and hierarchical agglomerative clustering.</div><div style="font-size:12.8000001907349px">It is faster and produces good results based on various metrics of cluster quality.</div><div style="font-size:12.8000001907349px"><br></div><div style="font-size:12.8000001907349px"><br></div><div style=""><div style="font-size:12.8000001907349px">the algorithm works in following way</div><div style="font-size:12.8000001907349px">The first phase of the algorithm is</div><div style="font-size:12.8000001907349px">divisive, as the original data set(in this case, set of search documents to cluster) is split recursively into miniclusters through successive</div><div style="font-size:12.8000001907349px">binary splits: the algorithm’s second phase is agglomerative since these miniclusters</div><div style="font-size:12.8000001907349px">are recombined into the final result. Further, the approach induces during execution a dynamic</div><div style="font-size:12.8000001907349px">hierarchical grid that will better fit the dataset w.r.t. classical grid approaches that exploit</div><div style="font-size:12.8000001907349px">a fixed grid instead. Finally, the algorithm exploits the analytical properties of the</div><div style="font-size:12.8000001907349px">Quadratic Sums of Squares (SSQ in the following) function to minimize the cost of</div><div style="font-size:12.8000001907349px">merge and split operations.</div><div style="font-size:12.8000001907349px"><br></div><div style="font-size:12.8000001907349px">you can refer the paper for more details about time complexity and performance</div><div style=""><span style="font-size:12.8000001907349px"><a href="http://web.cs.ucla.edu/~zaniolo/papers/chp%253A10.1007%252F978-3-642-37456-2_10.pdf">http://web.cs.ucla.edu/~zaniolo/papers/chp%253A10.1007%252F978-3-642-37456-2_10.pdf</a></span><br></div><div style=""><span style="font-size:12.8000001907349px"><br></span></div><div style=""><span style="font-size:12.8000001907349px">for implementing it,we can use Documentsource class in our previous clustering approach and create a binary tree</span></div><div style=""><span style="font-size:12.8000001907349px">and perform and topdownsplitting and then bottomup merging.</span></div><div style=""><span style="font-size:12.8000001907349px">First we have to implement feature extraction from text document(TFIDF would be a good choice) which is implemented in xapian weighting schemes.</span></div><div style=""><span style="font-size:12.8000001907349px">Then we will implement function to compute distances between documents based on normalized TF-IDF Matrix.</span></div><div style=""><span style="font-size:12.8000001907349px">Based on distances we will initially assign cluster and improve on it using </span><span style="font-size:12.8000001907349px">topdownsplitting and then bottomup merging.</span></div><div style=""><span style="font-size:12.8000001907349px"><br></span></div><div style=""><span style="font-size:12.8000001907349px">We will compute SSQ of a cluster from all points in a cluster. and then Improve our clustering based on SSQ measures.</span></div><div style=""><span style="font-size:12.8000001907349px"><br></span></div><div style=""><span style="font-size:12.8000001907349px">The functions for topdownsplitting  will include </span>computeAverageDeltaSSQ to average the SSQ of all points in cluster.</div><div style="">The functions for bottomupmerging will include computeweightedSSQ.</div><div style="">.</div><div style=""><br></div><div style="">Please Give your valuable suggestions.</div></div><div style="font-size:12.8000001907349px">Have a Nice Day.</div><div style="font-size:12.8000001907349px"><br></div><div style="font-size:12.8000001907349px"><br></div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Regards,<div>Nirmal Singhania</div><div>B.tech III Yr</div></div></div></div></div></div></div></div>

<br><div class="gmail_quote">On Wed, Mar 9, 2016 at 10:33 PM, James Aylett <span dir="ltr"><<a href="mailto:james-xapian@tartarus.org" target="_blank">james-xapian@tartarus.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Wed, Mar 09, 2016 at 10:27:38AM +0530, nirmal singhania wrote:<br>

<br>

> based on my some time of Research,I have in mind a clustering algorithm<br>

> that can overcome Quality issues of K-means(and its variants) and Speed<br>

> Issues of Hierarchical Agglomerative Clustering.<br>

> Theoretically it can work O(n) and Can produce results better than HAC<br>

> based on various metrics.<br>

> I can't discuss it on mailing-list but you say we can discuss more about it<br>

> and its implementation in xapian in PM.<br>

<br>

</span>Hi, Nirmal -- welcome to Xapian! What do you mean when you say that<br>

you can't discuss it on the mailing list? If you can't discuss it in<br>

our public forums, then you aren't going to be able to contribute that<br>

work to Xapian, are you? (Since we'll need it documented clearly,<br>

which will likely include useful details of that discussion.)<br>

<span class="HOEnZb"><font color="#888888"><br>

J<br>

<br>

--<br>

  James Aylett, occasional trouble-maker<br>

  <a href="http://xapian.org" rel="noreferrer" target="_blank">xapian.org</a><br>

</font></span></blockquote></div><br></div>