<div dir="ltr"><div>Hello All,I am Nirmal Singhania from NIIT University,India.</div><div>I am interested in Clustering of search results Topic.</div><div><br></div><div>I have been in field of practical machine learning and information retrieval from quite some time.</div><div>I took various courses/MOOC on Information retrieval and Text Mining and have been working on real life datasets(KDD99,AWID,Movielens).</div><div>Because the problems you face in real life ML/IR scenario is different is different from what taught in theory.I am also working on R&D on "Hybrid Techniques for Intrusion Detection using Data Mining and Clustering on Newer Datasets".</div><div><br></div><div>Taking initial look at the docsim folder in xapian-core.</div><div>These are my insights</div><div>The clustering used is Single Link Agglomerative Hierarchical clustering.</div><div>Its Time Complexity is O(n^2) for n=number of documents.</div><div>At first Choosing K-means seems to be viable solution as K-Means has O(n) Time Complexity.</div><div>But it has various Shortcomings</div><div><span style="color:rgb(85,85,68);line-height:18.2000007629395px;font-family:arial;font-size:16px">1) </span><span style="color:rgb(65,75,86);font-family:'Times New Roman',serif;font-size:medium">The learning algorithm</span><span style="color:rgb(65,75,86);font-family:'Times New Roman',serif;font-size:medium"> </span><span style="color:rgb(65,75,86);font-family:'Times New Roman',serif;font-size:medium">requires apriori specification of the number of cluster centers.</span><br style="color:rgb(85,85,68);font-family:tahoma,'Trebuchet MS',lucida,helvetica,sans-serif;font-size:13px;line-height:18.2000007629395px"><font color="#555544" face="arial"><span style="font-size:16px;line-height:18.2000007629395px">2)Different Initial Partitions can result in different final clusters</span></font><br></div><div><font color="#555544" face="arial"><span style="font-size:16px;line-height:18.2000007629395px">3)It does not work well with clusters of different size and Different Density.</span></font></div><div><font color="#555544" face="Arial, Helvetica, sans-serif"><span style="line-height:18.2000007629395px">After That we Can Think of KMeans++</span></font></div><div><span style="color:rgb(37,37,37);font-family:sans-serif;font-size:14px;line-height:22.3999996185303px">The </span><i style="color:rgb(37,37,37);font-family:sans-serif;font-size:14px;line-height:22.3999996185303px">k</i><span style="color:rgb(37,37,37);font-family:sans-serif;font-size:14px;line-height:22.3999996185303px">-means++ algorithm addresses the first of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard </span><i style="color:rgb(37,37,37);font-family:sans-serif;font-size:14px;line-height:22.3999996185303px">k</i><span style="color:rgb(37,37,37);font-family:sans-serif;font-size:14px;line-height:22.3999996185303px">-means optimization iterations</span><span style="color:rgb(85,85,68);font-size:13px;line-height:18.2000007629395px;font-family:Arial,Helvetica,sans-serif"><br></span></div><div><font color="#252525" face="sans-serif"><span style="font-size:14px;line-height:22.3999996185303px">But it is a little bit slow due to cluster initialization.</span></font></div><div><font color="#252525" face="sans-serif"><span style="font-size:14px;line-height:22.3999996185303px">Then we can think of bisecting k-means which is better than k-means.but the</span></font> bisecting K-means algorithm is a divisive hierarchical clustering
algorithm</div><div>It is little bit faster than original k-means but the results of clustering are poorer than Hierarchical agglomerative clustering</div><div>based on various Metrics of Cluster quality such as Entropy,F-Measure,Overall Similarity,Relative Margin,Variance Ratio.</div><div><br></div><div>based on my some time of Research,I have in mind a clustering algorithm that can overcome Quality issues of K-means(and its variants) and Speed Issues of Hierarchical Agglomerative Clustering.</div><div>Theoretically it can work O(n) and Can produce results better than HAC based on various metrics.</div><div>I can't discuss it on mailing-list but you say we can discuss more about it and its implementation in xapian in PM.</div><div><br></div><div>Thank you for your Time</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Regards,<div>Nirmal Singhania</div><div>B.tech III Yr</div></div></div></div></div></div></div></div>
</div>