<div dir="ltr">And Yes,the similarity measure for document similarity is cosine similarity.<div>For the algorithm i proposed in trailing mail,i have to implement euclidean distance similarity measure and tweak it to make it work well with the algorithm.</div><div><br></div><div>waiting for your suggestions.</div></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Regards,<div>Nirmal Singhania</div><div>B.tech III Yr</div></div></div></div></div></div></div></div>

<br><div class="gmail_quote">On Wed, Mar 9, 2016 at 10:27 AM, nirmal singhania <span dir="ltr"><<a href="mailto:nirmal.singhania@st.niituniversity.in" target="_blank">nirmal.singhania@st.niituniversity.in</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Hello All,I am Nirmal Singhania from NIIT University,India.</div><div>I am interested in Clustering of search results Topic.</div><div><br></div><div>I have been in field of practical machine learning and information retrieval from quite some time.</div><div>I took various courses/MOOC on Information retrieval and Text Mining and have been working on real life datasets(KDD99,AWID,Movielens).</div><div>Because the problems you face in real life ML/IR scenario is different is different from what taught in theory.I am also working on R&D on "Hybrid Techniques for Intrusion Detection using Data Mining and Clustering on Newer Datasets".</div><div><br></div><div>Taking initial look at the docsim folder in xapian-core.</div><div>These are my insights</div><div>The clustering used is Single Link Agglomerative Hierarchical clustering.</div><div>Its Time Complexity is O(n^2) for n=number of documents.</div><div>At first Choosing K-means seems to be viable solution as K-Means has O(n) Time Complexity.</div><div>But it has various Shortcomings</div><div><span style="color:rgb(85,85,68);line-height:18.2000007629395px;font-family:arial;font-size:16px">1) </span><span style="color:rgb(65,75,86);font-family:'Times New Roman',serif;font-size:medium">The learning algorithm</span><span style="color:rgb(65,75,86);font-family:'Times New Roman',serif;font-size:medium"> </span><span style="color:rgb(65,75,86);font-family:'Times New Roman',serif;font-size:medium">requires apriori specification of the number of  cluster centers.</span><br style="color:rgb(85,85,68);font-family:tahoma,'Trebuchet MS',lucida,helvetica,sans-serif;font-size:13px;line-height:18.2000007629395px"><font color="#555544" face="arial"><span style="font-size:16px;line-height:18.2000007629395px">2)Different Initial Partitions can result in different final clusters</span></font><br></div><div><font color="#555544" face="arial"><span style="font-size:16px;line-height:18.2000007629395px">3)It does not work well with clusters of different size and Different Density.</span></font></div><div><font color="#555544" face="Arial, Helvetica, sans-serif"><span style="line-height:18.2000007629395px">After That we Can Think of KMeans++</span></font></div><div><span style="color:rgb(37,37,37);font-family:sans-serif;font-size:14px;line-height:22.3999996185303px">The </span><i style="color:rgb(37,37,37);font-family:sans-serif;font-size:14px;line-height:22.3999996185303px">k</i><span style="color:rgb(37,37,37);font-family:sans-serif;font-size:14px;line-height:22.3999996185303px">-means++ algorithm addresses the first of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard </span><i style="color:rgb(37,37,37);font-family:sans-serif;font-size:14px;line-height:22.3999996185303px">k</i><span style="color:rgb(37,37,37);font-family:sans-serif;font-size:14px;line-height:22.3999996185303px">-means optimization iterations</span><span style="color:rgb(85,85,68);font-size:13px;line-height:18.2000007629395px;font-family:Arial,Helvetica,sans-serif"><br></span></div><div><font color="#252525" face="sans-serif"><span style="font-size:14px;line-height:22.3999996185303px">But it is a little bit slow due to cluster initialization.</span></font></div><div><font color="#252525" face="sans-serif"><span style="font-size:14px;line-height:22.3999996185303px">Then we can think of bisecting k-means which is better than k-means.but the</span></font> bisecting K-means algorithm is a divisive hierarchical clustering

algorithm</div><div>It is little bit faster than original k-means but the results of clustering are poorer than Hierarchical agglomerative clustering</div><div>based on various Metrics of Cluster quality such as Entropy,F-Measure,Overall Similarity,Relative Margin,Variance Ratio.</div><div><br></div><div>based on my some time of Research,I have in mind a clustering algorithm that can overcome Quality issues of K-means(and its variants) and Speed Issues of Hierarchical Agglomerative Clustering.</div><div>Theoretically it can work O(n) and Can produce results better than HAC based on various metrics.</div><div>I can't discuss it on mailing-list but you say we can discuss more about it and its implementation in xapian in PM.</div><div><br></div><div>Thank you for your Time</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><br clear="all"><div><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr">Regards,<div>Nirmal Singhania</div><div>B.tech III Yr</div></div></div></div></div></div></div></div>

</div>

</blockquote></div><br></div>