<div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div><div><div>Hello,<br><br></div>I've recently finished with an implementation of KMeans with two initialization techniques, random initialization and KMeans++. I would like to share my findings after evaluating the same.<br><br></div>I have tested this implementation of KMeans with a BBC news article dataset. I am currently working on evaluating the same with FIRE datasets. Currently, clustering more than 500 documents efficiently is taking time (a little more than 5 seconds) and I'm trying to optimize for time by finding methods for early convergence or simple code profiling.<br><br></div>The BBC news article dataset contains articles related to politics, sports, tech, entertainment and business and contains 2225 documents in all.<br><br></div>Random KMeans results:<br></div>When dealing with an MSet that contains around 200 - 300 documents, clustering is efficient and usually gives sub-optimal clustering solutions (average distance between document and centroid = 0.65 - 0.75 in cosine similarity). For MSet's containing more documents, the time seems to increase steadily.<br><br></div>KMeans++ results:<br></div>KMeans++ provides slightly superior results due to spreading the cluster centroids out. But this becomes a problem when the document vectors are not spread uniformly and concentrate in a few places, because the cluster sizes end up being out of proportion.<br></div>The average distance between document and centroid is lesser than that of KMeans mostly (around 0.55-0.7). Thus, in many cases, the clusters end up being more compact.<br><br></div>I'm continuing to optimize this solution in time before I can go ahead with PSO.<br><br></div><div>With respect to optimization, I'm currently using unordered_map where ever I'm requiring to map values. It certainly works faster than the std::map but there are various hashmap implementations that can work faster. Google dense hash map is one of those and they work way faster than unordered_map (to my knowledge). I thought of using them to speed value look ups, but would it affect the portability of the code? Or will it be possible for me to use it?<br><br></div>I would like to know what documentation I would need to write for the current clustering API so that I can start with that too,<br><br></div>Also, since the final week for GSoC is starting now, should I list the API that was merged into a different branch before as merged or as yet to merge? And incase I am not able to complete PSO by 23rd august since I am behind the timeline, would it be possible for me to continue it later on?<br><br></div><div>Thanks.<br><br></div>Regards,<br></div>Richhiey<br></div>