<p dir="ltr">Hey Parth,</p>
<p dir="ltr">Thanks for the reply.<br>
I am considering implementing a cosine distance metric too, along with euclidian distance because of the dimensionality issue that comes in with K-Means and euclidian distance metric.<br>
That does help when we deal with sparse vectors for documents. The particular problem I'm having is representing centroids in an efficient way.<br>
For example, when we find the mean vector of a cluster, the resultant centroid need not be a document vector of a document belonging to that cluster. Hence representing that cluster, which will be dense as a C++ map is inefficient because of the number of terms associated with it and calculating distances with that doesn't work or scale too well.<br>
Over that, my distance calculation works over two documents. So will I need to modify that in a way to accommodate arbitrary vectors which might not represent document vectors?<br>
Would be great if everyone could add there inputs on this.</p>