<p dir="ltr">Hey Parth,</p>

<p dir="ltr">Thanks for the reply.<br>

I am considering implementing a cosine distance metric too, along with euclidian distance because of the dimensionality issue that comes in with K-Means and euclidian distance metric.<br>

That does help when we deal with sparse vectors for documents. The particular problem I'm having is representing centroids in an efficient way.<br>

For example, when we find the mean vector of a cluster, the resultant centroid need not be a document vector of a document belonging to that cluster. Hence representing that cluster, which will be dense as a C++ map is inefficient because of the number of terms associated with it and calculating distances with that doesn't work or scale too well.<br>

Over that, my distance calculation works over two documents. So will I need to modify that in a way to accommodate arbitrary vectors which might not represent document vectors?<br>

Would be great if everyone could add there inputs on this.</p>