<p dir="ltr">Hey Parth,</p>

<p dir="ltr">Thanks for the reply.<br>

I am considering implementing a cosine distance metric because of the dimensionality issue that comes in with K-Means and euclidian distance metric.</p>

<p dir="ltr">Currently, the way I'm finding distances between documents is finding their terms and looking up their term frequencies which I've stored in a map. So I've not stored a unique vector for every document. Now in KMeans, when we find the mean of a cluster, the resultant need not be a document vector. So representing these centroids is becoming a problem since the centroids will be dense. Should I use a map for that too? By storing all the terms and their avg values. <br>

Or would it be a better approach to have a document vector for every document stored?</p>

<p dir="ltr">Thanks.<br>

</p>