<div dir="ltr"><div class="gmail_default" style="color:rgb(11,83,148)">Hi Richhiey<br><br></div><div class="gmail_default" style="color:rgb(11,83,148)">Storing the centroids as double arrays is a better choice because of their dense nature and simplicity for operating over in parallel e.g. if you want to pass it to BLAS subroutine. In my previous email, I tried to calculate space requirements for it.<br><br></div><div class="gmail_default" style="color:rgb(11,83,148)">A document can be stored in the sparse format (like you do with map) but before passing it to a cosine similarity subroutine you can make it dense (create a new double array and set its particular non-zero indexes). Alternatively, if you decide to operate in the sparse space, you can efficiently access centroid entries at indexes for which document has non-zero entry.<br><br></div><div class="gmail_default" style="color:rgb(11,83,148)">Basically, here you have to make a design choice. In the former case, you can use BLAS like subroutines, while in the latter, you save computation. The former is also valid for the euclidean distance metric while the latter is not. Olly/James might have an opinion (may be from the previous cluster branch).<br><br></div><div class="gmail_default" style="color:rgb(11,83,148)">Cheers<br></div><div class="gmail_default" style="color:rgb(11,83,148)">Parth<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jul 27, 2016 at 6:17 PM, Richhiey Thomas <span dir="ltr"><<a href="mailto:richhiey.thomas@gmail.com" target="_blank">richhiey.thomas@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p dir="ltr">Hey Parth,</p>
<p dir="ltr">Thanks for the reply.<br>
I am considering implementing a cosine distance metric too, along with euclidian distance because of the dimensionality issue that comes in with K-Means and euclidian distance metric.<br>
That does help when we deal with sparse vectors for documents. The particular problem I'm having is representing centroids in an efficient way.<br>
For example, when we find the mean vector of a cluster, the resultant centroid need not be a document vector of a document belonging to that cluster. Hence representing that cluster, which will be dense as a C++ map is inefficient because of the number of terms associated with it and calculating distances with that doesn't work or scale too well.<br>
Over that, my distance calculation works over two documents. So will I need to modify that in a way to accommodate arbitrary vectors which might not represent document vectors?<br>
Would be great if everyone could add there inputs on this.</p>
</blockquote></div><br></div>