K MEANS clustering
James Aylett
james-xapian at tartarus.org
Thu Jul 28 10:57:54 BST 2016
On Thu, Jul 28, 2016 at 12:15:51PM +0530, Parth Gupta wrote:
> Storing the centroids as double arrays is a better choice because of their
> dense nature and simplicity for operating over in parallel e.g. if you want
> to pass it to BLAS subroutine. In my previous email, I tried to calculate
> space requirements for it.
Centroids are certainly more likely to be dense than document vectors
in this case. I'd initially just get something working, probably
dense, and worry about performance once you have a test
suite. (Everything is easier once you have a test suite.)
> A document can be stored in the sparse format (like you do with map) but
> before passing it to a cosine similarity subroutine you can make it dense
> (create a new double array and set its particular non-zero indexes).
> Alternatively, if you decide to operate in the sparse space, you can
> efficiently access centroid entries at indexes for which document has
> non-zero entry.
Computing cosine is pretty easil between two sparsely-stored document
vectors providing you know the size of the equivalent array. Just
iterate over the indices, and keep track of the three terms making up
the similarity measure; on a zero component the vector-specific term
won't increase, and the numerator term won't increase if the component
in either vector is zero.
Don't forget that once you've initialised the system, you shouldn't need
the map from term to id.
> The former is also valid for the euclidean distance metric while the
> latter is not. Olly/James might have an opinion (may be from the
> previous cluster branch).
I don't have thoughts from previous clustering projects, but I will
note that no decision made at this point is final, and it should be
reasonably easy to change things once there's a running version that
we can profile to see how it behaves.
J
--
James Aylett, occasional trouble-maker
xapian.org
More information about the Xapian-devel
mailing list