KMeans - Evaluation Results

Richhiey Thomas richhiey.thomas at gmail.com
Wed Aug 17 13:40:24 BST 2016


> How long does 200–300 documents take to cluster? How does it grow as more
> documents are included in the MSet? We'd expect an MSet of 1000 documents
> to take longer to cluster than one with 100, but the important thing is
> _how_ the time increases as the number of documents grows.
>
> Currently, the number of seconds taken for clustering a set of documents
for varying sizes is :

100 documents - 0.50 s
200 documents - 1.5 s
300 documents - 4.5 s
400 documents - 6.02 s
500 documents - 10.3 s
600 documents - 17.02 s
700 documents - 23.56 s
800 documents - 29.12 s
900 documents - 36.87 s
1000 documents - 42.46 s

Surely that's the right behaviour for that kind of data? (Although AIUI
> KMeans is supposed to be that good in that situation: is that what you
> mean?)
>
>  Yes that's the right kind of behavior for KMeans++. KMeans++ seeding too
takes almost the same amount of time to converge to a solution.

I'll address the other things you mentioned in your mail soon. Thanks for
the information on the documentation that will be required.

Currently, as you had mentioned that pruning the API for hiding
implementation of things that are not part of the public API is an
important thing to do. So I was looking at how PIMPL has been adopted in
Xapian, and if I'm not wrong, this has been done with the Internal class.
But I hadn't written the API in a way to agree with that design. Any tips
or guidelines I could get in order to make the current API conform with
PIMPL as implemented in Xapian?

Thanks.

Regards,
Richhiey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160817/a5380b4c/attachment.html>


More information about the Xapian-devel mailing list