KMeans - Going forward

Richhiey Thomas richhiey.thomas at gmail.com
Sun Jul 23 20:50:51 BST 2017


Hello,

Now work on stopword removal and stemming is almost ending and the run time
for KMeans seem to be getting lesser (around 0.15 s for 100 documents and
this increases to around 1.2 s with 500 documents and 2.5 s with 1000
documents). I tried this out on the BBC datasets available with a value
k=5, since there were 5 categories in the dataset.

Going forward, the next step to optimize KMeans is to use the faster
optimized version of KMeans which reduces distance computations developed
by Charles Elkan. For this, I will be providing the user an option to
specify with the constructor whether they would want the standard algorithm
or Elkans algorithm. and write a method within KMeans to implement the
triangle inequality optmization. I will also be moving RoundRobin to the
testsuite.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170723/da5d7e92/attachment.html>


More information about the Xapian-devel mailing list