KMeans Clusterer - Going forward

Richhiey Thomas richhiey.thomas at gmail.com
Thu Jun 15 00:25:39 BST 2017


Hello,

I have finished moving the API to PIMPL classes and will fix issues within
the current code over the next week, based on reviews from mentors.

The next step going forward is to start with forming document vectors that
are reduced and more useful. This majorly helps in saving run time (since
time for distance calculation depends on number of terms). Getting the
useful terms within a document in its document vector can improve its
accuracy, due to less noise terms. Two important things to be done in this
direction are :

1) Stemming
This is easier because xapian already provides stemmed terms.

2) Stopword removal
Use either Xapian::SimpleStopper or create a subclass of Xapian::Stopper to
determine whether a term that is fed to it is a stopword or not. But for
determining which terms are stopwords, I was wondering whether we'd be
using the stopword list within xapian/languages/stopwords or will we have
to create one within the cluster directory?

Over the next half of the month, the plan will be to get feature extraction
and elkans-kmeans (with triangle inequality) to be working well.

As Olly has mentioned in one of his comments on the PR, it wouldn't be
ideal to use hard coded criteria for feature selection. Thus using
something like an ExpandDecider would certainly be great. I will look into
it and make my approach clear as I go ahead.

Thanks,
Richhiey
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170614/fc5f3a7e/attachment.html>


More information about the Xapian-devel mailing list