KMeans Clusterer - Going forward

James Aylett james-xapian at tartarus.org
Sun Jun 18 19:00:49 BST 2017


On 15 Jun 2017, at 00:25, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:

> The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a document in its document vector can improve its accuracy, due to less noise terms. Two important things to be done in this direction are :
> 
> 1) Stemming
> This is easier because xapian already provides stemmed terms.

Are you planning on dropping all the stemmed terms, or all the unstemmed terms?

> 2) Stopword removal
> Use either Xapian::SimpleStopper or create a subclass of Xapian::Stopper to determine whether a term that is fed to it is a stopword or not. But for determining which terms are stopwords, I was wondering whether we'd be using the stopword list within xapian/languages/stopwords or will we have to create one within the cluster directory?

I'd suggest that you allow users to pass in a Stopper subclass, which gives them maximum control. You don't need to create a new stopword list, or manage it at all. For documentation and examples, I'd either use a builtin list or provide an explicit list of terms.

> Over the next half of the month, the plan will be to get feature extraction and elkans-kmeans (with triangle inequality) to be working well.

In that order, I assume, so focussing on the two straightforward dimensionality reduction approaches (stemming and stopping) until they're working and merged, and then looking at things like the triangle inequality optimisation.

> As Olly has mentioned in one of his comments on the PR, it wouldn't be ideal to use hard coded criteria for feature selection. Thus using something like an ExpandDecider would certainly be great. I will look into it and make my approach clear as I go ahead.

This is definitely nice to have, but I suspect getting a solid and performant system is a better focus. A good thing to do is to keep track of ideas like this that come up, and reconsider it next time you look afresh at your timeline and where you are against it. (It's good to do this at the evaluation points, for instance.)

J

-- 
 James Aylett, occasional troublemaker & project governance
 xapian.org







More information about the Xapian-devel mailing list