KMeans Clusterer - Going forward

Sun Jun 18 23:19:36 BST 2017

[Please keep emails on the mailing list.]

> On 18 Jun 2017, at 22:43, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:
> 
>> Are you planning on dropping all the stemmed terms, or all the unstemmed terms?
> 
> I plan on dropping all the unstemmed terms since it reduces the size of the termlist to a larger extent and can also take care of noise within text data such as spelling mistakes.

Hmm, I wonder how much of a negative impact false positive conflation errors from the stemmer will do here.

It's fairly easy either way; I suspect in future we'll come up with something more sophisticated and under control of the user, but that shouldn't hold us up for now.

>> I'd suggest that you allow users to pass in a Stopper subclass, which gives them maximum control. You don't need to create a new stopword list, or manage it at all. For documentation and examples, I'd either use a builtin list or provide an explicit list of terms.
> 
> This sounds good. In a case where the user does not provide a Stopper, I guess it would be ideal to initialise the Stopper subclass with the common stopword list that we already have. This can be passed to KMeans in its constructor and any initializations can be done there.

I'd start with the default being no stopping if there's no explicit stopper. There may be situations where that is the right approach (particularly if you have a complex multi-language situation, or you aren't using word-like terms at all), so it'd be a shame to make it impossible.

J

-- 
 James Aylett, occasional troublemaker & project governance
 xapian.org