<div dir="ltr"><div><div><div><div><div><div><div><div><div>Hello,<br><br></div>I have finished moving the API to PIMPL classes and will fix issues within the current code over the next week, based on reviews from mentors.<br><br></div>The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a document in its document vector can improve its accuracy, due to less noise terms. Two important things to be done in this direction are :<br><br></div>1) Stemming<br></div><div>This is easier because xapian already provides stemmed terms.<br></div><div><br></div>2) Stopword removal<br></div>Use either Xapian::SimpleStopper or create a subclass of Xapian::Stopper to determine whether a term that is fed to it is a stopword or not. But for determining which terms are stopwords, I was wondering whether we'd be using the stopword list within xapian/languages/stopwords or will we have to create one within the cluster directory?<br><br></div>Over the next half of the month, the plan will be to get feature extraction and elkans-kmeans (with triangle inequality) to be working well.<br><br></div>As Olly has mentioned in one of his comments on the PR, it wouldn't be ideal to use hard coded criteria for feature selection. Thus using something like an ExpandDecider would certainly be great. I will look into it and make my approach clear as I go ahead.<br><br></div>Thanks,<br></div>Richhiey<br></div>