<div dir="ltr"><div><div><div>Hello devs,<br><br></div>I am Richhiey Thomas, pursuing my third year of undergraduate studies in Computer Science from Mumbai University. I had gone through the project list for this year and the project idea based on clustering caught my attention. I spoke to Assem Chelli on IRC who guided me to the code and got me started.<br><br></div>I started going through the code and have successfully built Xapian on my machine. If I am not mistaken, the currently implemented clustering branch has used heirarchial clustering which has quadratic complexity or higher, which naturally makes it very inefficient for large data sets.<br><br></div><div>A better approach to clustering would be to use K-means clustering or a variant (like bisecting K-means) which provides equal or better performance than the implemented heirarchial technique having lesser (linear) complexity, thus being more favourable to large datasets.<br><br></div><div>It would be great to have your opinions on how to go about this or any other approaches that may be more favorable for document clustering in the context of Xapian.<br><br></div><div>Thanks! :)<br></div><div><br><br></div><br><br></div>