GSoC 2017 Project Proposal

James Aylett james at
Sun Mar 19 19:00:48 GMT 2017

On 9 Mar 2017, at 05:18, Richhiey Thomas <richhiey.thomas at> wrote:

> Hello devs.

Hi Richhiey :-)

> 1) Automated Performance Analysis

This is definitely interesting. However it's not terribly worth having this unless/until we have more than one clustering system to evaluate, is it? (Beyond the uniform/random one, although I guess if any clusterer performs _worse_ than that, it's a bad sign!)

> 2) Dimensionality Reduction

Do you think we'll need to implement several of these, for different uses? If not, is there a reason you think LSA will work best? You talk about eliminating words that occur rarely in documents — could we have a quick-and-dirty approach that looks at the within-corpus frequency of terms?

(If multiple different approaches makes sense, we'll end up having each encapsulated as a class or similar, in the same way we're doing with Clusterer. It's worth having an idea of this up front, in case it changes the approach.)

> 3) Hierarchical Clustering
> Since the clustering API is already in place, I would like to implement a hierarchical clusterer to cluster the search results.

Do you have a particular approach you think is a good one? Are you thinking agglomerative or divisive?


 James Aylett — —

More information about the Xapian-devel mailing list