GSoC 2017 Project Proposal
James Aylett
james at tartarus.org
Sun Mar 19 19:00:48 GMT 2017
On 9 Mar 2017, at 05:18, Richhiey Thomas <richhiey.thomas at gmail.com> wrote:
> Hello devs.
Hi Richhiey :-)
> 1) Automated Performance Analysis
This is definitely interesting. However it's not terribly worth having this unless/until we have more than one clustering system to evaluate, is it? (Beyond the uniform/random one, although I guess if any clusterer performs _worse_ than that, it's a bad sign!)
> 2) Dimensionality Reduction
Do you think we'll need to implement several of these, for different uses? If not, is there a reason you think LSA will work best? You talk about eliminating words that occur rarely in documents — could we have a quick-and-dirty approach that looks at the within-corpus frequency of terms?
(If multiple different approaches makes sense, we'll end up having each encapsulated as a class or similar, in the same way we're doing with Clusterer. It's worth having an idea of this up front, in case it changes the approach.)
> 3) Hierarchical Clustering
> Since the clustering API is already in place, I would like to implement a hierarchical clusterer to cluster the search results.
Do you have a particular approach you think is a good one? Are you thinking agglomerative or divisive?
J
--
James Aylett
devfort.com — spacelog.org — tartarus.org/james/
More information about the Xapian-devel
mailing list