GSOC-2016 Project : Clustering of search results

Richhiey Thomas richhiey.thomas at gmail.com
Tue Mar 15 06:32:01 GMT 2016


On Mon, Mar 14, 2016 at 5:12 PM, James Aylett <james-xapian at tartarus.org>
wrote:

> On Mon, Mar 14, 2016 at 02:09:13AM +0530, Richhiey Thomas wrote:
>
> If there are good default values that will work for most people, that
> would be fine (that's what we do for the parameters to BM25, for
> instance; in more extreme situations you have to change from the
> defaults, but most of the time you can just accept them and get
> reasonable behaviour). If the documentation can give an idea of when
> you'd want to change from the defaults, that would be even better.
>

Yes! We can use default values that work well with K-means and PSO and then
provide and option to override these defaults incase of a specialized use
case. So that will work well in all cases.

>
> > To break things down further, we can look at things like this
> > June 7-June 10
> > Create a class to represent particles and write code to initialize each
> > particle in the dataset with cluster-centroids
> > June 11 - June 16:
> > Implement code to find the fitness value for each particle and finding
> the
> > new location of the particle.
> > June 17-June 20:
> > Test the code against any dataset which contains documents so that we can
> > see how the code will be finding the initial centroids and so that we can
> > tweak the initial parameters to get good performance.


> It'd be good to break the tests into two groups: one very fine
> grained, testing individual pieces of functionality (such as the
> fitness calculation, and calculating the new particle location); one
> higher level, testing the overall algorithm. Depending on how your
> classes work, the fine grained tests may not need to know about
> Xapian::Document, which might make them easier to write.
>
either before or at the same time as the code they test.
>

Yes I completely agree with your suggestion on including fine grained tests
checking on individual functionalities and a higher level test for the
overall algorithm.
The timeline could be made better by providing two days in between after
implementing a small part and spending 1-2 days on testing and
documentation since that will add more value than doing it after
implementing the whole algorithm

>
> On things like code indentation, you may want to look at
> clang-format. It may take a little while to get a format definition
> file that matches Xapian's code layout, but you can then run it
> regularly so you aren't spending ages fixing things up later. (The
> sooner you fix any indentation issues, the faster you'll get familiar
> with indenting things in the style of Xapian.)
>
> (We don't have a clang-format file at the moment, but if you were to
> make one as part of this project, we could include it for people in
> future.)
>
>
I wouldn't mind making a clang format file as a part of a pre-GSOC exercise
too! It can just help me get used to the indentation style and the way the
code is laid out in Xapian.
Also, is there any other way I can start contributing to the project or
anything that I can go ahead with during this period?


> > I've tried to breakdown the timeline out here but I'm sure that this is
> > still not perfect enough. So I'd love it if you could tell me where I am
> > going wrong and what more detail I could provide so I can think about it
> > and get back to you.
>
> What you have now is a lot better than your previous timeline --
> hopefully you feel that it's more structured as well. I think at this
> point the best move is to update your draft, and (in a few hours) open
> an application on the GSoC website, at which point we can discuss
> things further.
>
> Thanks James!
I will be sending across my first proposal in a few hours from now! Hoping
for constructive criticism on the same!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160315/7637fa2e/attachment-0001.html>


More information about the Xapian-devel mailing list