GSoC-2017 Introduction and Project Discussion

Tue Mar 21 04:00:59 GMT 2017

On Sun, Mar 19, 2017 at 11:13:37PM +0530, Shivang Bansal wrote:
> I will surely try to propose a completely detailed plan on how
> graph-of-words model could be implemented in Xapian. The fact that Xapian
> does not support graph at present could make things a bit difficult.

I think we need to be realistic about the amount of work this is likely
to be - in order to make this a working system, you're going to need to
add support for this to at least TermGenerator (which splits up text to
generate terms), to the Document class, to the database backends, to the
QueryParser, to the Query class, and to the matcher.  Depending on the
details of the algorithms involved (which I've not looked at yet) there
may be even more places.  That's a lot of code to be working with,
especially if you're starting from a position of being unfamiliar with any
of it.

We're looking for projects that can be successfully completed including
getting the code complete with suitable test coverage and documentation
and in line with the coding standards, and having it ready to merge if
not actually merged by the end of the 3 month coding period.  While
many students are keen to try to finish of their project, progress is
inevitably much slower once university (or a job) starts and they aren't
working full-time on the project any longer.

Experience has shown us that projects which aren't ready to merge at
that point more often than not never get merged.  That's demoralisng for
both the student and the mentors.

Where possible, we encourage students to structure projects as a series
of sub-projects to complete and merge separately, so that if there are
problems which delay progress at least some useful work gets merged.
But "adding TW-IDF" doesn't seem amenable to this - it's not useful as a
feature until it's supported all the way through the list I gave above,
which is especially problematic as the scope seems ambitious to start
with.

> But, I strongly believe that it would be worthful as this model will
> judge the terms according to their relationship order in the documents
> which would enhance the effectiveness of search results (please let me
> know if you think otherwise).

It probably would if implemented.  If it sits unfinished on a branch it
isn't going enhance the effectiveness of anything.  It also needs to
work faster enough to be usable on non-trivial document collections,
and to not take up huge amounts of extra disk space.

I'm not necessarily saying this is a non-starter, but I'm going to need
convincing that the scope is realistic given your skills and experience,
and that you are capable of quickly getting to grips with all the
different areas of the code you'll need to work on.  I'd suggest "show
don't tell" - simply asserting that you can do the job is not nearly as
convincing as showing us prototype patches, plans for how you're
going to store the extra data needed compactly but such that it can
be efficiently accessed, etc.

Cheers,
    Olly