[Xapian-discuss] Participation in GSOC

Michael Thomas m.thomas at networksec.de
Tue Mar 29 23:07:26 BST 2011


I'm Michael, I would like to participate in this year's Google Summer of
Code, and I picked Xapian as the project to code for. 

Before writing a full proposal, I want to get in contact with the
community, as well as introducing myself and discuss my ideas for the
contribution to Xapian. 

First of all I'd like to talk about my motivation. 
I'm currently working on a webapp for document classification and I came
across Xapian on my research for open source search engine alternatives
to Lucene. It seemed like a fairly good, lightweight search engine
library to me, so I decided to use it, rather to implement one myself,
and push its development further adding nice features I could use for my
own project. 

I checked the idea wiki and the source code, and noticed that there is
currently only BM25 and a probabilistic approach implemented as
weighting scheme, so im interested to work on the improvement of the
ranking by implementing more weighting / ranking schemes:

* implementing other statistical schemes like DfR, and tf-idf based term
weighting schemes. 
* word-distance weighting: so documents wich contain the query terms
with close distance to each other get higher scores
* location based weighting: terms, that appear in the top of the
document are generally more important
* size based weighting: longer documents tend to be more important, than
shorter ones, as they contain more words
* neural network (mlp) for learning: if user decides (clicks) for a
certain document, the network learns to connect the query string with
this doc. could be interesting to improve quality of ranking.

Another interesting project is the query parser. I could imagine
improving it, by adding some natural language proceccing to avoid the
need of binary keywords, as well as using semantic databases like
WordNet to improve matching by normalizing the sematics of a query term.
 Of course, this is not an easy task, and multilanguage support is quite
hard, but anyway, I think its worth working on it, starting with some
special cases. A natural language date parser, like chronic for ruby
could be a usefull feature, when search for date ranges in a chronical
document archive, like blogs. This is just an example. 

A few words to myself: 
I'm a computer scientist student from Berlin. I've been working 4 years
at a research center, developing software for 2d/3d image proceccing,
mainly in C/C++. I participated in two minor projects for the open
source software OCRopus (http://code.google.com/p/ocropus/) implementing
some classification algorithms for character recognition /
I have fairly good knowledge of C/C++, python, Linux, algorithms, data
structures, mathematics, artificial intelligence concepts ... 

I hope to get some feedback of you concerning my ideas soon. I then will
start writing a full proposal to apply in a formal way until 08.04.11.

Best Regards.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://lists.xapian.org/pipermail/xapian-discuss/attachments/20110330/bfc4419b/attachment.pgp>

More information about the Xapian-discuss mailing list