[Xapian-devel] Participation in GSOC
Dan Colish
dcolish at gmail.com
Wed Mar 30 05:59:49 BST 2011
On Tuesday, March 29, 2011 at 3:07 PM, Michael Thomas wrote:
Hi,
>
> I'm Michael, I would like to participate in this year's Google Summer of
> Code, and I picked Xapian as the project to code for.
>
> Before writing a full proposal, I want to get in contact with the
> community, as well as introducing myself and discuss my ideas for the
> contribution to Xapian.
>
Awesome, its great to hear from you!
> First of all I'd like to talk about my motivation.
> I'm currently working on a webapp for document classification and I came
> across Xapian on my research for open source search engine alternatives
> to Lucene. It seemed like a fairly good, lightweight search engine
> library to me, so I decided to use it, rather to implement one myself,
> and push its development further adding nice features I could use for my
> own project.
>
> I checked the idea wiki and the source code, and noticed that there is
> currently only BM25 and a probabilistic approach implemented as
> weighting scheme, so im interested to work on the improvement of the
> ranking by implementing more weighting / ranking schemes:
>
> * implementing other statistical schemes like DfR, and tf-idf based term
> weighting schemes.
> * word-distance weighting: so documents wich contain the query terms
> with close distance to each other get higher scores
> * location based weighting: terms, that appear in the top of the
> document are generally more important
> * size based weighting: longer documents tend to be more important, than
> shorter ones, as they contain more words
> * neural network (mlp) for learning: if user decides (clicks) for a
> certain document, the network learns to connect the query string with
> this doc. could be interesting to improve quality of ranking.
>
There's been a lot of discussion on our mailing list about these topics. I'd really recommend reading though those comments. You can find our searchable archives here: http://dir.gmane.org/gmane.comp.search.xapian.devel
> Another interesting project is the query parser. I could imagine
> improving it, by adding some natural language proceccing to avoid the
> need of binary keywords, as well as using semantic databases like
> WordNet to improve matching by normalizing the sematics of a query term.
> Of course, this is not an easy task, and multilanguage support is quite
> hard, but anyway, I think its worth working on it, starting with some
> special cases. A natural language date parser, like chronic for ruby
> could be a usefull feature, when search for date ranges in a chronical
> document archive, like blogs. This is just an example.
>
This has also been discussed a bit on that list. Essentially, we still want to have a boolean query language similar to the specification here: http://xapian.org/docs/queryparser.html. WordNet sounds very similar to our synonym support. There is a project to support additional languages, but that is a different scope from the queryparser. We also provide range query support already so there is a possibility of extending those range classes to support other data range types.
One other note, we're trying to keep all GSoC discussion on the devel list so I've moved this thread there. Good luck with your project proposal!
--Dan
More information about the Xapian-devel
mailing list