[Xapian-devel] GSOC 2015

Richhiey Thomas richhiey.thomas at gmail.com
Wed Feb 25 18:36:02 GMT 2015


Hello xapian devs,

For GSOC 2015, I would like to work on Heimstra's language modelling and
LDA based relavance language modelling for the project idea 'Weighting
schemes for Xapian'.

Heimstra's LM:

Heimstra suggested a parsimonious language model which models what language
use distinguishes a relevant document from other documents. For example,
adding words which are common in the English language to the language model
would only make the language model less effective and large. Parsimonius LM
helps in language modelling by reducing the number of parameters required
to model the data.
This approach can be used for indexing and ranking documents and is
implemented with the help of a mixture model. The mixture model can use two
or more language model components. In this case, based on the the paper,
the link of which is given below, it uses a background language model and a
document
model along with expectation maximization estimation algorithm. While
retreival, it also uses a relevance or request model which is used to rank
the documents by using Kullback-Leibler divergence between this and
document model.

Original paper :
http://research.microsoft.com/pubs/66933/hiemstra_sigir04.pdf

LDA based relevance language modelling:

This is an approach to integrate the advantages of both relevance language
models with Latent Dirchlet Allocation topic modelling. It is a generative
model and can retrieve relevant documents for a given query.
The language model depends on the language model to describe a term in the
query, the language model to describe the background topic and a language
model used to descirbe other ideas in the document.
The good part about this approach is that unlike relevance language models
which consider all tokens are generated by term t, this approach includes
the background topic and words which are specific to the given document.
This gives us an insight for extrapolating various document specific
features and identifying the non relevant parts of the document which we
wouldnt be able to do otherwise.
The model used in the paper uses Gibbs sampling for the inference since we
will be dealing with Dirchlet distributions.

Original paper : http://dollar.biz.uiowa.edu/~street/airs09.pdf

This is just a rough idea of what I would like to do. I would like to have
a discussion on this and any constructive advice
is welcome. Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20150226/0086d2a6/attachment.html>


More information about the Xapian-devel mailing list