<div dir="ltr"><div>Hello xapian devs,</div><div><br></div><div>For GSOC 2015, I would like to work on Heimstra's language modelling and LDA based relavance language modelling for the project idea 'Weighting schemes for Xapian'.</div><div><br></div><div>Heimstra's LM:</div><div><br></div><div>Heimstra suggested a parsimonious language model which models what language use distinguishes a relevant document from other documents. For example, adding words which are common in the English language to the language model would only make the language model less effective and large. Parsimonius LM helps in language modelling by reducing the number of parameters required to model the data.</div><div>This approach can be used for indexing and ranking documents and is implemented with the help of a mixture model. The mixture model can use two or more language model components. In this case, based on the the paper, the link of which is given below, it uses a background language model and a document</div><div>model along with expectation maximization estimation algorithm. While retreival, it also uses a relevance or request model which is used to rank the documents by using Kullback-Leibler divergence between this and document model.</div><div><br></div><div>Original paper : <a href="http://research.microsoft.com/pubs/66933/hiemstra_sigir04.pdf" target="_blank">http://research.microsoft.com/pubs/66933/hiemstra_sigir04.pdf</a></div><div><br></div><div>LDA based relevance language modelling:</div><div><br></div><div>This is an approach to integrate the advantages of both relevance language models with Latent Dirchlet Allocation topic modelling. It is a generative model and can retrieve relevant documents for a given query.</div><div>The language model depends on the language model to describe a term in the query, the language model to describe the background topic and a language model used to descirbe other ideas in the document.</div><div>The good part about this approach is that unlike relevance language models which consider all tokens are generated by term t, this approach includes the background topic and words which are specific to the given document.</div><div>This gives us an insight for extrapolating various document specific features and identifying the non relevant parts of the document which we wouldnt be able to do otherwise.</div><div><div>The model used in the paper uses Gibbs sampling for the inference since we will be dealing with Dirchlet distributions.</div></div><div><br></div><div>Original paper : <a href="http://dollar.biz.uiowa.edu/~street/airs09.pdf" target="_blank">http://dollar.biz.uiowa.edu/~street/airs09.pdf</a></div><div><br></div><div>This is just a rough idea of what I would like to do. I would like to have a discussion on this and any constructive advice</div><div>is welcome. Thanks.</div><div><br></div><div><br></div><div><br></div><div><br></div></div>