[Xapian-devel] Interested in IR, Getting started with Xapian

Akshay M S akshayms91 at gmail.com
Mon Mar 5 18:09:31 GMT 2012


Hi everyone,
I'm Akshay, an Information Science undergrad from Bangalore. I'm interested
in Information Retrieval and I'd like to contribute to Xapian as a part of
GSoC and later to feed my interests.

I liked the idea of adding more weighting schemes (Project #2). I did a
project last semester on Document Retrieval on Hadoop using TF-IDF and
Cosine Similarity (the query had to be a document).
I read about BM25 from the resources. I don't have a good idea about DFR.
I'm referring to [1] and [2] for more information on DFR in addition to the
resources mentioned on the page.

And in the project description, I couldn't understand this - *"Additionally,
for faster searching, an upper bound on each component is needed (each
database stores a number of summary statistics to help with this - if
additional statistics would be useful, you could add them as part of the
project)."*
I'm thinking components refer to the per-document term weights and the
DF/IDF weights? Any elaboration would be really helpful.

Can someone please point me to patches/bugs that are related to this
project so I can understand the existing code better, especially related to
Xapian::Weight class or anything else that can get me started with Xapian
codebase?

References -
[1]
http://dl.acm.org/citation.cfm?id=582416&dl=ACM&coll=DL&CFID=88428217&CFTOKEN=33171494
[2]  http://terrier.org/docs/v3.5/dfr_description.html

-- 
Regards,
Akshay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120305/78b8dc87/attachment.htm>


More information about the Xapian-devel mailing list