GSoC aspirant - guruprasad hegde

Mon Mar 26 11:51:01 BST 2018

>
> I plan to pick Tangent because it performs better. Also, it has a good
> literature(thesis report and few papers available) and reference code
> available.
>

    Great! Having reference code would definitely help. Seems a decent
choice.

Tangent:
>
Indexing stage:
> Each document contains math formula and text. Text indexing is done in a
> usual way.
>
> ======preprocessing=================       ===indexing====
> Math Formula(PresentationMathML) => Symbol Layout Tree => Generate Symbol
> pair tuples => Store in Inverted Index
> Searching stage:
> Query(PresentationMathML) => symbol layout tree => Generate symbol pair
> tuples => Form a query with logical OR operator=> Candidate documents
> selection using dice coefficient metric => ReRanking the documents using
> MSS metric.
>
> I am thinking since putting PresentationMathML as a query would be
difficult for user. We should as the first step aim to parse latex while
Query processing. It should be latex or any other representation which is
easier for user to put in search Box. Generally i could search on google by
typing X^{2} + Y^{2} . Google Search
<https://www.google.co.in/search?q=x%5E%7B2%7D%2BY%5E%7B2%7D&gws_rd=cr&dcr=0&ei=Zcu4WtW2DMXPvgTOoL2gCA>

Check out this link <http://saskatoon.cs.rit.edu/min/> where, you can draw
an equation and search on Tangent, Google and other search engine.

I am hoping to that Reranking has to be kept the last module as it would
have some complexities like we would need to store the Layout Symbol tree
efficiently with the document to be able to actually calculate MSS metric.

Have you estimated how much time would implementing and what would be the
complexity level of following tasks:
1. Implementation to convert PresentationMathML -> Symbol Layout tree
-> Generate
Symbol pair tuples
2. Implementation to rank formulas using Dice coefficient metric.

> I also put some thoughts on implementation here.
> I believe the major work is in preprocessing and searching stage(new
> weight metric implementation). Existing indexing technique can be used for
> math part as well.
> My plan is to implement only formulae retrieval first(document has only
> math) and add keyword support(document = text + math) later.
>
One can use the set of Wikipedia documents
<http://www.cs.rit.edu/~rlaz/Wiki_formulas_v0.1.tar.bz2>provided by MathIR
at NTCIR for initial indexing. They only contain Math formulas. On a
general note adding keyword support won't be very difficult. It's would be
ok to initial only focus on documents only with maths.

Later also add support for the query in latex format.
>
> I am hoping Query would need to support latex from initially, since we can
expect users to enter  PresentationMathML. I expect it to  be something
which user could enter easily as a search query it would be difficult to
write Markup Language in search text box. Tangent seems to support both
mathml and tex. We can support both or start with tex.

>
> Link for Tangent paper:https://www.cs.rit.edu/~
> rlaz/files/ntcir2016_tangent.pdf
>
You could also refer there initial paper at
https://www.cs.rit.edu/~rlaz/files/sigir-tangent.pdf
Reference Implementation https://github.com/DPRL/tangent

- Gaurav Arora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20180326/4c088dc8/attachment.html>