<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>I plan to pick Tangent because it performs better. Also, it has a good literature(thesis report and few papers available) and reference code available.<br></div></div></blockquote><div> </div><div> Great! Having reference code would definitely help. Seems a decent choice.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Tangent:<br></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Indexing stage:</div><div>Each document contains math formula and text. Text indexing is done in a usual way.</div><div> ======preprocessing===========<wbr>====== ===indexing====</div><div>Math Formula(PresentationMathML) => Symbol Layout Tree => Generate Symbol pair tuples => Store in Inverted Index</div><div>Searching stage:</div><div>Query(PresentationMathML) => symbol layout tree => Generate symbol pair tuples => Form a query with logical OR operator=> Candidate documents selection using dice coefficient metric => ReRanking the documents using MSS metric.</div><div><br></div></div></blockquote><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">I am thinking since putting PresentationMathML as a query would be difficult for user. We should as the first step aim to parse latex while Query processing. It should be latex or any other representation which is easier for user to put in search Box. Generally i could search on google by typing X^{2} + Y^{2} . <a href="https://www.google.co.in/search?q=x%5E%7B2%7D%2BY%5E%7B2%7D&gws_rd=cr&dcr=0&ei=Zcu4WtW2DMXPvgTOoL2gCA">Google Search</a></span><br></div><div><br></div><div>Check out this <a href="http://saskatoon.cs.rit.edu/min/">link</a> where, you can draw an equation and search on Tangent, Google and other search engine.</div><div><br></div><div>I am hoping to that Reranking has to be kept the last module as it would have some complexities like we would need to store the Layout Symbol tree efficiently with the document to be able to actually calculate MSS metric.</div><div><br></div><div>Have you estimated how much time would implementing and what would be the complexity level of following tasks:</div><div>1. Implementation to convert PresentationMathML -> Symbol Layout tree -> <span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Generate Symbol pair tuples</span></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">2. Implementation to rank formulas using Dice coefficient metric.</span></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>I also put some thoughts on implementation here.</div><div>I believe the major work is in preprocessing and searching stage(new weight metric implementation). Existing indexing technique can be used for math part as well.</div><div>My plan is to implement only formulae retrieval first(document has only math) and add keyword support(document = text + math) later.</div></div></blockquote><div>One can use the set of <a href="http://www.cs.rit.edu/~rlaz/Wiki_formulas_v0.1.tar.bz2">Wikipedia documents </a>provided by MathIR at NTCIR for initial indexing. They only contain Math formulas. On a general note adding keyword support won't be very difficult. It's would be ok to initial only focus on documents only with maths.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Later also add support for the query in latex format. </div><div><br></div></div></blockquote><div>I am hoping Query would need to support latex from initially, since we can expect users to enter PresentationMathML. I expect it to be something which user could enter easily as a search query it would be difficult to write Markup Language in search text box. Tangent seems to support both mathml and tex. We can support both or start with tex.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><br></div><div>Link for Tangent paper:<a href="https://www.cs.rit.edu/~rlaz/files/ntcir2016_tangent.pdf" target="_blank">https://www.cs.rit.edu/~<wbr>rlaz/files/ntcir2016_tangent.<wbr>pdf</a></div></div></blockquote><div>You could also refer there initial paper at <a href="https://www.cs.rit.edu/~rlaz/files/sigir-tangent.pdf">https://www.cs.rit.edu/~rlaz/files/sigir-tangent.pdf</a></div><div>Reference Implementation <a href="https://github.com/DPRL/tangent">https://github.com/DPRL/tangent</a> </div></div><div><br></div><div class="gmail_signature"><div dir="ltr">- Gaurav Arora</div></div>
</div></div>