[Xapian-tickets] [Xapian] #761: Implement symbol layout tree from presentation mathml expression
Xapian
nobody at xapian.org
Thu May 10 19:36:34 BST 2018
#761: Implement symbol layout tree from presentation mathml expression
---------------------------+--------------------
Reporter: gp1308 | Owner: gp1308
Type: task | Status: new
Priority: normal | Milestone:
Component: Other | Version:
Severity: normal | Keywords:
Blocked By: | Blocking:
Operating System: All |
---------------------------+--------------------
This ticket is to discuss the implementation of symbol layout tree.
Some background information:
** Presentation MathML **
Presentation MathML is one of the formats to represent math expression in
documents. Presentation elements are broadly classified into two types:-
* Token elements
- mi, mo, mn: these elements correspond to a visible symbol ( like
number, identifier text, operator(+,/,%) etc.
* Layout schemata
- mrow, mfrac, msqrt, mroot, mfenced: these elements are used to
represent fractions, radicals or group subexpressions.
- msub, msup, msubsup, munder, mover, mmultiscripts: these elements
are used to represent script over base.
- mtable, mtr, mtd: these elements correspond to tables, matrices, and
vectors.
** Symbol layout tree **
Generally, math expression is a group of symbols (integer, operators,
summation, integral etc) written on a horizontal line and special
structure like subscript, superscript, limits on integral, summation
written on top/bottom.
The tree is built by traversing from left to right, starting with the
first symbol. It will be a deep tree with branches representing script or
radical index.
Each node in a tree represents either a symbol or grouping construct like
a table, vector, matrix or parenthesized expression.
Every node is assigned a label. A label has two parts - node_type and
value. Node type can be an integer, operator, variable, matrix etc. It
reflects the value stored in the node. For example, to represent integer 2
in symbol tree, a node is created with the label `N!2`.
Every edge represents a spatial relationship between two adjacent symbols.
For example, if edge type is `next` means two symbols are adjacent on a
horizontal line, `above` means parent node is base and child node is
superscript.
Complete details on symbol layout tree can be found in the wiki:
https://github.com/guruhegde/xapian-gsoc-diary/blob/master/docs/slt.rst/
(link to be updated at later point)
Implementation:
After considering various options about parsing MathML, I feel it is
better to implement from our own rather than use the existing XML parser.
Having studied //rapidxml//(XML parser) code and //MyHtmlParser//(from
Omega), I felt it can be realized in the time slot allocated.
Question:
* Interface of indexing math expression - Do we provide a new interface in
TermGenerator class (for ex. index_math) or build new API class like
MathTermGenerator? Please suggest if there is any other way to do it.
Another option in my mind is if `TermGenerator.index_text` interface is
used for indexing, if `<math>` term is detected, then text until `</math>`
term is considered as math expression and input them to math index module.
(I guess we use `UtfIterator`, so iterator is passed to math index module)
--
Ticket URL: <https://trac.xapian.org/ticket/761>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list