[Xapian-devel] [GSOC 2013] Question about indexing INEX dataset

Olly Betts olly at survex.com
Tue Mar 11 23:50:12 GMT 2014


On Tue, Mar 11, 2014 at 11:40:42AM +0100, Parth Gupta wrote:
> Yes, treating them as HTML is fine. We did not face any problems with it.

It's not a bad way to get things working - the XML format uses <title> 
so that will be picked up by the HTML parser, and the document's body
text is in a <bdy> tag which the parser will not understand and it just
gathers up all the text in that case.

But before the <bdy> tag, there is some other metadata inside other tags
which the HTML parser won't know, so it will treat these as more body
text, but we really want to handle this metadata specially.

It wouldn't be hard to add a special parser for this format which
handles this better - opendocparse.cc and opendocparse.h are probably a
good similar example to look at.

Cheers,
    Olly



More information about the Xapian-devel mailing list