[Xapian-tickets] [Xapian] #583: Spin off Omega's filetype conversion code as a library

Thu Aug 17 02:39:29 BST 2023

#583: Spin off Omega's filetype conversion code as a library
-------------------------+-------------------------------
 Reporter:  Olly Betts   |             Owner:  Olly Betts
     Type:  enhancement  |            Status:  new
 Priority:  low          |         Milestone:
Component:  Omega        |           Version:
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+-------------------------------
Comment (by Olly Betts):

 Oddly I remember responding to this, but perhaps that was an email thread
 on the same topic.  Anyway, summarising the current situation here:

 > Considering that the ticket is still open, I assume this is a path you
 are still wanting to follow, isn't it?

 It's something I'm generally supportive of doing if we can do it well.

 On git master, we have made significant steps towards being able to use
 the extraction code outside of omindex.  Extractors for formats which are
 available as a library API are now effectively plugins (separate binaries
 which omindex runs in subprocesses, communicating via pipes), and at least
 conceptually we could make a public API for the small amount of code that
 gets linked into omindex to communicate with the plugins.  In practice
 some things still really need sorting out first though.

 Formats which are extracted via an external program (e.g. `catdvi` for DVI
 files) currently get run from the main omindex process.  I think a "run an
 external command" plugin would make sense even within omindex since (at
 least on Linux) `fork()` can get unreasonably slow for a process with a
 large memory footprint due to the cost of copying the page mappings -
 moving the `fork()` to a plugin process should avoid this issue.  It'd
 also be desirable to support this in a public API around these plugins,
 though it could reasonably be added later so long as we're confident it
 can be implemented without incompatible changes to that API.

 Formats which are extracted entirely by code in the xapian repo are
 currently handled entirely in process.  This includes HTML, SVG, CSV, Atom
 feeds.  There doesn't seem to be a compelling reason for moving these to
 plugins as far as omindex is concerned, but they could be provided as a
 plugin or plugins for external use.  Another option would be to provide a
 direct public API for the HTML parser so it could be used in other
 programs much like it is in omindex.

 There's also !PostScript for which we have hardcoded handling in-process
 (we convert via PDF by running `ps2pdf` then `pdftotext` as the direct
 convertors don't handle Unicode - I just checked and `man pstotext` still
 says ''"pstotext  always translates to the ISO 8859-1 (Latin-1) character
 code"'').  There's a poppler plugin so probably we should move this
 support to that (it could probably also use libgs instead of running
 `ps2pdf`).

 Currently input is provided by passing a filename to the plugin.  That's
 mostly OK for omindex, though the in-process handling supports extracting
 from a file descriptor.  You can pass an fd across a socket on Unix, so
 that could be supported.  For your use it sounds like you'd like to be
 able to pass input in a buffer, which isn't currently supported but we
 could probably support that efficiently via a shared `mmap()` buffer or
 similar.  This would also be useful for being able to chain plugins (e.g.
 extracting text from a file inside a Zip archive).

 Ideally we'd have a testsuite for the new API, but we do at least have
 testing of omindex on git master which provides indirect testing of the
 plugins, and could probably be morphed into a testsuite for the API.

 The other thing that's missing that may matter for some use cases is
 sandboxing of the plugins.  If you're indexing data that may be actively
 hostile that brings additional concerns to those from the problem space
 omindex is aimed at.  Having the extraction in a subprocess means it can't
 crash the main omindex process, but that's more aimed at avoiding problems
 from bugs in the extraction libraries being inadvertently triggered.  E.g.
 if you were using this extraction code to handle attachments in a mail
 reader you'd really need robust sandboxing.  Sadly modern sandboxing
 features tend to be platform-specific so probably we'll need to let people
 contribute sandboxing implementations for platforms they care about.

 > If you or any other senior dev of Xapian would have to do it (and
 assuming you could/would), how many days of work would you estimate to be
 able to release a first version?

 It would depend a lot on which parts are hard requirements, but it's
 probably somewhere from a few days to a few weeks.

 I did have a client funding work on this but they had a major technical
 restructuring a few months ago and I don't know if they're likely to
 continue.  If you or someone else reading has a budget I'm happy to
 discuss.

 > Assuming this first version would have been released, could I assume
 this new library will be maintained by the core Xapian team?

 Yeah, I wouldn't want to release something we weren't intending to
 maintain.
-- 
Ticket URL: <https://trac.xapian.org/ticket/583#comment:6>
Xapian <https://xapian.org/>
Xapian