[Xapian-tickets] [Xapian] #583: Spin off Omega's filetype conversion code as a library
Xapian
nobody at xapian.org
Thu Aug 17 02:39:29 BST 2023
#583: Spin off Omega's filetype conversion code as a library
-------------------------+-------------------------------
Reporter: Olly Betts | Owner: Olly Betts
Type: enhancement | Status: new
Priority: low | Milestone:
Component: Omega | Version:
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+-------------------------------
Comment (by Olly Betts):
Oddly I remember responding to this, but perhaps that was an email thread
on the same topic. Anyway, summarising the current situation here:
> Considering that the ticket is still open, I assume this is a path you
are still wanting to follow, isn't it?
It's something I'm generally supportive of doing if we can do it well.
On git master, we have made significant steps towards being able to use
the extraction code outside of omindex. Extractors for formats which are
available as a library API are now effectively plugins (separate binaries
which omindex runs in subprocesses, communicating via pipes), and at least
conceptually we could make a public API for the small amount of code that
gets linked into omindex to communicate with the plugins. In practice
some things still really need sorting out first though.
Formats which are extracted via an external program (e.g. `catdvi` for DVI
files) currently get run from the main omindex process. I think a "run an
external command" plugin would make sense even within omindex since (at
least on Linux) `fork()` can get unreasonably slow for a process with a
large memory footprint due to the cost of copying the page mappings -
moving the `fork()` to a plugin process should avoid this issue. It'd
also be desirable to support this in a public API around these plugins,
though it could reasonably be added later so long as we're confident it
can be implemented without incompatible changes to that API.
Formats which are extracted entirely by code in the xapian repo are
currently handled entirely in process. This includes HTML, SVG, CSV, Atom
feeds. There doesn't seem to be a compelling reason for moving these to
plugins as far as omindex is concerned, but they could be provided as a
plugin or plugins for external use. Another option would be to provide a
direct public API for the HTML parser so it could be used in other
programs much like it is in omindex.
There's also !PostScript for which we have hardcoded handling in-process
(we convert via PDF by running `ps2pdf` then `pdftotext` as the direct
convertors don't handle Unicode - I just checked and `man pstotext` still
says ''"pstotext always translates to the ISO 8859-1 (Latin-1) character
code"''). There's a poppler plugin so probably we should move this
support to that (it could probably also use libgs instead of running
`ps2pdf`).
Currently input is provided by passing a filename to the plugin. That's
mostly OK for omindex, though the in-process handling supports extracting
from a file descriptor. You can pass an fd across a socket on Unix, so
that could be supported. For your use it sounds like you'd like to be
able to pass input in a buffer, which isn't currently supported but we
could probably support that efficiently via a shared `mmap()` buffer or
similar. This would also be useful for being able to chain plugins (e.g.
extracting text from a file inside a Zip archive).
Ideally we'd have a testsuite for the new API, but we do at least have
testing of omindex on git master which provides indirect testing of the
plugins, and could probably be morphed into a testsuite for the API.
The other thing that's missing that may matter for some use cases is
sandboxing of the plugins. If you're indexing data that may be actively
hostile that brings additional concerns to those from the problem space
omindex is aimed at. Having the extraction in a subprocess means it can't
crash the main omindex process, but that's more aimed at avoiding problems
from bugs in the extraction libraries being inadvertently triggered. E.g.
if you were using this extraction code to handle attachments in a mail
reader you'd really need robust sandboxing. Sadly modern sandboxing
features tend to be platform-specific so probably we'll need to let people
contribute sandboxing implementations for platforms they care about.
> If you or any other senior dev of Xapian would have to do it (and
assuming you could/would), how many days of work would you estimate to be
able to release a first version?
It would depend a lot on which parts are hard requirements, but it's
probably somewhere from a few days to a few weeks.
I did have a client funding work on this but they had a major technical
restructuring a few months ago and I don't know if they're likely to
continue. If you or someone else reading has a budget I'm happy to
discuss.
> Assuming this first version would have been released, could I assume
this new library will be maintained by the core Xapian team?
Yeah, I wouldn't want to release something we weren't intending to
maintain.
--
Ticket URL: <https://trac.xapian.org/ticket/583#comment:6>
Xapian <https://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list