[GSoC] Questions about project Text-Extraction Libraries

Sat Mar 23 15:09:58 GMT 2019

On Thu, Mar 21, 2019 at 09:31:26AM -0300, Bruno Baruffaldi wrote:
> Firstly, I think that trying to isolate library bugs in subprocesses could
> get to work, but I am not sure about how to handle deadlocks or infinite
> loops. I feel that using a timer is the only way to deal with it but I
> would like to know what you think about it.

There's already code to set a CPU time limit for filter subprocesses
(using setrlimit()) and to implement an inactivity timeout (by using
select() to wait for the connection file descriptor to become readable
or a timeout to be reached) - see runfilter.cc.  I think both mechanisms
should be usable for this project (the CPU time limit would need to
allow for CPU time used by the child process processing previous
files).

> Secondly, I have been reading the source code of ominex, but I cannot
> figure out if it is possible to group all file formats under the same
> interface. When indexing files, are all file formats treated in a similar
> way, or are there special formats that require a different work (beyond the
> use of external filters)?

A few do - e.g. for PDF files we currently need to run pdfinfo and
pdftotext on the file, PostScript files are first converted to a
temporary PDF (because there doesn't seem to be a Unicode-aware
filter which converts PostScript to text), etc.

It may be possible to come up with a common interface still though.

> To sum up, I want to know if ominex use multithreading for indexing files
> or if you consider that it could be implemented to speed it up.

Currently there isn't really any parallelism in omindex.  It would help
when indexing formats which are CPU intensive to extract text from
(an extreme case is if you're running OCR to index image files).

When dealing with external filters, the extra isolation that
subprocesses gives us makes that a better approach than launching
threads - if a library used by a thread crashes the process then the
indexer dies, while if that happens in a subprocess the parent indexer
process can recover easily.

Potentially we could have concurrent child processes working on
different documents.  I'd suggest that it's better to focus on getting
the subprocesses to work individually first before trying to get them to
run in parallel, but to keep in mind that we're likely to want to
instantiate multiple concurrent instances while implementing them.