GSoC 2016: Text-Extraction Libraries in Omega

Thu Mar 10 11:23:24 GMT 2016

On Wed, Mar 09, 2016 at 01:46:06PM -0800, Philip Chung wrote:

> > I'm not sure how you propose generalising use of a library for
> > extraction; how would a user configure omindex to know how to call the
> > relevant library functions?
> 
> Sorry, I think I didn't make myself clear. From what I can gather,
> Olly's patch introduces a new executable "omindex_wv" that is
> responsible for the processing. The justification was that the
> conversion happens in a subprocess to shield Omega from any crashes.

Ah, sorry. I'd forgotten the detail of how that was tackled in Olly's
patch.

> I was thinking of generalizing this addition to other types of
> "worker" processes. The question was: Should we introduce more
> executables like "omindex_wv", like say, "omindex_poppler",
> "omindex_wps", etc., for each type of conversion?
> 
> Now that I think about it, I'm not sure if this has any advantage over
> the current system. Or am I just misunderstanding?

As I understand it, the patch introduces a long-lived 'worker' process
which reads commands over a pipe, runs an extractor and then returns
the data over the pipe. So you save the fork overhead most of the time
(it'll only restart on crashes, although there's a FIXME comment
suggesting it could restart every N files processed), and also save
the VM churn.

One of the things that the project will have to do is to generalise
the worker module (in module.cc) to check the filetype and use one of
a number of different libraries to extract text from different
filetypes. (Others include writing tests and documentation, and
deciding what to do about the FIXME comments :-)

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org