[GSoC] Questions about project Text-Extraction Libraries

Sat Mar 23 18:42:36 GMT 2019

Thanks!
That was really useful!

I wanted to share my approach to this project with the hope that you can
give me some feedback.

I am think that applying a design that foresees the incorporation of new
file formats is the most suitable way to solve the problem.

In the attached sketch we can see:
* Bug_Box: It is responsible for encapsulating and handling errors.
* File_extrator: It presents an interface for the different formats.
* File_X: Encapsulates a particular library for the X file format.
* File_Hadle: It is responsible for directing the extraction. More
specifically, it determines the file format and which extractor to use.
* Ominex: It represents the rest of the project.

The idea of organizing the code in this way focuses on two fundamental
items:
* The possibility of changing a particular library for another that
fulfills the same purpose without affecting the project.
* The possibility of extending Xapian's support in terms of file format.

One of the major advantages is that if a particular programmer wishes to
add support for a new file format or improve an existing one, they should
only modify the objects that are in red. In this way, with a proper
documentation this kind of tasks should not be a complex task.

I know it is an ambitious approach, but I think that with a good
documentation it would give the project great flexibility and the
programmers would have the option of adapting Xapian to their needs.

**The image only presents a simple scheme to explain the idea, I do not
consider it as a design for the project. I believe that we should discuss
about different design patterns to choose the most suitable one.

Cheers,
   Bruno Baruffaldi

El sáb., 23 de mar. de 2019 a la(s) 12:10, Olly Betts (olly at survex.com)
escribió:

> On Thu, Mar 21, 2019 at 09:31:26AM -0300, Bruno Baruffaldi wrote:
> > Firstly, I think that trying to isolate library bugs in subprocesses
> could
> > get to work, but I am not sure about how to handle deadlocks or infinite
> > loops. I feel that using a timer is the only way to deal with it but I
> > would like to know what you think about it.
>
> There's already code to set a CPU time limit for filter subprocesses
> (using setrlimit()) and to implement an inactivity timeout (by using
> select() to wait for the connection file descriptor to become readable
> or a timeout to be reached) - see runfilter.cc.  I think both mechanisms
> should be usable for this project (the CPU time limit would need to
> allow for CPU time used by the child process processing previous
> files).
>
> > Secondly, I have been reading the source code of ominex, but I cannot
> > figure out if it is possible to group all file formats under the same
> > interface. When indexing files, are all file formats treated in a similar
> > way, or are there special formats that require a different work (beyond
> the
> > use of external filters)?
>
> A few do - e.g. for PDF files we currently need to run pdfinfo and
> pdftotext on the file, PostScript files are first converted to a
> temporary PDF (because there doesn't seem to be a Unicode-aware
> filter which converts PostScript to text), etc.
>
> It may be possible to come up with a common interface still though.
>
> > To sum up, I want to know if ominex use multithreading for indexing files
> > or if you consider that it could be implemented to speed it up.
>
> Currently there isn't really any parallelism in omindex.  It would help
> when indexing formats which are CPU intensive to extract text from
> (an extreme case is if you're running OCR to index image files).
>
> When dealing with external filters, the extra isolation that
> subprocesses gives us makes that a better approach than launching
> threads - if a library used by a thread crashes the process then the
> indexer dies, while if that happens in a subprocess the parent indexer
> process can recover easily.
>
> Potentially we could have concurrent child processes working on
> different documents.  I'd suggest that it's better to focus on getting
> the subprocesses to work individually first before trying to get them to
> run in parallel, but to keep in mind that we're likely to want to
> instantiate multiple concurrent instances while implementing them.
>
>

-- 
Atte. Bruno Baruffaldi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190323/1e5ed24a/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sketch.png
Type: image/png
Size: 11183 bytes
Desc: not available
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190323/1e5ed24a/attachment-0001.png>