[GSoC] Questions about project Text-Extraction Libraries

Olly Betts olly at survex.com
Tue Mar 26 22:38:57 GMT 2019


On Sat, Mar 23, 2019 at 03:42:36PM -0300, Bruno Baruffaldi wrote:
> Thanks!
> That was really useful!
> 
> I wanted to share my approach to this project with the hope that you can
> give me some feedback.
> 
> I am think that applying a design that foresees the incorporation of new
> file formats is the most suitable way to solve the problem.
> 
> In the attached sketch we can see:
> * Bug_Box: It is responsible for encapsulating and handling errors.
> * File_extrator: It presents an interface for the different formats.
> * File_X: Encapsulates a particular library for the X file format.
> * File_Hadle: It is responsible for directing the extraction. More
> specifically, it determines the file format and which extractor to use.
> * Ominex: It represents the rest of the project.

I'm not entirely sure what these boxes are meant to actually be
(classes? programs? something else?), but in general I'd tend to steer
GSoC projects towards an evolutionary approach rather than trying to
rewrite everything in sight, or even refactor everything into some
entirely new structure.

With an evolutionary approach you can get to something that basically
works much sooner, and then fill in the missing pieces, fix bugs, etc.
It lends itself much better to incremental cycles of implement, test,
document, review, merge, which is easier to work through for both
mentors and students, and if the work doesn't get fully completed, at
least there's something to show for it.

With a revolutionary approach, there's nothing you can show working
for much longer, and you'll need to do a lot of extra testing for
all the existing functionality to make sure your reimplementation
works (unfortunately there's currently no testsuite for omindex you can
lean on here).

Review is painful because it involves wading through thousands of
lines of code, so you're likely to need to wait longer for a review
because it's harder for mentors to find enough time in one go for
that.

And if the work doesn't get fully completed, there's a big pile of
non-functioning code, which it's unlikely anyone is going to have the
time or enthusiasm to do anything further with.

More specifically to this case we already have code which encapsulates
extraction in a subprocess for an external filter program, and code
which determines the file format and which extractor to use.  If you
are proposing to replace those, you're going to need to convince us
what you think is deficient about the existing code, how you can
do better, and why that's a good use of the limited GSoC coding time
(if you spend time doing X, then you can't do Y).

> The idea of organizing the code in this way focuses on two fundamental
> items:
> * The possibility of changing a particular library for another that
> fulfills the same purpose without affecting the project.

That's achievable without a major restructure (e.g. wrap each library
in a helper program).

> * The possibility of extending Xapian's support in terms of file format.

People have been adding new file formats for years within the current
structure.

> One of the major advantages is that if a particular programmer wishes to
> add support for a new file format or improve an existing one, they should
> only modify the objects that are in red. In this way, with a proper
> documentation this kind of tasks should not be a complex task.

We've had prospective GSoC students who were new to the codebase add
support for new formats successful, which suggests it isn't all that
complex currently.

To show what's typically involved, here's the patch to add support for
iWork documents (which is the most recent format added):

https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee

(The gen-mimemap tweak is only because this happened to mean the
generated mimetype lookup table now needs 2 byte offsets).

Cheers,
    Olly



More information about the Xapian-devel mailing list