[GSoC] Questions about project Text-Extraction Libraries

Bruno Baruffaldi baruffaldibruno at gmail.com
Wed Mar 27 15:52:15 GMT 2019


I think you are right and I will try with another approach.

One last query, I was thinking if it would be worth trying to use an
external filter (when it is available) in case a particular library fails
on run time.

Have you considered it?

El mar., 26 de mar. de 2019 a la(s) 19:39, Olly Betts (olly at survex.com)
escribió:

> On Sat, Mar 23, 2019 at 03:42:36PM -0300, Bruno Baruffaldi wrote:
> > Thanks!
> > That was really useful!
> >
> > I wanted to share my approach to this project with the hope that you can
> > give me some feedback.
> >
> > I am think that applying a design that foresees the incorporation of new
> > file formats is the most suitable way to solve the problem.
> >
> > In the attached sketch we can see:
> > * Bug_Box: It is responsible for encapsulating and handling errors.
> > * File_extrator: It presents an interface for the different formats.
> > * File_X: Encapsulates a particular library for the X file format.
> > * File_Hadle: It is responsible for directing the extraction. More
> > specifically, it determines the file format and which extractor to use.
> > * Ominex: It represents the rest of the project.
>
> I'm not entirely sure what these boxes are meant to actually be
> (classes? programs? something else?), but in general I'd tend to steer
> GSoC projects towards an evolutionary approach rather than trying to
> rewrite everything in sight, or even refactor everything into some
> entirely new structure.
>
> With an evolutionary approach you can get to something that basically
> works much sooner, and then fill in the missing pieces, fix bugs, etc.
> It lends itself much better to incremental cycles of implement, test,
> document, review, merge, which is easier to work through for both
> mentors and students, and if the work doesn't get fully completed, at
> least there's something to show for it.
>
> With a revolutionary approach, there's nothing you can show working
> for much longer, and you'll need to do a lot of extra testing for
> all the existing functionality to make sure your reimplementation
> works (unfortunately there's currently no testsuite for omindex you can
> lean on here).
>
> Review is painful because it involves wading through thousands of
> lines of code, so you're likely to need to wait longer for a review
> because it's harder for mentors to find enough time in one go for
> that.
>
> And if the work doesn't get fully completed, there's a big pile of
> non-functioning code, which it's unlikely anyone is going to have the
> time or enthusiasm to do anything further with.
>
> More specifically to this case we already have code which encapsulates
> extraction in a subprocess for an external filter program, and code
> which determines the file format and which extractor to use.  If you
> are proposing to replace those, you're going to need to convince us
> what you think is deficient about the existing code, how you can
> do better, and why that's a good use of the limited GSoC coding time
> (if you spend time doing X, then you can't do Y).
>
> > The idea of organizing the code in this way focuses on two fundamental
> > items:
> > * The possibility of changing a particular library for another that
> > fulfills the same purpose without affecting the project.
>
> That's achievable without a major restructure (e.g. wrap each library
> in a helper program).
>
> > * The possibility of extending Xapian's support in terms of file format.
>
> People have been adding new file formats for years within the current
> structure.
>
> > One of the major advantages is that if a particular programmer wishes to
> > add support for a new file format or improve an existing one, they should
> > only modify the objects that are in red. In this way, with a proper
> > documentation this kind of tasks should not be a complex task.
>
> We've had prospective GSoC students who were new to the codebase add
> support for new formats successful, which suggests it isn't all that
> complex currently.
>
> To show what's typically involved, here's the patch to add support for
> iWork documents (which is the most recent format added):
>
>
> https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee
>
> (The gen-mimemap tweak is only because this happened to mean the
> generated mimetype lookup table now needs 2 byte offsets).
>
> Cheers,
>     Olly
>


-- 
Atte. Bruno Baruffaldi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190327/36952f7a/attachment.html>


More information about the Xapian-devel mailing list