<div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif">I think you are right and I will try with another approach.</div><div class="gmail_default" style="font-family:tahoma,sans-serif"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif">One last query, I was thinking if it would be worth trying to use an external filter (when it is available) in case a particular library fails on run time.</div><div class="gmail_default" style="font-family:tahoma,sans-serif"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif">Have you considered it?<br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">El mar., 26 de mar. de 2019 a la(s) 19:39, Olly Betts (<a href="mailto:olly@survex.com">olly@survex.com</a>) escribió:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Sat, Mar 23, 2019 at 03:42:36PM -0300, Bruno Baruffaldi wrote:<br>

> Thanks!<br>

> That was really useful!<br>

> <br>

> I wanted to share my approach to this project with the hope that you can<br>

> give me some feedback.<br>

> <br>

> I am think that applying a design that foresees the incorporation of new<br>

> file formats is the most suitable way to solve the problem.<br>

> <br>

> In the attached sketch we can see:<br>

> * Bug_Box: It is responsible for encapsulating and handling errors.<br>

> * File_extrator: It presents an interface for the different formats.<br>

> * File_X: Encapsulates a particular library for the X file format.<br>

> * File_Hadle: It is responsible for directing the extraction. More<br>

> specifically, it determines the file format and which extractor to use.<br>

> * Ominex: It represents the rest of the project.<br>

<br>

I'm not entirely sure what these boxes are meant to actually be<br>

(classes? programs? something else?), but in general I'd tend to steer<br>

GSoC projects towards an evolutionary approach rather than trying to<br>

rewrite everything in sight, or even refactor everything into some<br>

entirely new structure.<br>

<br>

With an evolutionary approach you can get to something that basically<br>

works much sooner, and then fill in the missing pieces, fix bugs, etc.<br>

It lends itself much better to incremental cycles of implement, test,<br>

document, review, merge, which is easier to work through for both<br>

mentors and students, and if the work doesn't get fully completed, at<br>

least there's something to show for it.<br>

<br>

With a revolutionary approach, there's nothing you can show working<br>

for much longer, and you'll need to do a lot of extra testing for<br>

all the existing functionality to make sure your reimplementation<br>

works (unfortunately there's currently no testsuite for omindex you can<br>

lean on here).<br>

<br>

Review is painful because it involves wading through thousands of<br>

lines of code, so you're likely to need to wait longer for a review<br>

because it's harder for mentors to find enough time in one go for<br>

that.<br>

<br>

And if the work doesn't get fully completed, there's a big pile of<br>

non-functioning code, which it's unlikely anyone is going to have the<br>

time or enthusiasm to do anything further with.<br>

<br>

More specifically to this case we already have code which encapsulates<br>

extraction in a subprocess for an external filter program, and code<br>

which determines the file format and which extractor to use.  If you<br>

are proposing to replace those, you're going to need to convince us<br>

what you think is deficient about the existing code, how you can<br>

do better, and why that's a good use of the limited GSoC coding time<br>

(if you spend time doing X, then you can't do Y).<br>

<br>

> The idea of organizing the code in this way focuses on two fundamental<br>

> items:<br>

> * The possibility of changing a particular library for another that<br>

> fulfills the same purpose without affecting the project.<br>

<br>

That's achievable without a major restructure (e.g. wrap each library<br>

in a helper program).<br>

<br>

> * The possibility of extending Xapian's support in terms of file format.<br>

<br>

People have been adding new file formats for years within the current<br>

structure.<br>

<br>

> One of the major advantages is that if a particular programmer wishes to<br>

> add support for a new file format or improve an existing one, they should<br>

> only modify the objects that are in red. In this way, with a proper<br>

> documentation this kind of tasks should not be a complex task.<br>

<br>

We've had prospective GSoC students who were new to the codebase add<br>

support for new formats successful, which suggests it isn't all that<br>

complex currently.<br>

<br>

To show what's typically involved, here's the patch to add support for<br>

iWork documents (which is the most recent format added):<br>

<br>

<a href="https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee" rel="noreferrer" target="_blank">https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee</a><br>

<br>

(The gen-mimemap tweak is only because this happened to mean the<br>

generated mimetype lookup table now needs 2 byte offsets).<br>

<br>

Cheers,<br>

    Olly<br>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Atte. Bruno Baruffaldi<br></div></div>