<div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif">I think you are right and I will try with another approach.</div><div class="gmail_default" style="font-family:tahoma,sans-serif"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif">One last query, I was thinking if it would be worth trying to use an external filter (when it is available) in case a particular library fails on run time.</div><div class="gmail_default" style="font-family:tahoma,sans-serif"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif">Have you considered it?<br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">El mar., 26 de mar. de 2019 a la(s) 19:39, Olly Betts (<a href="mailto:olly@survex.com">olly@survex.com</a>) escribió:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Sat, Mar 23, 2019 at 03:42:36PM -0300, Bruno Baruffaldi wrote:<br>
> Thanks!<br>
> That was really useful!<br>
> <br>
> I wanted to share my approach to this project with the hope that you can<br>
> give me some feedback.<br>
> <br>
> I am think that applying a design that foresees the incorporation of new<br>
> file formats is the most suitable way to solve the problem.<br>
> <br>
> In the attached sketch we can see:<br>
> * Bug_Box: It is responsible for encapsulating and handling errors.<br>
> * File_extrator: It presents an interface for the different formats.<br>
> * File_X: Encapsulates a particular library for the X file format.<br>
> * File_Hadle: It is responsible for directing the extraction. More<br>
> specifically, it determines the file format and which extractor to use.<br>
> * Ominex: It represents the rest of the project.<br>
<br>
I'm not entirely sure what these boxes are meant to actually be<br>
(classes? programs? something else?), but in general I'd tend to steer<br>
GSoC projects towards an evolutionary approach rather than trying to<br>
rewrite everything in sight, or even refactor everything into some<br>
entirely new structure.<br>
<br>
With an evolutionary approach you can get to something that basically<br>
works much sooner, and then fill in the missing pieces, fix bugs, etc.<br>
It lends itself much better to incremental cycles of implement, test,<br>
document, review, merge, which is easier to work through for both<br>
mentors and students, and if the work doesn't get fully completed, at<br>
least there's something to show for it.<br>
<br>
With a revolutionary approach, there's nothing you can show working<br>
for much longer, and you'll need to do a lot of extra testing for<br>
all the existing functionality to make sure your reimplementation<br>
works (unfortunately there's currently no testsuite for omindex you can<br>
lean on here).<br>
<br>
Review is painful because it involves wading through thousands of<br>
lines of code, so you're likely to need to wait longer for a review<br>
because it's harder for mentors to find enough time in one go for<br>
that.<br>
<br>
And if the work doesn't get fully completed, there's a big pile of<br>
non-functioning code, which it's unlikely anyone is going to have the<br>
time or enthusiasm to do anything further with.<br>
<br>
More specifically to this case we already have code which encapsulates<br>
extraction in a subprocess for an external filter program, and code<br>
which determines the file format and which extractor to use. If you<br>
are proposing to replace those, you're going to need to convince us<br>
what you think is deficient about the existing code, how you can<br>
do better, and why that's a good use of the limited GSoC coding time<br>
(if you spend time doing X, then you can't do Y).<br>
<br>
> The idea of organizing the code in this way focuses on two fundamental<br>
> items:<br>
> * The possibility of changing a particular library for another that<br>
> fulfills the same purpose without affecting the project.<br>
<br>
That's achievable without a major restructure (e.g. wrap each library<br>
in a helper program).<br>
<br>
> * The possibility of extending Xapian's support in terms of file format.<br>
<br>
People have been adding new file formats for years within the current<br>
structure.<br>
<br>
> One of the major advantages is that if a particular programmer wishes to<br>
> add support for a new file format or improve an existing one, they should<br>
> only modify the objects that are in red. In this way, with a proper<br>
> documentation this kind of tasks should not be a complex task.<br>
<br>
We've had prospective GSoC students who were new to the codebase add<br>
support for new formats successful, which suggests it isn't all that<br>
complex currently.<br>
<br>
To show what's typically involved, here's the patch to add support for<br>
iWork documents (which is the most recent format added):<br>
<br>
<a href="https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee" rel="noreferrer" target="_blank">https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee</a><br>
<br>
(The gen-mimemap tweak is only because this happened to mean the<br>
generated mimetype lookup table now needs 2 byte offsets).<br>
<br>
Cheers,<br>
Olly<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Atte. Bruno Baruffaldi<br></div></div>