<div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif">Hello,</div><div class="gmail_default" style="font-family:tahoma,sans-serif"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif">I have been looking libarchive and it seems a great candidate!</div><div class="gmail_default" style="font-family:tahoma,sans-serif"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif">I think we can also add <a href="https://github.com/fosnola/libstaroffice" target="_blank">libstaroffice</a> and libmarkdown2-dev. I wasn't sure about adding libmarkdown2-dev to the list because I couldn't find much information about it.<br></div><div class="gmail_default" style="font-family:tahoma,sans-serif"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">El sáb., 15 de jun. de 2019 a la(s) 00:49, Olly Betts (<a href="mailto:olly@survex.com">olly@survex.com</a>) escribió:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, Jun 14, 2019 at 08:52:51AM -0300, Bruno Baruffaldi wrote:<br>

> This is a list with some libraries that I have been looking at.<br>

> <br>

> The idea is to discuss the advantages and disadvantages of adding some of<br>

> these libraries to Xapian.<br>

<br>

I think we should prioritise formats which are widely used (among<br>

current and potential users of Omega particularly), and also formats<br>

which we don't already support (or which we could support better by<br>

using a library).<br>

<br>

> <br>

> If anyone knows another library that could be add to the list it would be<br>

> great!<br>

> <br>

> Libfreexl:<br>

> * For Excel (.xls)<br>

> * Last release: 2018-02<br>

> * Info: <a href="http://gaia-gis.it/fossil/freexl/index" rel="noreferrer" target="_blank">gaia-gis.it/fossil/freexl/index</a><br>

> * License: MPL tri-license<br>

<br>

I've not come across this before.  It looks like it is currently only<br>

used in GIS software which is probably more interested in numbers than<br>

text, so before we commit a lot of effort to supporting it I'd suggest<br>

we try it out and compare how it does with the command line tool we<br>

currently use (xls2csv).<br>

<br>

> Libzip:<br>

> * For zip archives(C library)<br>

> * Last release 2018-04<br>

> * Info: <a href="http://libzip.org" rel="noreferrer" target="_blank">libzip.org</a><br>

> * License: 3-clause BSD<br>

> <br>

> Libzipios++:<br>

> * For zip archives<br>

> * Last release 2019-04<br>

> * Info: <a href="http://zipios.sourceforge.net" rel="noreferrer" target="_blank">zipios.sourceforge.net</a><br>

> * License: GNU Lesser General Public License (LGPL)<br>

> <br>

> I have been thinking about unzip. It is widely use in omindex an it might<br>

> be an option to replace unzip with one of this libraries. I know that it is<br>

> not the best solution, but it could be something to consider for some<br>

> formats.<br>

<br>

I'd suggest libarchive for zip files - it's widely used, and supports<br>

reading other archive formats rather than just zip files (I actually<br>

wrote a prototype patch for omindex a while back to support indexing<br>

files in archive files which used libarchive, though it hasn't been<br>

merged yet).<br>

<br>

I think this is probably one to prioritise since we use unzip for a<br>

number of common formats.<br>

<br>

> Djvulibre:<br>

> * For DjVu files<br>

> * Last release: 2015-02<br>

> * Info: <a href="http://djvu.sourceforge.net" rel="noreferrer" target="_blank">djvu.sourceforge.net</a><br>

> * License: GNU General Public License version 2<br>

<br>

While DjVu is an interesting format, it doesn't seem to be widely used<br>

and we can already index these files using the command line djvutxt<br>

tool.<br>

<br>

> Libe-book:<br>

> * For ebooks formats<br>

> * Last release 2018-01<br>

> * It shows little activity<br>

> * Status: Beta<br>

> * Info: <a href="http://sourceforge.net/projects/libebook/" rel="noreferrer" target="_blank">sourceforge.net/projects/libebook/</a><br>

> * License: GNU Lesser GPL 2.1+ and MPL 2.0+<br>

> <br>

> I have been reading the code of this library, but it seems a bit complex.<br>

> It could be a good option, but it will take a while to figure it out how it<br>

> works.<br>

<br>

This is used by libreoffice.<br>

<br>

There's a command line tool in the libe-book source to extract text<br>

(though for some reason this tool isn't packaged for Debian it seems).<br>

You can see the source here, which shows how to use the API to extract<br>

text:<br>

<br>

<a href="https://sources.debian.org/src/libe-book/0.1.3-1/src/conv/text/ebook2text.cpp/" rel="noreferrer" target="_blank">https://sources.debian.org/src/libe-book/0.1.3-1/src/conv/text/ebook2text.cpp/</a><br>

<br>

This would add support for several popular formats we don't currently<br>

support at all, so seems another one to prioritise.<br>

<br>

> Libetonyek-dev:<br>

> * For Apple iWork documents<br>

> * Status: Beta<br>

> * Info: <a href="http://wiki.documentfoundation.org/DLP/Libraries/libetonyek" rel="noreferrer" target="_blank">wiki.documentfoundation.org/DLP/Libraries/libetonyek</a><br>

> * License: MPL 2.0+<br>

<br>

We use this via a command line tool currently.  I'd guess it's popular<br>

on Macs so this is probably a good candidate.<br>

<br>

> Libabw:<br>

> * For AbiWord documents<br>

> * Last release 2017-12<br>

> * Info: <a href="http://wiki.documentfoundation.org/DLP/Libraries/libabw" rel="noreferrer" target="_blank">wiki.documentfoundation.org/DLP/Libraries/libabw</a><br>

> * License: MPL 2.0<br>

<br>

This is an XML-based format which we have a built-in parser for, so<br>

there's probably not a lot to gain from using an external library.<br>

It's also not a very widely used format in my experience.<br>

<br>

> Other Options:<br>

> * libreoffice-dev(SDK)<br>

<br>

I guess this is "libreofficekit"?<br>

<br>

I actually maintain a command line tool which is a thin wrapper<br>

around that:<br>

<br>

<a href="https://gitlab.com/ojwb/lloconv" rel="noreferrer" target="_blank">https://gitlab.com/ojwb/lloconv</a><br>

<br>

It works pretty well, but it's rather slow even reusing the<br>

lok::Office() object (lloconv has a feature where it can fork a<br>

daemon process to allow such reuse).<br>

<br>

Much of the import code libreoffice uses has now beep split out into<br>

libraries (like libabw, libe-book and libetonyek from your list) and<br>

I think we'd do better to use such libraries directly.<br>

<br>

You can find a list of these libraries here:<br>

<br>

<a href="https://www.documentliberation.org/projects/#import-libs" rel="noreferrer" target="_blank">https://www.documentliberation.org/projects/#import-libs</a><br>

<br>

Cheers,<br>

    Olly<br>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">Atte. Bruno Baruffaldi<br></div></div>