Text-Extraction Libraries for Omindex

Sat Jun 15 04:49:31 BST 2019

On Fri, Jun 14, 2019 at 08:52:51AM -0300, Bruno Baruffaldi wrote:
> This is a list with some libraries that I have been looking at.
> 
> The idea is to discuss the advantages and disadvantages of adding some of
> these libraries to Xapian.

I think we should prioritise formats which are widely used (among
current and potential users of Omega particularly), and also formats
which we don't already support (or which we could support better by
using a library).

> 
> If anyone knows another library that could be add to the list it would be
> great!
> 
> Libfreexl:
> * For Excel (.xls)
> * Last release: 2018-02
> * Info: gaia-gis.it/fossil/freexl/index
> * License: MPL tri-license

I've not come across this before.  It looks like it is currently only
used in GIS software which is probably more interested in numbers than
text, so before we commit a lot of effort to supporting it I'd suggest
we try it out and compare how it does with the command line tool we
currently use (xls2csv).

> Libzip:
> * For zip archives(C library)
> * Last release 2018-04
> * Info: libzip.org
> * License: 3-clause BSD
> 
> Libzipios++:
> * For zip archives
> * Last release 2019-04
> * Info: zipios.sourceforge.net
> * License: GNU Lesser General Public License (LGPL)
> 
> I have been thinking about unzip. It is widely use in omindex an it might
> be an option to replace unzip with one of this libraries. I know that it is
> not the best solution, but it could be something to consider for some
> formats.

I'd suggest libarchive for zip files - it's widely used, and supports
reading other archive formats rather than just zip files (I actually
wrote a prototype patch for omindex a while back to support indexing
files in archive files which used libarchive, though it hasn't been
merged yet).

I think this is probably one to prioritise since we use unzip for a
number of common formats.

> Djvulibre:
> * For DjVu files
> * Last release: 2015-02
> * Info: djvu.sourceforge.net
> * License: GNU General Public License version 2

While DjVu is an interesting format, it doesn't seem to be widely used
and we can already index these files using the command line djvutxt
tool.

> Libe-book:
> * For ebooks formats
> * Last release 2018-01
> * It shows little activity
> * Status: Beta
> * Info: sourceforge.net/projects/libebook/
> * License: GNU Lesser GPL 2.1+ and MPL 2.0+
> 
> I have been reading the code of this library, but it seems a bit complex.
> It could be a good option, but it will take a while to figure it out how it
> works.

This is used by libreoffice.

There's a command line tool in the libe-book source to extract text
(though for some reason this tool isn't packaged for Debian it seems).
You can see the source here, which shows how to use the API to extract
text:

https://sources.debian.org/src/libe-book/0.1.3-1/src/conv/text/ebook2text.cpp/

This would add support for several popular formats we don't currently
support at all, so seems another one to prioritise.

> Libetonyek-dev:
> * For Apple iWork documents
> * Status: Beta
> * Info: wiki.documentfoundation.org/DLP/Libraries/libetonyek
> * License: MPL 2.0+

We use this via a command line tool currently.  I'd guess it's popular
on Macs so this is probably a good candidate.

> Libabw:
> * For AbiWord documents
> * Last release 2017-12
> * Info: wiki.documentfoundation.org/DLP/Libraries/libabw
> * License: MPL 2.0

This is an XML-based format which we have a built-in parser for, so
there's probably not a lot to gain from using an external library.
It's also not a very widely used format in my experience.

> Other Options:
> * libreoffice-dev(SDK)

I guess this is "libreofficekit"?

I actually maintain a command line tool which is a thin wrapper
around that:

https://gitlab.com/ojwb/lloconv

It works pretty well, but it's rather slow even reusing the
lok::Office() object (lloconv has a feature where it can fork a
daemon process to allow such reuse).

Much of the import code libreoffice uses has now beep split out into
libraries (like libabw, libe-book and libetonyek from your list) and
I think we'd do better to use such libraries directly.

You can find a list of these libraries here:

https://www.documentliberation.org/projects/#import-libs

Cheers,
    Olly