Text-Extraction Libraries for Omindex

Bruno Baruffaldi barufa1996 at gmail.com
Fri Jun 14 12:52:51 BST 2019

This is a list with some libraries that I have been looking at.

The idea is to discuss the advantages and disadvantages of adding some of
these libraries to Xapian.

If anyone knows another library that could be add to the list it would be

* For Excel (.xls)
* Last release: 2018-02
* Info: gaia-gis.it/fossil/freexl/index
* License: MPL tri-license


* For zip archives(C library)
* Last release 2018-04
* Info: libzip.org
* License: 3-clause BSD

* For zip archives
* Last release 2019-04
* Info: zipios.sourceforge.net
* License: GNU Lesser General Public License (LGPL)

I have been thinking about unzip. It is widely use in omindex an it might
be an option to replace unzip with one of this libraries. I know that it is
not the best solution, but it could be something to consider for some


* For DjVu files
* Last release: 2015-02
* Info: djvu.sourceforge.net
* License: GNU General Public License version 2


* For ebooks formats
* Last release 2018-01
* It shows little activity
* Status: Beta
* Info: sourceforge.net/projects/libebook/
* License: GNU Lesser GPL 2.1+ and MPL 2.0+

I have been reading the code of this library, but it seems a bit complex.
It could be a good option, but it will take a while to figure it out how it


* For Apple iWork documents
* Status: Beta
* Info: wiki.documentfoundation.org/DLP/Libraries/libetonyek
* License: MPL 2.0+


* For AbiWord documents
* Last release 2017-12
* Info: wiki.documentfoundation.org/DLP/Libraries/libabw
* License: MPL 2.0


Other Options:
* libreoffice-dev(SDK)
* libmarkdown2-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190614/d70b8090/attachment.html>

More information about the Xapian-devel mailing list