Text-Extraction Libraries for Omindex

Bruno Baruffaldi baruffaldibruno at gmail.com
Sat Jun 15 18:13:43 BST 2019


Hello,

I have been looking libarchive and it seems a great candidate!

I think we can also add libstaroffice
<https://github.com/fosnola/libstaroffice> and libmarkdown2-dev. I wasn't
sure about adding libmarkdown2-dev to the list because I couldn't find much
information about it.


El sáb., 15 de jun. de 2019 a la(s) 00:49, Olly Betts (olly at survex.com)
escribió:

> On Fri, Jun 14, 2019 at 08:52:51AM -0300, Bruno Baruffaldi wrote:
> > This is a list with some libraries that I have been looking at.
> >
> > The idea is to discuss the advantages and disadvantages of adding some of
> > these libraries to Xapian.
>
> I think we should prioritise formats which are widely used (among
> current and potential users of Omega particularly), and also formats
> which we don't already support (or which we could support better by
> using a library).
>
> >
> > If anyone knows another library that could be add to the list it would be
> > great!
> >
> > Libfreexl:
> > * For Excel (.xls)
> > * Last release: 2018-02
> > * Info: gaia-gis.it/fossil/freexl/index
> > * License: MPL tri-license
>
> I've not come across this before.  It looks like it is currently only
> used in GIS software which is probably more interested in numbers than
> text, so before we commit a lot of effort to supporting it I'd suggest
> we try it out and compare how it does with the command line tool we
> currently use (xls2csv).
>
> > Libzip:
> > * For zip archives(C library)
> > * Last release 2018-04
> > * Info: libzip.org
> > * License: 3-clause BSD
> >
> > Libzipios++:
> > * For zip archives
> > * Last release 2019-04
> > * Info: zipios.sourceforge.net
> > * License: GNU Lesser General Public License (LGPL)
> >
> > I have been thinking about unzip. It is widely use in omindex an it might
> > be an option to replace unzip with one of this libraries. I know that it
> is
> > not the best solution, but it could be something to consider for some
> > formats.
>
> I'd suggest libarchive for zip files - it's widely used, and supports
> reading other archive formats rather than just zip files (I actually
> wrote a prototype patch for omindex a while back to support indexing
> files in archive files which used libarchive, though it hasn't been
> merged yet).
>
> I think this is probably one to prioritise since we use unzip for a
> number of common formats.
>
> > Djvulibre:
> > * For DjVu files
> > * Last release: 2015-02
> > * Info: djvu.sourceforge.net
> > * License: GNU General Public License version 2
>
> While DjVu is an interesting format, it doesn't seem to be widely used
> and we can already index these files using the command line djvutxt
> tool.
>
> > Libe-book:
> > * For ebooks formats
> > * Last release 2018-01
> > * It shows little activity
> > * Status: Beta
> > * Info: sourceforge.net/projects/libebook/
> > * License: GNU Lesser GPL 2.1+ and MPL 2.0+
> >
> > I have been reading the code of this library, but it seems a bit complex.
> > It could be a good option, but it will take a while to figure it out how
> it
> > works.
>
> This is used by libreoffice.
>
> There's a command line tool in the libe-book source to extract text
> (though for some reason this tool isn't packaged for Debian it seems).
> You can see the source here, which shows how to use the API to extract
> text:
>
>
> https://sources.debian.org/src/libe-book/0.1.3-1/src/conv/text/ebook2text.cpp/
>
> This would add support for several popular formats we don't currently
> support at all, so seems another one to prioritise.
>
> > Libetonyek-dev:
> > * For Apple iWork documents
> > * Status: Beta
> > * Info: wiki.documentfoundation.org/DLP/Libraries/libetonyek
> > * License: MPL 2.0+
>
> We use this via a command line tool currently.  I'd guess it's popular
> on Macs so this is probably a good candidate.
>
> > Libabw:
> > * For AbiWord documents
> > * Last release 2017-12
> > * Info: wiki.documentfoundation.org/DLP/Libraries/libabw
> > * License: MPL 2.0
>
> This is an XML-based format which we have a built-in parser for, so
> there's probably not a lot to gain from using an external library.
> It's also not a very widely used format in my experience.
>
> > Other Options:
> > * libreoffice-dev(SDK)
>
> I guess this is "libreofficekit"?
>
> I actually maintain a command line tool which is a thin wrapper
> around that:
>
> https://gitlab.com/ojwb/lloconv
>
> It works pretty well, but it's rather slow even reusing the
> lok::Office() object (lloconv has a feature where it can fork a
> daemon process to allow such reuse).
>
> Much of the import code libreoffice uses has now beep split out into
> libraries (like libabw, libe-book and libetonyek from your list) and
> I think we'd do better to use such libraries directly.
>
> You can find a list of these libraries here:
>
> https://www.documentliberation.org/projects/#import-libs
>
> Cheers,
>     Olly
>


-- 
Atte. Bruno Baruffaldi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20190615/d169b3f0/attachment.html>


More information about the Xapian-devel mailing list