[Xapian-devel] PPT text extracter

jf at dockes.org jf at dockes.org
Thu Dec 12 20:05:30 GMT 2013


Olly Betts writes:
 > On Thu, Dec 12, 2013 at 03:11:29PM +0100, jf at dockes.org wrote:
 > > I've had a heads up from a user that catppt did not work at all on
 > > semi-recent PowerPoint files (ppt, not pptx). I checked, and indeed it
 > > misses most of the content on many files.
 > > 
 > > After looking around, I found Python code from the libreoffice project
 > > which makes a nice ppt text extractor after adding a very thin command line
 > > wrapper:
 > > 
 > >   http://cgit.freedesktop.org/libreoffice/contrib/mso-dumper/
 > > 
 > > It's pure python, no other dependancies, orders of magnitude faster than
 > > unoconv, and contrarily to catppt, does extract the text...
 > > 
 > > Just in case this can be useful to Omega... I can provide more details of
 > > course.
 > 
 > Thanks, that is interesting.
 > 
 > Another option coming soon is liblibreoffice, which debuts in Libreoffice
 > 4.2 - currently in beta, due for release late January or early Februrary
 > 2014:
 > 
 > http://cgit.freedesktop.org/libreoffice/core/tree/desktop/inc/
 > 
 > It looks like the current API requires saving to a temporary file.
 > 
 > I haven't tried this yet, so I'm not sure about speed, but it should
 > avoid a lot of the overhead of unoconv.

After doing a number of informal tests with unoconv, I have more or less
come to the conclusion that the abysmal performance when used on ppt files
is due to the time needed to process graphics, not the client-server
overhead (for example performance does not change a lot if the server is
already started). Plus the incessant crashes. Or maybe I just did not find
the right options.

It will be interesting to see if liblibreoffice does better, but what I
like with the Python code is that I can ship it today (as a zip package +
script), without having to add dependancies and wait for packaging or
backporting.

For the sake of completeness there is also this:

http://silvercoders.com/en/products/doctotext/

It's commercial GPL, based on the wvWare libs, and works extremely well on
everything I tried it on. It's an order of magnitude again faster than the
Python version (and also a bit better at eliminating spurious text), but
the build system is abysmal and it's not packaged anywhere. So I'm going
with Python for now ...

Cheers,

jf



More information about the Xapian-devel mailing list