[Xapian-devel] PPT text extracter

jf at dockes.org jf at dockes.org
Thu Dec 12 20:05:30 GMT 2013

Olly Betts writes:
 > On Thu, Dec 12, 2013 at 03:11:29PM +0100, jf at dockes.org wrote:
 > > I've had a heads up from a user that catppt did not work at all on
 > > semi-recent PowerPoint files (ppt, not pptx). I checked, and indeed it
 > > misses most of the content on many files.
 > > 
 > > After looking around, I found Python code from the libreoffice project
 > > which makes a nice ppt text extractor after adding a very thin command line
 > > wrapper:
 > > 
 > >   http://cgit.freedesktop.org/libreoffice/contrib/mso-dumper/
 > > 
 > > It's pure python, no other dependancies, orders of magnitude faster than
 > > unoconv, and contrarily to catppt, does extract the text...
 > > 
 > > Just in case this can be useful to Omega... I can provide more details of
 > > course.
 > Thanks, that is interesting.
 > Another option coming soon is liblibreoffice, which debuts in Libreoffice
 > 4.2 - currently in beta, due for release late January or early Februrary
 > 2014:
 > http://cgit.freedesktop.org/libreoffice/core/tree/desktop/inc/
 > It looks like the current API requires saving to a temporary file.
 > I haven't tried this yet, so I'm not sure about speed, but it should
 > avoid a lot of the overhead of unoconv.

After doing a number of informal tests with unoconv, I have more or less
come to the conclusion that the abysmal performance when used on ppt files
is due to the time needed to process graphics, not the client-server
overhead (for example performance does not change a lot if the server is
already started). Plus the incessant crashes. Or maybe I just did not find
the right options.

It will be interesting to see if liblibreoffice does better, but what I
like with the Python code is that I can ship it today (as a zip package +
script), without having to add dependancies and wait for packaging or

For the sake of completeness there is also this:


It's commercial GPL, based on the wvWare libs, and works extremely well on
everything I tried it on. It's an order of magnitude again faster than the
Python version (and also a bit better at eliminating spurious text), but
the build system is abysmal and it's not packaged anywhere. So I'm going
with Python for now ...



More information about the Xapian-devel mailing list