[Xapian-devel] PPT text extracter
jf at dockes.org
jf at dockes.org
Thu Dec 12 20:05:30 GMT 2013
Olly Betts writes:
> On Thu, Dec 12, 2013 at 03:11:29PM +0100, jf at dockes.org wrote:
> > I've had a heads up from a user that catppt did not work at all on
> > semi-recent PowerPoint files (ppt, not pptx). I checked, and indeed it
> > misses most of the content on many files.
> > After looking around, I found Python code from the libreoffice project
> > which makes a nice ppt text extractor after adding a very thin command line
> > wrapper:
> > http://cgit.freedesktop.org/libreoffice/contrib/mso-dumper/
> > It's pure python, no other dependancies, orders of magnitude faster than
> > unoconv, and contrarily to catppt, does extract the text...
> > Just in case this can be useful to Omega... I can provide more details of
> > course.
> Thanks, that is interesting.
> Another option coming soon is liblibreoffice, which debuts in Libreoffice
> 4.2 - currently in beta, due for release late January or early Februrary
> It looks like the current API requires saving to a temporary file.
> I haven't tried this yet, so I'm not sure about speed, but it should
> avoid a lot of the overhead of unoconv.
After doing a number of informal tests with unoconv, I have more or less
come to the conclusion that the abysmal performance when used on ppt files
is due to the time needed to process graphics, not the client-server
overhead (for example performance does not change a lot if the server is
already started). Plus the incessant crashes. Or maybe I just did not find
the right options.
It will be interesting to see if liblibreoffice does better, but what I
like with the Python code is that I can ship it today (as a zip package +
script), without having to add dependancies and wait for packaging or
For the sake of completeness there is also this:
It's commercial GPL, based on the wvWare libs, and works extremely well on
everything I tried it on. It's an order of magnitude again faster than the
Python version (and also a bit better at eliminating spurious text), but
the build system is abysmal and it's not packaged anywhere. So I'm going
with Python for now ...
More information about the Xapian-devel