[Xapian-devel] PPT text extracter

Olly Betts olly at survex.com
Thu Dec 12 19:05:04 GMT 2013


On Thu, Dec 12, 2013 at 03:11:29PM +0100, jf at dockes.org wrote:
> I've had a heads up from a user that catppt did not work at all on
> semi-recent PowerPoint files (ppt, not pptx). I checked, and indeed it
> misses most of the content on many files.
> 
> After looking around, I found Python code from the libreoffice project
> which makes a nice ppt text extractor after adding a very thin command line
> wrapper:
> 
>   http://cgit.freedesktop.org/libreoffice/contrib/mso-dumper/
> 
> It's pure python, no other dependancies, orders of magnitude faster than
> unoconv, and contrarily to catppt, does extract the text...
> 
> Just in case this can be useful to Omega... I can provide more details of
> course.

Thanks, that is interesting.

Another option coming soon is liblibreoffice, which debuts in Libreoffice
4.2 - currently in beta, due for release late January or early Februrary
2014:

http://cgit.freedesktop.org/libreoffice/core/tree/desktop/inc/

It looks like the current API requires saving to a temporary file.

I haven't tried this yet, so I'm not sure about speed, but it should
avoid a lot of the overhead of unoconv.

Cheers,
    Olly



More information about the Xapian-devel mailing list