[Xapian-discuss] Index Apple iWork docs

Olly Betts olly at survex.com
Tue Sep 18 05:42:23 BST 2012


On Tue, Sep 18, 2012 at 12:19:08PM +1000, linbloke wrote:
> For a given keynote file called testxyz.key:
> 
> cp textxyz.key textxyz.key.zip
> mkdir textxyz.key.tmp
> cd textxyz.key.tmp
> unzip ../textxyz.key.zip
> 
> All text within the keynote file is stored in an xml file called
> index.apxl. The following adds newlines after xml tag closures and
> then filters xml tags, filters some &gt garbage, leaving only the text
> from the keynote file.
> 
> cat index.apxl | perl -pe 's/>/>\n/g' | perl -pe 's/<(.*?)>//g' | strings | grep -v '\&gt' > testxyz.key.txt

You can unpack just that file on the fly from the .key file using
unzip -p, simplifying all the commands above to a single pipeline:

unzip -p textxyz.key index.apxl | perl -pe 's/>/>\n/g' | perl -pe 's/<(.*?)>//g' | strings | grep -v '\&gt' > testxyz.key.txt

> Probably a better way to do it would be with an xml parser but that's
> beyond me. Please CC me with comments.

Yeah, the &gt; is a ">" character escaped in XML, and you'll see other
characters escaped like this too.

Do you have any example keynote files with a liberal licence?

Cheers,
    Olly



More information about the Xapian-discuss mailing list