[Xapian-discuss] Using Open Office to convert documents.

Frank John Bruzzaniti frank.bruzzaniti at gmail.com
Mon Feb 2 15:05:14 GMT 2009


I wrote a little python script (oOC.py) that I could insert as one of
the "helper" apps that uses unoconv and openoffice to convert documents
to text. E.g. I was having trouble converting *.doc that were saved with
wordperfect as antiword didn't decode them so I substitute the line in
omindex that contains atiword with oOC.py.  Theoretically oOC can
convert almost any format supported by OpenOffice and Unconv. 

I've done some initial testing and it seems to work ok. I wouldn't
recommend it in a production environment without lots of testing, I
decided to email it for the sake of curiosity.

Basically it runs a headless copy of openoffice which should stay
running and accept requests from unconv and print the results from
stdout.


#!/usr/bin/python
# Python script to convert dpcuments via OpenOffice for Xapian-Omega
# By Frank J Bruzzaniti
# frank.bruzzaniti at gmail.com

import os, sys, time
from subprocess import *

# Get pid of any running soffice processes
getpid = Popen(["ps -ef | grep -v grep | grep
'/usr/lib/openoffice/program/soffice.bin -headless
-accept=socket,host=127.0.0.1,port=2002;urp; -nofirststartwizard' | cut
-f3 -d' '"], shell=True, stdout=PIPE).stdout

# Save pid might be usefull
pid = getpid.read()
#print "PID=" + pid

# If soffice not running start and wait 5 secs
if pid == "":
    Popen(['soffice -headless
-accept="socket,host=127.0.0.1,port=2002;urp;" -nofirststartwizard'],
shell=True)
    #print "I didn't find soffice running so I'm starting one now and
waiting 5 secs"
    time.sleep(5)

# Run unoconv
os.system('unoconv --stdout -f text ' + sys.argv[1])   




More information about the Xapian-discuss mailing list