[Xapian-tickets] [Xapian] #324: A Script that users OpenOffice to filter text for Xapian Omega
Xapian
nobody at xapian.org
Wed Feb 18 00:45:57 GMT 2009
#324: A Script that users OpenOffice to filter text for Xapian Omega
-------------------------+--------------------------------------------------
Reporter: frankjb | Owner: olly
Type: enhancement | Status: new
Priority: normal | Milestone:
Component: Omega | Version:
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: Linux | Blocking:
-------------------------+--------------------------------------------------
Changes (by olly):
* component: Examples => Omega
Old description:
> This python script is an example of how to use openoffice to convert
> documents to text. It's starts an headless version of openoffice which
> should remain running and will attempt to start a new instance if it is
> not. It also uses Unoconv which can be downloaded from
> http://dag.wieers.com/home-made/unoconv/.
>
> Unoconv doesn't need to be told what format it is accepting so you should
> be able to slot the script anywhere in omindex without to much hassle.
> For example I replaced antiword in omindex.cc with oOC.py (this script)
> because antiword couldn't open .doc's saved via Word Perfect
>
> I would love to get some high end stability and performance testing using
> OpenOffice as a filter. I couldn't figure out how to get python to
> correctly marshal the soffice process hence I parsed the output of ps
> command. Maybe one of the python guru's could have a look :)
New description:
This python script is an example of how to use openoffice to convert
documents to text. It starts an headless version of openoffice which
should remain running and will attempt to start a new instance if it is
not. It also uses Unoconv which can be downloaded from
http://dag.wieers.com/home-made/unoconv/.
Unoconv doesn't need to be told what format it is accepting so you should
be able to slot the script anywhere in omindex without to much hassle. For
example I replaced antiword in omindex.cc with oOC.py (this script)
because antiword couldn't open .doc's saved via Word Perfect
I would love to get some high end stability and performance testing using
OpenOffice as a filter. I couldn't figure out how to get python to
correctly marshal the soffice process hence I parsed the output of ps
command. Maybe one of the python guru's could have a look :)
--
Comment:
Sadly antiword isn't getting many updates now, so the option to use
something more actively maintained would be useful. Perhaps openoffice is
a bit heavyweight, but the ability to use a single instance in the
background should at least mean the runtime overhead isn't an issue.
As you suggest, the "ps" stuff really needs replacing with something
better (amongst other things, "ps -ef" isn't portable, and you're hard-
coding the install location).
This script also assumes nothing is running on port 2002, and allows other
users on the system to do things with your openoffice process, which is a
potential security risk.
I had a quick look at unoconv and it looks like you'd do better to (but
I've not tested either):
* use {{{unoconv --listener}}} to start a persistent openoffice process
* use {{{unoconv --pipe}}} to use a named pipe to communicate with
openoffice instead of a TCP socket
--
Ticket URL: <http://trac.xapian.org/ticket/324#comment:1>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list