[Xapian-tickets] [Xapian] #324: A Script that users OpenOffice to filter text for Xapian Omega

Xapian nobody at xapian.org
Wed Feb 18 00:45:57 GMT 2009


#324: A Script that users OpenOffice to filter text for Xapian Omega
-------------------------+--------------------------------------------------
 Reporter:  frankjb      |        Owner:  olly
     Type:  enhancement  |       Status:  new 
 Priority:  normal       |    Milestone:      
Component:  Omega        |      Version:      
 Severity:  normal       |   Resolution:      
 Keywords:               |    Blockedby:      
 Platform:  Linux        |     Blocking:      
-------------------------+--------------------------------------------------
Changes (by olly):

  * component:  Examples => Omega


Old description:

> This python script is an example of how to use openoffice to convert
> documents to text.  It's starts an headless version of openoffice which
> should remain running and will attempt to start a new instance if it is
> not. It also uses Unoconv which can be downloaded from
> http://dag.wieers.com/home-made/unoconv/.
>
> Unoconv doesn't need to be told what format it is accepting so you should
> be able to slot the script anywhere in omindex without to much hassle.
> For example I replaced antiword in omindex.cc with oOC.py (this script)
> because antiword couldn't open .doc's saved via Word Perfect
>
> I would love to get some high end stability and performance testing using
> OpenOffice as a filter.  I couldn't figure out how to get python to
> correctly marshal the soffice process hence I parsed the output of ps
> command. Maybe one of the python guru's could have a look :)

New description:

 This python script is an example of how to use openoffice to convert
 documents to text.  It starts an headless version of openoffice which
 should remain running and will attempt to start a new instance if it is
 not. It also uses Unoconv which can be downloaded from
 http://dag.wieers.com/home-made/unoconv/.

 Unoconv doesn't need to be told what format it is accepting so you should
 be able to slot the script anywhere in omindex without to much hassle. For
 example I replaced antiword in omindex.cc with oOC.py (this script)
 because antiword couldn't open .doc's saved via Word Perfect

 I would love to get some high end stability and performance testing using
 OpenOffice as a filter.  I couldn't figure out how to get python to
 correctly marshal the soffice process hence I parsed the output of ps
 command. Maybe one of the python guru's could have a look :)

--

Comment:

 Sadly antiword isn't getting many updates now, so the option to use
 something more actively maintained would be useful.  Perhaps openoffice is
 a bit heavyweight, but the ability to use a single instance in the
 background should at least mean the runtime overhead isn't an issue.

 As you suggest, the "ps" stuff really needs replacing with something
 better (amongst other things, "ps -ef" isn't portable, and you're hard-
 coding the install location).

 This script also assumes nothing is running on port 2002, and allows other
 users on the system to do things with your openoffice process, which is a
 potential security risk.

 I had a quick look at unoconv and it looks like you'd do better to (but
 I've not tested either):

  * use {{{unoconv --listener}}} to start a persistent openoffice process
  * use {{{unoconv --pipe}}} to use a named pipe to communicate with
 openoffice instead of a TCP socket

-- 
Ticket URL: <http://trac.xapian.org/ticket/324#comment:1>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list