[Xapian-discuss] antiword

Olly Betts olly at survex.com
Thu May 14 04:28:10 BST 2009


On Wed, Apr 29, 2009 at 09:21:31PM +0930, Frank J Bruzzaniti wrote:
> I've been noticing more and more that antiword has trouble with many 
> word documents.
> It may look like it's converted a document but leaves out headings and 
> bits of text.

It's always seemed to do OK for me, but I strive to avoid needing to deal
with such formats.

Do you have some example documents which show such problems?

There doesn't seem to be much development of antiword these days, but
then it's no longer chasing an evolving format so perhaps that's to be
expected.

If we're going to pick a better default converter, I'd rather do so
based on trying the various options on a set of sample documents and
comparing the output, time taken, and memory requirements rather than
relying on anecdotal reports that a particular option has trouble with
"many" documents.

> I've been looking into getting openoffice to do it in headless mode but 
> still have a way to go before it's stable.
> I was wondering if anyone else had any luck on this front?

It's rather a heavyweight solution though...

> One quick fix I have found for word documents  is by using  abiword
> 
> If you want to convert a file to text and display it to stdout:
> 
> abiword --to=txt --to-name=fd://1 <file to convert>
> 
> E..g. abiword --to=txt --to-name=fd://1 test_word6.doc

Again, quite a lot bigger than antiword - on my x86-64 Ubuntu jaunty
system, installing abiword requires 19.1MB of diskspace (and that's just
the required dependencies - with the recommended ones I get by default
it's 27.6MB).  To install antiword needs 811KB.

Perhaps it is worth the extra space, but I think more concrete evidence
is needed.  I looked at a large word document I have to hand, and the
only non-whitespace-change differences with antiword are that antiword
puts a space after a hyphen sometimes (less good, but won't affect
indexing) and puts "[pic]" where an image would go (which would make
the term "pic" harder to search for, but is otherwise harmless).

Cheers,
    Olly



More information about the Xapian-discuss mailing list