[Xapian-discuss] antiword

Olly Betts olly at survex.com
Fri May 15 05:00:44 BST 2009


On Thu, May 14, 2009 at 07:42:42PM +0200, Henry wrote:
> You may want to try wvware (http://wvware.sourceforge.net/).  It's  
> also becoming dated

It seems to be fairly active - wv2 v0.3.1 was released in March this
year.

> but still does a good job of converting msdoc  
> files.  Preserves the layout better than others too (even Abiword),  
> I've found.  This is not as important for indexing, but is for  
> displaying the cached (converted) version, etc.

I notice that abiword actually uses the wv library (and wv steer you to
using abiword instead of the command line tools):

http://wvware.sourceforge.net/index.html#wv

But it says wvWare is supported still and this seems to produce
plausible output:

wvWare -cutf-8 -1 -xwvText.xml file.doc

It takes 4-5 times to extract longer than antiword on a 12000 word
document.  It does a similarly good job on my example, but the slower
extraction is probably an acceptable trade-off for most people if it
actually handles more documents better (and we might be able to reduce
the overhead by using wv as a library, which isn't an option for
antiword).

Though I notice that wv lists support for "Word 2000, 97, 95 and 6 file
formats. (These are the file formats known internally as Word 9, 8, 7
and 6.) There is some support for reading earlier formats as well: Word
2 docs are converted to plaintext." while antiword lists "Word 2, 6, 7,
97, 2000, 2002 and 2003", which on the surface suggests that wv doesn't
handle Word 2002 or 2003 files...

But I don't have many example files, and I've no idea how to tell what
format the few I have actually are.

> I also got this working pretty painlessly, but it's also resource  
> intensive as Olly says.  My concern on our cluster was to use  
> something as lightweight as possible, while striking a balance of file  
> format compatibility.  That being said, it's possible that Abiword  
> might the best forward-looking solution...

I think a lightweight (or lighter-weight) extractor is a better default,
but we can fairly easily provide users with the ability to chose how
to extract text from particular formats.

Cheers,
    Olly



More information about the Xapian-discuss mailing list