[Xapian-discuss] antiword

Henry henka at cityweb.co.za
Thu May 14 18:42:42 BST 2009


Quoting "Olly Betts" <olly at survex.com>:
> If we're going to pick a better default converter, I'd rather do so
> based on trying the various options on a set of sample documents and
> comparing the output, time taken, and memory requirements rather than
> relying on anecdotal reports that a particular option has trouble with
> "many" documents.

You may want to try wvware (http://wvware.sourceforge.net/).  It's  
also becoming dated, but still does a good job of converting msdoc  
files.  Preserves the layout better than others too (even Abiword),  
I've found.  This is not as important for indexing, but is for  
displaying the cached (converted) version, etc.

>> I've been looking into getting openoffice to do it in headless mode but
>> still have a way to go before it's stable.
>> I was wondering if anyone else had any luck on this front?

I tried this, but crikey, it's dependency-hell and requires some  
hackery to achieve headless state.

>> One quick fix I have found for word documents  is by using  abiword

I also got this working pretty painlessly, but it's also resource  
intensive as Olly says.  My concern on our cluster was to use  
something as lightweight as possible, while striking a balance of file  
format compatibility.  That being said, it's possible that Abiword  
might the best forward-looking solution...

Cheers
Henry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: PGP Digital Signature
Url : http://lists.xapian.org/pipermail/xapian-discuss/attachments/20090514/c53bcc05/attachment.pgp 


More information about the Xapian-discuss mailing list