[Xapian-discuss] Tika memory problems. Omindex restrictions?

Charles xapian at catcons.co.uk
Sat Jun 18 15:53:28 BST 2011


Hello :-)

IDK what the significant change was but running Tika from omindex 
started failing on 247 out of a tree of 400 files with error message 
"java.lang.OutOfMemoryError: requested <number> bytes for CHeapObj-new.
  Out of swap space?".  The biggest file in the tree is ~95 MB; most are 
under 1 MB.

The files triggering the error had extensions doc, pdf, ppt, rtf and
xls so the problem is probably not specific to the file type.

Running vmstat with a 1 second delay during the omindex run showed no 
swapping and consistently ~0.5GB (of 1 GB) free memory so the problem is 
not system memory.

The bash ulimit command reported "unlimited" and 
/etc/security/limits.conf is all comments or empty lines.

Omindex ran Tika OK on this development system from installation on 
31mar11 until it was last used on 14apr11.  All system changes are 
logged but none of the changes since 14apr11 are obviously relevant.

The OS is Debian Squeeze 64 bit running in a virtual machine -- hence 
the small sample of 400 files and the 1 GB memory.

Changing the VirtualBox VM memory from ~1 GB to 3072 MB fixed the 
problem.  Changed to 1024 MB and tried to reproduce the problem but the 
behaviour had changed.  The java.lang.OutOfMemoryError message no longer 
appeared.  Some now generated std::bad_alloc messages but most simply 
"Aborted" (IDK whether that message is from omindex or Tika).

For the file types that omindex uses Tika as a filter:

doc files:  tried: 134, failed: 60  44.77%
docx files: tried:   1, failed:  0
odp files:  tried:   1, failed:  0
ods files:  tried:  23, failed:  0
odt files:  tried:  71, failed:  0
pdf files:  tried:  81, failed: 81 100.00%
ppt files:  tried:   4, failed:  4 100.00%
rtf files:  tried:   2, failed:  2 100.00%
xls files:  tried:  27, failed: 27 100.00%

Taking a sample of failing Tika commands from omindex output and running 
them at the command prompt does not produce any errors.  It is beginning 
to look as if the problem is caused by the environment that omindex sets 
up for Tika to run in.  Does that make any sense?

Best

Charles



More information about the Xapian-discuss mailing list