indexing pdf errors

Wed Mar 31 22:14:51 BST 2021

On Mon, Mar 29, 2021 at 10:10:39AM +0200, Henk L. wrote:
> I have indexed pdf's with omindex. In this batch not all went well,
> though the files got indexed. I get errors:
> 
> Syntax Warning: Couldn't link the profiles
> Syntax Warning: Can't create transform
> 
> These pdf's are the output of a scan and ocr process. So they contain
> text.  
> 
> Is there a way I can find out what happened?

Assuming Xapian 1.4.x where x >= 10, we extract text from PDFs by piping
them to:

    pdftotext -enc UTF-8 - -

For some PDF files pdftotext emits warning messages.

These are presumably due to either invalid structure within the PDF file
(either due to bugs in the tool that made it, or corruption of the file
since) or possibly bugs in libpoppler (which pdftotext uses to do most
of the actual work.)

Most of them sound like things that aren't important for extracting just
the text (e.g. "Can't create transform" sounds like a graphics
coordinate mapping problem.)

You can test by hand with the command above and see what text is
actually extracted.

Possibly we should run pdftotext with -q to disable such messages,
though that also seems like it makes errors silent too, which is less
helpful.  There doesn't seem to be any finer level of control.

(Or if you're running Omega from git master you may be using the new
worker module for libpoppler, but mostly that just means we don't fork()
and exec() for each PDF file indexed - these messages are still coming
from libpoppler, and we could set the option that pdftotext -q sets to
disable them.)

Cheers,
    Olly