indexing pdf errors
henk l
gordevio2 at gmail.com
Fri Apr 2 08:51:56 BST 2021
Thanks Olly for your reaction. You are right about the graphics issues.
Good idea to test it with pdftotext. I was not aware of the command.
Kind regards,
Henk
On Wed, 31 Mar 2021 at 23:14, Olly Betts <olly at survex.com> wrote:
> On Mon, Mar 29, 2021 at 10:10:39AM +0200, Henk L. wrote:
> > I have indexed pdf's with omindex. In this batch not all went well,
> > though the files got indexed. I get errors:
> >
> > Syntax Warning: Couldn't link the profiles
> > Syntax Warning: Can't create transform
> >
> > These pdf's are the output of a scan and ocr process. So they contain
> > text.
> >
> > Is there a way I can find out what happened?
>
> Assuming Xapian 1.4.x where x >= 10, we extract text from PDFs by piping
> them to:
>
> pdftotext -enc UTF-8 - -
>
> For some PDF files pdftotext emits warning messages.
>
> These are presumably due to either invalid structure within the PDF file
> (either due to bugs in the tool that made it, or corruption of the file
> since) or possibly bugs in libpoppler (which pdftotext uses to do most
> of the actual work.)
>
> Most of them sound like things that aren't important for extracting just
> the text (e.g. "Can't create transform" sounds like a graphics
> coordinate mapping problem.)
>
> You can test by hand with the command above and see what text is
> actually extracted.
>
> Possibly we should run pdftotext with -q to disable such messages,
> though that also seems like it makes errors silent too, which is less
> helpful. There doesn't seem to be any finer level of control.
>
> (Or if you're running Omega from git master you may be using the new
> worker module for libpoppler, but mostly that just means we don't fork()
> and exec() for each PDF file indexed - these messages are still coming
> from libpoppler, and we could set the option that pdftotext -q sets to
> disable them.)
>
> Cheers,
> Olly
>
More information about the Xapian-discuss
mailing list