[Xapian-discuss] Dealing with image PDF's

Thu Jul 31 13:26:20 BST 2008

On Thu, Jul 31, 2008 at 04:09:39AM +0930, Frank Bruzzaniti wrote:
> I was just playing around and added a bit of code to omindex.cc so I 
> could ocr tiff and tif with gocr which seems to work. Here's what it 
> looks like:
> 
>  // Tiff:
>     } else if (startswith(mimetype, "image/tif"))

Just test (mimetype == "image/tiff") instead -- image/tif is just incorrect.

>     {
>     // Inspired by http://mjr.towers.org.uk/comp/sxw2text

This comment is not relevant here.

>     string safefile = shell_protect(file);
>     string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -";
>     try {
>         dump = stdout_to_string(cmd);
>     } catch (ReadError) {
>         cout << "\"" << cmd << "\" failed - skipping\n";
>         return;
>     }
>     // Tiff:End

Interesting idea!  I tried it on the TIFF files I have here.  The
problems I noticed:

* On the TIFF icons I have from various packages, I get random junk from
  the OCR software, which we don't really want to be indexing.  I couldn't
  see an obvious option to tell gocr to "give up if there's nothing
  which looks like text".  Logos and graphics on pages of text also lead
  to random junk so perhaps a filtering step to drop it would be better
  anyway.

* On the multi-page scanned document I have, I only get the text from
  the first page.  I guess that's tifftopnm, but it doesn't seem to have
  an option to do "all pages".  Perhaps something else to do this
  conversion would be better?

* It OCRed a barcode in my document, which is cute, but we don't really
  want to index the XML-like tag as plain text:

  <barcode type="39" chars="12" code="*N04456664M*" crc="E" error="0.049" />

> I don't really understand all the code in omindex.cc but was wondering 
> if I could OCR when no text was returned while trying to process PDF's 
> as a way of dealing with image only PDF's.
>
> Here's the bit in omindex.cc that deals with pdf's:
> 
> } else if (mimetype == "application/pdf") {
>     string safefile = shell_protect(file);
>     string cmd = "pdftotext -enc UTF-8 " + safefile + " -";
>     try {
>         dump = stdout_to_string(cmd);
>     } catch (ReadError) {
>         cout << "\"" << cmd << "\" failed - skipping\n";
>         return;
>     }

And then:

    if (dump.empty()) {
	// Do the OCR thing...
    }

Or if you get can get an "empty" dump which actually just has whitespace
in then:

    if (dump.find_first_not_of(" \n\t") == string::npos) {
	// Do the OCR thing...
    }

> I wanted to change it so if nothing (or no strings) was returned from 
> "pdftotext -enc UTF-8 " + safefile + " -";   then run "pdftoppm " + 
> safefile + " | gocr -f UTF8 -";

pdftoppm seems to produce one ppm file per page, rather than output on
stdout, so you'll need to extract to a temporary directory and then
read files from it.  See the PostScript handling code for how to work
with a temporary directory.

> P.S. I was able to write similar snippets of code to process docx and 
> xlsx, so far so good, if they test ok should I post them somewhere or 
> email them to someone?

Creating a new trac ticket and attaching the patch is probably best.

If you've not already done so, take a look at:

http://trac.xapian.org/wiki/FAQ/OmegaNewFileFormat

As that says, it's very helpful if you can update the documentation to
cover the new format(s) and supply some sample files which we can
redistribute for testing (my hope is to create an automated test suite
for omindex).

Cheers,
    Olly