[Xapian-discuss] Dealing with image PDF's

Frank Bruzzaniti frank.bruzzaniti at gmail.com
Wed Jul 30 19:39:39 BST 2008


Guys,

I was just playing around and added a bit of code to omindex.cc so I 
could ocr tiff and tif with gocr which seems to work. Here's what it 
looks like:

 // Tiff:
    } else if (startswith(mimetype, "image/tif"))
    {
    // Inspired by http://mjr.towers.org.uk/comp/sxw2text
    string safefile = shell_protect(file);
    string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -";
    try {
        dump = stdout_to_string(cmd);
    } catch (ReadError) {
        cout << "\"" << cmd << "\" failed - skipping\n";
        return;
    }
    // Tiff:End

I don't really understand all the code in omindex.cc but was wondering 
if I could OCR when no text was returned while trying to process PDF's 
as a way of dealing with image only PDF's.

Here's the bit in omindex.cc that deals with pdf's:

} else if (mimetype == "application/pdf") {
    string safefile = shell_protect(file);
    string cmd = "pdftotext -enc UTF-8 " + safefile + " -";
    try {
        dump = stdout_to_string(cmd);
    } catch (ReadError) {
        cout << "\"" << cmd << "\" failed - skipping\n";
        return;
    }

I wanted to change it so if nothing (or no strings) was returned from 
"pdftotext -enc UTF-8 " + safefile + " -";   then run "pdftoppm " + 
safefile + " | gocr -f UTF8 -";

P.S. I was able to write similar snippets of code to process docx and 
xlsx, so far so good, if they test ok should I post them somewhere or 
email them to someone?

Thanks,

Frank




More information about the Xapian-discuss mailing list