[Xapian-discuss] Dealing with image PDF's
Frank Bruzzaniti
frank.bruzzaniti at gmail.com
Wed Jul 30 19:39:39 BST 2008
Guys,
I was just playing around and added a bit of code to omindex.cc so I
could ocr tiff and tif with gocr which seems to work. Here's what it
looks like:
// Tiff:
} else if (startswith(mimetype, "image/tif"))
{
// Inspired by http://mjr.towers.org.uk/comp/sxw2text
string safefile = shell_protect(file);
string cmd = "tifftopnm " + safefile + " | gocr -f UTF8 -";
try {
dump = stdout_to_string(cmd);
} catch (ReadError) {
cout << "\"" << cmd << "\" failed - skipping\n";
return;
}
// Tiff:End
I don't really understand all the code in omindex.cc but was wondering
if I could OCR when no text was returned while trying to process PDF's
as a way of dealing with image only PDF's.
Here's the bit in omindex.cc that deals with pdf's:
} else if (mimetype == "application/pdf") {
string safefile = shell_protect(file);
string cmd = "pdftotext -enc UTF-8 " + safefile + " -";
try {
dump = stdout_to_string(cmd);
} catch (ReadError) {
cout << "\"" << cmd << "\" failed - skipping\n";
return;
}
I wanted to change it so if nothing (or no strings) was returned from
"pdftotext -enc UTF-8 " + safefile + " -"; then run "pdftoppm " +
safefile + " | gocr -f UTF8 -";
P.S. I was able to write similar snippets of code to process docx and
xlsx, so far so good, if they test ok should I post them somewhere or
email them to someone?
Thanks,
Frank
More information about the Xapian-discuss
mailing list