[Xapian-discuss] Xapian with djvu files?

John Pye john at curioussymbols.com
Mon Jan 14 23:39:49 GMT 2008


Olly Betts wrote:
> On Mon, Jan 14, 2008 at 07:25:45AM +0000, James Aylett wrote:
>   
>> On Mon, Jan 14, 2008 at 05:33:38PM +1100, John Pye wrote:
>>
>>     
>>> I was wondering if there was any support in Xapian for DJVU files. These
>>> are a nice alternative to PDF files -- much smaller file size, typically.
>>>       
>
> I did actually write a patch a while back for djvu.  I think I didn't
> apply it because I only actually found a single example file with a text
> layer, and that only had 20 words of ASCII text.  I like to have a
> few decent test files (including some with non-ASCII characters) to give
> me some confidence that a filter program actually works well.  It
> doesn't seem to be a popular format (John is the first person to ask
> about support for it), so I just left the matter.
>   

You can use a free online OCR tool to generate DJVU files that include
text in them:

http://any2djvu.djvuzone.org/

The OCR tool there is not particularly amazing but it certainly does
well enough for the text to be used for search indexing use.

I believe the OCR tool is also available as a commercial thing.

>   
>> There isn't at the moment, but it would be fairly easy to add support
>> into omindex(1) to use djvutxt to convert for indexing. djvutxt uses
>> UTF-8 already, so something like the following in
>> omindex.cc:index_file() around line 308 *should* do the trick
>> (untested!):
>>     
> [...]
>
> Yes, that looks about right, except the mime-type I have listed in
> /etc/mime.types is "image/vnd.djvu", which is the one registered with
> IANA:
>
> http://www.iana.org/assignments/media-types/image/vnd-djvu
>
>   
>> However I have to wonder why you want to - djvu is primarily an image
>> file format, although it has support for mixed text and images. I
>> admit I hadn't heard of it before now though, so perhaps the website
>> [1] is a little misleading about the primary use.
>>     

I would say that DJVU is primarily a file format for scanned
*documents*, which may or may not include OCRed text. This is just the
same as PDFs. Many online journals have scanned their back catalog and
incorporated OCR text into the PDF data. The same thing can be achieved
with DJVU, but with much smaller resulting files.

FWIW Evince supports DJVU files on Linux, too, providing you have the
djvulibre library installed.

Cheers
JP



More information about the Xapian-discuss mailing list