[Xapian-devel] [Xapian-discuss] Dealing with image PDF's

James Aylett james-xapian at tartarus.org
Sat Aug 2 17:08:20 BST 2008


On Sat, Aug 02, 2008 at 12:59:29AM +0100, Olly Betts wrote:

> > We could also use something similar to mailcap + mime.types, on
> > systems that support them.
> 
> The standard mailcap file entries are slanted too much towards human
> viewability rather than provided text in a suitable form for indexing
> without caring much about formatting.  And for images and video we
> want the meta-data rather than the content.  But the format might be
> a sane choice.

Yes, I meant more a text file configuration system in that style.

> Recoll uses filter system which seems to be taken from Estraier.  It
> uses a shell script which does the work for each format, but it has to
> output HTML which often seems to require a run through sed to escape
> '<', '>', and '&', and then the indexer has to parse the HTML, which all
> seems a bit unnecessary.  But it might be nice to support such filter
> scripts as an option.

If there's a well defined format it'd be nice, but it sounds a bit
messy. What we could do is write a script that takes that as input and
spits out whatever it is that we want, so you can chain any
Estraier/Recoll filters with the converter.

We went a little way down this road years ago with the XML indexer
system, but it was never a pleasant thing to use or work with. We
could probably get away with something along the lines of YAML as an
encapsulation; it's not like we need huge numbers of distinct blobs of
data to work with.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-devel mailing list