[Xapian-devel] [Xapian-discuss] Dealing with image PDF's
James Aylett
james-xapian at tartarus.org
Sat Aug 2 17:08:20 BST 2008
On Sat, Aug 02, 2008 at 12:59:29AM +0100, Olly Betts wrote:
> > We could also use something similar to mailcap + mime.types, on
> > systems that support them.
>
> The standard mailcap file entries are slanted too much towards human
> viewability rather than provided text in a suitable form for indexing
> without caring much about formatting. And for images and video we
> want the meta-data rather than the content. But the format might be
> a sane choice.
Yes, I meant more a text file configuration system in that style.
> Recoll uses filter system which seems to be taken from Estraier. It
> uses a shell script which does the work for each format, but it has to
> output HTML which often seems to require a run through sed to escape
> '<', '>', and '&', and then the indexer has to parse the HTML, which all
> seems a bit unnecessary. But it might be nice to support such filter
> scripts as an option.
If there's a well defined format it'd be nice, but it sounds a bit
messy. What we could do is write a script that takes that as input and
spits out whatever it is that we want, so you can chain any
Estraier/Recoll filters with the converter.
We went a little way down this road years ago with the XML indexer
system, but it was never a pleasant thing to use or work with. We
could probably get away with something along the lines of YAML as an
encapsulation; it's not like we need huge numbers of distinct blobs of
data to work with.
J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james at tartarus.org uncertaintydivision.org
More information about the Xapian-devel
mailing list