[Xapian-discuss] Omindex Filters

Frank J Bruzzaniti frank.bruzzaniti at gmail.com
Wed Sep 17 10:50:16 BST 2008


How about XML for the output so we can incorporate any additional meta-data.

How about using tmpfs for processing could improve processing instead of 
packing to disk all the time and it could be wiped before or after batches.


Olly Betts wrote:
> On Mon, Sep 15, 2008 at 01:33:45PM +0100, James Aylett wrote:
>   
>> On Mon, Sep 15, 2008 at 09:59:16PM +0930, Frank J Bruzzaniti wrote:
>>
>>     
>>> I was wondering if it would be a bad idea to have a way to incorporate 
>>> plugins/filters in a way that would allow us to chop and change filters 
>>> without having to recompile and edit the source.
>>>       
>> We've discussed this in the past, and certainly I'm in favour of
>> it.
>>     
>
> Yes, it would be useful.
>
> However, I think it's important to think about the right way to specify
> such filters - an "executable or script which decodes the input file as
> text to stdout" isn't all that general.  Adding the apparently implicit
> assumption that the output is UTF-8, then of the existing formats
> supported, that covers only these:
>
> 	application/msword
> 	application/vnd.ms-excel
> 	application/vnd.ms-powerpoint
> 	application/vnd.ms-works
> 	application/vnd.wordperfect
> 	image/vnd.djvu
>
> I think you need to at least consider the character set and format of
> the output (plain text and HTML are common), and possibly also filters
> which can only produce output to a file, not stdout.  Meta-data is
> another issue (look at the PDF handling for example).
>
> It's true that some such issues can be handled to at least some extent
> with a wrapper script around the command, but then you're adding the
> overhead of forking several extra commands per file processed, which
> is better avoided.
>
> We also don't want to encourage hacky handling of temporary files as
> that's a route straight to security bugs via symlink attacks - an
> obvious but bad approach to handling output to a file is a wrapper
> script like this one:
>
>     #!/bin/sh
>     foo2txt "$1" /tmp/$$.txt
>     cat /tmp/$$.txt
>     rm /tmp/$$.txt
>
>   
>> Before going to a configuration file, I'd suggest building in a
>> CLI option to set filters, as that will require some internal plumbing
>> to get working anyway, and not having to use a configuration file is a
>> good thing IMO.
>>     
>
> I don't see this as a good thing to specify on the command line.  Nobody
> is going to type these in by hand for every run, so this just means you
> need a script to launch omindex with the right options.  And that's just
> a config file with a nasty syntax (since shell meta characters need
> protecting).  Also, if we have an explicit configuration file, we can 
> look at its timestamp to check for changes to it.
>
> And it really is time that omindex finally got a proper configuration
> file.
>
> Cheers,
>     Olly
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>   



More information about the Xapian-discuss mailing list