[Xapian-discuss] Mime2Text library, derived from omindex

Olly Betts olly at survex.com
Mon Jan 16 04:34:39 GMT 2012


I've created a ticket to track this, as I'm unlikely to get to it for a
while yet:

http://trac.xapian.org/ticket/583

On Fri, Nov 11, 2011 at 03:32:29PM -0800, Liam wrote:
> On Thu, Nov 10, 2011 at 9:19 PM, Olly Betts <olly at survex.com> wrote:
> 
> > Though this presumably isn't intended to go in the core API
> 
> I've been thinking... This could reasonably be a package of xapian-core;
> you probably shouldn't have to install the omega package just for this
> lib...

Extracting text from file types isn't very "core" though, and there
aren't any clear benefits from closer integration with core features
that I can see.

But there's no reason why it can't be packaged separately anyway.

On Fri, Nov 11, 2011 at 12:55:27AM -0800, Liam wrote:
> On Thu, Nov 10, 2011 at 9:19 PM, Olly Betts <olly at survex.com> wrote:
> 
> > On Wed, Nov 09, 2011 at 12:22:22PM -0800, Liam wrote:
> > > I've started a branch for a patch. Would love any feedback
> > >
> > > Header
> > >
> > https://github.com/networkimprov/xapian/blob/liam_mime2text-lib/xapian-applications/omega/mime2text.h
> >
> > Some thoughts from looking at just the header:
> >
> > I don't like that this API forces you to have a file on disk.  If you're
> > unpacking a container format you may well have data in memory, and while
> > it might be necessary to write it to disk to pass it to some extractors,
> > we really want to avoid that for extractors with a library API, or where
> > we can feed data to an external filter on stdin.
> 
> The existing code expects a filename, so I feel we should stick with that
> for the first version of this, although I agree it should take a stream
> type eventually.

For code only used in omindex, that's fine - we control the caller(s)
and can just change how things work if needed.

But if we're going to split this off into a library, we're committing to
support the API we provide for a significant length of time, and to
providing sane upgrade paths for any changes which later get made.

> > Why does the public API here take const char * parameters not const
> > std::string&?  If we really want const char * internally and the overhead
> > of a conversion to std::string matters for the case where the user has
> > a const char * to start with, we should provide overloaded forms which
> > call c_str() for the user.
> 
> I'm already converting to std::string for the filename internally.

Then just make the API take const std::string &, and that conversion
will happen at the call site if the user passes const char *.

> > Why is "command" in Mime2Text::Fields?  It doesn't seem to be a field.
> 
> It's for informational purposes, what external command produced these
> results, if any.

It isn't a field though.

On Tue, Nov 22, 2011 at 10:26:52PM -0800, Liam wrote:
> On Wed, Nov 9, 2011 at 12:22 PM, Liam <xapian at networkimprov.net> wrote:
> 
> > Continuing the "omindex library or daemon" thread in a new vein...
> 
> load_file() in omega/loadfile.cc (part of the pending Mime2Text lib) calls
> 
>   posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
> 
> once, before closing the fd. In order to minimally impact the filesystem
> cache, I suspect it should call that after each read()?

Benchmark and see?  It's a lot more syscalls to call it for every read,
which is an overhead for everyone, whereas there's probably only much
benefit from marking it as "DONTNEED" more often if you're tight on RAM
and/or processing a lot of huge files.  So it seems unwise to make this
change without some evidence.

> Also, the read buffer is only 4KB. It might be considerably more efficient
> if sized to the filesystem block size?

Benchmark and see?

Cheers,
    Olly



More information about the Xapian-discuss mailing list