[Xapian-discuss] GSoC 2012

Olly Betts olly at survex.com
Sun Feb 19 20:42:00 GMT 2012


On Wed, Feb 15, 2012 at 10:37:16PM -0800, Liam wrote:
> Re: Text-Extraction Libraries, starting a new process isn't expensive (on
> the order of 40usec for Linux, I believe), and prevents crashing the main
> program. So the benefit of libraries vs apps would be saving any
> extractor-specific initialization time, which I'd guess would be pretty
> low. If init time is a factor for some extractors, one could rev those
> programs (if source available) to accept a sequence of filenames via stdin
> or other input stream.

If you look at the prototype patch, you'll see this is pretty much what
it already does.

There's a small helper program which links to libwv2 and takes a
filename on stdin and sends back the text for the title, body, etc
(which is better than we can achieve with an external extractor unless
we run a separate command for the metadata, or can get it to output HTML
which we then have to parse).

The helper program is a separate process, so we don't crash omindex if
the extractor crashes, and the helper is restarted automatically if we
come to reuse it and find it isn't running.

> Wouldn't handling archive files (tar, zip) would be the more pressing need
> in this area?

I would say "more pressing" is a subjective assessment, but feel free
to add suitable project ideas to the list if you are (or have) someone
willing to mentor them.  Try to write the idea up so that it is easy
to understand for a student who isn't intimately familiar with the
area already, with some "resources" for further reading and a list of
required or useful skills.

> Re: Support Another Language, you might mention the Node.js binding I've
> been working on? It could use a LOT more Xapian features. I'd be glad to
> mentor for that. https://github.com/networkimprov/node-xapian

Again, if it's a suitable scope project (I have little idea of what is
involved) and you are willing to mentor, feel free to add it to the
list.

Cheers,
    Olly



More information about the Xapian-discuss mailing list