[Xapian-devel] Gsoc- Text Extraction Libraries

Olly Betts olly at survex.com
Tue Mar 22 13:49:50 GMT 2011


On Mon, Mar 21, 2011 at 10:37:17PM -0700, Zongwei Li wrote:
> My name is Zongwei, and I'm a 2nd year computer science major at UCLA.
> I was interested in the text extraction library project, since I have
> almost 2 years experience with C++ and half a year with Linux/Unix.
> As I look the formats that Omega already supports, I see that there a
> lot of formats that only work if a certain program is included.  What
> would be the most important formats to support first?  Based on the
> ideas page, it seems that .zip, pdf, and .doc would be the most
> helpful to have.

The .zip format is used as a container format for some modern formats
(like Open Document Format).  Inside the .zip are various XML files
which we can already index with built-in code, so I think .zip is
probably a good one to do first as it will help with several formats.

I just added a link to the patch the idea mentions, which adds support
for .doc via libwv2.  There are a few things to improve in the patch,
but the main issue I found is that libwv2 is a bit unreliable, and
will crash on some documents.  Perhaps libwv1 would be better.

PDF is certainly a popular format too.

> Which formats would be preferred to be implemented
> after those?  Roughly speaking, how many would be a feasible amount
> for 12 weeks?

I'd think you should be able to do quite a few in that time.  There's
some work needed on a framework for them (which my patch provides some
of) and you might find a different library throws up a reason to
tweak that (a new piece of metadata to index perhaps), but in general
you're likely to need less time on average for each additional format.

It would be good to add tests too.  Currently we don't have indexing
tests for this (which is a sad omission), so it would need a test
framework, and sample documents with a suitable licence (might be
simplest to just create some) in the various formats.  Again, this
should get easier for each additional format.

In general, it's a good idea to structure your project proposal as a
series of tasks each of which forms some sort of end point if you run
out of time.  So you'd implement, test, and document each part, and then
we can merge it and you can start on the next (which avoids a large and
potentially painful merge at the end).

You can then define which of the tasks you really should complete, and
which are "stretch goals" to try for if you have the time.

This doesn't work as well for some projects, but this one breaks down
naturally into a series of tasks.

Cheers,
    Olly



More information about the Xapian-devel mailing list