[Xapian-devel] Gsoc- Text Extraction Libraries

Tue Mar 22 18:29:11 GMT 2011

Thanks for the informative reply.  I'll take a look at the patch you put up.  Would there be a preferred language for the testing environment?
Sincerely,Zongwei

> Date: Tue, 22 Mar 2011 13:49:50 +0000
> From: olly at survex.com
> To: zli2009 at hotmail.com
> CC: xapian-devel at lists.xapian.org
> Subject: Re: [Xapian-devel] Gsoc- Text Extraction Libraries
> 
> On Mon, Mar 21, 2011 at 10:37:17PM -0700, Zongwei Li wrote:
> > My name is Zongwei, and I'm a 2nd year computer science major at UCLA.
> > I was interested in the text extraction library project, since I have
> > almost 2 years experience with C++ and half a year with Linux/Unix.
> > As I look the formats that Omega already supports, I see that there a
> > lot of formats that only work if a certain program is included.  What
> > would be the most important formats to support first?  Based on the
> > ideas page, it seems that .zip, pdf, and .doc would be the most
> > helpful to have.
> 
> The .zip format is used as a container format for some modern formats
> (like Open Document Format).  Inside the .zip are various XML files
> which we can already index with built-in code, so I think .zip is
> probably a good one to do first as it will help with several formats.
> 
> I just added a link to the patch the idea mentions, which adds support
> for .doc via libwv2.  There are a few things to improve in the patch,
> but the main issue I found is that libwv2 is a bit unreliable, and
> will crash on some documents.  Perhaps libwv1 would be better.
> 
> PDF is certainly a popular format too.
> 
> > Which formats would be preferred to be implemented
> > after those?  Roughly speaking, how many would be a feasible amount
> > for 12 weeks?
> 
> I'd think you should be able to do quite a few in that time.  There's
> some work needed on a framework for them (which my patch provides some
> of) and you might find a different library throws up a reason to
> tweak that (a new piece of metadata to index perhaps), but in general
> you're likely to need less time on average for each additional format.
> 
> It would be good to add tests too.  Currently we don't have indexing
> tests for this (which is a sad omission), so it would need a test
> framework, and sample documents with a suitable licence (might be
> simplest to just create some) in the various formats.  Again, this
> should get easier for each additional format.
> 
> In general, it's a good idea to structure your project proposal as a
> series of tasks each of which forms some sort of end point if you run
> out of time.  So you'd implement, test, and document each part, and then
> we can merge it and you can start on the next (which avoids a large and
> potentially painful merge at the end).
> 
> You can then define which of the tasks you really should complete, and
> which are "stretch goals" to try for if you have the time.
> 
> This doesn't work as well for some projects, but this one breaks down
> naturally into a series of tasks.
> 
> Cheers,
>     Olly

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20110322/0ad5c447/attachment.htm>