<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 10pt;
font-family:Tahoma
}
--></style>
</head>
<body class='hmmessage'>
Thanks for the informative reply. I'll take a look at the patch you put up. Would there be a preferred language for the testing environment?<div><br></div><div>Sincerely,</div><div>Zongwei<br><br>> Date: Tue, 22 Mar 2011 13:49:50 +0000<br>> From: olly@survex.com<br>> To: zli2009@hotmail.com<br>> CC: xapian-devel@lists.xapian.org<br>> Subject: Re: [Xapian-devel] Gsoc- Text Extraction Libraries<br>> <br>> On Mon, Mar 21, 2011 at 10:37:17PM -0700, Zongwei Li wrote:<br>> > My name is Zongwei, and I'm a 2nd year computer science major at UCLA.<br>> > I was interested in the text extraction library project, since I have<br>> > almost 2 years experience with C++ and half a year with Linux/Unix.<br>> > As I look the formats that Omega already supports, I see that there a<br>> > lot of formats that only work if a certain program is included. What<br>> > would be the most important formats to support first? Based on the<br>> > ideas page, it seems that .zip, pdf, and .doc would be the most<br>> > helpful to have.<br>> <br>> The .zip format is used as a container format for some modern formats<br>> (like Open Document Format). Inside the .zip are various XML files<br>> which we can already index with built-in code, so I think .zip is<br>> probably a good one to do first as it will help with several formats.<br>> <br>> I just added a link to the patch the idea mentions, which adds support<br>> for .doc via libwv2. There are a few things to improve in the patch,<br>> but the main issue I found is that libwv2 is a bit unreliable, and<br>> will crash on some documents. Perhaps libwv1 would be better.<br>> <br>> PDF is certainly a popular format too.<br>> <br>> > Which formats would be preferred to be implemented<br>> > after those? Roughly speaking, how many would be a feasible amount<br>> > for 12 weeks?<br>> <br>> I'd think you should be able to do quite a few in that time. There's<br>> some work needed on a framework for them (which my patch provides some<br>> of) and you might find a different library throws up a reason to<br>> tweak that (a new piece of metadata to index perhaps), but in general<br>> you're likely to need less time on average for each additional format.<br>> <br>> It would be good to add tests too. Currently we don't have indexing<br>> tests for this (which is a sad omission), so it would need a test<br>> framework, and sample documents with a suitable licence (might be<br>> simplest to just create some) in the various formats. Again, this<br>> should get easier for each additional format.<br>> <br>> In general, it's a good idea to structure your project proposal as a<br>> series of tasks each of which forms some sort of end point if you run<br>> out of time. So you'd implement, test, and document each part, and then<br>> we can merge it and you can start on the next (which avoids a large and<br>> potentially painful merge at the end).<br>> <br>> You can then define which of the tasks you really should complete, and<br>> which are "stretch goals" to try for if you have the time.<br>> <br>> This doesn't work as well for some projects, but this one breaks down<br>> naturally into a series of tasks.<br>> <br>> Cheers,<br>> Olly<br></div>                                            </body>
</html>