<html>

<head>

<style><!--

.hmmessage P

{

margin:0px;

padding:0px

}

body.hmmessage

{

font-size: 10pt;

font-family:Tahoma

}

--></style>

</head>

<body class='hmmessage'>

Thanks for the informative reply. &nbsp;I'll take a look at the patch you put up. &nbsp;Would there be a preferred language for the testing environment?<div><br></div><div>Sincerely,</div><div>Zongwei<br><br>&gt; Date: Tue, 22 Mar 2011 13:49:50 +0000<br>&gt; From: olly@survex.com<br>&gt; To: zli2009@hotmail.com<br>&gt; CC: xapian-devel@lists.xapian.org<br>&gt; Subject: Re: [Xapian-devel] Gsoc- Text Extraction Libraries<br>&gt; <br>&gt; On Mon, Mar 21, 2011 at 10:37:17PM -0700, Zongwei Li wrote:<br>&gt; &gt; My name is Zongwei, and I'm a 2nd year computer science major at UCLA.<br>&gt; &gt; I was interested in the text extraction library project, since I have<br>&gt; &gt; almost 2 years experience with C++ and half a year with Linux/Unix.<br>&gt; &gt; As I look the formats that Omega already supports, I see that there a<br>&gt; &gt; lot of formats that only work if a certain program is included.  What<br>&gt; &gt; would be the most important formats to support first?  Based on the<br>&gt; &gt; ideas page, it seems that .zip, pdf, and .doc would be the most<br>&gt; &gt; helpful to have.<br>&gt; <br>&gt; The .zip format is used as a container format for some modern formats<br>&gt; (like Open Document Format).  Inside the .zip are various XML files<br>&gt; which we can already index with built-in code, so I think .zip is<br>&gt; probably a good one to do first as it will help with several formats.<br>&gt; <br>&gt; I just added a link to the patch the idea mentions, which adds support<br>&gt; for .doc via libwv2.  There are a few things to improve in the patch,<br>&gt; but the main issue I found is that libwv2 is a bit unreliable, and<br>&gt; will crash on some documents.  Perhaps libwv1 would be better.<br>&gt; <br>&gt; PDF is certainly a popular format too.<br>&gt; <br>&gt; &gt; Which formats would be preferred to be implemented<br>&gt; &gt; after those?  Roughly speaking, how many would be a feasible amount<br>&gt; &gt; for 12 weeks?<br>&gt; <br>&gt; I'd think you should be able to do quite a few in that time.  There's<br>&gt; some work needed on a framework for them (which my patch provides some<br>&gt; of) and you might find a different library throws up a reason to<br>&gt; tweak that (a new piece of metadata to index perhaps), but in general<br>&gt; you're likely to need less time on average for each additional format.<br>&gt; <br>&gt; It would be good to add tests too.  Currently we don't have indexing<br>&gt; tests for this (which is a sad omission), so it would need a test<br>&gt; framework, and sample documents with a suitable licence (might be<br>&gt; simplest to just create some) in the various formats.  Again, this<br>&gt; should get easier for each additional format.<br>&gt; <br>&gt; In general, it's a good idea to structure your project proposal as a<br>&gt; series of tasks each of which forms some sort of end point if you run<br>&gt; out of time.  So you'd implement, test, and document each part, and then<br>&gt; we can merge it and you can start on the next (which avoids a large and<br>&gt; potentially painful merge at the end).<br>&gt; <br>&gt; You can then define which of the tasks you really should complete, and<br>&gt; which are "stretch goals" to try for if you have the time.<br>&gt; <br>&gt; This doesn't work as well for some projects, but this one breaks down<br>&gt; naturally into a series of tasks.<br>&gt; <br>&gt; Cheers,<br>&gt;     Olly<br></div>                                               </body>

</html>