[Xapian-devel] Draft Application for GSoC 11 - Text extraction libraries - please review

nijil yes nonijil at yahoo.co.in
Mon Mar 28 09:26:22 BST 2011


 
Proposal for Google Summer of Code 2011 (draft)
Appling organisation:Xapian
Name : Nijil.Y
E-mail address: nijil.y at gmail.com
IRC nickname : laserbled


Biography
I am 4thyear Computer Science and Engineering undergraduate student at CUSAT 
University from India.I am interested in open source and search engines , 
cluster computing , HPC and AI would be my areas of interest.
	* Analytical, detail oriented with 	strong programming skills; work diligently 
on long,challenging 	assignments. 

	* Maintain excellent interpersonal 	communication, time management, and problem 
resolution skills. 

	* Quick to get accustomed to 	new technologies  
Eligibility
I am fully eligible by google norms and will be available through out the 
timespan given for GsoC 2011.
Background Information


1:Have you taken part in GSoC and/or  GHOP and/or  GCI before? If so, in what 
role(s)? Tell us about how it went, and any areas you would have liked more help 
with. 

Ans:I have applied for GSoC 2010 to an organization called Berkman Center which 
is a wing of Harvard University. It was for a news indexing application called 
Mediacloud.I was not selected as there were more deserving candidates.
2:Please tell us about any previous experience you have with Xapian, or other 
systems for indexed text search. 

Ans:I have been interested in web search engines and indexing and crawling as a 
whole.I have done a small indexing and searching application in java which could 
index from multiple servers and extract contents from html , pdf , office 
formats , ppts etc.I wanted to extent it to a distributed level of indexing but 
was not able to continue.I have also built a small content extraction tool with 
perl and cpan modules.I have been reading and disecting xapian code for the past 
2 weeks and am feeling comfortable with the code as its neat and organised.I 
have been concentrating more on the Omega tool provided with the xapian package 
and am looking forward to work with it.
I have been reading omindex.cc for a while now and am trying to learn more about 
the system.
3:Do you have previous experience with Free Software and Open Source other than 
Xapian? 

Ans:I joined dreamwidth community a while back .That helped me to learn peal a 
bit.It is a journaling tool.I have not been active though.I think I submitted a 
small patch.Now I am a part of linuxpmi (Linux Process Migration Infrastructure 
), a community which has kernel patches for process migration.The group has 
hardly 10 members and is called the new openmosix and was in a hibernation state 
when I joined a couple of months back.Plan to work with them as it is in lines 
of my interest.
4:Do you have any other relevant prior experience (courses taken at college, 
hobbies, holiday jobs, etc)? 

Ans:I have taken all the usual courses in computer Science.I have done a Office 
Automation to maintain fees , details and admission logs for my university which 
has around 3000 students and also the details of faculty.Platform used were 
Visual Studio , ASP.NET and MySql.
My major project is study of SSI Virtualisation using code extraction so that 
multithreaded applications could be parellised and run on cluster nodes. 
Platform used is C and QEMU.So threadmigration etc are possible 
implementations.The completion of the whole project would take sometime though.I 
play around with my system hardware and do the tweaking myself.
5:What development platforms, tools and methods do you prefer to use? 
Ans:I have been using linux platform completely for the past 1 and a half 
years.Before that I was a partial user.I work on an ubuntu box 10.10.Tools would 
be vim , grep , geany etc.
6:Have you previously been responsible (as an employee/volunteer/student/etc) 
for a project of a similar size? 

Ans:Projects done till now are given at the end of this document.
7:What timezone will you be in during the coding period? 
Ans: UTC/GMT +5:30 hours. I would be in India during that time.But working time 
could be flexible as I am a nightbird mostly.So wont be an issue.
8:Will your Summer of Code project be the main focus of your time during the 
program? 

Ans:Oh yes.Absolutely
9:How many hours a week will you realistically be able to devote to your 
project? 

Ans:I would work on the project 8-9 hrs a day at the minimum.That would compute 
to around 50-55hrs for six days.Planning to take sunday off.But ofcourse if it 
doesnt meet up with the shedule I am more than willing to work on sundays too.No 
issues.
10:Are you applying for other projects in GSoC 2011? If so, with which 
organisations? (We don't regard you applying for other projects negatively, but 
we like to know so that we can plan for possible scenarios when assigning 
mentors). 

Ans:NO.I debated a lot on it and since I wont be able to do any preliminary work 
on other orgs proj I wont be able to call myself commited  and I belive it takes 
a more than just writing a proposal and submitting it.So I thought I better 
stick to one org and work on that and try my luck.
Project
Title : Text-Extraction Libraries to index file-formats  
Summary 
Currently Omega has built-in support for HTML, plain text, and uncompressed 
AbiWord documents.Other files are being textracted using external programs which 
casue a overhead.I am planning to use libraries to replace this external 
programs and preserve and improve the file support list.Also with that plan to 
add new filetype support for audio , email , and if possible 3d file formats , 
archive formats , packaging formats and database file support.Another nifty 
feature planning to develop will be a thumbnail generation system which will as 
a entry of the thumbnail generated from a file so that it can be viewed during 
retrival.The main aim is to  avoid externel cpu programs eating up cpu time and 
reduce that with the help of library services.
Why have you chosen this project?  
Am a supporter of opensource , is interested in indexing  and search engine 
based services and I found xapian interesting as it would be some thing I could 
put to use on daily terms and where I would be doing something I love .Also 
since I have conceptual knowledge about these aspects , am hoping that that 
would come to my aid. 

Benefits
The open source community would definitely benifit , so will the xapian 
userbase.As of now Omega indexer doesnt provide fileformat support implicilty.It 
does explicilty by making user of external programs thus increasing its cpu foot 
print.Also for each format 1 or more than 1 externel filter progrma is needed to 
be run.We can completely avoid that once this project is over.Also the user base 
is likely to increase as it will be getting a indexer along with a better and 
robut file indexing support capability.It could then be filled with a gui 
interface and made as a desktop indexing and search application.
Project Details
A:The project I plan to undertakecan be summerized as below
1:Replace existing external filter programs with shared libraries.
2:Add new file-format support.
3:Add thumbnail generation feature.
4:Add a testing framework
5:minimize 'ignore' file list.
B:What is new or different about your approach which hasn't been done or wasn't 
possible before? 

Currently we require external programs like xpdf, unzip  , xls2csv, catdoc etc – 
which the user would need to have installed on the users coputer to make use of 
the fileindexing for those corresponding formats.But it has a problem.It 
requires that a new process need to be started everytime we come across a 
fileformat and the external program is run which would then extract the contents 
and metadata.But that would cost a lot of cpu and each new process started would 
be extra load on the cpu and thus increase the footprint on xapian system.So the 
possible option would be to replace those filters and use shared libraries 
instead.That would take care of that isse.Also adding new fileformat support 
would definitely increase the usablity and flexibility of xapian and omega .
Another point would be to built a testing framework which would test the 
effectiveness of the system and text the indexing is flawless.A framework need 
to be built as curebtly there isnt one.
Thumbnail generation would be nifty feature which would make the the search UI 
much easier.The site could make use of javascript feature  and could display 
along with search result.
Possible file types that I play to include would contain mainly with zip archive 
formats , office formats , document formats.Possibly extended to secondary 
objectives like media format basically to index metadata and to 3d and 2 d files 
as those too would contain metadatas.If all the above does get completed in time 
the next posssible options would be to to exend the indexing to programming 
languages and repositories.Though that would require tinkering with other 
componenets.
C:Do you have any preliminary findings or results which suggest that your 
approach is possible and likely to succeed?
The approach was suggested in the ideas page and it seems perfectly fine.And the 
possibility of success is very high and the things that could possible go wrong 
are pretty low.

Project Timeline 
26THApril - 12THMay: Familiarization with Mentor and codebase of present 
XapianSystem. Study of patterns and conventions used .Specific study of Omega 
app and its inter-dependencies.Discussing the drawbacks of the current system. 
Improving C and Perl Coding Skill .  


13THMay – 22THMay: Finding the required C libraries , comparing and fxing the 
best possible once for each file types,Also the once with the minimum cpu 
usage.Also the requirements of the testing framework are discussed.Review about 
the errors that occurs in the present system. Full Blue print of what is to be 
done Exactly

23THMay – 24THMay: Current project status and goals reviewed by mentor. Timeline 
is Corrected if needed.

25THMay – 15THJune: Implementation Of skeletal System and Checking its 
integration with OmegaSystem and Testing Whether it address basic problems that 
was faced earlier. Support for basic formats are introduced.Logging of errors 
and Suggestion to improve the system is taken.  

17THJune - 2THJune: Solving Bugs and Error and Finish the work of a basic 
working system. Interact with development team to see whether it works well with 
the other models and is error free. System testing and Test runs done. Results 
evaluated.  


2THJune - 26THJune:Adding a few extra formats and testing as mentioed 
above.Implementation of thumbnail generation.
26THJune - 28THJune:Review by mentor and suggestions by mentor on whatshould be 
added to the system. Prepare for adding tweaks and performance improvements.

30THJune – 6THJuly:Incorporating extra changes and adding tweaks and performance 
improvements.  


8THJuly - 10THJuly: Documentation of Work till then complete with analysis of 
current and previous system. Getting Ready For Mid Term  Evaluation

12TH July: Submission for Mid term Evaluation.
16THJuly - 19THJuly: Discussion of advance features  that need to be 
implemented. Exception handling and fault tolerance issues discussed . Assuming 
that 85 % of the work is completed.Most of the fileformats are spported which 
include archives , office and documents and packages.
20THJuly - 6THAugust: Implementation of thorough test case framework.

6THAugust– 10THAugust:Extensive test on new System .Run on all 
conditions.Logging of errors and bugs  and Exception .

10THAugust – 13THAugust:System final run and Ready for upload .

13THAugust – 18THAugust: Buffer time if somethings is needed to be done or if I 
loose a few days.


18THAugust - 20THAugust: Submission of final evaluations to Google by both 
students and mentors.Wonderful  Time. Take A few Days To Chill out and Join the 
team for the remaining  journey to achieve our ultimate goal. 

Previous Discussion of your Project
I have discussed my project extensively with ojwb.Also I have contacted the 
person (Jean-Francois Dockes) who manages Recoll ( an application using xapian 
as a backend for searching ).




Projects done till now


	* Project : 	Office Automation  
Client : School of Engineering office ,CUSAT  
Team Size : 4  
Cochin University of Science & Technology (CUSAT) is a government owned 
autonomous university in Kochi (Cochin), Kerala, India. The university awards 
degrees in various fields of engineering and allied subjects at the 
undergraduate, postgraduate and doctoral levels. Nearly 1,000 students enroll 
yearly in various areas of undergraduate and postgraduate study in this 
university(totalling to 4000).  

The Office automation Project was to develop a web application in order to 
maintain the day-to-day activities of the School of Engineering office ,CUSAT. 
The web application has to enable the management of Student Details such as Fee 
Management of CUSAT, Issuing of Certificates, Staff Management etc. using a 
single Gateway.  

Platform: ASP.NET/ C#, MySQL Server and Windows Server 2003  
	* Activities 	& Responsibility  
	* Requirement 	Analysis  
	* Prototype 	Development  
	* Database 	Design  
	* Environment 	Set Up  
	* Development  
	* Unit 	Testing  
	* Project : 	Generic LAN Search  
Team Size : 2  
To develop a generic LAN search engine which can crawl the Local serves and 
index files.  

Activities & Responsibility  
	* System 	Study  
	* Finding 	Best Algorithms  
	* Programming  
Platform:Java, MySQL Server, JSP  
	* Project:Single 	System Image over Virtualization  
Client : FISAT CHPC(Center for High Perfomance Computing)  
Team Size : 2  
Description  
A SSI over virtualization used for implementation over a cluster or super 
computer which will provide transparency to the application running on the 
server.It can be used to achieve paralell processing with minimum change in the 
client programs, to utilize idle CPU processing , paralellize heavily threaded 
applications etc. The project is still going on and is not likely to get over 
any soon.
Platform: C/C++, Assembly, Qemu(VM), Linux  


Skill Set  
Languages 		: Java , C/C++ , HTML , Perl , ASP.NET/C#  
Database 		: MySQL  
Operating system	: Windows , Linux  
Software Packages	: Any software which follows usual conventions  
Personal Dossier  
Date of Birth 		: 26-March,1990  
Father's Name		: Yesudasan A  
Languages known 	: English, Malayalam, Hindi  



      
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20110328/cef3d212/attachment-0001.htm>


More information about the Xapian-devel mailing list