[Xapian-devel] Draft Application for GSoC 11 - Text extraction libraries - please review
nijil yes
nonijil at yahoo.co.in
Mon Mar 28 09:26:22 BST 2011
Proposal for Google Summer of Code 2011 (draft)
Appling organisation:Xapian
Name : Nijil.Y
E-mail address: nijil.y at gmail.com
IRC nickname : laserbled
Biography
I am 4thyear Computer Science and Engineering undergraduate student at CUSAT
University from India.I am interested in open source and search engines ,
cluster computing , HPC and AI would be my areas of interest.
* Analytical, detail oriented with strong programming skills; work diligently
on long,challenging assignments.
* Maintain excellent interpersonal communication, time management, and problem
resolution skills.
* Quick to get accustomed to new technologies
Eligibility
I am fully eligible by google norms and will be available through out the
timespan given for GsoC 2011.
Background Information
1:Have you taken part in GSoC and/or GHOP and/or GCI before? If so, in what
role(s)? Tell us about how it went, and any areas you would have liked more help
with.
Ans:I have applied for GSoC 2010 to an organization called Berkman Center which
is a wing of Harvard University. It was for a news indexing application called
Mediacloud.I was not selected as there were more deserving candidates.
2:Please tell us about any previous experience you have with Xapian, or other
systems for indexed text search.
Ans:I have been interested in web search engines and indexing and crawling as a
whole.I have done a small indexing and searching application in java which could
index from multiple servers and extract contents from html , pdf , office
formats , ppts etc.I wanted to extent it to a distributed level of indexing but
was not able to continue.I have also built a small content extraction tool with
perl and cpan modules.I have been reading and disecting xapian code for the past
2 weeks and am feeling comfortable with the code as its neat and organised.I
have been concentrating more on the Omega tool provided with the xapian package
and am looking forward to work with it.
I have been reading omindex.cc for a while now and am trying to learn more about
the system.
3:Do you have previous experience with Free Software and Open Source other than
Xapian?
Ans:I joined dreamwidth community a while back .That helped me to learn peal a
bit.It is a journaling tool.I have not been active though.I think I submitted a
small patch.Now I am a part of linuxpmi (Linux Process Migration Infrastructure
), a community which has kernel patches for process migration.The group has
hardly 10 members and is called the new openmosix and was in a hibernation state
when I joined a couple of months back.Plan to work with them as it is in lines
of my interest.
4:Do you have any other relevant prior experience (courses taken at college,
hobbies, holiday jobs, etc)?
Ans:I have taken all the usual courses in computer Science.I have done a Office
Automation to maintain fees , details and admission logs for my university which
has around 3000 students and also the details of faculty.Platform used were
Visual Studio , ASP.NET and MySql.
My major project is study of SSI Virtualisation using code extraction so that
multithreaded applications could be parellised and run on cluster nodes.
Platform used is C and QEMU.So threadmigration etc are possible
implementations.The completion of the whole project would take sometime though.I
play around with my system hardware and do the tweaking myself.
5:What development platforms, tools and methods do you prefer to use?
Ans:I have been using linux platform completely for the past 1 and a half
years.Before that I was a partial user.I work on an ubuntu box 10.10.Tools would
be vim , grep , geany etc.
6:Have you previously been responsible (as an employee/volunteer/student/etc)
for a project of a similar size?
Ans:Projects done till now are given at the end of this document.
7:What timezone will you be in during the coding period?
Ans: UTC/GMT +5:30 hours. I would be in India during that time.But working time
could be flexible as I am a nightbird mostly.So wont be an issue.
8:Will your Summer of Code project be the main focus of your time during the
program?
Ans:Oh yes.Absolutely
9:How many hours a week will you realistically be able to devote to your
project?
Ans:I would work on the project 8-9 hrs a day at the minimum.That would compute
to around 50-55hrs for six days.Planning to take sunday off.But ofcourse if it
doesnt meet up with the shedule I am more than willing to work on sundays too.No
issues.
10:Are you applying for other projects in GSoC 2011? If so, with which
organisations? (We don't regard you applying for other projects negatively, but
we like to know so that we can plan for possible scenarios when assigning
mentors).
Ans:NO.I debated a lot on it and since I wont be able to do any preliminary work
on other orgs proj I wont be able to call myself commited and I belive it takes
a more than just writing a proposal and submitting it.So I thought I better
stick to one org and work on that and try my luck.
Project
Title : Text-Extraction Libraries to index file-formats
Summary
Currently Omega has built-in support for HTML, plain text, and uncompressed
AbiWord documents.Other files are being textracted using external programs which
casue a overhead.I am planning to use libraries to replace this external
programs and preserve and improve the file support list.Also with that plan to
add new filetype support for audio , email , and if possible 3d file formats ,
archive formats , packaging formats and database file support.Another nifty
feature planning to develop will be a thumbnail generation system which will as
a entry of the thumbnail generated from a file so that it can be viewed during
retrival.The main aim is to avoid externel cpu programs eating up cpu time and
reduce that with the help of library services.
Why have you chosen this project?
Am a supporter of opensource , is interested in indexing and search engine
based services and I found xapian interesting as it would be some thing I could
put to use on daily terms and where I would be doing something I love .Also
since I have conceptual knowledge about these aspects , am hoping that that
would come to my aid.
Benefits
The open source community would definitely benifit , so will the xapian
userbase.As of now Omega indexer doesnt provide fileformat support implicilty.It
does explicilty by making user of external programs thus increasing its cpu foot
print.Also for each format 1 or more than 1 externel filter progrma is needed to
be run.We can completely avoid that once this project is over.Also the user base
is likely to increase as it will be getting a indexer along with a better and
robut file indexing support capability.It could then be filled with a gui
interface and made as a desktop indexing and search application.
Project Details
A:The project I plan to undertakecan be summerized as below
1:Replace existing external filter programs with shared libraries.
2:Add new file-format support.
3:Add thumbnail generation feature.
4:Add a testing framework
5:minimize 'ignore' file list.
B:What is new or different about your approach which hasn't been done or wasn't
possible before?
Currently we require external programs like xpdf, unzip , xls2csv, catdoc etc –
which the user would need to have installed on the users coputer to make use of
the fileindexing for those corresponding formats.But it has a problem.It
requires that a new process need to be started everytime we come across a
fileformat and the external program is run which would then extract the contents
and metadata.But that would cost a lot of cpu and each new process started would
be extra load on the cpu and thus increase the footprint on xapian system.So the
possible option would be to replace those filters and use shared libraries
instead.That would take care of that isse.Also adding new fileformat support
would definitely increase the usablity and flexibility of xapian and omega .
Another point would be to built a testing framework which would test the
effectiveness of the system and text the indexing is flawless.A framework need
to be built as curebtly there isnt one.
Thumbnail generation would be nifty feature which would make the the search UI
much easier.The site could make use of javascript feature and could display
along with search result.
Possible file types that I play to include would contain mainly with zip archive
formats , office formats , document formats.Possibly extended to secondary
objectives like media format basically to index metadata and to 3d and 2 d files
as those too would contain metadatas.If all the above does get completed in time
the next posssible options would be to to exend the indexing to programming
languages and repositories.Though that would require tinkering with other
componenets.
C:Do you have any preliminary findings or results which suggest that your
approach is possible and likely to succeed?
The approach was suggested in the ideas page and it seems perfectly fine.And the
possibility of success is very high and the things that could possible go wrong
are pretty low.
Project Timeline
26THApril - 12THMay: Familiarization with Mentor and codebase of present
XapianSystem. Study of patterns and conventions used .Specific study of Omega
app and its inter-dependencies.Discussing the drawbacks of the current system.
Improving C and Perl Coding Skill .
13THMay – 22THMay: Finding the required C libraries , comparing and fxing the
best possible once for each file types,Also the once with the minimum cpu
usage.Also the requirements of the testing framework are discussed.Review about
the errors that occurs in the present system. Full Blue print of what is to be
done Exactly
23THMay – 24THMay: Current project status and goals reviewed by mentor. Timeline
is Corrected if needed.
25THMay – 15THJune: Implementation Of skeletal System and Checking its
integration with OmegaSystem and Testing Whether it address basic problems that
was faced earlier. Support for basic formats are introduced.Logging of errors
and Suggestion to improve the system is taken.
17THJune - 2THJune: Solving Bugs and Error and Finish the work of a basic
working system. Interact with development team to see whether it works well with
the other models and is error free. System testing and Test runs done. Results
evaluated.
2THJune - 26THJune:Adding a few extra formats and testing as mentioed
above.Implementation of thumbnail generation.
26THJune - 28THJune:Review by mentor and suggestions by mentor on whatshould be
added to the system. Prepare for adding tweaks and performance improvements.
30THJune – 6THJuly:Incorporating extra changes and adding tweaks and performance
improvements.
8THJuly - 10THJuly: Documentation of Work till then complete with analysis of
current and previous system. Getting Ready For Mid Term Evaluation
12TH July: Submission for Mid term Evaluation.
16THJuly - 19THJuly: Discussion of advance features that need to be
implemented. Exception handling and fault tolerance issues discussed . Assuming
that 85 % of the work is completed.Most of the fileformats are spported which
include archives , office and documents and packages.
20THJuly - 6THAugust: Implementation of thorough test case framework.
6THAugust– 10THAugust:Extensive test on new System .Run on all
conditions.Logging of errors and bugs and Exception .
10THAugust – 13THAugust:System final run and Ready for upload .
13THAugust – 18THAugust: Buffer time if somethings is needed to be done or if I
loose a few days.
18THAugust - 20THAugust: Submission of final evaluations to Google by both
students and mentors.Wonderful Time. Take A few Days To Chill out and Join the
team for the remaining journey to achieve our ultimate goal.
Previous Discussion of your Project
I have discussed my project extensively with ojwb.Also I have contacted the
person (Jean-Francois Dockes) who manages Recoll ( an application using xapian
as a backend for searching ).
Projects done till now
* Project : Office Automation
Client : School of Engineering office ,CUSAT
Team Size : 4
Cochin University of Science & Technology (CUSAT) is a government owned
autonomous university in Kochi (Cochin), Kerala, India. The university awards
degrees in various fields of engineering and allied subjects at the
undergraduate, postgraduate and doctoral levels. Nearly 1,000 students enroll
yearly in various areas of undergraduate and postgraduate study in this
university(totalling to 4000).
The Office automation Project was to develop a web application in order to
maintain the day-to-day activities of the School of Engineering office ,CUSAT.
The web application has to enable the management of Student Details such as Fee
Management of CUSAT, Issuing of Certificates, Staff Management etc. using a
single Gateway.
Platform: ASP.NET/ C#, MySQL Server and Windows Server 2003
* Activities & Responsibility
* Requirement Analysis
* Prototype Development
* Database Design
* Environment Set Up
* Development
* Unit Testing
* Project : Generic LAN Search
Team Size : 2
To develop a generic LAN search engine which can crawl the Local serves and
index files.
Activities & Responsibility
* System Study
* Finding Best Algorithms
* Programming
Platform:Java, MySQL Server, JSP
* Project:Single System Image over Virtualization
Client : FISAT CHPC(Center for High Perfomance Computing)
Team Size : 2
Description
A SSI over virtualization used for implementation over a cluster or super
computer which will provide transparency to the application running on the
server.It can be used to achieve paralell processing with minimum change in the
client programs, to utilize idle CPU processing , paralellize heavily threaded
applications etc. The project is still going on and is not likely to get over
any soon.
Platform: C/C++, Assembly, Qemu(VM), Linux
Skill Set
Languages : Java , C/C++ , HTML , Perl , ASP.NET/C#
Database : MySQL
Operating system : Windows , Linux
Software Packages : Any software which follows usual conventions
Personal Dossier
Date of Birth : 26-March,1990
Father's Name : Yesudasan A
Languages known : English, Malayalam, Hindi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20110328/cef3d212/attachment-0001.htm>
More information about the Xapian-devel
mailing list