[GSoC] Bug tracker access

Olly Betts olly at survex.com
Sat Mar 30 03:18:11 GMT 2019


On Thu, Mar 14, 2019 at 11:04:14AM -0300, Bruno Baruffaldi wrote:
> I am interested in applying for the project "Text-Extraction Libraries" and
> I was wondering if you could recommend me something else to read in
> addition to the resources

I think (and hope) this probably got answered on IRC since, but in case
not I don't really have more useful resources (or else I'd have added
them).

One thing you may come across is "LibreOfficeKit" - this effectively
allows loading LibreOffice as a library and extracting text from
files with it (e.g. see https://gitlab.com/ojwb/lloconv/) but it's quite
big and slow compare to other options, and the results generally aren't
a lot better for indexing purposes, so I'd not recommend that as an
approach here.

> or if there is a particular issue that I can solve to get familiar
> with with the code of the project.

There's https://trac.xapian.org/ticket/771 but someone is already
working on it.

If you've not already found something to work on, I'd probably suggest
trying to add support for a new format to omindex.  There is an FAQ
entry which describes how to, but it's a bit out of date as things are
now simpler for most filters.  Probably the commit which added support
for iWork documents is a better guide (I think I pointed this out to you
in another thread):

https://git.xapian.org/?p=xapian;a=commitdiff;h=10e2cf5e64c8acd0a135e54007e1ba8eff2c53ee

Updating the FAQ entry would also be very useful:

https://trac.xapian.org/wiki/FAQ/OmegaNewFileFormat

Cheers,
    Olly



More information about the Xapian-devel mailing list