[Xapian-tickets] [Xapian] #517: omindex: could use libextractor for many formats

Fri Nov 12 04:50:06 GMT 2010

#517: omindex: could use libextractor for many formats
-------------------------+--------------------------------------------------
 Reporter:  olly         |       Owner:  olly 
     Type:  enhancement  |      Status:  new  
 Priority:  normal       |   Milestone:  1.3.0
Component:  Omega        |     Version:       
 Severity:  normal       |    Keywords:       
Blockedby:               |    Platform:  All  
 Blocking:               |  
-------------------------+--------------------------------------------------
 This is "son of #114" - that ticket was about using libmagic and
 libextractor, which is really two issues.  The libmagic one is now done,
 but the libextractor remains.  This would be a potentially disruptive
 change, which I think isn't appropriate to make mid-1.2 series, so marking
 as milestone:1.3.0.

 ----

 [attachment:libextractor.patch:ticket:114 patch to use libmagic and
 libextractor]

 This is a horrible hack, but you get the idea. A better setup would not
 bother with fileext/mimetypes that are known already to have no extractors
 available.

 ----

 Summarising the relevant parts of #114:

 libextractor plus points:

  * Has plugins for many file types
  * Extracts metadata as well as text
  * Saves us having to maintain code to perform filtering for so many
 formats

 Issues:

  * Haven't compared output quality with existing filters
  * Current libextractor API (at least when #114 was filed) doesn't
 distinguish between not having a plugin for a format, and the format not
 having metadata to extract, which makes it hard to efficiently fall back
 to other filters.

 ----

 We could use libextractor as a toolbox of filters which we pick from
 ourselves:

 I see one option besides taking this speed hit (which I believe forcing
 upon the user would be contrary to the design of omindex, since that was
 the whole point of removing file extensions from the map that are not
 handled by index_file).

 This would be to map MIME types directly to libextractor plugins.

 The maintainer guarantees that the name of libextractor plugins is static.
 So we have the filename-to-MIME-type-map, to save the open if the user
 doesn't want to use libmagic (libmagic for more accurate MIME type
 identification).

 Then we add a MIME-type-to-libextractor-plugin map, so that we check the
 MIME type of a file passed to index_file, and call libextractor with an
 !ExtractorList only including the plugin for that one file's MIME type.

 Drawbacks: - Requires a priori knowledge of what plugins libextractor
 currently has in order to add new ones, but it shouldn't change that
 frequently.  Essentially libextractor is a swiss army knife text
 extractor, and adding a new format it supports is conceptually similar to
 adding support for a new filter program or library (but less work!)

 - If the file extension is wrong, mime_map is wrong, or libmagic screws up
 fingerprinting the file, we extract empty keyword set because the wrong
 libextractor plugin is called. But that is already the case with omindex
 because it currently depends on the file extension being correct, and
 webservers typically pick the content-type to server based on the
 extension anyway.

-- 
Ticket URL: <http://trac.xapian.org/ticket/517>
Xapian <http://xapian.org/>
Xapian