[Xapian-tickets] [Xapian] #517: omindex: could use libextractor for many formats
Xapian
nobody at xapian.org
Fri Nov 12 04:50:06 GMT 2010
#517: omindex: could use libextractor for many formats
-------------------------+--------------------------------------------------
Reporter: olly | Owner: olly
Type: enhancement | Status: new
Priority: normal | Milestone: 1.3.0
Component: Omega | Version:
Severity: normal | Keywords:
Blockedby: | Platform: All
Blocking: |
-------------------------+--------------------------------------------------
This is "son of #114" - that ticket was about using libmagic and
libextractor, which is really two issues. The libmagic one is now done,
but the libextractor remains. This would be a potentially disruptive
change, which I think isn't appropriate to make mid-1.2 series, so marking
as milestone:1.3.0.
----
[attachment:libextractor.patch:ticket:114 patch to use libmagic and
libextractor]
This is a horrible hack, but you get the idea. A better setup would not
bother with fileext/mimetypes that are known already to have no extractors
available.
----
Summarising the relevant parts of #114:
libextractor plus points:
* Has plugins for many file types
* Extracts metadata as well as text
* Saves us having to maintain code to perform filtering for so many
formats
Issues:
* Haven't compared output quality with existing filters
* Current libextractor API (at least when #114 was filed) doesn't
distinguish between not having a plugin for a format, and the format not
having metadata to extract, which makes it hard to efficiently fall back
to other filters.
----
We could use libextractor as a toolbox of filters which we pick from
ourselves:
I see one option besides taking this speed hit (which I believe forcing
upon the user would be contrary to the design of omindex, since that was
the whole point of removing file extensions from the map that are not
handled by index_file).
This would be to map MIME types directly to libextractor plugins.
The maintainer guarantees that the name of libextractor plugins is static.
So we have the filename-to-MIME-type-map, to save the open if the user
doesn't want to use libmagic (libmagic for more accurate MIME type
identification).
Then we add a MIME-type-to-libextractor-plugin map, so that we check the
MIME type of a file passed to index_file, and call libextractor with an
!ExtractorList only including the plugin for that one file's MIME type.
Drawbacks: - Requires a priori knowledge of what plugins libextractor
currently has in order to add new ones, but it shouldn't change that
frequently. Essentially libextractor is a swiss army knife text
extractor, and adding a new format it supports is conceptually similar to
adding support for a new filter program or library (but less work!)
- If the file extension is wrong, mime_map is wrong, or libmagic screws up
fingerprinting the file, we extract empty keyword set because the wrong
libextractor plugin is called. But that is already the case with omindex
because it currently depends on the file extension being correct, and
webservers typically pick the content-type to server based on the
extension anyway.
--
Ticket URL: <http://trac.xapian.org/ticket/517>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list