[Xapian-tickets] [Xapian] #743: omindex: delay libmagic checks

Xapian nobody at xapian.org
Wed Apr 26 07:00:17 BST 2017


#743: omindex: delay libmagic checks
--------------------+-------------------------------
 Reporter:  olly    |             Owner:  olly
     Type:  defect  |            Status:  new
 Priority:  normal  |         Milestone:  1.4.x
Component:  Omega   |           Version:  git master
 Severity:  normal  |        Resolution:
 Keywords:          |        Blocked By:
 Blocking:          |  Operating System:  All
--------------------+-------------------------------

Old description:

> Currently omindex's logic is:
>
> * map the extension to a mime type
> * if "ignore" or "skip" move on to next file
> * if no mapping, call libmagic to get a mime type
> * if libmagic doesn't recognise the file, move on to next file
> * check file size (requires `stat()` call, which we have avoided so far
> if the file system returns `d_type` from `readdir()`)
> * if 0 or > max_size then move on to next file
> * create `Document` object and set up a little
> * check timestamps from `stat()` and the DB for an existing entry and
> move on to next file if this has been indexed and hasn't changed
> * check for failed entry in DB and move on if we already tried and failed
> (needs file size and last mod from `stat()`)
>
> The ordering here isn't ideal - in particular:
>
> * The probing done by libmagic is potentially fairly expensive since it
> has to open and read the start of the file, so we should avoid calling
> libmagic if another cheap check which doesn't need the mime type could
> reject the file (e.g. size, possibly timestamps if we can uncouple those
> checks from the check for the existing DB entry).  If we have a mapping
> checking the mimetype for "ignore" or "skip" is still a cheap early
> check.
> * We create and setup the `Document` object a bit early (though this
> shouldn't be very expensive).

New description:

 Currently omindex's logic is:

 * map the extension to a mime type
   * if "ignore" or "skip" move on to next file
 * check file size (requires `stat()` call, which we have avoided so far if
 the file system returns `d_type` from `readdir()`)
   * if 0 or > max_size then move on to next file
 * if extension mapping, call libmagic to get a mime type
   * if libmagic doesn't recognise the file, move on to next file
 * create `Document` object and set up a little
 * check timestamps from `stat()` and the DB for an existing entry and move
 on to next file if this has been indexed and hasn't changed
 * check for failed entry in DB and move on if we already tried and failed
 (needs file size and last mod from `stat()`)

 The ordering here isn't ideal - in particular:

 * The probing done by libmagic is potentially fairly expensive since it
 has to open and read the start of the file, so we should avoid calling
 libmagic if another cheap check which doesn't need the mime type could
 reject the file (e.g. possibly timestamps if we can uncouple those checks
 from the check for the existing DB entry).  If we have a mapping checking
 the mimetype for "ignore" or "skip" is still a cheap early check.
 * We create and setup the `Document` object a bit early (though this
 shouldn't be very expensive).

--

Comment (by olly):

 In d32e13545e699e6920e0e5a3ebfe8a94206de32f in git master, caiyulun has
 moved the libmagic check after the file size checks (description updated
 to match).  That should be backported to 1.4.5 I think.

--
Ticket URL: <https://trac.xapian.org/ticket/743#comment:1>
Xapian <https://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list