Question about the ticket #743 omindex: delay libmagic checks

YuLun Cai buptcyl at gmail.com
Sun Apr 23 17:22:26 BST 2017


>
> I'd suggest to start with you just look at moving the libmagic check after
> the filesize checks, so you don't need to get into whether libmagic or
> the database check is cheaper on average.


hi, Olly, I have moved the libmagic check after the filesize check directly,

https://github.com/caiyulun/xapian/commit/3a97d9ee5397fa900a473aa9b3d8eeb720177a4e


can you provide your comments on it and give some advice about the next
steps?

I think it is hard to say which is cheaper between the libmagic and
database check

Thanks


2017-04-21 13:37 GMT+08:00 Olly Betts <olly at survex.com>:

> On Fri, Apr 21, 2017 at 01:52:38AM +0800, YuLun Cai wrote:
> > I'm working on the ticket #743 omindex: delay libmagic checks
> > <https://trac.xapian.org/ticket/743>. As the ticket's
> > Description mention, the call to libmagic is expensive than call the
> stat,
> > so we can check the size by call the stat to get size before call
> > libmagic to get a mime type.
>
> Yes.
>
> > But how about the timestamps check? since timestamps check need to
> iterate
> > the DB to check if the file has been indexed and hasn't changed(in
> > `index_check_existing` function in omega\index_file.cc), so it is
> expensive
> > too. Should we call the libmagic before or after the timestamps, or do we
> > have another way to check the timestamps?
>
> We also have an upper bound on the newest timestamp in the database at the
> start of the run, so we can often avoid this check for new files (at least
> if they were created since the end of the previous index run).
>
> But that just quickly tells us "yes" for such files (at least on the basis
> of
> timestamp) so we'd need to check them with libmagic anyway.  To get a "no"
> based on timestamp we need to check against the database.
>
> I'd suggest to start with you just look at moving the libmagic check after
> the filesize checks, so you don't need to get into whether libmagic or
> the database check is cheaper on average.
>
> > What's more, how should we write tests to prove the omindex works
> > correctly, to generate some practical directories and use omindex to
> index
> > it then check the things in DB?
>
> We don't (sadly) have any tests of omindex behaviour currently, but having
> some would be great.
>
> You'd need to work out what cases you're aiming to test and then script up
> suitable changes to the directory between the omindex runs.
>
> Cheers,
>     Olly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20170424/a3329853/attachment.html>


More information about the Xapian-devel mailing list