Question about the ticket #743 omindex: delay libmagic checks

Olly Betts olly at survex.com
Fri Apr 21 06:37:20 BST 2017


On Fri, Apr 21, 2017 at 01:52:38AM +0800, YuLun Cai wrote:
> I'm working on the ticket #743 omindex: delay libmagic checks
> <https://trac.xapian.org/ticket/743>. As the ticket's
> Description mention, the call to libmagic is expensive than call the stat,
> so we can check the size by call the stat to get size before call
> libmagic to get a mime type.

Yes.

> But how about the timestamps check? since timestamps check need to iterate
> the DB to check if the file has been indexed and hasn't changed(in
> `index_check_existing` function in omega\index_file.cc), so it is expensive
> too. Should we call the libmagic before or after the timestamps, or do we
> have another way to check the timestamps?

We also have an upper bound on the newest timestamp in the database at the
start of the run, so we can often avoid this check for new files (at least
if they were created since the end of the previous index run).

But that just quickly tells us "yes" for such files (at least on the basis of
timestamp) so we'd need to check them with libmagic anyway.  To get a "no"
based on timestamp we need to check against the database.

I'd suggest to start with you just look at moving the libmagic check after
the filesize checks, so you don't need to get into whether libmagic or
the database check is cheaper on average.

> What's more, how should we write tests to prove the omindex works
> correctly, to generate some practical directories and use omindex to index
> it then check the things in DB?

We don't (sadly) have any tests of omindex behaviour currently, but having
some would be great.

You'd need to work out what cases you're aiming to test and then script up
suitable changes to the directory between the omindex runs.

Cheers,
    Olly



More information about the Xapian-devel mailing list