[Xapian-discuss] Mime2Text library, derived from omindex

Liam xapian at networkimprov.net
Thu Feb 9 23:50:13 GMT 2012


On Tue, Nov 22, 2011 at 10:26 PM, Liam <xapian at networkimprov.net> wrote:

>
> load_file() in omega/loadfile.cc (part of the pending Mime2Text lib) calls
>
>   posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
>
> once, before closing the fd. In order to minimally impact the filesystem
> cache, I suspect it should call that after each read()?
>
> Also, the read buffer is only 4KB. It might be considerably more efficient
> if sized to the filesystem block size?
>

I believe doing a posix_fadvise() per-read is wise, as 100MB PDFs are not
uncommon, and would pollute the filesystem cache. If, given the benchmarks
below, you'd agree, I'll commit my edits to loadfile.cc and test program to
my github branch.

Here are benchmarks from a test program that walks a tree calling
   load_file(pathname, output_string, NOCACHE | NOATIME)
test machine is a Core 2 Duo with low-end disk, Linux kernel
2.6.32-32-generic
Note: the pattern of alternating slower/faster runs repeats over many tries


Current loadfile.cc, with 4K buffer
  buffers of 8K 16K 32K 64K showed only a 1-2s speedup

$ time ./loadfile-test ~
total bytes read: 627344268

real    0m55.267s
user    0m0.424s
sys    0m2.504s

$ time ./loadfile-test ~
total bytes read: 627344268

real    0m18.937s
user    0m0.360s
sys    0m1.800s

------------

Moved posix_fadvise() into the read loop
  the faster pass is somewhat slower than before, tho only the first is
relevant here

$ time ./loadfile-test ~
total bytes read: 627344302

real    0m59.410s
user    0m0.532s
sys    0m2.696s

$ time ./loadfile-test ~
total bytes read: 627344302

real    0m42.393s
user    0m0.428s
sys    0m2.376s

------------

Increased the read() buffer to 32K to reduce the number of posix_fadvise()
calls

$ time ./loadfile-test ~
total bytes read: 627344305

real    0m56.894s
user    0m0.472s
sys    0m2.300s

$ time ./loadfile-test ~
total bytes read: 627344305

real    0m41.719s
user    0m0.408s
sys    0m1.948s


More information about the Xapian-discuss mailing list