[Xapian-discuss] Mime2Text library, derived from omindex
Liam
xapian at networkimprov.net
Thu Feb 9 23:50:13 GMT 2012
On Tue, Nov 22, 2011 at 10:26 PM, Liam <xapian at networkimprov.net> wrote:
>
> load_file() in omega/loadfile.cc (part of the pending Mime2Text lib) calls
>
> posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
>
> once, before closing the fd. In order to minimally impact the filesystem
> cache, I suspect it should call that after each read()?
>
> Also, the read buffer is only 4KB. It might be considerably more efficient
> if sized to the filesystem block size?
>
I believe doing a posix_fadvise() per-read is wise, as 100MB PDFs are not
uncommon, and would pollute the filesystem cache. If, given the benchmarks
below, you'd agree, I'll commit my edits to loadfile.cc and test program to
my github branch.
Here are benchmarks from a test program that walks a tree calling
load_file(pathname, output_string, NOCACHE | NOATIME)
test machine is a Core 2 Duo with low-end disk, Linux kernel
2.6.32-32-generic
Note: the pattern of alternating slower/faster runs repeats over many tries
Current loadfile.cc, with 4K buffer
buffers of 8K 16K 32K 64K showed only a 1-2s speedup
$ time ./loadfile-test ~
total bytes read: 627344268
real 0m55.267s
user 0m0.424s
sys 0m2.504s
$ time ./loadfile-test ~
total bytes read: 627344268
real 0m18.937s
user 0m0.360s
sys 0m1.800s
------------
Moved posix_fadvise() into the read loop
the faster pass is somewhat slower than before, tho only the first is
relevant here
$ time ./loadfile-test ~
total bytes read: 627344302
real 0m59.410s
user 0m0.532s
sys 0m2.696s
$ time ./loadfile-test ~
total bytes read: 627344302
real 0m42.393s
user 0m0.428s
sys 0m2.376s
------------
Increased the read() buffer to 32K to reduce the number of posix_fadvise()
calls
$ time ./loadfile-test ~
total bytes read: 627344305
real 0m56.894s
user 0m0.472s
sys 0m2.300s
$ time ./loadfile-test ~
total bytes read: 627344305
real 0m41.719s
user 0m0.408s
sys 0m1.948s
More information about the Xapian-discuss
mailing list