[Xapian-devel] Proposed changes to omindex

Olly Betts olly at survex.com
Sat Sep 2 03:17:40 BST 2006


On Tue, Aug 29, 2006 at 04:22:29PM +0100, Olly Betts wrote:
> I've made some and I'm in the process of working through the rest.

OK, I've done fettling.

After some research, I went for the public domain MD5 implementation
written by Colin Plumb.  It's used widely (including in the Linux
kernel), compiles as C++, and doesn't add further relicensing obstacles.

I couldn't find a guarantee that std::string::c_str() will have the
correct alignment for access as a 32 bit integer (though I can believe
it typically will be), so I've used memcpy() instead of
reinterpret_cast<>.

I've been wondering if there should be a command line option to
enable/disable the MD5 checksumming.  If you don't want to collapse
indentical documents, it's just overhead (slower indexing, bigger
database, and probably some slowdown when sorting by lastmod with
the current way we store values).

So I did some simple benchmarking by indexing /usr/share/doc on my
Ubuntu box:

Without MD5:

real    1m56.279s
user    1m44.573s
sys     0m7.358s
58536   usrsharedoc

With MD5:

real    1m54.171s
user    1m45.104s
sys     0m7.631s
58784   usrsharedoc

I used the flint backend.  The last number is "du -sk" on the database
directory.  Times are having already generated the same index (so warm
cache).

The box is running my desktop, which is probably why the wall clock time
is less.  Looking at the user, system, and disk space, the overhead is:

user:  0.51%
sys:   3.71% 
total: 0.72%
disk:  0.42%

So I'm inclined not to worry about the overhead.  Each extra command
line option makes it harder for the user to find the command line
options they want.  But if anyone benchmarks on a different platform
with different data and comes to a different conclusion, post here!

Cheers,
    Olly



More information about the Xapian-devel mailing list