[Xapian-devel] Proposed changes to omindex
olly at survex.com
Sat Sep 2 03:17:40 BST 2006
On Tue, Aug 29, 2006 at 04:22:29PM +0100, Olly Betts wrote:
> I've made some and I'm in the process of working through the rest.
OK, I've done fettling.
After some research, I went for the public domain MD5 implementation
written by Colin Plumb. It's used widely (including in the Linux
kernel), compiles as C++, and doesn't add further relicensing obstacles.
I couldn't find a guarantee that std::string::c_str() will have the
correct alignment for access as a 32 bit integer (though I can believe
it typically will be), so I've used memcpy() instead of
I've been wondering if there should be a command line option to
enable/disable the MD5 checksumming. If you don't want to collapse
indentical documents, it's just overhead (slower indexing, bigger
database, and probably some slowdown when sorting by lastmod with
the current way we store values).
So I did some simple benchmarking by indexing /usr/share/doc on my
I used the flint backend. The last number is "du -sk" on the database
directory. Times are having already generated the same index (so warm
The box is running my desktop, which is probably why the wall clock time
is less. Looking at the user, system, and disk space, the overhead is:
So I'm inclined not to worry about the overhead. Each extra command
line option makes it harder for the user to find the command line
options they want. But if anyone benchmarks on a different platform
with different data and comes to a different conclusion, post here!
More information about the Xapian-devel