[Xapian-devel] Proposed changes to omindex

Olly Betts olly at survex.com
Mon Aug 28 23:27:10 BST 2006


On Sun, Aug 27, 2006 at 01:24:42AM -0700, Michael Trinkala wrote:
> The tar file can be found here: https://www.trinkala.com/xapian/sort_collapse.tgz

OK, I'm currently working through this.  Some initial thoughts below.

> Change summary for omega
> ------------------------
> - Added the document’s last modified time to the value table (ID 0).
> It is stored as a 4 byte string in big endian format

The patch actually creates sizeof(time_t) byte strings, so they're 8
bytes on my dev box.  We want databases to be portable between
platforms, so I'll fix it to always give 4 bytes regardless of
sizeof(time_t).

> - Added the document’s MD5 to the value table (ID 1) as a 16 byte
> string and C term prefix to allow collapsed documents to be easily
> expanded/searched

Is such a term useful?  I struggle to see why someone might want to
restrict a search to identical documents!

Also "term prefix corresponding to the collapse key" is rather a flawed
concept because there could be several values that could be collapsed on
in the same database (e.g. collapsing on hostname or category in some
hierarchy).  So if it's useful to have at all, I think it's better to
document the term prefix as the file's MD5 sum.  Or even have a "term
corresponding to value N" prefix.

And if we have this term, I wonder if it's better to just stick the
binary MD5 hash in - it seems a bit random to store it as binary
data in the value, but convert it to a hex string for the term.

> Added the following files from the GNU testutils-2.1 package
> md5.c
> md5.h
> unlocked-io.h

Currently everything in Xapian is deliberately compiled as C++.  Mixing
gcc and g++ seems fine, but with some of the vendor compilers the
results of configure tests for the C and C++ compilers aren't always the
same which makes for fun games.

Another problem is that configure sometimes picks a "mixed" pair of
compilers e.g. gcc and the vendor C++ compiler).  I eventually decided
it was simplest to sidestep all these problems and just tweak any C code
we want to use to compile as C++.

>    - added $md5 and $valuedate commands/documentation

I think it's better to be orthogonal here and instead add a function
which converts a 4 byte (or however many byte perhaps) big endian binary
number to a string.  Then you can just convert the value and use the
existing $date command.  That way other binary numbers can also be
converted and fed to other functions, rather than having to create
variants of each which take binary data.

Cheers,
    Olly



More information about the Xapian-devel mailing list