[Xapian-devel] Proposed changes to omindex

Michael Trinkala mdt at trinkala.com
Tue Aug 29 05:32:29 BST 2006


> The patch actually creates sizeof(time_t) byte strings, so they're 8
> bytes on my dev box.  We want databases to be portable between
> platforms, so I'll fix it to always give 4 bytes regardless of
> sizeof(time_t).

Sounds good I thought time_t is a long int (32 bits even on a x64 platform).  Are you just
truncating the extra 4 bytes?

> Is such a term useful?  I struggle to see why someone might want to
> restrict a search to identical documents!

Here is how I use it:
 - When I am indexing message threads all the C terms for the realted messages get the parent's
MD5.  The allows me to expand the entire message thread from any message that is displayed in the
search results (this cannot necessarily be done from our maillist software because it doesn't
track message threads across lists)
 - When a duplicate is found it provides the user a quick way to locate them all.

> Also "term prefix corresponding to the collapse key" is rather a flawed
> concept because there could be several values that could be collapsed on
> in the same database (e.g. collapsing on hostname or category in some
> hierarchy).  So if it's useful to have at all, I think it's better to
> document the term prefix as the file's MD5 sum.  Or even have a "term
> corresponding to value N" prefix.

I agree

>
> And if we have this term, I wonder if it's better to just stick the
> binary MD5 hash in - it seems a bit random to store it as binary
> data in the value, but convert it to a hex string for the term.

I couldn't easily build a query string from the browser the could search the binary versions of
that term so I stuck with text.

>>    - added $md5 and $valuedate commands/documentation
>
> I think it's better to be orthogonal here and instead add a function
> which converts a 4 byte (or however many byte perhaps) big endian binary
> number to a string.  Then you can just convert the value and use the
> existing $date command.  That way other binary numbers can also be
> converted and fed to other functions, rather than having to create
> variants of each which take binary data.

Yes, something like this would be better. $date{$benumbertostring{$value{0}},"%D"}.

Also if md5_file is going to return an int is should be the file system errno or it should just
return a bool instead.

Thanks for the feedback, do you need me to submit the changes or have you made them?
Trink




More information about the Xapian-devel mailing list