[Xapian-devel] Proposed changes to omindex

Olly Betts olly at survex.com
Tue Aug 29 16:22:29 BST 2006


On Mon, Aug 28, 2006 at 09:32:29PM -0700, Michael Trinkala wrote:
> > The patch actually creates sizeof(time_t) byte strings, so they're 8
> > bytes on my dev box.  We want databases to be portable between
> > platforms, so I'll fix it to always give 4 bytes regardless of
> > sizeof(time_t).
> 
> Sounds good I thought time_t is a long int (32 bits even on a x64 platform).

On my dev box (x86_64 linux), long int is 64 bits too.  That's generally
true for 64 bit Unix platforms.  Microsoft keep long as 32 bits
apparently:

http://en.wikipedia.org/wiki/ILP64#64-bit_data_models

> Are you just truncating the extra 4 bytes?

For now, yes.

> > Is such a term useful?  I struggle to see why someone might want to
> > restrict a search to identical documents!
> 
> Here is how I use it:
>  - When I am indexing message threads all the C terms for the realted
> messages get the parent's MD5.  The allows me to expand the entire
> message thread from any message that is displayed in the search
> results (this cannot necessarily be done from our maillist software
> because it doesn't track message threads across lists)

OK, I can see the term is useful in this situation, but that's different
to what omindex is doing.  In your case the MD5 is really acting like a
message-id for the parent.  The situation you describe is much more like
collapsing on hostname than eliminating exact duplicates.

>  - When a duplicate is found it provides the user a quick way to
>  locate them all.

I guess a website admin might be interested to know where duplicates
of files are, but adding a fairly long and mostly unique term per
document will increase the size of the database and slow down indexing a
little, which is why I'm wondering if such a term is useful.

It's also worth noting that identical documents will get the same score
in a non-collapsed search, so cutting and pasting a paragraph of text
from a document as the query should list all duplicates together anyway
(a short query might give other documents the same weight, but a longer
one is unlikely to).

> Also if md5_file is going to return an int is should be the file
> system errno or it should just return a bool instead.

Hmm yes, that would be more consistent with other APIs.

> Thanks for the feedback, do you need me to submit the changes or have
> you made them?

I've made some and I'm in the process of working through the rest.

Cheers,
    Olly



More information about the Xapian-devel mailing list