[Xapian-devel] Proposed changes to omindex
Olly Betts
olly at survex.com
Tue Aug 29 16:22:29 BST 2006
On Mon, Aug 28, 2006 at 09:32:29PM -0700, Michael Trinkala wrote:
> > The patch actually creates sizeof(time_t) byte strings, so they're 8
> > bytes on my dev box. We want databases to be portable between
> > platforms, so I'll fix it to always give 4 bytes regardless of
> > sizeof(time_t).
>
> Sounds good I thought time_t is a long int (32 bits even on a x64 platform).
On my dev box (x86_64 linux), long int is 64 bits too. That's generally
true for 64 bit Unix platforms. Microsoft keep long as 32 bits
apparently:
http://en.wikipedia.org/wiki/ILP64#64-bit_data_models
> Are you just truncating the extra 4 bytes?
For now, yes.
> > Is such a term useful? I struggle to see why someone might want to
> > restrict a search to identical documents!
>
> Here is how I use it:
> - When I am indexing message threads all the C terms for the realted
> messages get the parent's MD5. The allows me to expand the entire
> message thread from any message that is displayed in the search
> results (this cannot necessarily be done from our maillist software
> because it doesn't track message threads across lists)
OK, I can see the term is useful in this situation, but that's different
to what omindex is doing. In your case the MD5 is really acting like a
message-id for the parent. The situation you describe is much more like
collapsing on hostname than eliminating exact duplicates.
> - When a duplicate is found it provides the user a quick way to
> locate them all.
I guess a website admin might be interested to know where duplicates
of files are, but adding a fairly long and mostly unique term per
document will increase the size of the database and slow down indexing a
little, which is why I'm wondering if such a term is useful.
It's also worth noting that identical documents will get the same score
in a non-collapsed search, so cutting and pasting a paragraph of text
from a document as the query should list all duplicates together anyway
(a short query might give other documents the same weight, but a longer
one is unlikely to).
> Also if md5_file is going to return an int is should be the file
> system errno or it should just return a bool instead.
Hmm yes, that would be more consistent with other APIs.
> Thanks for the feedback, do you need me to submit the changes or have
> you made them?
I've made some and I'm in the process of working through the rest.
Cheers,
Olly
More information about the Xapian-devel
mailing list