[Xapian-discuss] Emtpy records & unique key...

Olly Betts olly at survex.com
Thu May 12 00:17:36 BST 2005


On Wed, May 11, 2005 at 07:13:27AM -0700, arjan holscher wrote:
> --- Olly Betts <olly at survex.com> wrote:
> > But some lines have multiple \r characters before
> > the \n,
> > not just one.  Which is rather odd, but shouldn't
> > actually cause
> > problems except that boolean terms will include
> > these extra characters!
> 
> So, it would be wise to get rid of these \r
> characters.

I'd say so.  It perhaps makes sense for scriptindex to strip
multiple trailing \r characters.  You wouldn't expect them
normally, but if they are there it probably makes sense to
remove them.

> Are you sure that this doesn't cause the problem with the empty records?

I can't see how it could, and it doesn't for me.

> > Anyway, with the latest development version on
> > Linux, I get 4284 records
> > indexed.
> 
> Don't ask me why and how, but now I actually get 4k of
> documents added. However, some of the records are
> still empty. How is this possible?

When you say "4k" do you mean exactly 4000, or 4096, or
the same "about 4k" that I got (i.e. 4284)?

> So, apart from those \r characters no strange material
> is contained within the data dump?

Hmmm.  I wonder if the double blank lines between records are a problem.
Or that coupled with the extra \r characters.

Running under valgrind, double blank lines cause us to look at character
-1 of a string.  I'll fix that.  Maybe that's the cause of your blank
records.  It would explain why they seem to come and go...

> I have removed the \r characters and so far the issue
> seems fixed. I have 2 remaining issues now:
> 
> - How do I sort a document by time. I could add a
> timestamp field which would contain a unix timestamp.

Put the timestamp in a document value (with Document::add_value()).

> Then one question remains, how do I sort it ascending
> or descending?

0.8.5 only allows sorting by value in one direction.  0.9.0 will add the
ability to reverse sort (previously people have worked around this by
storing <large number> - <timestamp> in another value).  I'm pretty sure
I'll get 0.9.0 released this week.

> - Furthermore, I can't find the internal which is
> empty. There has to be 1 document with an empty
> internal. However, if you could point me to it (since
> you have seem to found it :>) then i'd be glad.

I only deduced it must exist, since there's one more record that there
are internal fields in the file!

Actually, looking again I bet it's because there's a double blank line
at the end of the file.  And if I run update and look at the last record
(which is the one readded) it has no data and no terms, so that fits.

Cheers,
    Olly



More information about the Xapian-discuss mailing list