[Xapian-devel] Re: [Xapian-commits] 7283: trunk/xapian-core/ trunk/xapian-core/net/

Wed Oct 4 22:39:55 BST 2006

On Wed, Oct 04, 2006 at 08:56:32PM +0100, Olly Betts wrote:
> > http://www.tartarus.org/~richard/xapian-patches/queryparser-incremental.patch
> 
> I notice a bug - you need to restart the allterms iteration if there's
> no "I" term...

I'm not convinced it is actually a bug - the skip_to() will move to the
first term after the non-existent "I" term, but all the non-prefixed terms
will sort after the "I" term, so the second skip_to() will still move to
the right place.  (Assuming "name" begins with a lowercase letter, which I
think it always will at present.)

However, this is fragile (relies on the sort order of term iterators, and
"name" beginning with a lowercase letter), so I'll add a
    t = db.allterms_begin();
before the second skip_to().

> Interesting idea though.  How much size overhead do the I-terms add?

That depends - you don't need to generate I-terms for all possible prefixes
to remove the slow cases.  The current index strategy I'm using is to only
generate I-terms for the prefixes of length <= 3.  So: "categories"
generates:

Ic Ica Icat (plus the normal terms generated by "categories")

This means that the short prefix searches which can't reasonably be used
for a wildcard search (due to the immense number of matching terms) are
nice and efficient, but the longer prefix searches still work by expanding
all the matching terms.

With this setup, one test database increases its postlist size by 50%, and
its termlist size by 35%.

I'm wondering if the single letter terms "ie, Pa to Pz" are actually
worthwhile, since so many documents match these.  I may discard them, and
change my incremental match code such that it doesn't bother trying to
search for 1 letter partial terms.  Discarding the 1 letter I-terms makes
the postlist size increase by 36% (instead of 50%) and the termlist size
increase by 29% (instead of 35%).  (All these measurements are using quartz
- the better compression in flint might make the differences smaller, I
  suppose, since these terms will match many documents.)

Of course, for a database without any I-terms generated, the extra
skip_to() is a bit of a waste of time, but I wouldn't expect it to be a
noticeable problem - it will always skip to the same place in the
list of terms, so should get nicely cached.

> I think at this point I'd rather hold off on any "new functionality"
> changes.  I wasn't going to bother with a 0.9.7 release, but a number
> of useful fixes and improvements had accrued and people may not be able
> to migrate to 1.0 right away since databases will need rebuilding.
> 
> We can apply them for 1.0 I think.

Fair enough.

Perhaps, after the 0.9.7 release, we should move any bug fixes to 0.9.7 to
a separate branch, and then apply any patches like these, and the UTF-8
stuff, to the SVN trunk.

-- 
Richard