[Xapian-discuss] Rqt for Features

Olly Betts olly at survex.com
Mon Aug 9 12:16:59 BST 2004


On Fri, Jul 09, 2004 at 03:59:16PM +0100, Tim Brody wrote:
> Could xapian have the ability to specify docids? My system - as I'm sure
> many others do - maintains it's own ids for people, docs etc. For the moment
> I've opted to rebuild the index from scratch everyday, rather than
> maintaining a docid => myid mapping in order to perform incremental nightly
> changes.
> 
> The cleanest method from outside of the API would be if replace_document
> accepted a non-existent (to xapian) docid, in which case it adds the
> document rather than excepting (i.e. SQL's "REPLACE" behaviour).

This is something I'd noticed might be useful.  The main caveats are that
docid 0 is always invalid in Xapian, and that specifying sparse document
ids would subvert the compression techniques used in the quartz backend
to such an extent that you are likely to be better off using the unique
term approach.

There's a comment in scriptindex.cc sketching out the idea of allowing
the hash of an external UID as the Xapian docid, but upon reflection I
think this is a very bad idea, especially as it risks collisions between
UIDs too.  I'll remove that comment shortly.

> Having added wrappers for QueryParser I wonder whether it would be
> worthwhile revising Stopper. I can't think of a situation where a stopper
> would need to be more intelligent than containing a list of words to stop,
> so seems a little pointless distributing a class in Xapian that doesn't do
> this.

Two fairly reasonable examples:

(a) you might want to unconditionally stop all terms of N or fewer
characters.  Your approach would require specifying all 26^N terms to
stop (probably more actually since digits, etc are usually allowed in
terms).

(b) you might want to stop based on term frequency - for example any
term which occurs in more than M% of documents in the database could
be treated as a stopword (which provides a self-tuning application
specific stopword list!)

>         map<string,bool> terms;

I think set<string> is probably a more appropriate data structure here.

Cheers,
    Olly



More information about the Xapian-discuss mailing list