[Xapian-discuss] Splitting terms into separate B-Trees.

Wed May 23 08:43:45 BST 2007

On Tue, May 22, 2007 at 02:34:35PM -0700, Alexander Lind wrote:
> Kevin Duraj wrote:
> >I would like to have a feature in Xapian that would allow me to tell
> >Xapian to split B-Tree terms into 2 or more B-Trees that I could place
> >on different servers and then based on what term Xapian is searching
> >master server will issue search to particular worker server that holds
> >the term. Master server will received info from worker server which
> >documents IDs contain the term then master server retrieves document
> >from local database and return search result to client.

While possible, this is a major piece of work with a widespread
consequences.  It's unlikely to get implemented just because you'd
like to have it, although of course you're welcome to work on it
yourself.

But what is the problem you are hoping this change would address?  It's
likely there are easier ways to address it.

> Is it really a better approach to achieve load balancing this way, 
> rather than have identical indexes on all worker-machines, and just 
> round-robin out requests to them?

Kevin is essentially suggesting partitioning the data by term rather
than by document.

Both approaches are certainly possible to implement, but after some
consideration and experimentation we deliberately chose partitioning
by document for Xapian.

While partitioning by term does allow a single term search to be
serviced by a single database, for a multi-term query you will in
general need to combine information for multiple databases.  That's also
true for searching multiple databases in Xapian, but Xapian achieves
this by generating an MSet for each database and then merging these.
The merging required if you partition by term is at a much finer
granularity so is very sensitive to network performance.

Also, if a server becomes unresponsive, partitioning by document means
you can at least still provide some results.

> Seems like it could get pretty hairy to have the trees split like you 
> suggest. What if you want to add a third server?  What if most people 
> happen to be searching on things beginning with [a-m]?

Handling adding a server would indeed be somewhat clumsy, though it's
not too hard - you "just" need to repartition the entries in the two
databases between three.  There's no need to reinvert the file at
least.

I suspect you wouldn't actually want to partition on the first character
like that to try to avoid skewed load.  Using a hash of the term modulo
the number of servers is perhaps the most obvious approach, but it makes
it hard to iterate terms in order.  

Cheers,
    Olly