[Xapian-discuss] Splitting terms into separate B-Trees.

Thu May 31 19:25:48 BST 2007

I have feeling that developers who created  Google search engine
achieved their high search engine performance by partitioning the data
by terms rather then by document. For example, if you have billions of
documents and running 80,000 servers you cannot possibly execute
search for each user on all your servers and then merge the result
with MSet as we do with Xapian, but rather locate those servers that
holds B-Tree with the term that the user is searching for.

In this case it would be maximum one term for one server. If the user
enters 7 terms we would hit maximum < 7 servers. In our scenario
because we are partitioning data by documents and we would search
billions of documents on 80,000 servers we would hit 80,000 servers
and then merge the result with MSet. Because of this I feel that we
can improve our search engine and prepare to handle large scale of
documents.

Kevin Duraj

On 5/23/07, Olly Betts <olly at survex.com> wrote:
> On Tue, May 22, 2007 at 02:34:35PM -0700, Alexander Lind wrote:
> > Kevin Duraj wrote:
> > >I would like to have a feature in Xapian that would allow me to tell
> > >Xapian to split B-Tree terms into 2 or more B-Trees that I could place
> > >on different servers and then based on what term Xapian is searching
> > >master server will issue search to particular worker server that holds
> > >the term. Master server will received info from worker server which
> > >documents IDs contain the term then master server retrieves document
> > >from local database and return search result to client.
>
> While possible, this is a major piece of work with a widespread
> consequences.  It's unlikely to get implemented just because you'd
> like to have it, although of course you're welcome to work on it
> yourself.
>
> But what is the problem you are hoping this change would address?  It's
> likely there are easier ways to address it.
>
> > Is it really a better approach to achieve load balancing this way,
> > rather than have identical indexes on all worker-machines, and just
> > round-robin out requests to them?
>
> Kevin is essentially suggesting partitioning the data by term rather
> than by document.
>
> Both approaches are certainly possible to implement, but after some
> consideration and experimentation we deliberately chose partitioning
> by document for Xapian.
>
> While partitioning by term does allow a single term search to be
> serviced by a single database, for a multi-term query you will in
> general need to combine information for multiple databases.  That's also
> true for searching multiple databases in Xapian, but Xapian achieves
> this by generating an MSet for each database and then merging these.
> The merging required if you partition by term is at a much finer
> granularity so is very sensitive to network performance.
>
> Also, if a server becomes unresponsive, partitioning by document means
> you can at least still provide some results.
>
> > Seems like it could get pretty hairy to have the trees split like you
> > suggest. What if you want to add a third server?  What if most people
> > happen to be searching on things beginning with [a-m]?
>
> Handling adding a server would indeed be somewhat clumsy, though it's
> not too hard - you "just" need to repartition the entries in the two
> databases between three.  There's no need to reinvert the file at
> least.
>
> I suspect you wouldn't actually want to partition on the first character
> like that to try to avoid skewed load.  Using a hash of the term modulo
> the number of servers is perhaps the most obvious approach, but it makes
> it hard to iterate terms in order.
>
> Cheers,
>    Olly
>

-- 
Kevin Duraj
http://myhealthcare.com