[Xapian-discuss] Splitting terms into separate B-Trees.

Thu May 31 21:58:19 BST 2007

If we would partition databases by term's first characters and of
course some terms would generate larger B-tree then others, then we
could infinitely partition terms by more the one, two characters like
this … until we bring the size of the B-Tree database to small enough
to handle by many small servers ...

Example terms goes to:
-  google = server_go
-  good  = server_go
-  green = server_gr
-  kevin = server_ke
-  duraj = server_du

This would explain why Google in their documents claim to use cheap
servers, because partitioning B-Tree indexes by the terms can bring
the index down to any small size that the server can handle well, and
actually the servers would become the part of the B-Tree. There is no
limit how much you can search and size down the database.

Can you see the idea ?

Cheers
  Kevin Duraj

On 5/31/07, Kevin Duraj <kevin.softdev at gmail.com> wrote:
> I have feeling that developers who created  Google search engine
> achieved their high search engine performance by partitioning the data
> by terms rather then by document. For example, if you have billions of
> documents and running 80,000 servers you cannot possibly execute
> search for each user on all your servers and then merge the result
> with MSet as we do with Xapian, but rather locate those servers that
> holds B-Tree with the term that the user is searching for.
>
> In this case it would be maximum one term for one server. If the user
> enters 7 terms we would hit maximum < 7 servers. In our scenario
> because we are partitioning data by documents and we would search
> billions of documents on 80,000 servers we would hit 80,000 servers
> and then merge the result with MSet. Because of this I feel that we
> can improve our search engine and prepare to handle large scale of
> documents.
>
> Kevin Duraj
>
> On 5/23/07, Olly Betts <olly at survex.com> wrote:
> > On Tue, May 22, 2007 at 02:34:35PM -0700, Alexander Lind wrote:
> > > Kevin Duraj wrote:
> > > >I would like to have a feature in Xapian that would allow me to tell
> > > >Xapian to split B-Tree terms into 2 or more B-Trees that I could place
> > > >on different servers and then based on what term Xapian is searching
> > > >master server will issue search to particular worker server that holds
> > > >the term. Master server will received info from worker server which
> > > >documents IDs contain the term then master server retrieves document
> > > >from local database and return search result to client.
> >
> > While possible, this is a major piece of work with a widespread
> > consequences.  It's unlikely to get implemented just because you'd
> > like to have it, although of course you're welcome to work on it
> > yourself.
> >
> > But what is the problem you are hoping this change would address?  It's
> > likely there are easier ways to address it.
> >
> > > Is it really a better approach to achieve load balancing this way,
> > > rather than have identical indexes on all worker-machines, and just
> > > round-robin out requests to them?
> >
> > Kevin is essentially suggesting partitioning the data by term rather
> > than by document.
> >
> > Both approaches are certainly possible to implement, but after some
> > consideration and experimentation we deliberately chose partitioning
> > by document for Xapian.
> >
> > While partitioning by term does allow a single term search to be
> > serviced by a single database, for a multi-term query you will in
> > general need to combine information for multiple databases.  That's also
> > true for searching multiple databases in Xapian, but Xapian achieves
> > this by generating an MSet for each database and then merging these.
> > The merging required if you partition by term is at a much finer
> > granularity so is very sensitive to network performance.
> >
> > Also, if a server becomes unresponsive, partitioning by document means
> > you can at least still provide some results.
> >
> > > Seems like it could get pretty hairy to have the trees split like you
> > > suggest. What if you want to add a third server?  What if most people
> > > happen to be searching on things beginning with [a-m]?
> >
> > Handling adding a server would indeed be somewhat clumsy, though it's
> > not too hard - you "just" need to repartition the entries in the two
> > databases between three.  There's no need to reinvert the file at
> > least.
> >
> > I suspect you wouldn't actually want to partition on the first character
> > like that to try to avoid skewed load.  Using a hash of the term modulo
> > the number of servers is perhaps the most obvious approach, but it makes
> > it hard to iterate terms in order.
> >
> > Cheers,
> >    Olly
> >
>
>
> --
> Kevin Duraj
> http://myhealthcare.com
>

-- 
Kevin Duraj
http://myhealthcare.com