[Xapian-discuss] Splitting terms into separate B-Trees.

Richard Boulton richard at lemurconsulting.com
Wed May 23 08:43:05 BST 2007


Alexander Lind wrote:
> Kevin Duraj wrote:
>> Splitting terms into separate B-Trees.
>>
>> I would like to have a feature in Xapian that would allow me to tell
>> Xapian to split B-Tree terms into 2 or more B-Trees that I could place
>> on different servers and then based on what term Xapian is searching
>> master server will issue search to particular worker server that holds
>> the term. Master server will received info from worker server which
>> documents IDs contain the term then master server retrieves document
>> from local database and return search result to client.
>>
>> Example:
>>
>> Master holds database with documents Server0
>> - B-tree with bitmaps for all terms starting a-m are stored on ServerA
>> - B-tree with bitmaps for all terms starting n-z  are stored on ServerB

By "bitmaps", I assume you mean posting lists - ie, lists of matching 
documents.  For reference, Xapian doesn't use a bitmap to store this, it 
uses a list of (compressed) chunks of document IDs, together with other 
information (wdf, etc).

>> If I am looking for term: xapian, Master Server0 will issue search to
>> ServerB, result is then send back Master Server0 then Master Server0
>> retrieves documents details from local database and will send to
>> client.
>>
> 
> Is it really a better approach to achieve load balancing this way, 
> rather than have identical indexes on all worker-machines, and just 
> round-robin out requests to them?  Or if you want to get fancy, use some 
> sort of loadmonitor report to decide which machine should carry out the 
> search?
> 
> Seems like it could get pretty hairy to have the trees split like you 
> suggest. What if you want to add a third server?  What if most people 
> happen to be searching on things beginning with [a-m]?

Even worse - what if most people are performing two term searches where 
one term is in the first database, and the other term is in the second 
database?  Then, in order to find the highest ranked documents, the 
entire list of matching documents from each of the lists needs to be 
sent to the master and combined there.

Partitioning the database by term instead of by document might sometimes 
a reasonable thing to do, but I'm not sure it could be fitted into the 
Xapian framework very well.  Anyway, I'm not sure _why_ you'd want to, 
when Xapian allows you to partition the database by document, and allows 
for efficient combining of results in this situation.

-- 
Richard



More information about the Xapian-discuss mailing list