[Xapian-discuss] Converting MySQL database to Xapian

Fri Apr 26 15:27:17 BST 2013

Hi Michael,

On Thu, Apr 25, 2013 at 04:56:26PM -0500, Michael Lewis wrote:
> I am looking for some guidance on converting a large MySQL database to Xapian. The current structure is that the database is broken up into 160 "sub-databases". There are 50,000 or so records in each stub database. Each record has content that I am full-text indexing. The average size of the text is about 59k characters. The database is broken up into sub-databases because the MySQL full-text handling is so abysmal. I use MySQL-Proxy and some Lua modules to make the searching and inserting transparent to the client programs. This data is written once and never (or once in a blue moon a record is deleted) modified. What I am looking to do is:

we have a similar fragmented setup for our library catalogs, 193 at this time. See:

http://kug.ub.uni-koeln.de/

The data comes from different sources, e.g. not only from local
library systems, but also from OpenLibrary, WikiSource, Project
Gutenberg and is exported in a unified JSON format.

For each source catalog we create

a) a separate PostgreSQL database with all relevant information
(formerly we used MySQL, but switched to PostgreSQL recently)
b) a separate Xapian index. The data for each document is a
predefined set of catalog fields in JSON format for display in search
result lists. We use pure Xapian, no Omega etc.

With Xapian you can

a) join several or all indexes at search time. For lots of indexes we
experienced problems with wildcard searches, though...

b) physically join each index to a single index with xapian-compact.
This compacted index is optimized for search.

c) export its index and access it as a remote backend

We use both a) and b) by make it configurable for which combination of
databases (=predefined search profile) we use a compacted index. We
don't use c) as all Postgresql databases and Xapian indexes are on
each application server. 

We use a setup of two clusters with two servers each. While one
cluster is searchable the other is updated. As soon as updating has
finished they swap. For load balancing/distribution of search requests
we use haproxy. 

We experienced that putting all information in a search index only is
- for our library catalogs - not advisable. The additioal PostgreSQL
databases comes in handy when analysing and mining our data.

So a possibly setup for you could be - provided you want to keep your
sub-databases

1) Build a xapian index for each mysql sub-database, preferably with
Omega
2) Join the indexes with xapian-compact to a single index.
3) Export this single index, again with Omega. 

Compacting is very fast. For our 193 catalogs with currently
12.387.390 titles it take around 40 minutes.

I wouldn't export each sub-index and then join it remotely at search
time, because I guess

a) setup is a mess
b) degraded performance

Try to export only one index.

Regards,

Oliver Flimm

-- 
Universitaet zu Koeln :: Universitaets- und Stadtbibliothek
IT-Dienste :: Abteilung Universitaetsgesamtkatalog
Universitaetsstr. 33 :: D-50931 Koeln
Tel.: +49 221 470-3330 :: Fax: +49 221 470-5166
flimm at ub.uni-koeln.de :: www.ub.uni-koeln.de