[Xapian-discuss] Positive experiences with Xapian

Wed Aug 3 00:00:45 BST 2011

Hi Guys,

I just wanted to take a moment to give some positive feedback regarding my
experiences with Xapian recently.
I've been doing a fair amount of research into search engines recently, as
we have some fairly specific requirements with what we're attempting to do
with them. Long story short, after a few weeks of playing around with just
about everything under the sun (or at least, everything off the shelf,
sphinx, lucene, solr, mysql/postgres fulltext, etc, etc), we recently
settled on Xapian because of it's specific design characteristics, and that
it's really really easy to use and alter.

The main reason we struggled to find something suitable was because of our
large data requirements: In terms of raw data, we're looking at indexing
about 1TB raw (ie. excluding the size of any indexes or other metadata), for
about 30,000 individual users (We're "top heavy" in terms of database design
- small number of users, but large amount of data). This gives rise to a few
different issues.

Before we even begin on the search aspects, one thing that's important to us
is data separation. We're not running a blog or forum where you can mix
everyone's data in together and accept that there might be some errors from
time to time -- it's highly critical that nobody ever sees anyone else's
data. Lucene can deal with this in a general sense, specifically that it
fits into the same niche that Xapian does in terms of how it integrates with
everything else, it's just a library that you can use to create search
indexes essentially. However, 'off the shelf' engines like solr and sphinx
fundamentally fail to handle the situation where you want to have phsyically
(as in, file system level) separation of data. Ever tried creating 30,000
individual indexes in sphinx? or solr? i can tell you first hand that they
dont even come close to working. (Please note, i'm fully aware of the
argument that this could be considered "designing it wrong", however, i've
been designing these sorts of things for a long, long time, and i like to
think i know what i'm getting myself into). Lucene can handle this sort of
thing in theory, but given we're a PHP / C shop, having to build and support
Java apps for us would just be a nightmare, not to mention that without
prototyping such a system there's no guarantee that it will even work.

Hypothetically, even if you could get such systems to work in
Lucene/Solr/Sphinx, the other significant design flaw as far as i'm
concerned is the fact that they're designed to run "in memory". This just
flat out does not work for us at all. The /raw/ data set we're dealing with
is about 1TB. After you've cooked this, by indexing and whatever other
processes take place, it'll end up being a multiple of this. This would end
up meaning that, were we to try shove everything in memory for performance
reasons, you'd have to have stupidly massive amounts of ram DEDICATED to the
searchd. Xapian, on the other hand, does what i consider to be the "right"
thing, and actually uses the OS to cache it's file accesses. This approach
is totally superior as far as we're concerned. It allows us to throw as much
memory at a box as we require for performance reasons, without having to get
into the insanity of managing an individual service that needs to consume
99% of the available memory.

Another quick note on the database based fulltext indexes - MyISAM fulltext
is just fundamentally unable to handle what we want to do from a performance
standpoint, end of story. I think we calculated it'd take us something like
3 months to build the indexes on a single development server. I'm aware
postgres is a different story, but at the end of the day, it's really not
suitable either for the same reasons. They're designed as databases not as
search engines.

In summary, Xapian ticks all of the boxes for us. It can integrate with just
about any modern language, it's easy to use, it "just works" and is
generally bug free, and it's made a great foundation for us to build our own
search services off of. There's a hundred other design aspects i haven't
touched on here (general feature set comes to mind. stemming, and out of the
box search accuracy come to mind), but for the most part we haven't been let
down yet.

Leaving one negative bit for last, and it's not a huge one by any means - as
someone who's been building large scale web apps since the dawn of time, the
swig PHP classes are fairly awful. Don't get me wrong - it's not a huge
issue, and i fully understand why they are they way they are (it's a c++
library, and is not php specific, therefore swig is a good fit). I'd like to
try do something about it in the future, so if i come up with anything
worthwhile you'll be the first to know.

We haven't put the system into production yet, but at this stage i'm really
looking forward to finishing off development and seeing what happens.

Regards,
Peter