[Xapian-discuss] Fwd: Positive experiences with Xapian

Tue Aug 9 01:56:11 BST 2011

On 8 August 2011 19:38, Paul Boddie <paul.boddie at biotek.uio.no> wrote:

> On 03/08/11 01:00, Peter Van Dijk wrote:
>
>>
>> Hypothetically, even if you could get such systems to work in
>> Lucene/Solr/Sphinx, the other significant design flaw as far as i'm
>> concerned is the fact that they're designed to run "in memory".
>>
>
> I'm sorry, but I seriously doubt that Lucene was "designed to run in
> memory", presumably meaning that you have to load the index into memory to
> get it to work: the characteristics of the data format are specifically
> designed to work efficiently with disk I/O.
>
>
Let me start by saying thanks for the feedback :)

You are right, they aren't specifically designed that way, but to get the
levels of performance out of them that we require, we needed significantly
more memory than Xapian, (which could be due to other factors, i admit)
and I probably shouldn't have included lucene in that statement at all.

Regarding Sphinx and Solr though, my explanation was a bit flawed - I wasn't
trying to imply that they are designed to be in-memory in the same way that
some database engines are - what i was more referring to is that they
require a second layer of cache that's separate to the OS/FS cache. Using
sphinx as an example, it really is designed to use significant amounts of
memory for caching in it's searchd process, and solr does the same sort of
thing i believe (even though i'm not intimiately familiar with it).
With Xapian we don't need to worry about it (ie. memory management for
individual processes and such) since it simply relies on the OS, which is
the optimal approach for what we want.

Don't get me wrong, though, they all run just fine off of disk for a large
majority of use cases, and i'm not trying to scare anyone away from them,
It's just that when your data requirements get big enough, and your
performance requirements are high, some of the cracks really start to show
in terms of how they all fit together. (and even then, i'm fairly sure our
data requirements aren't "that big" compared to a lot of other stuff out
there)

For what it's worth, we've been using Sphinx in other systems for years now
(and will continue using it), and it's great at what it does.

>  This approach
>> is totally superior as far as we're concerned. It allows us to throw as
>> much
>> memory at a box as we require for performance reasons, without having to
>> get
>> into the insanity of managing an individual service that needs to consume
>> 99% of the available memory.
>>
>
> You can argue that Java itself imposes ridiculous memory management
> limitations - I was using PyLucene in the era when they supported a
> GCJ-compiled library - but that's a separate issue.
>
> I'm not using either Lucene or Xapian actively at the moment, and I can't
> really call myself a Lucene enthusiast either - I switched from Lucene to
> Xapian for various reasons, some of which I probably share with you - but
> no-one benefits from inaccurate information about supposed "competitors"
> when having accurate information about them could actually inform Xapian
> development.
>

My post was mainly intended to be a somewhat technical "thank you" to anyone
involved with Xapian that might see it,
I see nothing as a competitor, as you put it. I don't advocate anything -
happy to let people make their own mind up, and use whatever tool is right
for the job.
Not to mention that we don't even have Xapian in production yet, so my
comments should all be taken with a grain of salt :)

Anyway, I'm far from an expert, but i just wanted to try to explain why it
works so well for us, and i figured some people might appreciate the
positive feedback.

>
>  Another quick note on the database based fulltext indexes - MyISAM
>> fulltext
>> is just fundamentally unable to handle what we want to do from a
>> performance
>> standpoint, end of story. I think we calculated it'd take us something
>> like
>> 3 months to build the indexes on a single development server. I'm aware
>> postgres is a different story, but at the end of the day, it's really not
>> suitable either for the same reasons. They're designed as databases not as
>> search engines.
>>
>
> There's one thing that database systems are very good at, if configured
> appropriately, and that's determining the most optimal querying approach. I
> perform huge numbers of searches on indexed text in batches, and in such
> situations a database system would probably employ more efficient techniques
> transparently, mostly because they provide such facilities generally.
> Indeed, the general data management functions offered by systems like
> PostgreSQL have a lot more to bring to the table than people would have you
> believe.
>
> The only reason why I'm not playing with PostgreSQL's full-text support is
> that they omitted support for general regular-expression-based tokenisation
> in favour of a handful of hand-coded tokenisers, and I don't yet have the
> inclination to write one which provides such an obviously useful feature.
>

Well i'm a MySQL nut from way back, so I would have loved to use an RDBMS of
any kind to solve my problems. It's just a shame it "doesnt work" for us. I
never got so far as to play with Postgres' tokenisers, but i can see why
that'd be an issue for a lot of people. Though I think one of the other
notable things about Postgres is that it has a lot more work being done on
it in the fulltext search realm than MySQL.

Going back a few years, Sphinx held a lot of initial appeal for me - the
MySQL integration is a nice touch if you're working with a dev team that
uses MySQL daily.

Peter