[Xapian-discuss] Positive experiences with Xapian
Paul Boddie
paul.boddie at biotek.uio.no
Mon Aug 8 10:38:57 BST 2011
On 03/08/11 01:00, Peter Van Dijk wrote:
>
> Hypothetically, even if you could get such systems to work in
> Lucene/Solr/Sphinx, the other significant design flaw as far as i'm
> concerned is the fact that they're designed to run "in memory".
I'm sorry, but I seriously doubt that Lucene was "designed to run in
memory", presumably meaning that you have to load the index into memory
to get it to work: the characteristics of the data format are
specifically designed to work efficiently with disk I/O.
[...]
> searchd. Xapian, on the other hand, does what i consider to be the "right"
> thing, and actually uses the OS to cache it's file accesses.
Filesystem caching works effectively enough with Lucene. In fact, when I
tested the "load it into RAM" approach using a "RAM directory" or
whatever it may be called, it offered no real benefit over letting the
OS do the caching. This was with a large number of searches spread
across an index.
> This approach
> is totally superior as far as we're concerned. It allows us to throw as much
> memory at a box as we require for performance reasons, without having to get
> into the insanity of managing an individual service that needs to consume
> 99% of the available memory.
You can argue that Java itself imposes ridiculous memory management
limitations - I was using PyLucene in the era when they supported a
GCJ-compiled library - but that's a separate issue.
I'm not using either Lucene or Xapian actively at the moment, and I
can't really call myself a Lucene enthusiast either - I switched from
Lucene to Xapian for various reasons, some of which I probably share
with you - but no-one benefits from inaccurate information about
supposed "competitors" when having accurate information about them could
actually inform Xapian development.
> Another quick note on the database based fulltext indexes - MyISAM fulltext
> is just fundamentally unable to handle what we want to do from a performance
> standpoint, end of story. I think we calculated it'd take us something like
> 3 months to build the indexes on a single development server. I'm aware
> postgres is a different story, but at the end of the day, it's really not
> suitable either for the same reasons. They're designed as databases not as
> search engines.
There's one thing that database systems are very good at, if configured
appropriately, and that's determining the most optimal querying
approach. I perform huge numbers of searches on indexed text in batches,
and in such situations a database system would probably employ more
efficient techniques transparently, mostly because they provide such
facilities generally. Indeed, the general data management functions
offered by systems like PostgreSQL have a lot more to bring to the table
than people would have you believe.
The only reason why I'm not playing with PostgreSQL's full-text support
is that they omitted support for general regular-expression-based
tokenisation in favour of a handful of hand-coded tokenisers, and I
don't yet have the inclination to write one which provides such an
obviously useful feature.
Paul
More information about the Xapian-discuss
mailing list