[Xapian-discuss] Positive experiences with Xapian

Mon Aug 8 10:38:57 BST 2011

On 03/08/11 01:00, Peter Van Dijk wrote:
>
> Hypothetically, even if you could get such systems to work in
> Lucene/Solr/Sphinx, the other significant design flaw as far as i'm
> concerned is the fact that they're designed to run "in memory".

I'm sorry, but I seriously doubt that Lucene was "designed to run in 
memory", presumably meaning that you have to load the index into memory 
to get it to work: the characteristics of the data format are 
specifically designed to work efficiently with disk I/O.

[...]

> searchd. Xapian, on the other hand, does what i consider to be the "right"
> thing, and actually uses the OS to cache it's file accesses.

Filesystem caching works effectively enough with Lucene. In fact, when I 
tested the "load it into RAM" approach using a "RAM directory" or 
whatever it may be called, it offered no real benefit over letting the 
OS do the caching. This was with a large number of searches spread 
across an index.

> This approach
> is totally superior as far as we're concerned. It allows us to throw as much
> memory at a box as we require for performance reasons, without having to get
> into the insanity of managing an individual service that needs to consume
> 99% of the available memory.

You can argue that Java itself imposes ridiculous memory management 
limitations - I was using PyLucene in the era when they supported a 
GCJ-compiled library - but that's a separate issue.

I'm not using either Lucene or Xapian actively at the moment, and I 
can't really call myself a Lucene enthusiast either - I switched from 
Lucene to Xapian for various reasons, some of which I probably share 
with you - but no-one benefits from inaccurate information about 
supposed "competitors" when having accurate information about them could 
actually inform Xapian development.

> Another quick note on the database based fulltext indexes - MyISAM fulltext
> is just fundamentally unable to handle what we want to do from a performance
> standpoint, end of story. I think we calculated it'd take us something like
> 3 months to build the indexes on a single development server. I'm aware
> postgres is a different story, but at the end of the day, it's really not
> suitable either for the same reasons. They're designed as databases not as
> search engines.

There's one thing that database systems are very good at, if configured 
appropriately, and that's determining the most optimal querying 
approach. I perform huge numbers of searches on indexed text in batches, 
and in such situations a database system would probably employ more 
efficient techniques transparently, mostly because they provide such 
facilities generally. Indeed, the general data management functions 
offered by systems like PostgreSQL have a lot more to bring to the table 
than people would have you believe.

The only reason why I'm not playing with PostgreSQL's full-text support 
is that they omitted support for general regular-expression-based 
tokenisation in favour of a handful of hand-coded tokenisers, and I 
don't yet have the inclination to write one which provides such an 
obviously useful feature.

Paul