[Xapian-discuss] Ranking and term proximity

Tue Sep 6 13:18:41 BST 2011

On 06/09/11 19:31, William Crawford wrote:
> On Tuesday 06 September 2011 07:35:32 goran kent wrote:
>> Reminds me of
>> http://trac.xapian.org/ticket/326 - chert (without patches, but even
>> with, it's still bad) is 7x SLOWER than the older flint format.
>> That's embarrassing.  Yes, one can argue that chert *may* perform
>> better with larger indexes, but hell, that's still a bad start...  Can
>> you imagine trying to justify/explain that kind of degradation in a
>> commercial product?  You'd be laughed right out the conference room.
> I'd like to pipe up from the back here and make an observation.
>
> I've got an index that's about 700M, on a server with 24G RAM, and I'd much
> rather have the faster search that comes from /not/ trying too hard to
> compress the data. I understand there are use-cases where everything can't be
> cached, but perhaps there's a need for either two backends (mono- and megalith
> would be good names?) or a flag to pass to the WritableDatabase when creating?
It would be awesome to have a back end that was more optimized for when 
the data is all (or mostly all) in memory. It would affect choice of 
compression algorithms, datastructures, etc.

I expect this would allow  you to use more complex ranking, faceting, 
large queries, etc. and also have more predictable response times due to 
not hitting the disks.

It's not only for small data - there are plenty of use cases with large 
data where it's worth partitioning the index across many servers.

I'm not aware of anything open source that does this well.