[Xapian-discuss] Incremental indexing limitations

Ron Kass ron at pidgintech.com
Sun Oct 14 14:12:19 BST 2007

>> We must make a priority what to do here.
>> - fast indexing
>> - fast searches
>> - anything else has low priority
> These are your priorities.  Other people have different priorities.
> For example, at the start of this thread Ron was suprised by the size of
> the index, so it would seem that he's worried about index size to at
> least some extent.  Simply stating boldly that your priorities must be
> ours won't change what other people's priorities are.
> Overall, I think fast indexing and searching are important, but so is
> index size, quality of results, useful features, and ease of use (both
> to end users, and of the API).  This list probably isn't exhaustive.
> We need to consider everyone's needs, or else you'd be the only Xapian
> user.
>> I am placing highest priority on indexing than searching size of index
>> is a low priority tomorrow you will be able to by 1 Terabyte hard
>> drive for good lunch in Ritz Carlton. Why to focus on small index when
>> we want to have fastest indexing. However developers tend to complain
>> about the size of index but never complaint that Xapian was indexing
>> approx 10 times faster than Lucene, MySQL or MSSQL. We got ridd of the
>> fastest indexing I have ever seen in my life. Why? Developers
>> complaint about the size of Quartz database that we use to have as
>> default.
> Also factor in the cost of backup for that 1TB, and the extra RAM you'll
> need to cache enough of it for searches to run fast.  Also, a disk that
> size is likely to run hotter and draw more power (as will the extra
> RAM), so your may need a beefier power supply and better fans.  The
> extra electricity will tend to increase your hosting costs too.  And
> being cutting edge technology, a 1TB drive will probably be more liable
> to failure.
Hi Olly And Kevin

It would actually be interesting to run a poll of a sort to see how 
people prioritize the importance  of different features in Xapian.
I would say the most important for us is speed of search. It might have 
implication on the size of index, or might not but thats the most important.
Third is the size of index, indeed it is important to us. Mainly because 
we are interesting in testing Xapian on a huge dataset. And by huge I 
mean immensely, mind bogglingly gigantic. I guess if we could anyway fit 
the entire index on one big raid, we wouldn't care if its 3TB or 5TB. 
But when you have to take it few leaps ahead, "size does matter".
Third will be features actually. Which we feed Xapian covers quite 
extensively if taking out speed factor. But maybe others feel there are 
missing fundamental ones, so would be interesting to ask.
Only then speed of indexing. But let me explain this one.. there are two 
speeds here.. first-time indexing speed and ongoing indexing speed. Both 
actually are less critical (unless we are talking about a huge 
difference). If it took twice as much to do the first indexing, that 
would be acceptable. I would say the same about ongoing indexing, BUT 
the most important factor here is that the indexing will be faster than 
incoming data AND that it wont take new data to appear in the index much 
longer than people expect it to. For example, if it takes a day (after 
initial crawling) to show a page google's index, thats no problem (even 
a week). But if it took even 8 hours, to show a news piece in google's 
news search, that would be way too much.

Our main concern right now is search speed. Reading some exchanges here 
about people who deal with >minute searches, that would be an issue. 
Now, of course you can throw more RAM on the thing and get better 
results, but thats not really the right criteria for speed. What is 
important here is to get better speed than alternative with the SAME 
hardware. This is the winning card here.
Now, for this we have two ideas we are happy to share and hear feedback on..
1. Auto warm-up: In many discussions it is stated that a warmed up 
database is faster. Of course it is. One nice thing would be to have a 
warmup (or auto-warmup) mechanism that will automatically load 
critical/popular parts of the BTREEs into memory even before first 
searches are executed. It is better to let the machine do that than 
letting the first users do that and see slow searches for a while.
2. Graceful Time-out on searches: To allow a search to run a MAXIMUM of 
X seconds, and then return whatever results it has EVEN if not 
complete/perfect. In many applications, it is better to show a partial 
result after 5 seconds than a complete/accurate one after 50. Users 
don't wait 50 seconds for a result. (most don't wait even 5, so maybe a 
timeout of 0.5 should be used by some. Anyway, it should a parameter per 
3. I know that Xapian's model is to trust the OS's cache. But, was this 
assumption tested to prove that it really is logical? I am sure it saves 
effort, but I am just wondering at what expense. If indeed the speed 
difference is marginal, then its a smart choice. But its a question still.

Any feedback on those 3 ideas/issues will be greatly appreciated.

(Last note, maybe we will test Quartz in the future, just so we have 
something (data) to contribute to this 'argument'. If we do, we will 
share our test results here.)


More information about the Xapian-discuss mailing list