[Xapian-discuss] Incremental indexing limitations

Olly Betts olly at survex.com
Sun Oct 14 04:45:07 BST 2007


On Sat, Oct 13, 2007 at 06:54:22AM -0700, Kevin Duraj wrote:
> If you read my previous posts I brought this issue several months ago
> about slow indexing using flint database. The indexing used to run so
> fast when using quartz database that I was indexing 20 million records
> in 1 hour or less. However developers did not like the large size of
> database and without testing large amount of data we moved to flint
> database that from my experience is the worst that happen to Xapian
> recently.

Kevin, please stop spreading this misinformation.

Both the developers and several users have tested flint with large
amounts of data, and have found it to perform at least as well as
quartz, and generally much better.

To my knowledge, there's one exception.  One person who has reported
that flint doesn't perform as well for them as quartz does.  And that
person is you!  Clearly something is different about what you are
doing, so the way forward is to identify what that is, and work out
how we can make flint perform better that quartz for everyone.

The problem is that each time you report a problem, you only complain
in very general terms.  The lack of detail, coupled with the fact that
we simply don't see this ourselves and don't have access to the data you
are indexing (nor indeed the machine you are indexing on), means we
simply aren't able to reproduce what you report.

We'd love to help you resolve your issues, but you need to help us to
do this.  But if we ask you for more information, or suggest things you
might try in order to identify what is happening, you either don't
respond, or you reply doesn't contain any useful extra information.

As an example, here I suggest you could try oprofile to find where
time is spent, and also suggest you could try tuning the compression
threshold to see what the optimum value is for your situation:

http://thread.gmane.org/gmane.comp.search.xapian.general/4602/focus=4606

However, your reply doesn't address either of these points:

http://thread.gmane.org/gmane.comp.search.xapian.general/4602/focus=4642

And this isn't the only case like this, just the first I found in
the archives.

I actually thought you'd solved your problem and it turned out to be
that you weren't exporting XAPIAN_FLUSH_THRESHOLD correctly, but that
was only a guess because you didn't reply when I explicitly asked about
it.

> We must make a priority what to do here.
> - fast indexing
> - fast searches
> - anything else has low priority

These are your priorities.  Other people have different priorities.

For example, at the start of this thread Ron was suprised by the size of
the index, so it would seem that he's worried about index size to at
least some extent.  Simply stating boldly that your priorities must be
ours won't change what other people's priorities are.

Overall, I think fast indexing and searching are important, but so is
index size, quality of results, useful features, and ease of use (both
to end users, and of the API).  This list probably isn't exhaustive.

We need to consider everyone's needs, or else you'd be the only Xapian
user.

> I am placing highest priority on indexing than searching size of index
> is a low priority tomorrow you will be able to by 1 Terabyte hard
> drive for good lunch in Ritz Carlton. Why to focus on small index when
> we want to have fastest indexing. However developers tend to complain
> about the size of index but never complaint that Xapian was indexing
> approx 10 times faster than Lucene, MySQL or MSSQL. We got ridd of the
> fastest indexing I have ever seen in my life. Why? Developers
> complaint about the size of Quartz database that we use to have as
> default.

Also factor in the cost of backup for that 1TB, and the extra RAM you'll
need to cache enough of it for searches to run fast.  Also, a disk that
size is likely to run hotter and draw more power (as will the extra
RAM), so your may need a beefier power supply and better fans.  The
extra electricity will tend to increase your hosting costs too.  And
being cutting edge technology, a 1TB drive will probably be more liable
to failure.

And feel free to buy me lunch at the Ritz Carlton sometime!

> You cannot make happy everyone therefore we must put priority. What is
> the highest priority?

Speed of indexing is important, as is speed of searching.  To most
people, the size of the index is an issue too (but I know it isn't to
you).  But as I've tried to explain before, reducing the size of the
index on disk means reduced I/O and reduced VM pressure, so a smaller
index will often be faster.  Hurrah!  Sometimes we can have the best of
all worlds.

In both my tests and those of other people, flint indexes at least as
fast as quartz, runs searches faster than quartz, and produces databases
significantly smaller than quartz.

And yes, we know it doesn't seem to for you.  But if you want to have a
hope of addressing this situation, you need to actually help me
understand *why* this is different for you.  Index exactly the same data
with a recent version of Xapian, once with quartz and once with flint,
so we have some scientifically valid numbers to go on.  Run it again
with profiling so I can see where the extra time is spent.

> I have installed new Xapian 1.0.3 and my search engine has been
> broken, tried to install back 1.0.2 search engine still broken. When I
> watch index each user is trying to modify iamflit file. Why? Because
> someone have complaint that cannot search while indexing or something
> similar. Now the whole release is broken.

This is a bug in 1.0.3.  I'm sorry that this bug slipped through, and
I've produced a patch to fix it which someone else has confirmed fixes
identical symptoms, and which I pointed you to several days ago.

Xapian has supported searching during indexing for many years, and it
has nothing to do with this bug.  The change which caused the bug was
actually adding support for user metadata.

Cheers,
    Olly



More information about the Xapian-discuss mailing list