[Xapian-discuss] scriptindex on an internet crawl

Thu Jun 23 13:35:29 BST 2005

On Thu, Jun 23, 2005 at 08:14:05AM +0200, Arjen van der Meijden wrote:
> >>On Wed, Jun 22, 2005 at 03:21:32PM -0400, Georges Dupret wrote:
> >>
> >>>In a first try, I inserted in the command file url : field=url 
> >>>boolean=XURL
> >>>unique=XURL and in the input file: url=www.dcc.uchile.cl/~gdupret for
> >>>example, but scriptindex start using 100% of the CPU and never finishes.
[...]
> 
> Can't this be explained by just that scriptindex is very very slow?

In this particularly case, I hope you mean...

> I can imagine that a unique-check for a relatively long identifier with a 
> relatively similar beginning can be very time consuming and/or results 
> in quite a bit of more btree-work. At least compared to more evenly 
> distributed identifiers.

The term length shouldn't make too much difference, but you could be
right that it's just being slow.  Checking a unique id does slow things
down (and there's scope for improvement there), and checking two for
each document could conceivably be worse than double the overhead of
checking one.

Georges: try adding "-v" to the scriptindex command line for verbose
output.  That will make it print a message each time it adds a document
so we'll see if it's actually making slow progress.

Cheers,
    Olly