[Xapian-discuss] How to update DB concurrently?

Thu May 18 05:52:58 BST 2006

I set this up, but found that the 150 spiders where processing data at a much faster rate than the indexer was able to build the index. This poses a serious performance bottle neck issue since it is not scaling. How can I increase or improve the rate of the indexer to the level the spiders are processing the URLs?

Thanks

> ----- Original Message -----
> From: "Olly Betts" <olly at survex.com>
> To: oscaruser at programmer.net
> Subject: Re: [Xapian-discuss] How to update DB concurrently?
> Date: Wed, 17 May 2006 20:17:16 +0100
> 
> 
> On Wed, May 17, 2006 at 10:08:06AM -0800, oscaruser at programmer.net wrote:
> > seems xapian-tcpsrv is what was needed, while increasing listen
> > backlog to 256 in tcpserver.cc
> 
> The remote backend (which uses xapian-tcpsrv) only supports reading
> databases currently, though it looks like someone's going to commission
> me to implement a writable remote backend in the near future.
> 
> But adding documents in batches is much more efficient - if you try
> to scale your current setup, you'll probably hit a limit to how fast
> you can add documents.
> 
> I'd suggest that each spider should dump pages in a form suitable for
> feeding into script index (in Perl you can just suck the whole page
> into $html and then:
> 
> $html =~ s/\n/ /g;
> 
> then create a dump file entry like so:
> 
> print DUMPFILE_TMP <<END;
> url=$url
> html=$html
> 
> END
> 
> You can include any other meta information you want - title,
> content-type, modification time, sitename, etc in other fields.
> A suitable index script would be something like:
> 
> url : field=url hash boolean=Q unique=Q
> html : unhtml index truncate=250 field=sample
> 
> And then when you've dumped 100 or 1000 or something you can switch
> to a new dump file and feed the old one into scriptindex.  The way
> I'd do that is have a spool directory which dump files just get
> renamed into by the spiders, and an indexer process which does something
> like:
> 
> chdir "spool" or die $!;
> while (1) {
>      my @files = glob "*.dump";
>      if (@files) {
> 	system "scriptindex", $database, $indexscript, @files or die $!;
> 	unlink @files;
>      } else {
> 	sleep 60;
>      }
> }
> 
> This "spool directory" style of design is both simple and suitably
> robust.
> 
> Cheers,
>      Olly

>

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/