[Xapian-discuss] How to update DB concurrently?
oscaruser at programmer.net
oscaruser at programmer.net
Thu May 18 05:52:58 BST 2006
I set this up, but found that the 150 spiders where processing data at a much faster rate than the indexer was able to build the index. This poses a serious performance bottle neck issue since it is not scaling. How can I increase or improve the rate of the indexer to the level the spiders are processing the URLs?
Thanks
> ----- Original Message -----
> From: "Olly Betts" <olly at survex.com>
> To: oscaruser at programmer.net
> Subject: Re: [Xapian-discuss] How to update DB concurrently?
> Date: Wed, 17 May 2006 20:17:16 +0100
>
>
> On Wed, May 17, 2006 at 10:08:06AM -0800, oscaruser at programmer.net wrote:
> > seems xapian-tcpsrv is what was needed, while increasing listen
> > backlog to 256 in tcpserver.cc
>
> The remote backend (which uses xapian-tcpsrv) only supports reading
> databases currently, though it looks like someone's going to commission
> me to implement a writable remote backend in the near future.
>
> But adding documents in batches is much more efficient - if you try
> to scale your current setup, you'll probably hit a limit to how fast
> you can add documents.
>
> I'd suggest that each spider should dump pages in a form suitable for
> feeding into script index (in Perl you can just suck the whole page
> into $html and then:
>
> $html =~ s/\n/ /g;
>
> then create a dump file entry like so:
>
> print DUMPFILE_TMP <<END;
> url=$url
> html=$html
>
> END
>
> You can include any other meta information you want - title,
> content-type, modification time, sitename, etc in other fields.
> A suitable index script would be something like:
>
> url : field=url hash boolean=Q unique=Q
> html : unhtml index truncate=250 field=sample
>
> And then when you've dumped 100 or 1000 or something you can switch
> to a new dump file and feed the old one into scriptindex. The way
> I'd do that is have a spool directory which dump files just get
> renamed into by the spiders, and an indexer process which does something
> like:
>
> chdir "spool" or die $!;
> while (1) {
> my @files = glob "*.dump";
> if (@files) {
> system "scriptindex", $database, $indexscript, @files or die $!;
> unlink @files;
> } else {
> sleep 60;
> }
> }
>
> This "spool directory" style of design is both simple and suitably
> robust.
>
> Cheers,
> Olly
>
--
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/
More information about the Xapian-discuss
mailing list