[Xapian-discuss] How to update DB concurrently?

Wed May 17 20:17:16 BST 2006

On Wed, May 17, 2006 at 10:08:06AM -0800, oscaruser at programmer.net wrote:
> seems xapian-tcpsrv is what was needed, while increasing listen
> backlog to 256 in tcpserver.cc

The remote backend (which uses xapian-tcpsrv) only supports reading
databases currently, though it looks like someone's going to commission
me to implement a writable remote backend in the near future.

But adding documents in batches is much more efficient - if you try
to scale your current setup, you'll probably hit a limit to how fast
you can add documents.

I'd suggest that each spider should dump pages in a form suitable for
feeding into script index (in Perl you can just suck the whole page
into $html and then:

$html =~ s/\n/ /g;

then create a dump file entry like so:

print DUMPFILE_TMP <<END;
url=$url
html=$html

END

You can include any other meta information you want - title,
content-type, modification time, sitename, etc in other fields.
A suitable index script would be something like:

url : field=url hash boolean=Q unique=Q
html : unhtml index truncate=250 field=sample

And then when you've dumped 100 or 1000 or something you can switch
to a new dump file and feed the old one into scriptindex.  The way
I'd do that is have a spool directory which dump files just get
renamed into by the spiders, and an indexer process which does something
like:

chdir "spool" or die $!;
while (1) {
    my @files = glob "*.dump";
    if (@files) {
	system "scriptindex", $database, $indexscript, @files or die $!;
	unlink @files;
    } else {
	sleep 60;
    }
}

This "spool directory" style of design is both simple and suitably
robust.

Cheers,
    Olly