[Xapian-discuss] How to update DB concurrently?
Olly Betts
olly at survex.com
Wed May 17 20:17:16 BST 2006
On Wed, May 17, 2006 at 10:08:06AM -0800, oscaruser at programmer.net wrote:
> seems xapian-tcpsrv is what was needed, while increasing listen
> backlog to 256 in tcpserver.cc
The remote backend (which uses xapian-tcpsrv) only supports reading
databases currently, though it looks like someone's going to commission
me to implement a writable remote backend in the near future.
But adding documents in batches is much more efficient - if you try
to scale your current setup, you'll probably hit a limit to how fast
you can add documents.
I'd suggest that each spider should dump pages in a form suitable for
feeding into script index (in Perl you can just suck the whole page
into $html and then:
$html =~ s/\n/ /g;
then create a dump file entry like so:
print DUMPFILE_TMP <<END;
url=$url
html=$html
END
You can include any other meta information you want - title,
content-type, modification time, sitename, etc in other fields.
A suitable index script would be something like:
url : field=url hash boolean=Q unique=Q
html : unhtml index truncate=250 field=sample
And then when you've dumped 100 or 1000 or something you can switch
to a new dump file and feed the old one into scriptindex. The way
I'd do that is have a spool directory which dump files just get
renamed into by the spiders, and an indexer process which does something
like:
chdir "spool" or die $!;
while (1) {
my @files = glob "*.dump";
if (@files) {
system "scriptindex", $database, $indexscript, @files or die $!;
unlink @files;
} else {
sleep 60;
}
}
This "spool directory" style of design is both simple and suitably
robust.
Cheers,
Olly
More information about the Xapian-discuss
mailing list