[Xapian-discuss] Adding a Web Spider

Fri Jul 2 12:17:52 BST 2004

On Fri, Jul 02, 2004 at 12:38:33PM +0200, rm at fabula.de wrote:

> I happen to call myself a programmer and have written some
> crawlers myself. No, writing a _good_ crawler is _far_ from
> simple (you need a rather error-tolerant HTML/XHTML parser,
> a good http lib, smart tracking of Etag headers and content
> hash sums, and more and more a rather capable ECMA-script
> interpreter (for those stupid javascript links ....).

I'd echo that; i haven't written a crawler for indexing, but I've
written similar systems at work, and they tend to be fairly painful
:-/

However if we were to come up with some sort of modular design of
spider-crawler / indexer pair, and implement it well, it might indeed
help. But I do wonder how many people actually need something like
that? Surely most potential uses of an IR system will be working with
local data? (Larger institutions need spiders, so I can see the appeal
for consultancy companies, and I'll certainly support and offer
suggestions if anyone is going to write one. Just don't think it's
going to be easy :-)

> Use Perl with the LWP lib  to fetch the documents,
> parse them with the Perl libxml2 parser (that has a pretty
> good html mode), use libxml2's Reader API to fetch all
> URLs nd push them onto a stack of jobs. Use Xapian's 
> Perl bindings to do the actual indexing. Nothing to
> hard. But: if the resources you grab aren't on your servers
> you might want to honor robot.txt and add delays to the
> job queue, check for dynamic content etc.

If you use Python, there's a robots.txt implementation in the
library. Although IIRC it's buggy :-(

J 

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org