[Xapian-discuss] Adding a Web Spider

Fri Jul 2 13:22:41 BST 2004

On Fri, Jul 02, 2004 at 12:17:52PM +0100, James Aylett wrote:
> On Fri, Jul 02, 2004 at 12:38:33PM +0200, rm at fabula.de wrote:
> 
> > I happen to call myself a programmer and have written some
> > crawlers myself. No, writing a _good_ crawler is _far_ from
> > simple (you need a rather error-tolerant HTML/XHTML parser,
> > a good http lib, smart tracking of Etag headers and content
> > hash sums, and more and more a rather capable ECMA-script
> > interpreter (for those stupid javascript links ....).
> 
> I'd echo that; i haven't written a crawler for indexing, but I've
> written similar systems at work, and they tend to be fairly painful
> :-/

Several years ago I bumped into someone I was at university with and we
were catching up with what we'd been up to.  I told him I was writing
web crawlers these days.  He responded "Oh how dull, that's just a big
hash table!"

This always makes me think of the Python sketch in which John Cleese
teaches us how to play the flute:

    "You blow there and you move your fingers up and down here."

For the full sketch see: http://orangecow.org/pythonet/sketches/toridof.htm

I've written several web crawlers, and it's hard to do well.  The fact I
had to write several is a clue - in the process of writing one and
seeing how it performs and the problems it runs into, you learn a lot
and are then able to rework to produce something better.

Also the web is also a moving target.  New technologies emerge and
crawlers need to adapt to cope with them.

The first thing to decide is if you want a "constrained" or
"unconstrained" crawler.  If you want to index one to a few thousand
sites which you've hand picked, you need to worry a lot less about
search engine spammers, URL blackholes, etc.  If a site is causing
problems, you can just drop it from the crawl.

If you want to build the next Google, you need something a lot more
robust.  You're also going to appear on the search engine optimisers'
radar (assuming you do well) and get deliberately targetted by search
engine spam, rather than just hitting attempts to game Google.

If you just have a few sites to crawl, you could use the htdig crawler
and the script here to suck the results into Xapian:

http://thread.gmane.org/gmane.comp.search.xapian.general/465

If you need a better constrained crawler, bolting together perl or
python modules will get you further.

But for an unconstrained crawler, or perhaps even a large enough
constrained crawler, you're going to need to write a lot of your own
code.  You'll probably even need to replace some of the standard modules
with your own implemntation which are better tuned to your needs.

> If you use Python, there's a robots.txt implementation in the
> library. Although IIRC it's buggy :-(

All the standard robots.txt implementations I've seen implement the
spec.  Sadly almost nobody who writes robots.txt files seems to read
the spec...

Cheers,
Olly