[Xapian-discuss] Adding a Web Spider

James Aylett james-xapian at tartarus.org
Fri Jul 2 14:05:28 BST 2004


On Fri, Jul 02, 2004 at 01:22:41PM +0100, Olly Betts wrote:

> > If you use Python, there's a robots.txt implementation in the
> > library. Although IIRC it's buggy :-(
> 
> All the standard robots.txt implementations I've seen implement the
> spec.  Sadly almost nobody who writes robots.txt files seems to read
> the spec...

ISTR something by Mark Pilgrim saying the Python one didn't behave
itself properly; he wrote his own for the Ultra Liberal Feed
Parser. I note that he's fixing two bugs, one fixed in Python 2.3a2;
there's another which appears to still be open in Python itself which
he patches (bug 690214 - although this doesn't appear to be a valid bug).

It doesn't help that the robots.txt spec is fairly poorly-written, and
exists only as an expired I-D. It also lacks some useful features that
make it tortuous or impossible to build certain types of robot control
policies :-(

Having said that, it's fairly simple even to deal with weird
robots.txt files. But the effort does all add up ... :-(

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list