[Xapian-discuss] Adding a Web Spider
James Aylett
james-xapian at tartarus.org
Fri Jul 2 14:05:28 BST 2004
On Fri, Jul 02, 2004 at 01:22:41PM +0100, Olly Betts wrote:
> > If you use Python, there's a robots.txt implementation in the
> > library. Although IIRC it's buggy :-(
>
> All the standard robots.txt implementations I've seen implement the
> spec. Sadly almost nobody who writes robots.txt files seems to read
> the spec...
ISTR something by Mark Pilgrim saying the Python one didn't behave
itself properly; he wrote his own for the Ultra Liberal Feed
Parser. I note that he's fixing two bugs, one fixed in Python 2.3a2;
there's another which appears to still be open in Python itself which
he patches (bug 690214 - although this doesn't appear to be a valid bug).
It doesn't help that the robots.txt spec is fairly poorly-written, and
exists only as an expired I-D. It also lacks some useful features that
make it tortuous or impossible to build certain types of robot control
policies :-(
Having said that, it's fairly simple even to deal with weird
robots.txt files. But the effort does all add up ... :-(
J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james at tartarus.org uncertaintydivision.org
More information about the Xapian-discuss
mailing list