[Xapian-discuss] Adding a Web Spider

Lee Johnson atsina6 at yahoo.com
Fri Jul 2 09:45:29 BST 2004

i have read the future of xapian thread today. One
item is specifically is very interesting for me in
that thread is adding a web spider. We all know that
Xapian is not designed exclusively for that purpose
but a web spider can increase greatly the usage of
Xapian. I'm not a programmer but writing a web spider
is rather simple wrt writing xapian itself. In turn,
xapian can earn lots of users and those ones become
familiar with xapian and so they use in other areas,
they tell others about xapian and so on.

I'm saying this because i also need a crawler for
xapian. I have hand-picked rather big list of URLs
(just URLs not the contents) and need a crawler to
crawl all pages beneath the URLs and put the those
content into a db. so i can use xapian to index and
search that db. I'm very open to suggestions. I looked
at nutch, heritrix and larbin (this one probably just
fetches the URLs not the contents i asked this to the
developer but no answer yet) but with those i cannot
use xapian (if i use one of them then probably i will
use mnogosearch). another thing with nutch and
heritrix is that they are written in java, imho, is
not a good idea.

Also for those interested a good read may be
which devoted that month's issue to search topic.


