[Xapian-discuss] omega crawler: ht://dig or wget?

Peter Masiar peter.masiar at yale.edu
Fri Mar 17 05:55:47 GMT 2006


Quoting Olly Betts <olly at survex.com>:

> On Fri, Mar 17, 2006 at 12:01:45AM -0500, Peter Masiar wrote:
> > At wiki page: http://wiki.xapian.org/Omega
> > I added a comment that ht://Dig looks like dead.
>
> 3.x seems pretty lifeless, but they're working on 4.0:
>
> http://htdig.blogspot.com/

Blogs looks like dead link to me, too :-(

But you are right, htdig is debian package in stable.

> How long it will take until it's usable isn't clear yet - major rewrites
> have a habit of dragging out.
>
> But 3.x does the job - it crawls web pages and you can dump them into
> a file and build a Xapian index with it.  For that purpose it's not
> really important if htdig itself is being actively developed (apart
> from possible lack of support for new web technologies).

Thats what exactly I wanted to know - if anybody here uses it for something.

> > From brief glance at docs I had a feeling it is not easy to configure.
>
> It does have rather a lot of levers and knobs, but you don't have to
> touch most of them.
>
> > Maybe better crawler is GNU wget? Mature, stable, maintained?
>
> Really wget is a web page fetcher with a recursion feature rather than a
> fully featured crawler.  It can mirror part of a website locally, which
> you can then index with omindex.  Sometimes that may be enough, but
> htdig can do rather more.

How much more? I am not sure if I can tell the difference.

I need a program which I can feed a URL and it will get me all pages
linked from URL on that site. Is it a crawler? Page fetcher?

Maybe I am not asking right questions. What other features I need
to look into when selecting a crawler?

Would be nice if fetcher can log in into some sites with password or cookie.
Currently I do not see any more features. Can wget or htdig do it?

> Incidentally, you say htdig is hard to configure but did you actually
> look at how many options wget has?

yes, and I run away scared :-)

-- 
Peter Masiar



More information about the Xapian-discuss mailing list