[Xapian-discuss] omega crawler: ht://dig or wget?

Olly Betts olly at survex.com
Fri Mar 17 05:20:11 GMT 2006


On Fri, Mar 17, 2006 at 12:01:45AM -0500, Peter Masiar wrote:
> At wiki page: http://wiki.xapian.org/Omega
> I added a comment that ht://Dig looks like dead.

3.x seems pretty lifeless, but they're working on 4.0:

http://htdig.blogspot.com/

How long it will take until it's usable isn't clear yet - major rewrites
have a habit of dragging out.

But 3.x does the job - it crawls web pages and you can dump them into
a file and build a Xapian index with it.  For that purpose it's not
really important if htdig itself is being actively developed (apart
from possible lack of support for new web technologies).

> From brief glance at docs I had a feeling it is not easy to configure.

It does have rather a lot of levers and knobs, but you don't have to
touch most of them.

> Maybe better crawler is GNU wget? Mature, stable, maintained?

Really wget is a web page fetcher with a recursion feature rather than a
fully featured crawler.  It can mirror part of a website locally, which
you can then index with omindex.  Sometimes that may be enough, but
htdig can do rather more.

Incidentally, you say htdig is hard to configure but did you actually
look at how many options wget has?

Cheers,
    Olly



More information about the Xapian-discuss mailing list