[Xapian-discuss] omega crawler: ht://dig or wget?
Peter Masiar
peter.masiar at yale.edu
Fri Mar 17 05:55:47 GMT 2006
Quoting Olly Betts <olly at survex.com>:
> On Fri, Mar 17, 2006 at 12:01:45AM -0500, Peter Masiar wrote:
> > At wiki page: http://wiki.xapian.org/Omega
> > I added a comment that ht://Dig looks like dead.
>
> 3.x seems pretty lifeless, but they're working on 4.0:
>
> http://htdig.blogspot.com/
Blogs looks like dead link to me, too :-(
But you are right, htdig is debian package in stable.
> How long it will take until it's usable isn't clear yet - major rewrites
> have a habit of dragging out.
>
> But 3.x does the job - it crawls web pages and you can dump them into
> a file and build a Xapian index with it. For that purpose it's not
> really important if htdig itself is being actively developed (apart
> from possible lack of support for new web technologies).
Thats what exactly I wanted to know - if anybody here uses it for something.
> > From brief glance at docs I had a feeling it is not easy to configure.
>
> It does have rather a lot of levers and knobs, but you don't have to
> touch most of them.
>
> > Maybe better crawler is GNU wget? Mature, stable, maintained?
>
> Really wget is a web page fetcher with a recursion feature rather than a
> fully featured crawler. It can mirror part of a website locally, which
> you can then index with omindex. Sometimes that may be enough, but
> htdig can do rather more.
How much more? I am not sure if I can tell the difference.
I need a program which I can feed a URL and it will get me all pages
linked from URL on that site. Is it a crawler? Page fetcher?
Maybe I am not asking right questions. What other features I need
to look into when selecting a crawler?
Would be nice if fetcher can log in into some sites with password or cookie.
Currently I do not see any more features. Can wget or htdig do it?
> Incidentally, you say htdig is hard to configure but did you actually
> look at how many options wget has?
yes, and I run away scared :-)
--
Peter Masiar
More information about the Xapian-discuss
mailing list