[Xapian-discuss] newbie help, group site crawl, index and search
Samuel Liddicott
sam at liddicott.com
Thu Feb 3 10:15:36 GMT 2005
Haniff Din wrote:
>Hello xapian-discuss,
>
> Forgive the newbie question.
>
> I want to crawl a subset of business sites on a regular basis, in a specific
> niche market (about 100-200 sites).
> In effect for parent charity organisation website so you can search all the group
> independent sites, from the parent, as if was one web-site.
> Of course this means, I always crawl and index the same group
> of sites, which is expanding slowly in effect. So the crawler
> is always given the same set of URLs to crawl in effect.
>
> What is the best approach to crawl and build an index
> for searching? Are crawl and/or index tools part
> of this tool-set? Are they available?
>
>
Something like this, perhaps (shell script stuff):
spider() {
dir="$1"
shift
wget --progress=bar "--directory-prefix=$dir" --html-extension \
--cookies=on --save-cookies "$dir/cookies" --load-cookies
"$dir/cookies" \
--exclude-directories=logout,login,search \
--recursive --convert-links --mirror "$@"
}
then call:
spider site/dump-directory http://site/
then once the files are on disk you can index them using whatever
indexing tools can index flat files.
Sam
More information about the Xapian-discuss
mailing list