[Xapian-discuss] newbie help, group site crawl, index and search

Samuel Liddicott sam at liddicott.com
Thu Feb 3 10:15:36 GMT 2005


Haniff Din wrote:

>Hello xapian-discuss,
>
>  Forgive the newbie question.
>  
>  I want to crawl a subset of business sites on a regular basis, in a specific
>  niche market (about 100-200 sites).
>  In effect for  parent charity organisation website so you can search all the group
>  independent sites, from the parent, as if was one web-site.
>  Of course this means, I always crawl and index the same group
>  of sites, which is expanding slowly in effect. So the crawler
>  is always given the same set of URLs to crawl in effect.
>  
>  What is the best approach to crawl and build an index
>  for searching?  Are crawl and/or index tools part
>  of this tool-set? Are they available?
>  
>

Something like this, perhaps (shell script stuff):
spider() {
  dir="$1"
  shift

     wget --progress=bar "--directory-prefix=$dir" --html-extension \
     --cookies=on --save-cookies "$dir/cookies"  --load-cookies 
"$dir/cookies" \
     --exclude-directories=logout,login,search \
     --recursive --convert-links --mirror "$@"
}

then call:
spider site/dump-directory http://site/

then once the files are on disk you can index them using whatever 
indexing tools can index flat files.

Sam



More information about the Xapian-discuss mailing list