[Xapian-discuss] newbie help, group site crawl, index and search

Thu Feb 3 13:13:23 GMT 2005

 From the xapian web page,

" The indexer supplied can index HTML, PHP, PDF, PostScript, and plain 
text. Adding support for indexing other formats is easy where conversion 
filters are available (e.g. Microsoft Word). This indexer works using 
the filing system, but we also provide a script to allow the htdig web 
crawler to be hooked in, allowing remote sites to be searched using Omega."

Jim.

Samuel Liddicott wrote:

> Haniff Din wrote:
>
>> Hello xapian-discuss,
>>
>>  Forgive the newbie question.
>>  
>>  I want to crawl a subset of business sites on a regular basis, in a 
>> specific
>>  niche market (about 100-200 sites).
>>  In effect for  parent charity organisation website so you can search 
>> all the group
>>  independent sites, from the parent, as if was one web-site.
>>  Of course this means, I always crawl and index the same group
>>  of sites, which is expanding slowly in effect. So the crawler
>>  is always given the same set of URLs to crawl in effect.
>>  
>>  What is the best approach to crawl and build an index
>>  for searching?  Are crawl and/or index tools part
>>  of this tool-set? Are they available?
>>  
>>
>
> Something like this, perhaps (shell script stuff):
> spider() {
>  dir="$1"
>  shift
>
>     wget --progress=bar "--directory-prefix=$dir" --html-extension \
>     --cookies=on --save-cookies "$dir/cookies"  --load-cookies 
> "$dir/cookies" \
>     --exclude-directories=logout,login,search \
>     --recursive --convert-links --mirror "$@"
> }
>
> then call:
> spider site/dump-directory http://site/
>
> then once the files are on disk you can index them using whatever 
> indexing tools can index flat files.
>
> Sam
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss