[Xapian-discuss] newbie help, group site crawl, index and search
Jim Lynch
jwl at sgi.com
Thu Feb 3 13:13:23 GMT 2005
From the xapian web page,
" The indexer supplied can index HTML, PHP, PDF, PostScript, and plain
text. Adding support for indexing other formats is easy where conversion
filters are available (e.g. Microsoft Word). This indexer works using
the filing system, but we also provide a script to allow the htdig web
crawler to be hooked in, allowing remote sites to be searched using Omega."
Jim.
Samuel Liddicott wrote:
> Haniff Din wrote:
>
>> Hello xapian-discuss,
>>
>> Forgive the newbie question.
>>
>> I want to crawl a subset of business sites on a regular basis, in a
>> specific
>> niche market (about 100-200 sites).
>> In effect for parent charity organisation website so you can search
>> all the group
>> independent sites, from the parent, as if was one web-site.
>> Of course this means, I always crawl and index the same group
>> of sites, which is expanding slowly in effect. So the crawler
>> is always given the same set of URLs to crawl in effect.
>>
>> What is the best approach to crawl and build an index
>> for searching? Are crawl and/or index tools part
>> of this tool-set? Are they available?
>>
>>
>
> Something like this, perhaps (shell script stuff):
> spider() {
> dir="$1"
> shift
>
> wget --progress=bar "--directory-prefix=$dir" --html-extension \
> --cookies=on --save-cookies "$dir/cookies" --load-cookies
> "$dir/cookies" \
> --exclude-directories=logout,login,search \
> --recursive --convert-links --mirror "$@"
> }
>
> then call:
> spider site/dump-directory http://site/
>
> then once the files are on disk you can index them using whatever
> indexing tools can index flat files.
>
> Sam
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
More information about the Xapian-discuss
mailing list