[Xapian-discuss] htdig with omega for multiple URLs (websites)

Peter Masiar peter.masiar at yale.edu
Wed Mar 29 18:41:31 BST 2006


many thanks for suggesting htdig, you saved me a lot of time.
Htdig looks better than my original idea - wget, you were right.

Using htdig, I can crawl and search single website - but I need to 
integrate search of pages spread over 100+ sites. Learning, learning....

Htdig uses separate document database for every website (one database 
per URL to initiate crawling). Htdig also can merge result databases to 
allow search of integrated results.

If you still have around the script you said you wrote to use htdig as 
crawler front-end for omega, I would be really interested to see it.

My htdig crawls single site. I need to learn how to crawl multiple sites 
and merge results. Do you recall your htdig2omega script handling this 
merging? Or you searched one htdig-crawled database? Or can I merge 
using htdig and then search using omega?

Thanks for any insight which way to start looking.

Also if anyone on list has experience with using htdig to crawl multiple 
websites, I would really appreciate insight or sample scripts.
My current approach would be
1) generate 100+ config files (one per URL), creating 100+ databases
2) generate script to merge results.

Is there a better way?

Peter Masiar, Yale center for medical Informatics

