[Xapian-discuss] htdig with omega for multiple URLs (websites)
peter.masiar at yale.edu
Wed Mar 29 18:41:31 BST 2006
many thanks for suggesting htdig, you saved me a lot of time.
Htdig looks better than my original idea - wget, you were right.
Using htdig, I can crawl and search single website - but I need to
integrate search of pages spread over 100+ sites. Learning, learning....
Htdig uses separate document database for every website (one database
per URL to initiate crawling). Htdig also can merge result databases to
allow search of integrated results.
If you still have around the script you said you wrote to use htdig as
crawler front-end for omega, I would be really interested to see it.
My htdig crawls single site. I need to learn how to crawl multiple sites
and merge results. Do you recall your htdig2omega script handling this
merging? Or you searched one htdig-crawled database? Or can I merge
using htdig and then search using omega?
Thanks for any insight which way to start looking.
Also if anyone on list has experience with using htdig to crawl multiple
websites, I would really appreciate insight or sample scripts.
My current approach would be
1) generate 100+ config files (one per URL), creating 100+ databases
2) generate script to merge results.
Is there a better way?
Peter Masiar, Yale center for medical Informatics
A: Because it messes up the flow of reading.
Q: Why is top-posting often frowned upon?
More information about the Xapian-discuss