[Xapian-discuss] Xapian Terms vs. Document Partition.

Wed Jun 4 00:23:31 BST 2008

Yeah, it is getting tougher to get 100 million of web sites index on
one machine. But, I learned few things to make it faster. Today a big
problem on Internet, is that there are around 20% of web sites
completely dedicated to spamdex and another 20% to xxx, and they are
connected to each other and also to legitimate web sites. My crawlers
used to get stuck in cycle of spamdex or xxx web sites and could not
get out from it. However, it is not hard to write pattern recognition
software in Perl that detect these web sites and avoid going there.
But this patterns must be evaluated and often re-written and is time
consuming. There are many web sites with different URL's having same
content. So, I am running SHA1 hash algorithm on first part of the web
site content and use that as unique key that made the search faster
and removed duplicated web sites. So that seems to be done and you can
see my index at http://myhealthcare.com is almost completely clean
from these type of web sites.

Another thing is that my crawlers brought to index lot of Asian web
sites and because they use different characters they create the
postlist of index terms really big. So I am trying several different
scenarios analyzing the text and detect whether text is readable by
gunning fog and flesh-kincaid algorithms, that takes care of
separating all good text from bad or text that do not originate in
Cyrillic charset which includes also Asian languages. I end up with
second search engine that I run on http://pacificair.com and you can
search for Chinese, Japanese, Korean and some other non Cyrillic web
sites.

There are few things on my mind including that most probably in Xapian
we can rewrite the scriptindex to create index distributed into
separate indexes based on terms across many servers and then search
particular server that contain the set of search terms. Seems nobody
wants to do it, and I am only one left here believing it, so I guess I
must hack the scriptindex to spit out indexes based on terms rather
then documents. On ranking site, I am looking to obtain major router's
logs, collect their activity to each web site and then weight those
web sites with high frequency of visit higher. I am wondering where
web site like Alexa.com and other search engines obtain data about
activity of other web sites.

-- 
Kevin Duraj
http://myhealthcare.com

On Wed, May 14, 2008 at 2:25 PM, Chris Good <chris at g2.nu> wrote:
> "Kevin Duraj" wrote:
>> My index is growing to 100 million of documents at
>> http://myhealthcare.com and I need to implement some parallel
>> architecture, because it takes too long to update and add new
>> documents into index.
>
> Kevin, what sorts of timings, document update rates and what
> hardware are you running on?  Scaling xapian isn't too hard
> provided that you get your hardware and system architecture
> right and 100m documents wouldn't concern me greatly if I were
> asked to implement it.
>
> Chris
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>