[Xapian-discuss] How to beat Google aka Xapian & Natural Language Processing.

Kevin Duraj kevin.softdev at gmail.com
Fri Oct 19 22:45:51 BST 2007


Google search engine runs on a cluster of Linux machines over 250,000
of them in December 2004, according to an MIT paper. When comparing
Google traffic from Netcraft, today Google might have approximately
500,000 or more servers.

When closely examine Google search engine you find few interesting
behaviors that does not indicate this is search engine but rather pre
populated information retrieval systems based on pair key and value
residing in memory of particular server that relates to the [key =
search terms] and [value=1000 ranked result available].  This type of
key value search can be easily being implemented using superfast
Berkeley DB that uses memory and B-tree to retrieve values based on
key. When user search for very unique term that is not found on any
server Google show no result or give user suggestion.

However in the background when there is enough request for the same
particular search term Google issue search to their real search engine
cluster and to get result might take long time. When Google gets that
results it places is to particular server that name of that server
corresponding to the [key=search term]. Next time the same user come
to search for the same terms, Google shows the result in lightning
speed, and user is impressed, because the [key=search terms] and
[values=1000 ranked results] has been already sitting in memory of
pre-populated Berkeley DB.  Keys with values that are sitting in
Berkeley DB for long times are refreshed against real search engine
and ranking engine. This looks very simple and easy to do except the
part of ranking.

Kevin Duraj
http://pacific-design.com


On 10/10/07, James Aylett <james-xapian at tartarus.org> wrote:
> On Tue, Oct 09, 2007 at 11:02:08AM -0700, Kevin Duraj wrote:
>
> > But I am puzzled about something. What make you think that any
> > corporation can compete with open source and not fail in time? Or
> > what make you think that any corporation can hire more programmers
> > then open source community?
>
> The second question isn't really the right one. I think you actually
> mean:
>
>   What makes you think that any corporation can attract more
>   *motivated* programmers *with suitable skills* than the open source
>   community can attract?
>
> I think the jury's out on this. Sometimes the open source community
> seems to win, sometimes corporations. I'd say that corporations have
> the edge in extreme niche markets (there isn't an open source speech
> recogniser that comes to the level of the commercial ones, for
> instance), but this may change over time.
>
> The open source community is generally much, much smaller than the
> corpus of available developers, in any given society (at the
> moment). I suspect that most specialists would rather work
> commercially in their specialism than work in something else and hack
> on their specialism by night; there *are* ways of being paid for open
> source work, but they are either more hassle (eg contracting), or rare
> (eg working for someone IBM, Sun or RedHat on open source projects).
>
> Incidentally, I and a GNU developer spent some time in 1998 arguing
> with a commercial developer that his company would disappear under the
> rolling stone of apache. We were wrong: that company was Zeus, and
> they adapted to the changing environment; they now make one of the
> nicest layer 7 load balancers. In the long term, maybe someone will
> beat Zeus, NetScalar, CISCO and the rest... we'll have to see.
>
> J
>
> --
> /--------------------------------------------------------------------------\
>  James Aylett                                                  xapian.org
>  james at tartarus.org                               uncertaintydivision.org
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>



More information about the Xapian-discuss mailing list