[Xapian-discuss] Tryout paches for faster chert search: http://trac.xapian.org/ticket/326

Thu Sep 8 11:50:23 BST 2011

On 09/08/2011 12:46 PM, Chris wrote:
> On 09/08/2011 11:51 AM, Richard Boulton wrote:
>> Sources of realistic
>> query data are harder to come across - anyone got any good ideas for
>> that?  
>>
> Reminds me about the AOL fuckup a few years ago (they released the
> search queries of 650.000 users, by mistake).
> Mirror: http://www.gregsadetsky.com/aol-data/
>
> Combined with Wikipedia, Stackoverflow and product-data of a few hundred
> online shops (affili.net et al) could(?) provide a nice and diversed
> dataset.
>
> On the other side, the database should probably be in-memory, to not be
> limited by disk io, which gives a 40GB index if just using the online
> shop product data.
>
> Greets, Chris
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss

Could be useful for a performance analysis tool, too:

Most commercial web search engines do not disclose their search logs, so
information about what users are searching for on the Web is difficult
to come by.^[2]
<http://en.wikipedia.org/wiki/Web_search_query#cite_note-1> ^[3]
<http://en.wikipedia.org/wiki/Web_search_query#cite_note-2> analyzed the
queries from the Excite <http://en.wikipedia.org/wiki/Excite> search
engine showed some interesting characteristics of web search:
Nevertheless, a study in 2001

    * The average length of a search query was 2.4 terms.
    * About half of the users entered a single query while a little less
      than a third of users entered three or more unique queries.
    * Close to half of the users examined only the first one or two
      pages of results (10 results per page).
    * Less than 5% of users used advanced search features (e.g., Boolean
      operators <http://en.wikipedia.org/wiki/Boolean_operators> like
      AND, OR, and NOT).
    * The top four most frequently used terms were , /(empty search),
      and, of, and sex./

A study of the same Excite query logs revealed that 19% of the queries
contained a geographic term (e.g., place names, zip codes, geographic
features, etc.).^[4]
<http://en.wikipedia.org/wiki/Web_search_query#cite_note-3>
A 2005 study of Yahoo's query logs revealed 33% of the queries from the
same user were repeat queries and that 87% of the time the user would
click on the same result.^[5]
<http://en.wikipedia.org/wiki/Web_search_query#cite_note-4> This
suggests that many users use repeat queries to revisit or re-find
information. This analysis is confirmed by a Bing search engine blog
post telling about 30% queries are navigational queries ^[6]
<http://en.wikipedia.org/wiki/Web_search_query#cite_note-5>
In addition, much research has shown that query term frequency
distributions conform to the power law
<http://en.wikipedia.org/wiki/Power_law>, or /long tail/ distribution
curves. That is, a small portion of the terms observed in a large query
log (e.g. > 100 million queries) are used most often, while the
remaining terms are used less often individually.^[7]
<http://en.wikipedia.org/wiki/Web_search_query#cite_note-baezayates1-6>
This example of the Pareto principle
<http://en.wikipedia.org/wiki/Pareto_principle> (or /80-20 rule/) allows
search engines to employ optimization techniques
<http://en.wikipedia.org/w/index.php?title=Optimization_techniques&action=edit&redlink=1>
such as index or database partitioning
<http://en.wikipedia.org/wiki/Partition_%28database%29>, caching
<http://en.wikipedia.org/wiki/Cache> and pre-fetching.

http://en.wikipedia.org/wiki/Web_search_query