[Xapian-discuss] Tryout paches for faster chert search: http://trac.xapian.org/ticket/326
chris at s-4-u.net
Thu Sep 8 11:50:23 BST 2011
On 09/08/2011 12:46 PM, Chris wrote:
> On 09/08/2011 11:51 AM, Richard Boulton wrote:
>> Sources of realistic
>> query data are harder to come across - anyone got any good ideas for
> Reminds me about the AOL fuckup a few years ago (they released the
> search queries of 650.000 users, by mistake).
> Mirror: http://www.gregsadetsky.com/aol-data/
> Combined with Wikipedia, Stackoverflow and product-data of a few hundred
> online shops (affili.net et al) could(?) provide a nice and diversed
> On the other side, the database should probably be in-memory, to not be
> limited by disk io, which gives a 40GB index if just using the online
> shop product data.
> Greets, Chris
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
Could be useful for a performance analysis tool, too:
Most commercial web search engines do not disclose their search logs, so
information about what users are searching for on the Web is difficult
to come by.^
<http://en.wikipedia.org/wiki/Web_search_query#cite_note-2> analyzed the
queries from the Excite <http://en.wikipedia.org/wiki/Excite> search
engine showed some interesting characteristics of web search:
Nevertheless, a study in 2001
* The average length of a search query was 2.4 terms.
* About half of the users entered a single query while a little less
than a third of users entered three or more unique queries.
* Close to half of the users examined only the first one or two
pages of results (10 results per page).
* Less than 5% of users used advanced search features (e.g., Boolean
operators <http://en.wikipedia.org/wiki/Boolean_operators> like
AND, OR, and NOT).
* The top four most frequently used terms were , /(empty search),
and, of, and sex./
A study of the same Excite query logs revealed that 19% of the queries
contained a geographic term (e.g., place names, zip codes, geographic
A 2005 study of Yahoo's query logs revealed 33% of the queries from the
same user were repeat queries and that 87% of the time the user would
click on the same result.^
suggests that many users use repeat queries to revisit or re-find
information. This analysis is confirmed by a Bing search engine blog
post telling about 30% queries are navigational queries ^
In addition, much research has shown that query term frequency
distributions conform to the power law
<http://en.wikipedia.org/wiki/Power_law>, or /long tail/ distribution
curves. That is, a small portion of the terms observed in a large query
log (e.g. > 100 million queries) are used most often, while the
remaining terms are used less often individually.^
This example of the Pareto principle
<http://en.wikipedia.org/wiki/Pareto_principle> (or /80-20 rule/) allows
search engines to employ optimization techniques
such as index or database partitioning
<http://en.wikipedia.org/wiki/Cache> and pre-fetching.
More information about the Xapian-discuss