[Xapian-tickets] [Xapian] #679: Memory and speed issues in wildcard searches

Wed May 6 14:01:44 BST 2015

#679: Memory and speed issues in wildcard searches
-------------------------+--------------------------
 Reporter:  dk           |             Owner:  olly
     Type:  defect       |            Status:  new
 Priority:  normal       |         Milestone:
Component:  QueryParser  |           Version:  1.3.2
 Severity:  normal       |        Resolution:
 Keywords:               |        Blocked By:
 Blocking:               |  Operating System:  All
-------------------------+--------------------------

Comment (by olly):

 In git master (since 1.3.2) there are 3 modes for limiting wildcard
 expansion:

 {{{
     enum {
         /** Throw an error if OP_WILDCARD exceeds its expansion limit.
          *
          *  Xapian::WildcardError will be thrown when the query is
 actually
          *  run.
          */
         WILDCARD_LIMIT_ERROR,
         /** Stop expanding when OP_WILDCARD reaches its expansion limit.
          *
          *  This makes the wildcard expand to only the first N terms
 (sorted
          *  by byte order).
          */
         WILDCARD_LIMIT_FIRST,
         /** Limit OP_WILDCARD expansion to the most frequent terms.
          *
          *  If OP_WILDCARD would expand to more than its expansion limit,
 the
          *  most frequent terms are taken.  This approach works well for
 cases
          *  such as expanding a partial term at the end of a query string
 which
          *  the user hasn't finished typing yet - as well as being less
 expense
          *  to evaluate than the full expansion, using only the most
 frequent
          *  terms tends to give better results too.
          */
         WILDCARD_LIMIT_MOST_FREQUENT
     };
 }}}

 And you can tell `QueryParser` which mode(s) to use for wildcards and for
 partial terms.

 Using the "glass" backend for the database should make a significant
 difference to memory usage for cases like this, as glass reference counts
 cursor blocks, whereas chert just allocates fresh cursor blocks for all
 one million terms.

 I changed `create` to create the database like so:

 {{{
 #!perl
 my $db = Xapian::WritableDatabase->new( "index",
 DB_CREATE_OR_OPEN|Xapian::DB_BACKEND_GLASS );
 }}}

 And with current git master I get 111MB used by the search (taking off the
 memory allocated before we expand the wildcard, that's 713 bytes per term
 in the expanded wildcard):

 {{{
   time    vsz (  diff)    rss (  diff) shared (  diff)   code (  diff)
 data (  diff)
      0  43544 ( 43544)  11688 ( 11688)   6672 (  6672)      8 (     8)
 5200 (  5200) 18
    627  113160 ( 69616)  82316 ( 70628)   7676 (  1004)      8 (     0)
 74816 ( 69616) 20
    627  113160 (     0)  82316 (     0)   7676 (     0)      8 (     0)
 74816 (     0) 22
 }}}

 The search took ages to run (I didn't time it, but it felt like several
 minutes).  But my development tree is built with all possible assertions
 enabled, so I'm not sure how long it would take with a normal build
 anyway.

--
Ticket URL: <http://trac.xapian.org/ticket/679#comment:5>
Xapian <http://xapian.org/>
Xapian