[Xapian-tickets] [Xapian] #679: Memory and speed issues in wildcard searches
Xapian
nobody at xapian.org
Wed May 6 14:01:44 BST 2015
#679: Memory and speed issues in wildcard searches
-------------------------+--------------------------
Reporter: dk | Owner: olly
Type: defect | Status: new
Priority: normal | Milestone:
Component: QueryParser | Version: 1.3.2
Severity: normal | Resolution:
Keywords: | Blocked By:
Blocking: | Operating System: All
-------------------------+--------------------------
Comment (by olly):
In git master (since 1.3.2) there are 3 modes for limiting wildcard
expansion:
{{{
enum {
/** Throw an error if OP_WILDCARD exceeds its expansion limit.
*
* Xapian::WildcardError will be thrown when the query is
actually
* run.
*/
WILDCARD_LIMIT_ERROR,
/** Stop expanding when OP_WILDCARD reaches its expansion limit.
*
* This makes the wildcard expand to only the first N terms
(sorted
* by byte order).
*/
WILDCARD_LIMIT_FIRST,
/** Limit OP_WILDCARD expansion to the most frequent terms.
*
* If OP_WILDCARD would expand to more than its expansion limit,
the
* most frequent terms are taken. This approach works well for
cases
* such as expanding a partial term at the end of a query string
which
* the user hasn't finished typing yet - as well as being less
expense
* to evaluate than the full expansion, using only the most
frequent
* terms tends to give better results too.
*/
WILDCARD_LIMIT_MOST_FREQUENT
};
}}}
And you can tell `QueryParser` which mode(s) to use for wildcards and for
partial terms.
Using the "glass" backend for the database should make a significant
difference to memory usage for cases like this, as glass reference counts
cursor blocks, whereas chert just allocates fresh cursor blocks for all
one million terms.
I changed `create` to create the database like so:
{{{
#!perl
my $db = Xapian::WritableDatabase->new( "index",
DB_CREATE_OR_OPEN|Xapian::DB_BACKEND_GLASS );
}}}
And with current git master I get 111MB used by the search (taking off the
memory allocated before we expand the wildcard, that's 713 bytes per term
in the expanded wildcard):
{{{
time vsz ( diff) rss ( diff) shared ( diff) code ( diff)
data ( diff)
0 43544 ( 43544) 11688 ( 11688) 6672 ( 6672) 8 ( 8)
5200 ( 5200) 18
627 113160 ( 69616) 82316 ( 70628) 7676 ( 1004) 8 ( 0)
74816 ( 69616) 20
627 113160 ( 0) 82316 ( 0) 7676 ( 0) 8 ( 0)
74816 ( 0) 22
}}}
The search took ages to run (I didn't time it, but it felt like several
minutes). But my development tree is built with all possible assertions
enabled, so I'm not sure how long it would take with a normal build
anyway.
--
Ticket URL: <http://trac.xapian.org/ticket/679#comment:5>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list