[Xapian-tickets] [Xapian] #272: QueryParser scales as O(N*N) with AUTO_MULTIWORD_SYNONYM
Xapian
nobody at xapian.org
Wed Jun 4 14:37:21 BST 2008
#272: QueryParser scales as O(N*N) with AUTO_MULTIWORD_SYNONYM
---------------------+------------------------------------------------------
Reporter: richard | Owner: olly
Type: defect | Status: new
Priority: low | Milestone:
Component: Other | Version: SVN HEAD
Severity: normal | Resolution:
Keywords: | Blockedby:
Platform: All | Blocking:
---------------------+------------------------------------------------------
Description changed by richard:
Old description:
> I've just fixed various cases where the QueryParser scaled as O(N*N)
> where N is the number of terms in the query. However, there is one such
> case left, which involves parsing queries with the multiword synonym
> option.
>
> Unlike the cases I've just fixed (which involved pair-wise construction
> of Query objects), the O(N*N) behaviour here appears to stem from
> searching for any of the synonyms in all the possible multiword
> substrings of the query. A callgrind run showed 500501 invocations of
> string::resize() from the query parser for a 1000 word query.
>
> The current algorithm works by, for each starting point:
> 1. build a string containing all the subsequent words in the query
> 2. test for a synonym matching this string
> 3. (if no synonym found) cut off the last word in the string and go back
> to 2.
>
> I'm not sure that the feature can be reduced to linear, but some possible
> optimisations might be:
>
> * Don't bother looking for synonyms of more than 250 characters (or
> thereabouts), since there is a limit on the length of synonym keys of
> about this amount.
>
> * keep track of the number of words in the longest multiword synonym key
> stored, and only bother testing for substrings of the query of at most
> this length. This alone should change the scaling to O(N*M), where M is
> the number of words in the longest synonym.
>
> * Synonyms are stored in the btree in sorted order, so we could add the
> ability to quickly check if any synonyms exist with a given prefix.
> Then, we could check for a synonym based on the first word, and only
> progress to searching for one based on the first and second word if a
> possible candidate was found.
New description:
I've just fixed various cases where the QueryParser scaled as O(N*N) where
N is the number of terms in the query. However, there is one such case
left, which involves parsing queries with the multiword synonym option.
Unlike the cases I've just fixed (which involved pair-wise construction of
Query objects), the O(N*N) behaviour here appears to stem from searching
for any of the synonyms in all the possible multiword substrings of the
query. A callgrind run showed 500501 invocations of string::resize() from
the query parser for a 1000 word query.
The current algorithm works by, for each starting point:
1. build a string containing all the subsequent words in the query
2. test for a synonym matching this string
3. (if no synonym found) cut off the last word in the string and go back
to 2.
I'm not sure that the feature can be reduced to linear, but some possible
optimisations might be:
* Don't bother looking for synonyms of more than 250 characters (or
thereabouts), since there is a limit on the length of synonym keys of
about this amount.
* keep track of the number of words in the longest multiword synonym key
stored, and only bother testing for substrings of the query of at most
this length. This alone should change the scaling to O(N*M), where M is
the number of words in the longest synonym.
* Synonyms are stored in the btree in sorted order, so we can use
synonym_keys_begin(prefix) to check if any synonyms exist with a given
prefix. Then, we could check for a synonym based on the first word, and
only progress to searching for one based on the first and second word if a
possible candidate was found.
--
--
Ticket URL: <http://trac.xapian.org/ticket/272#comment:1>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list