[Xapian-tickets] [Xapian] #719: Tokenized CJK query terms wrongly combined with respect to prefixes

Xapian nobody at xapian.org
Wed May 4 13:29:52 BST 2016


#719: Tokenized CJK query terms wrongly combined with respect to prefixes
-------------------------+---------------------------
 Reporter:  liweitianux  |             Owner:  olly
     Type:  defect       |            Status:  new
 Priority:  normal       |         Milestone:
Component:  QueryParser  |           Version:  1.2.23
 Severity:  normal       |        Resolution:
 Keywords:  CJK, prefix  |        Blocked By:
 Blocking:               |  Operating System:  Linux
-------------------------+---------------------------

Old description:

> I first came across this issue when querying CJK with `mu`
> (https://github.com/djcb/mu) and reported the issue there
> (https://github.com/djcb/mu/issues/123#issuecomment-180999233).  However,
> after some further investigations into `mu` and `xapian` recently, I find
> it is a bug in `xapian`.
>
> ----
>
> Here I demonstrate this issue with `python-xapian`:
> {{{
> #!python
> qp = xapian.QueryParser()
>
> qp.add_prefix("subject", "S")
> qp.add_prefix("s", "S")
> qp.add_prefix("body", "B")
> qp.add_prefix("b", "B")
> qp.add_prefix("", "B")
> qp.add_prefix("", "S")
>
> qstr1 = "中文"
> qstr2 = "b:中文"
> qstr3 = "hello AND world"
>
> q1 = qp.parse_query(qstr1)
> q2 = qp.parse_query(qstr2)
> q3 = qp.parse_query(qstr3)
>
> print(q1)
> # Xapian::Query((B中:(pos=1) AND S中:(pos=1) AND
> #                B中文:(pos=1) AND S中文:(pos=1) AND
> #                B文:(pos=1) AND S文:(pos=1)))
>
> print(q2)
> # Xapian::Query((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)))
>
> print(q3)
> # Xapian::Query(((Bhello:(pos=1) OR Shello:(pos=1)) AND
> #                (Bworld:(pos=2) OR Sworld:(pos=2))))
> }}}
>
> The parsed queries for `qstr2` and `qstr3` are right, while the parsed
> query `q1` for (the CJK query string **without a prefix**) `qstr1` is
> **wrongly combined** with `OP_AND` with respect to the **prefixes**.
> Therefore, I have the CJK search problem in `mu` which gives me wrong or
> empty results.
>
> ----
>
> The **expected** parsed query for `qstr1` should look like this:
> {{{
> Xapian::Query(((B中:(pos=1) OR S中:(pos=1)) AND
>                (B中文:(pos=1) OR S中文:(pos=1)) AND
>                (B文:(pos=1) OR S文:(pos=1))))
> }}}
> where the **same** tokenized CJK term should be `OP_OR` combined with
> respect to the **prefixes**, and then be `OP_AND` combined with respect
> to each tokenized CJK term.
>
> On the other hand, the query may also look like this (i.e., `qstr1 = "b:
> 中文 OR s:中文"` for the above example):
> {{{
> Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR
>                (S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))
> }}}
> which seems to be more intuitive and maybe more logical to me.
>
> ----
>
> Environment:
> * Linux: Debian, testing, amd64
> * Xapian: `libxapian22v5`, version 1.2.23
> * `python-xapian`: version 1.2.23-1
> * environment variable: `XAPIAN_CJK_NGRAM=1`
>

> Best regards!
>
> Aly

New description:

 I first came across this issue when querying CJK with `mu`
 (https://github.com/djcb/mu) and reported the issue there
 (https://github.com/djcb/mu/issues/123#issuecomment-180999233).  However,
 after some further investigations into `mu` and `xapian` recently, I find
 it is a bug in `xapian`.

 ----

 Here I demonstrate this issue with `python-xapian`:
 {{{
 #!python
 qp = xapian.QueryParser()

 qp.add_prefix("subject", "S")
 qp.add_prefix("s", "S")
 qp.add_prefix("body", "B")
 qp.add_prefix("b", "B")
 qp.add_prefix("", "B")
 qp.add_prefix("", "S")

 qstr1 = "中文"
 qstr2 = "b:中文"
 qstr3 = "hello AND world"

 q1 = qp.parse_query(qstr1)
 q2 = qp.parse_query(qstr2)
 q3 = qp.parse_query(qstr3)

 print(q1)
 # Xapian::Query((B中:(pos=1) AND S中:(pos=1) AND
 #                B中文:(pos=1) AND S中文:(pos=1) AND
 #                B文:(pos=1) AND S文:(pos=1)))

 print(q2)
 # Xapian::Query((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)))

 print(q3)
 # Xapian::Query(((Bhello:(pos=1) OR Shello:(pos=1)) AND
 #                (Bworld:(pos=2) OR Sworld:(pos=2))))
 }}}

 The parsed queries for `qstr2` and `qstr3` are right, while the parsed
 query `q1` for (the CJK query string **without a prefix**) `qstr1` is
 **wrongly combined** with `OP_AND` with respect to the **prefixes**.
 As we can see, the **same** tokenized CJK term (e.g., `中`) is wrongly
 `OP_AND` combined for each prefix (i.e., `B` and `S` here), which should
 instead be `OP_OR` combined.
 Therefore, I have the CJK search problem in `mu` which gives me wrong or
 empty results.

 ----

 The **expected** parsed query for `qstr1` should look like this:
 {{{
 Xapian::Query(((B中:(pos=1) OR S中:(pos=1)) AND
                (B中文:(pos=1) OR S中文:(pos=1)) AND
                (B文:(pos=1) OR S文:(pos=1))))
 }}}
 where the **same** tokenized CJK term should be `OP_OR` combined with
 respect to the **prefixes**, and then be `OP_AND` combined with respect to
 each tokenized CJK term.

 On the other hand, the query may also look like this (i.e., `qstr1 = "b:中
 文 OR s:中文"` for the above example):
 {{{
 Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR
                (S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))
 }}}
 which seems to be more intuitive and maybe more logical to me.

 ----

 Environment:
 * Linux: Debian, testing, amd64
 * Xapian: `libxapian22v5`, version 1.2.23
 * `python-xapian`: version 1.2.23-1
 * environment variable: `XAPIAN_CJK_NGRAM=1`


 Best regards!

 Aly

--

Comment (by liweitianux):

 Explain more clearly about the CJK query parsing issue.

--
Ticket URL: <https://trac.xapian.org/ticket/719#comment:1>
Xapian <//xapian.org/>
Xapian



More information about the Xapian-tickets mailing list