[Xapian-tickets] [Xapian] #719: Tokenized CJK query terms wrongly combined with respect to prefixes
Xapian
nobody at xapian.org
Wed May 4 13:29:52 BST 2016
#719: Tokenized CJK query terms wrongly combined with respect to prefixes
-------------------------+---------------------------
Reporter: liweitianux | Owner: olly
Type: defect | Status: new
Priority: normal | Milestone:
Component: QueryParser | Version: 1.2.23
Severity: normal | Resolution:
Keywords: CJK, prefix | Blocked By:
Blocking: | Operating System: Linux
-------------------------+---------------------------
Old description:
> I first came across this issue when querying CJK with `mu`
> (https://github.com/djcb/mu) and reported the issue there
> (https://github.com/djcb/mu/issues/123#issuecomment-180999233). However,
> after some further investigations into `mu` and `xapian` recently, I find
> it is a bug in `xapian`.
>
> ----
>
> Here I demonstrate this issue with `python-xapian`:
> {{{
> #!python
> qp = xapian.QueryParser()
>
> qp.add_prefix("subject", "S")
> qp.add_prefix("s", "S")
> qp.add_prefix("body", "B")
> qp.add_prefix("b", "B")
> qp.add_prefix("", "B")
> qp.add_prefix("", "S")
>
> qstr1 = "中文"
> qstr2 = "b:中文"
> qstr3 = "hello AND world"
>
> q1 = qp.parse_query(qstr1)
> q2 = qp.parse_query(qstr2)
> q3 = qp.parse_query(qstr3)
>
> print(q1)
> # Xapian::Query((B中:(pos=1) AND S中:(pos=1) AND
> # B中文:(pos=1) AND S中文:(pos=1) AND
> # B文:(pos=1) AND S文:(pos=1)))
>
> print(q2)
> # Xapian::Query((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)))
>
> print(q3)
> # Xapian::Query(((Bhello:(pos=1) OR Shello:(pos=1)) AND
> # (Bworld:(pos=2) OR Sworld:(pos=2))))
> }}}
>
> The parsed queries for `qstr2` and `qstr3` are right, while the parsed
> query `q1` for (the CJK query string **without a prefix**) `qstr1` is
> **wrongly combined** with `OP_AND` with respect to the **prefixes**.
> Therefore, I have the CJK search problem in `mu` which gives me wrong or
> empty results.
>
> ----
>
> The **expected** parsed query for `qstr1` should look like this:
> {{{
> Xapian::Query(((B中:(pos=1) OR S中:(pos=1)) AND
> (B中文:(pos=1) OR S中文:(pos=1)) AND
> (B文:(pos=1) OR S文:(pos=1))))
> }}}
> where the **same** tokenized CJK term should be `OP_OR` combined with
> respect to the **prefixes**, and then be `OP_AND` combined with respect
> to each tokenized CJK term.
>
> On the other hand, the query may also look like this (i.e., `qstr1 = "b:
> 中文 OR s:中文"` for the above example):
> {{{
> Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR
> (S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))
> }}}
> which seems to be more intuitive and maybe more logical to me.
>
> ----
>
> Environment:
> * Linux: Debian, testing, amd64
> * Xapian: `libxapian22v5`, version 1.2.23
> * `python-xapian`: version 1.2.23-1
> * environment variable: `XAPIAN_CJK_NGRAM=1`
>
> Best regards!
>
> Aly
New description:
I first came across this issue when querying CJK with `mu`
(https://github.com/djcb/mu) and reported the issue there
(https://github.com/djcb/mu/issues/123#issuecomment-180999233). However,
after some further investigations into `mu` and `xapian` recently, I find
it is a bug in `xapian`.
----
Here I demonstrate this issue with `python-xapian`:
{{{
#!python
qp = xapian.QueryParser()
qp.add_prefix("subject", "S")
qp.add_prefix("s", "S")
qp.add_prefix("body", "B")
qp.add_prefix("b", "B")
qp.add_prefix("", "B")
qp.add_prefix("", "S")
qstr1 = "中文"
qstr2 = "b:中文"
qstr3 = "hello AND world"
q1 = qp.parse_query(qstr1)
q2 = qp.parse_query(qstr2)
q3 = qp.parse_query(qstr3)
print(q1)
# Xapian::Query((B中:(pos=1) AND S中:(pos=1) AND
# B中文:(pos=1) AND S中文:(pos=1) AND
# B文:(pos=1) AND S文:(pos=1)))
print(q2)
# Xapian::Query((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)))
print(q3)
# Xapian::Query(((Bhello:(pos=1) OR Shello:(pos=1)) AND
# (Bworld:(pos=2) OR Sworld:(pos=2))))
}}}
The parsed queries for `qstr2` and `qstr3` are right, while the parsed
query `q1` for (the CJK query string **without a prefix**) `qstr1` is
**wrongly combined** with `OP_AND` with respect to the **prefixes**.
As we can see, the **same** tokenized CJK term (e.g., `中`) is wrongly
`OP_AND` combined for each prefix (i.e., `B` and `S` here), which should
instead be `OP_OR` combined.
Therefore, I have the CJK search problem in `mu` which gives me wrong or
empty results.
----
The **expected** parsed query for `qstr1` should look like this:
{{{
Xapian::Query(((B中:(pos=1) OR S中:(pos=1)) AND
(B中文:(pos=1) OR S中文:(pos=1)) AND
(B文:(pos=1) OR S文:(pos=1))))
}}}
where the **same** tokenized CJK term should be `OP_OR` combined with
respect to the **prefixes**, and then be `OP_AND` combined with respect to
each tokenized CJK term.
On the other hand, the query may also look like this (i.e., `qstr1 = "b:中
文 OR s:中文"` for the above example):
{{{
Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR
(S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))
}}}
which seems to be more intuitive and maybe more logical to me.
----
Environment:
* Linux: Debian, testing, amd64
* Xapian: `libxapian22v5`, version 1.2.23
* `python-xapian`: version 1.2.23-1
* environment variable: `XAPIAN_CJK_NGRAM=1`
Best regards!
Aly
--
Comment (by liweitianux):
Explain more clearly about the CJK query parsing issue.
--
Ticket URL: <https://trac.xapian.org/ticket/719#comment:1>
Xapian <//xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list