[Xapian-tickets] [Xapian] #719: Tokenized CJK query terms wrongly combined with respect to prefixes
Xapian
nobody at xapian.org
Wed May 4 13:04:13 BST 2016
#719: Tokenized CJK query terms wrongly combined with respect to prefixes
--------------------------------+-------------------------
Reporter: liweitianux | Owner: olly
Type: defect | Status: new
Priority: normal | Milestone:
Component: QueryParser | Version: 1.2.23
Severity: normal | Keywords: CJK, prefix
Blocked By: | Blocking:
Operating System: Linux |
--------------------------------+-------------------------
I first came across this issue when querying CJK with `mu`
(https://github.com/djcb/mu) and reported the issue there
(https://github.com/djcb/mu/issues/123#issuecomment-180999233). However,
after some further investigations into `mu` and `xapian` recently, I find
it is a bug in `xapian`.
----
Here I demonstrate this issue with `python-xapian`:
{{{
#!python
qp = xapian.QueryParser()
qp.add_prefix("subject", "S")
qp.add_prefix("s", "S")
qp.add_prefix("body", "B")
qp.add_prefix("b", "B")
qp.add_prefix("", "B")
qp.add_prefix("", "S")
qstr1 = "中文"
qstr2 = "b:中文"
qstr3 = "hello AND world"
q1 = qp.parse_query(qstr1)
q2 = qp.parse_query(qstr2)
q3 = qp.parse_query(qstr3)
print(q1)
# Xapian::Query((B中:(pos=1) AND S中:(pos=1) AND
# B中文:(pos=1) AND S中文:(pos=1) AND
# B文:(pos=1) AND S文:(pos=1)))
print(q2)
# Xapian::Query((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)))
print(q3)
# Xapian::Query(((Bhello:(pos=1) OR Shello:(pos=1)) AND
# (Bworld:(pos=2) OR Sworld:(pos=2))))
}}}
The parsed queries for `qstr2` and `qstr3` are right, while the parsed
query `q1` for (the CJK query string **without a prefix**) `qstr1` is
**wrongly combined** with `OP_AND` with respect to the **prefixes**.
Therefore, I have the CJK search problem in `mu` which gives me wrong or
empty results.
----
The **expected** parsed query for `qstr1` should look like this:
{{{
Xapian::Query(((B中:(pos=1) OR S中:(pos=1)) AND
(B中文:(pos=1) OR S中文:(pos=1)) AND
(B文:(pos=1) OR S文:(pos=1))))
}}}
where the **same** tokenized CJK term should be `OP_OR` combined with
respect to the **prefixes**, and then be `OP_AND` combined with respect to
each tokenized CJK term.
On the other hand, the query may also look like this (i.e., `qstr1 = "b:中
文 OR s:中文"` for the above example):
{{{
Xapian::Query(((B中:(pos=1) AND B中文:(pos=1) AND B文:(pos=1)) OR
(S中:(pos=2) AND S中文:(pos=2) AND S文:(pos=2))))
}}}
which seems to be more intuitive and maybe more logical to me.
----
Environment:
* Linux: Debian, testing, amd64
* Xapian: `libxapian22v5`, version 1.2.23
* `python-xapian`: version 1.2.23-1
* environment variable: `XAPIAN_CJK_NGRAM=1`
Best regards!
Aly
--
Ticket URL: <https://trac.xapian.org/ticket/719>
Xapian <//xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list