[Xapian-tickets] [Xapian] #636: get_docid() and multiple databases

Xapian nobody at xapian.org
Wed Apr 30 01:42:10 BST 2014


#636: get_docid() and multiple databases
----------------------+-----------------------------
 Reporter:  jeffrand  |             Owner:  olly
     Type:  defect    |            Status:  assigned
 Priority:  normal    |         Milestone:  1.3.2
Component:  Other     |           Version:  1.2.12
 Severity:  normal    |        Resolution:
 Keywords:            |        Blocked By:
 Blocking:            |  Operating System:  Linux
----------------------+-----------------------------
\
\
\
\
Changes (by olly):

 * status:  new => assigned
 * milestone:   => 1.3.2

\
\
\

Old description:

> I'm using the python bindings for xapian 1.2.12 and I'm getting some
> unexpected behavior which I believe is a bug. While searching multiple
> databases I am getting inconsistent values from doc.get_docid() when
> using an overloaded KeyMaker class for custom sorting. The id value in
> the document's data is the same as the id set for each document.
>
> The behavior is expected when searching only one database:
> doc.get_docid() == int(json.loads(doc.get_data())['id']) .
>
> When searching more than one database the doc.get_data() will return a
> value that is not the same as int(json.loads(doc.get_data())['id']).
>
> According to the docs:
> docid Xapian::Document::get_docid       (                )       const
>
> Get the document id which is associated with this document (if any).
> NB If multiple databases are being searched together, then this will be
> the document id in the individual database, not the merged database!
>
> Here's my sample code and some output:
>

> import xapian as x
> import simplejson as json
>
> db = x.Database()
> db.add_database(x.Database('/var/xapian/db1.db')) #has XTYPA
>
> q = x.Query('XTYPA')
> q = x.Query(x.Query.OP_OR, q, x.Query('XTYPB'))
>
> class WhatsTheId(x.KeyMaker):
>     def __init__(self):
>         return super(WhatsTheId, self).__init__()
>     def __call__(self, doc):
>         my_doc_id = json.loads(doc.get_data())['id']
>         if my_doc_id <= 10:
>             print doc.get_docid(), my_doc_id,
> json.loads(doc.get_data())['type']
>         return x.sortable_serialise(1)
>
> e = x.Enquire(db)
> e.set_query(q)
> e.set_sort_by_key(WhatsTheId())
> e.get_mset(0, 1000000000, 0, None)
>
> # Expected results
>
> 2 2 A
> 3 3 A
> 4 4 A
> 5 5 A
> 6 6 A
> 7 7 A
> 8 8 A
> 9 9 A
> 10 10 A
>
> db.add_database(x.Database('/var/xapian/db2.db')) #has XTYPB
>
> e = x.Enquire(db)
> e.set_query(q)
> e.set_sort_by_key(WhatsTheId())
> r = e.get_mset(0, 1000000000, 0, None)
>
> # Add another, unexpected results
>
> 3 2 A
> 5 3 A
> 7 4 A
> 9 5 A
> 11 6 A
> 13 7 A
> 15 8 A
> 17 9 A
> 19 10 A
> 2 1 B
> 4 2 B
>
> # It will consistently modify the internal get_docid value when adding
> more databases:
>
> q = x.Query(x.Query.OP_OR, q, x.Query('XTYPC'))
> db.add_database(x.Database('/var/xapian/db3.db')) #has XTYPC
> e = x.Enquire(db)
> e.set_query(q)
> e.set_sort_by_key(WhatsTheId())
> r = e.get_mset(0, 1000000000, 0, None)
>
> 4 2 A
> 7 3 A
> 10 4 A
> 13 5 A
> 16 6 A
> 19 7 A
> 22 8 A
> 25 9 A
> 28 10 A
> 2 1 B
> 5 2 B
> 3 1 C
> 6 2 C
> 9 3 C
> 12 4 C
> 15 5 C
> 18 6 C
> 21 7 C
> 24 8 C
> 27 9 C
> 30 10 C

New description:

 I'm using the python bindings for xapian 1.2.12 and I'm getting some
 unexpected behavior which I believe is a bug. While searching multiple
 databases I am getting inconsistent values from doc.get_docid() when using
 an overloaded KeyMaker class for custom sorting. The id value in the
 document's data is the same as the id set for each document.

 The behavior is expected when searching only one database:
 {{{doc.get_docid() == int(json.loads(doc.get_data())['id'])}}} .

 When searching more than one database the doc.get_data() will return a
 value that is not the same as {{{int(json.loads(doc.get_data())['id'])}}}.

 According to the docs:
 docid Xapian::Document::get_docid       (                )       const

 Get the document id which is associated with this document (if any).
 NB If multiple databases are being searched together, then this will be
 the document id in the individual database, not the merged database!

 Here's my sample code and some output:

 {{{
 #!python
 import xapian as x
 import simplejson as json

 db = x.Database()
 db.add_database(x.Database('/var/xapian/db1.db')) #has XTYPA

 q = x.Query('XTYPA')
 q = x.Query(x.Query.OP_OR, q, x.Query('XTYPB'))

 class WhatsTheId(x.KeyMaker):
     def __init__(self):
         return super(WhatsTheId, self).__init__()
     def __call__(self, doc):
         my_doc_id = json.loads(doc.get_data())['id']
         if my_doc_id <= 10:
             print doc.get_docid(), my_doc_id,
 json.loads(doc.get_data())['type']
         return x.sortable_serialise(1)

 e = x.Enquire(db)
 e.set_query(q)
 e.set_sort_by_key(WhatsTheId())
 e.get_mset(0, 1000000000, 0, None)
 }}}

 # Expected results

 2 2 A
 3 3 A
 4 4 A
 5 5 A
 6 6 A
 7 7 A
 8 8 A
 9 9 A
 10 10 A

 {{{
 #!python
 db.add_database(x.Database('/var/xapian/db2.db')) #has XTYPB

 e = x.Enquire(db)
 e.set_query(q)
 e.set_sort_by_key(WhatsTheId())
 r = e.get_mset(0, 1000000000, 0, None)
 }}}

 # Add another, unexpected results

 3 2 A
 5 3 A
 7 4 A
 9 5 A
 11 6 A
 13 7 A
 15 8 A
 17 9 A
 19 10 A
 2 1 B
 4 2 B

 # It will consistently modify the internal get_docid value when adding
 more databases:

 {{{
 #!python
 q = x.Query(x.Query.OP_OR, q, x.Query('XTYPC'))
 db.add_database(x.Database('/var/xapian/db3.db')) #has XTYPC
 e = x.Enquire(db)
 e.set_query(q)
 e.set_sort_by_key(WhatsTheId())
 r = e.get_mset(0, 1000000000, 0, None)
 }}}

 4 2 A
 7 3 A
 10 4 A
 13 5 A
 16 6 A
 19 7 A
 22 8 A
 25 9 A
 28 10 A
 2 1 B
 5 2 B
 3 1 C
 6 2 C
 9 3 C
 12 4 C
 15 5 C
 18 6 C
 21 7 C
 24 8 C
 27 9 C
 30 10 C

--
\
\

Comment:

 Also discussed on the mailing list:
 http://thread.gmane.org/gmane.comp.search.xapian.general/9612
\
\
\

--
Ticket URL: <http://trac.xapian.org/ticket/636#comment:1>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list