[Xapian-discuss] How to speed up indexing ?

mark markkicks at gmail.com
Sat Aug 23 03:16:57 BST 2008


On Thu, Aug 21, 2008 at 2:26 AM, Charlie Hull <charlie at juggler.net> wrote:
> cel tix44 wrote:
>> I'm new to Xapian & need some help, many thanks if anyone replies.
>>
>> I did a release build from xapian-core-1.0.7 with VS2008 by using
>> Charlie Hull's makefiles.
>>
>> I'm trying to test-index my dataset -- some 200'000 docs, each
>> document being (on average) 50 bytes long and having 6 words.
>>
>> I tried (a) not to use stemmer, (b) commit_transaction() on every
>> 50/100/etc. docs, (c) not to use transactions at all -- but in all
>> scenarios indexing goes at ~10 doc/sec or 500 bytes per second.
>>
>> This should probably be ~400 times faster, I'm clearly doing something
>> wrong. Can anyone give me a hint or direct me to a source on the net
>> to do some reading?
>
> If you could let us know the platform you're using, and how you're
> accessing Xapian (which bindings for example, or directly using C/C++?),
> and even post the code you're using for your indexer, that would help
> hugely.

I have the exact  same problem in x86_64 fedora core 9 linux, 16GB
RAM, dual quad core, using python xappy library.
this is the code. it does only 10 docs / second approximately. there
are about 3 million docs to be indexed.

def clean(text):
    if not text:
        return u''
    return re.sub(r'[~@\^+=<>/&%$!*.,:? ;()\'\"\\-]', ' ', text)

def createdocument(t):
    doc = xappy.UnprocessedDocument()
    doc.fields.append(xappy.Field("id", t.id))
    doc.fields.append(xappy.Field("title", clean(t.title)))
    doc.fields.append(xappy.Field("description", clean(t.description)))
    doc.fields.append(xappy.Field("category_name", t.category_name))
    if t.added:doc.fields.append(xappy.Field("added",
t.added.strftime('%Y%m%d')))
    doc.fields.append(xappy.Field("number_a", t.number_a))
    doc.fields.append(xappy.Field("number_b", t.number_b))
    doc.fields.append(xappy.Field("number_c", t.number_c))
    doc.fields.append(xappy.Field("number_d", t.number_d))
    doc.id = 'd_' + `t.id`

    if t.id%100000==0:print t.id, time.ctime()
    return doc

def main():
    conn = xappy.IndexerConnection('db1')
    conn.add_field_action('id', xappy.FieldActions.STORE_CONTENT)
    conn.add_field_action('title', xappy.FieldActions.INDEX_FREETEXT,
language='en')
    conn.add_field_action('title', xappy.FieldActions.SORTABLE)
    conn.add_field_action('title', xappy.FieldActions.STORE_CONTENT)

    conn.add_field_action('description',
xappy.FieldActions.INDEX_FREETEXT, language='en')
    conn.add_field_action('description', xappy.FieldActions.STORE_CONTENT)

    conn.add_field_action('category_name', xappy.FieldActions.INDEX_EXACT)
    conn.add_field_action('category_name', xappy.FieldActions.SORTABLE)
    conn.add_field_action('category_name', xappy.FieldActions.STORE_CONTENT)

    conn.add_field_action('added', xappy.FieldActions.SORTABLE, type='date')
    conn.add_field_action('added', xappy.FieldActions.STORE_CONTENT)

    conn.add_field_action('number_a', xappy.FieldActions.SORTABLE,type='float')
    conn.add_field_action('number_a', xappy.FieldActions.STORE_CONTENT)

    conn.add_field_action('number_b', xappy.FieldActions.SORTABLE,type='float')
    conn.add_field_action('number_b', xappy.FieldActions.STORE_CONTENT)

    conn.add_field_action('number_c', xappy.FieldActions.SORTABLE,type='float')
    conn.add_field_action('number_c', xappy.FieldActions.STORE_CONTENT)

    conn.add_field_action('number_d', xappy.FieldActions.SORTABLE,type='float')
    conn.add_field_action('number_d', xappy.FieldActions.STORE_CONTENT)

    ds = db.query('select id, title, added, description, number_a,
number_b, number_c, number_d, category_name  from ds order by id')

    [conn.replace(createdocument(d)) for d in ds]
    conn.flush()
    conn.close()

if __name__ == '__main__':
    main()



More information about the Xapian-discuss mailing list