[Xapian-discuss] How to speed up indexing ?
mark
markkicks at gmail.com
Sat Aug 23 03:16:57 BST 2008
On Thu, Aug 21, 2008 at 2:26 AM, Charlie Hull <charlie at juggler.net> wrote:
> cel tix44 wrote:
>> I'm new to Xapian & need some help, many thanks if anyone replies.
>>
>> I did a release build from xapian-core-1.0.7 with VS2008 by using
>> Charlie Hull's makefiles.
>>
>> I'm trying to test-index my dataset -- some 200'000 docs, each
>> document being (on average) 50 bytes long and having 6 words.
>>
>> I tried (a) not to use stemmer, (b) commit_transaction() on every
>> 50/100/etc. docs, (c) not to use transactions at all -- but in all
>> scenarios indexing goes at ~10 doc/sec or 500 bytes per second.
>>
>> This should probably be ~400 times faster, I'm clearly doing something
>> wrong. Can anyone give me a hint or direct me to a source on the net
>> to do some reading?
>
> If you could let us know the platform you're using, and how you're
> accessing Xapian (which bindings for example, or directly using C/C++?),
> and even post the code you're using for your indexer, that would help
> hugely.
I have the exact same problem in x86_64 fedora core 9 linux, 16GB
RAM, dual quad core, using python xappy library.
this is the code. it does only 10 docs / second approximately. there
are about 3 million docs to be indexed.
def clean(text):
if not text:
return u''
return re.sub(r'[~@\^+=<>/&%$!*.,:? ;()\'\"\\-]', ' ', text)
def createdocument(t):
doc = xappy.UnprocessedDocument()
doc.fields.append(xappy.Field("id", t.id))
doc.fields.append(xappy.Field("title", clean(t.title)))
doc.fields.append(xappy.Field("description", clean(t.description)))
doc.fields.append(xappy.Field("category_name", t.category_name))
if t.added:doc.fields.append(xappy.Field("added",
t.added.strftime('%Y%m%d')))
doc.fields.append(xappy.Field("number_a", t.number_a))
doc.fields.append(xappy.Field("number_b", t.number_b))
doc.fields.append(xappy.Field("number_c", t.number_c))
doc.fields.append(xappy.Field("number_d", t.number_d))
doc.id = 'd_' + `t.id`
if t.id%100000==0:print t.id, time.ctime()
return doc
def main():
conn = xappy.IndexerConnection('db1')
conn.add_field_action('id', xappy.FieldActions.STORE_CONTENT)
conn.add_field_action('title', xappy.FieldActions.INDEX_FREETEXT,
language='en')
conn.add_field_action('title', xappy.FieldActions.SORTABLE)
conn.add_field_action('title', xappy.FieldActions.STORE_CONTENT)
conn.add_field_action('description',
xappy.FieldActions.INDEX_FREETEXT, language='en')
conn.add_field_action('description', xappy.FieldActions.STORE_CONTENT)
conn.add_field_action('category_name', xappy.FieldActions.INDEX_EXACT)
conn.add_field_action('category_name', xappy.FieldActions.SORTABLE)
conn.add_field_action('category_name', xappy.FieldActions.STORE_CONTENT)
conn.add_field_action('added', xappy.FieldActions.SORTABLE, type='date')
conn.add_field_action('added', xappy.FieldActions.STORE_CONTENT)
conn.add_field_action('number_a', xappy.FieldActions.SORTABLE,type='float')
conn.add_field_action('number_a', xappy.FieldActions.STORE_CONTENT)
conn.add_field_action('number_b', xappy.FieldActions.SORTABLE,type='float')
conn.add_field_action('number_b', xappy.FieldActions.STORE_CONTENT)
conn.add_field_action('number_c', xappy.FieldActions.SORTABLE,type='float')
conn.add_field_action('number_c', xappy.FieldActions.STORE_CONTENT)
conn.add_field_action('number_d', xappy.FieldActions.SORTABLE,type='float')
conn.add_field_action('number_d', xappy.FieldActions.STORE_CONTENT)
ds = db.query('select id, title, added, description, number_a,
number_b, number_c, number_d, category_name from ds order by id')
[conn.replace(createdocument(d)) for d in ds]
conn.flush()
conn.close()
if __name__ == '__main__':
main()
More information about the Xapian-discuss
mailing list