How to make xapian run in hadoop

Fri Nov 22 05:45:23 GMT 2019

On Thu, Nov 21, 2019 at 10:20:19AM +0800, 程苏珺 wrote:
> We use xapian as the backend of our system. Now the data need be
> indexed ever-increasing, and the local mode is hard to maintain, so we
> plan to move the index builder to hadoop. We try to make xapian can be
> run in hadoop, and now met a problem that there are many seek
> operations when xapian writes the index files, but the method seek()
> in hadoop c api only support read, and we blocked by that now

Updating a glass backend database pretty fundamentally requires a
way to "write block N".  We don't actually require the ability to
seek arbitrarily, but if hadoop writes are limited to appending to
a file your approach is just not going to work for updating.

It might be possible to buffer up everything in RAM and then write out a
glass database in one go with such a limitation, but if you're having
scaling problems then forcing a situation where the whole database needs
to be created in RAM before it can be written is not going to help.

> It looks a big work to rewrite the xapian database backend to
> adapter the hadoop c api. Could you please give us some suggestions?

The in-development backend (honey) would probably be easier to get
to work here once finished, but currently it doesn't support
writing directly so that's no help if you want a solution now.

Perhaps you could elaborate on the problem you're actually trying
to solve here.

What does "the local mode is hard to maintain" actually mean?

Cheers,
    Olly