[Xapian-devel] Lucene 3.6.2 backend for xapian (#25)
jiangwen127 at gmail.com
Thu Oct 31 02:24:26 GMT 2013
*I think that perhaps the best way to map this into Xapian is for each
Lucene "segment" to be handled as a database in Xapian, and use the
multi-database support to search them together*
Yes, there's two choices at the beginning:
1. Using multi-database.
2. Treat lucene database as a single database.
Finally, I choose 2. It's a long time ago, I am not quiet sure why this
decision is made, maybe:
1. We can handle multiple lucene databases.
2. I am not sure if multi-database can meet the requirements, such as:
Getting a doc_freq(how many documents contains the term) of a particular
term, actuallly, I want
get sum of doc_freq of a particular term in all lucene segments, I am
not sure xapian multi-database do it this way.
Do you think multi-database is a better way to handle lucene database?
*But we already handle merging allterms lists for multiple databases.*
If term lists are merged, I think it is the most appropriate way to solve
2013/10/31 Olly Betts <olly at survex.com>
> [Replying to xapian-devel, as I think a wider audience would be useful]
> On Mon, Oct 21, 2013 at 11:24:51PM +0800, jiangwen jiang wrote:
> > yes, it's less efficient. Lucene database has multiple segments, each
> > segment can treat as a independent database. The same term may exists in
> > 1 segments.
> Sorry for taking a while to respond - I've been both busy and mulling
> this over.
> I think that perhaps the best way to map this into Xapian is for each
> Lucene "segment" to be handled as a database in Xapian, and use the
> multi-database support to search them together.
> That's likely to need some adjustments to the multi-database support,
> but I think otherwise we'll end up duplicating a lot of that machinery
> in the Lucene backend anyway.
> I've not looked at the Lucene file structure with this in mind yet
> though - do you see any obvious problems with this approach?
> > Xapian::TermIterator it = db_in.allterms_begin();
> > This method traverse all terms in the first segment, then the second
> > segment, until the last segment.
> Iteration over all terms should return the terms in sorted order (by
> byte value) and without duplicates, neither of which is achieved by
> handling each segment in turn like this. But we already handle merging
> allterms lists for multiple databases.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Xapian-devel