[Xapian-discuss] patch proposal: omindex library or daemon

Mon Oct 24 09:55:21 BST 2011

On 24 October 2011 07:30, Liam <xapian at networkimprov.net> wrote:
> Ah, right; still coming up to speed here. I suppose you could wrap it in a
> MimeDocument subclass? But a pure helper function would be fine.
>
> The omindex routines currently depend on a global Database variable, I
> believe? Maybe that's not an issue for the lib tho...

I think maybe you'd want an interface similar to TermGenerator:
http://xapian.org/docs/apidoc/html/classXapian_1_1TermGenerator.html

Specifically, the things that I think you'd want to copy from TermGenerator are:

 - ability to set the document that the TermGenerator is going to
write data to. This means you can put some data into the document
yourself, and then use the indexing routines to add more data.

 - ability to get the document that the TermGenerator has written data
to; this avoids you having to pass the document around along with the
TermGenerator in some situations.

 - ability to set the database that the TermGenerator should use when
writing spelling data..  The only reason you can set a database for
the TermGenerator is to store spelling information about words seen.

(Actually, I think there should really be some kind of class heirarchy
based around TermGenerator, to implement lots of different indexing
strategies, but that's an unnecessary distraction right now.)

This would be a reasonable start; ideally you'd want to be able to
control how each field extracted from the document was indexed (ie,
some format parsers can spit out titles separately from body text,
etc).  For more flexibility, I'd quite like a library which had a very
simple interface something like:

/** Parse the file at filepath, returning a set of data found in
fields in the file.
 *
 * Should always produce a "body" field; other fields produced would
depend on the document and the abilities of the parser.
 *
 * @param fields A map from fieldname to field contents, used to
return the result of parsing.
 */
void parse(const std::string & filepath, std::map<std::string,
std::string> & fields);

ie, something which doesn't do anything apart from get data,
potentially in multiple fields, from the file.  This would allow you
to parse the data in any way you desired, which seems a better
coupling.  It also would produce a library which wouldn't depend on
any of Xapian core, which could be more generally useful.

-- 
Richard