[Xapian-discuss] patch proposal: omindex library or daemon

Mon Oct 24 13:10:52 BST 2011

On Mon, Oct 24, 2011 at 4:22 AM, Richard Boulton <richard at tartarus.org>wrote:

> On 24 October 2011 11:39, Liam <xapian at networkimprov.net> wrote:
> > Yes, there we go. Also needs arguments for parse options and (optional)
> > mime-type.
>
> True; the function I suggested would probably be better as a method of
> a DocumentParser class (or some better name), which allowed settings
> like the mime-mappings to be supplied, and could also keep some state
> (eg, I think ifilter type stuff on windows works best if you maintain
> a persistent connection to the filters - my memory may be inaccurate,
> but it seems likely that some filters could benefit from some
> persistent state being kept and reused for subsequent parse
> operations, so the API should allow that to be implemented).
>
> > There's a second routine which does the default Document ops for values &
> > data:
> >
> >  void Document::set_values_and_data(const std::map<std::string,
> > std::string>& fields, const std::vector<std::string>& omit_fields=0);
> >  // omit_fields is a list of field names to omit from Document values
> >  // might live in class MimeDocument : public Document
>
> I'm not so convinced by this; and it's certainly not something that I
> think is needed to make a useful library around omindex.  Given the
> text data from the fields, it's very easy to use TermGenerator to
> index the content, or to call your own routines.
>

Use TermGenerator? Wouldn't the user typically call Document::set_data()?
Forgive my inexperience...

I'm not at all convinced there's a good case for having a MimeDocument
> class (or at least, not as a subclass of Document), but I'm also not
> sure what you're thinking its use is.  A DocumentFields class of some
> kind to help manage document data as suggested in ticket #53 could be
> a useful addition to core Xapian, but I don't think that's quite what
> you're thinking of (and anyway, our most recent thinking (from 4 years
> ago, ahem!) is that this might be best done as methods of
> Xapian::Document).
>
> To summarise; getting the data out of arbitrary documents in a set of
> fields seems like a good aim for a library.  Hardcoding some default
> indexing behaviours for that data seems like feature creep,
> particularly since I usually find myself wanting custom behaviours.
>

Well, omindex has this logic, and it seems generally useful, especially if
it pushes parsed meta-data into Document values.

Hm, my omit_list argument should be a list of fields to either include or
exclude:

void Document::set_values_and_data(const std::map<std::string, std::string>&
fields, bool include, const std::vector<std::string>& list);
// include == true, store only fields in list to values
// include == false, store only fields not in list to values