[Xapian-tickets] [Xapian] #282: Assorted enhancements to omindex
Xapian
nobody at xapian.org
Tue Dec 6 13:23:52 GMT 2011
#282: Assorted enhancements to omindex
-------------------------+--------------------------------------------------
Reporter: olly | Owner: olly
Type: enhancement | Status: assigned
Priority: normal | Milestone: 1.2.x
Component: Omega | Version: SVN trunk
Severity: normal | Keywords:
Blockedby: | Platform: All
Blocking: |
-------------------------+--------------------------------------------------
Old description:
> A patch from Reini Urban at AVL which was pasted into the wiki a while
> back, but a ticket is really a more appropriate way to track it. We
> should look at folding some of these improvements in, though some others
> we probably don't want to include, at least in the form in this patch.
>
> I've updated the patch to compile with latest Omega SVN HEAD, dropping
> parts which Omega now supports anyway, and splitting out some features
> into separate tickets. I've not run-tested it at all.
>
> The remaining features in this patch are:
>
> * Unpacking "container file types" (e.g. archives like .zip, email
> folders like .mbox, email messages with attachments) so we can index the
> sub-parts
> * Logging stderr from filters to a file
> * Defaulting to adding the size and lastmod time of the dump file in
> scriptindex. In general, the size of the dump file seems misleading
> (though if you put one document per dump, less so). The lastmod isn't
> particular helpful in many cases either
New description:
A patch from Reini Urban at AVL which was pasted into the wiki a while
back, but a ticket is really a more appropriate way to track it. We
should look at folding some of these improvements in, though some others
we probably don't want to include, at least in the form in this patch.
I've updated the patch to compile with latest Omega SVN HEAD, dropping
parts which Omega now supports anyway, and splitting out some features
into separate tickets. I've not run-tested it at all.
The remaining features in this patch are:
* Unpacking "container file types" (e.g. archives like .zip, email
folders like .mbox, email messages with attachments) so we can index the
sub-parts
* Logging stderr from filters to a file
--
Comment(by olly):
I've updated the patch to current trunk. Not tested building this time.
I also dropped the code added to scriptindex to add the size and lastmod
time of the dump file to every document created from it. I don't see this
making sense in most cases. Perhaps if you feed one document per dump
file it does. But anyway, I think it's better to be explicit and put this
data in the records in the dump file if you want it.
I also dropped the excel2text script as we already have XL handling, and
this script
doesn't add anything beyond stripping out all numbers, which I can see the
motivation for, but isn't consistent with how we handle numbers in other
formats, and isn't helpful for users wanting to search for a number.
--
Ticket URL: <http://trac.xapian.org/ticket/282#comment:10>
Xapian <http://xapian.org/>
Xapian
More information about the Xapian-tickets
mailing list