[Xapian-tickets] [Xapian] #282: Assorted enhancements to omindex

Xapian nobody at xapian.org
Tue Dec 6 13:23:52 GMT 2011


#282: Assorted enhancements to omindex
-------------------------+--------------------------------------------------
 Reporter:  olly         |       Owner:  olly     
     Type:  enhancement  |      Status:  assigned 
 Priority:  normal       |   Milestone:  1.2.x    
Component:  Omega        |     Version:  SVN trunk
 Severity:  normal       |    Keywords:           
Blockedby:               |    Platform:  All      
 Blocking:               |  
-------------------------+--------------------------------------------------

Old description:

> A patch from Reini Urban at AVL which was pasted into the wiki a while
> back, but a ticket is really a more appropriate way to track it.  We
> should look at folding some of these improvements in, though some others
> we probably don't want to include, at least in the form in this patch.
>
> I've updated the patch to compile with latest Omega SVN HEAD, dropping
> parts which Omega now supports anyway, and splitting out some features
> into separate tickets.  I've not run-tested it at all.
>
> The remaining features in this patch are:
>
>  * Unpacking "container file types" (e.g. archives like .zip, email
> folders like .mbox, email messages with attachments) so we can index the
> sub-parts
>  * Logging stderr from filters to a file
>  * Defaulting to adding the size and lastmod time of the dump file in
> scriptindex. In general, the size of the dump file seems misleading
> (though if you put one document per dump, less so). The lastmod isn't
> particular helpful in many cases either

New description:

 A patch from Reini Urban at AVL which was pasted into the wiki a while
 back, but a ticket is really a more appropriate way to track it.  We
 should look at folding some of these improvements in, though some others
 we probably don't want to include, at least in the form in this patch.

 I've updated the patch to compile with latest Omega SVN HEAD, dropping
 parts which Omega now supports anyway, and splitting out some features
 into separate tickets.  I've not run-tested it at all.

 The remaining features in this patch are:

  * Unpacking "container file types" (e.g. archives like .zip, email
 folders like .mbox, email messages with attachments) so we can index the
 sub-parts
  * Logging stderr from filters to a file

--

Comment(by olly):

 I've updated the patch to current trunk.  Not tested building this time.

 I also dropped the code added to scriptindex to  add the size and lastmod
 time of the dump file to every document created from it.  I don't see this
 making sense in most cases.  Perhaps if you feed one document per dump
 file it does.  But anyway, I think it's better to be explicit and put this
 data in the records in the dump file if you want it.

 I also dropped the excel2text script as we already have XL handling, and
 this script
 doesn't add anything beyond stripping out all numbers, which I can see the
 motivation for, but isn't consistent with how we handle numbers in other
 formats, and isn't helpful for users wanting to search for a number.

-- 
Ticket URL: <http://trac.xapian.org/ticket/282#comment:10>
Xapian <http://xapian.org/>
Xapian



More information about the Xapian-tickets mailing list