[Xapian-discuss] Newbie question: How to extract 'date modified' from path when indexing?

Deron Meranda deron.meranda at gmail.com
Wed Apr 1 04:55:47 BST 2009


On Tue, Mar 31, 2009 at 8:36 PM, Bill Hutten <bill at hutten.org> wrote:
> I've successfully set up Xapian/Omega as the search engine on a client
> website. ...
>
> The files are stored in a consistent structure, for instance file
> "foo.html" might be in "archives/2006/07/foo.html"  In this example, I
> would like to be able to extract the 2006/07 value from the path during
> indexing and use that as the date that Xapian/Omega uses to search on.

Do you have access to the webserver files at all?  Because the best
solution is simply to change the timestamp of the underlying files.  That
would benefit not only your Xapian indexing, but also all the other HTTP
goodness; such as working with whatever other types of spiders or
indexers may be crawling the site, HTTP proxies and caches, etc.

If it's Unix/Linux, changing the file timestamps would be quite easy.
You want to look at the "time" command.  Or I could provide you
a little script to do that.


As a second choice, if say this is an Apache webserver and you
can add some configuration (either the main config file or the
per-directory .htaccess files); then you can force Apache to
lie about the file's date.  This is easiest though if you only have a
few directories (which if it's one directory per month is doable).
Again, since the webserver would be sending out the correct
date, it also benefits other spiders, indexers, HTTP caches, etc.


As a last resort, you're going to have to modify the indexer itself
to overrule what it learns from the HTTP date, and instead extract
a date out of the URL pattern.
-- 
Deron Meranda



More information about the Xapian-discuss mailing list