[Xapian-discuss] Getting custom field data from the page through crawling

Matt Barnicle mattb at wageslavery.org
Thu Feb 8 07:21:37 GMT 2007


Now on to my next question..  I've got the search and indexing working well for now.. 
My next quest is to implement a system of creating custom fields in the index.  Our site
is fully dynamic.  That is, every page is generated in PHP and there are enough
different kinds of pages that I wouldn't want to get into the business of indexing the
DB directly, so I think that using htdig to crawl the site is the best way to go..  But,
I would like to be able to search for things by field such as 'type', 'category',
'name', 'city', etc.  I thought about it a lot and also did a lot of reading and
research in the list archives but couldn't come up with any way of passing this
information from the built pages to the database.  I was hoping I could store this in
meta tags, like:

<meta name="myorg.item.type" content="event" />
<meta name="myorg.item.category" content="theatre" />
<meta name="myorg.item.name" content="The Nutcracker Suite" />
<meta name="myorg.item.start_date" content="2007-02-10" />
<meta name="myorg.item.end_date" content="2007-02-16" />

That won't work the best though, because htdig won't store that information in a
meaningful way to allow me to retrieve it in order to set the fields myself later.  So,
the one workaround solution I could come up with was to maybe edit the htdig2omega
script, and for each doc read from db.docs, I then do an HTTP request on the URL, read
it, parse these tags and then print the fields, which will map to the settings I specify
in htdig2omega.script.  But of course, I'm doing two page lookups when I spider the
site..  Once for the main htdig crawl, and a second time during the db conversion.

Is there a better way to achieve this result?

A second question that goes along with that one..  Can I have multiple field datum with
the same name?  For example sometimes an event falls under more than one category, like
'theatre' and 'performing arts'.  That's a basic example, but there are others where
there are many options like if the page type is 'venue', which services that venue
offers like wheelchair accessibility, closed caption, braille, and more..  Our site
visitors will be searching on these attributes to find for example, events happening on
a certain date at venues that offer certain services.

- Matt




More information about the Xapian-discuss mailing list