[Xapian-discuss] scriptindex on an internet crawl

Wed Jun 22 21:14:12 BST 2005

On Wed, Jun 22, 2005 at 03:21:32PM -0400, Georges Dupret wrote:
> 1) what should I input in the search field to search only the title? I
> was expecting that something like "title: plane" would work, but it
> doesn't.

You need to index the title with a suitable prefix for this to work.
Make the index script line for the title:

    title : field=title index=S index

And then in your omegascript template (i.e. templates/query), add this
(at the start is best):

    $setmap{title,S}

Omega's docs/termprefixes.txt and the $setmap section of
docs/omegascript.txt explain this, but could do with being less
disjoint.  I'll fix that.

> 2) how should I do to see the original url of the documents retrieved
> such that if I click on the hyperlink, I am redirected to the original
> document (i.e. not the document I have in my copy of the crawl).

Assuming you want to store the original url in the database (which
probably is the best approach) then you'll need to modify your
omegascript template.  Look for where it uses $field{url} and modify
that so that the link given is for the locally cached copy.

> In a first try, I inserted in the command file url : field=url boolean=XURL
> unique=XURL and in the input file: url=www.dcc.uchile.cl/~gdupret for
> example, but scriptindex start using 100% of the CPU and never finishes.

You probably don't want to specify both uid and url as unique fields,
but this should cause a hang - I'll see if I can reproduce this.  Just
this rule should do:

    url : field=url

Cheers,
    Olly