[Xapian-discuss] some questions with scriptindex

Sabrina Shen hm2shen at yahoo.com
Sun Mar 27 20:45:21 BST 2005


I just learned probabilisitc retrieval(PR) in my IR course. And I do find Xapian
is a fantastic open source to help me understand how PR can be implemented in
practice. I tried to build up a local article PR system with Xapian and omega
with an eye on extending it into a larger text categorization project. My index
script is as below:

****************************************************
UID: field=UID boolean=XUID unique=Q 
JN:  boolean=XJN field=JN
PY:  boolean=XPY field=PY
TI:  index=XTA truncate=200 field=TI
AU:  boolean=XAU truncate=200 field=AU
CA:  boolean=XCA field=CA
AB: index=XTA truncate=200  field=AB
****************************************************
(UID: unique ID for each article; JN: journal;PY: publication year; TI: title;
AU: author; CA: classification; AB: abstract. I hope I can search TI and AB, and
probably use JN, PY, and CA as BOOLEAN filters in some occasions ). 

Here are my questions: (to clarify, when I say field with no quotation mark, I
mean a field in the db. "field=" refers to the action to add as a field to the
Xapian record)

(1) Do I really need "field=" for each? Isn't "field=" just for displaying web
search results (As Sam described in earlier messages: "Fields are used to
retreive per-record text for summaries and things like for Omega." ) ? Can't I
get these values with "get_document().get_data()" using MSetiterator in my local
system even without "field="? say, output search results into a text file?

(2) How does "truncate=" work? Does it work for both probabilistic field and
BOOLEAN field? Does it truncate each word while indexing, e.g. truncate a term
if it's longer than 200 characters while indexing? Or does it truncate the whole
field while doing the action "field="? 

(3) In the indexing process, I got an error message as following: 
"Exception: Key too long: length was 264 bytes, maximum length of a key is
Btree::max_key_len bytes". I understand it means a single term is too long. But
a term in which field: the primary field UID? or any field such as JN, CA, and
AB? If it's a term in a probabilistic field, which I'd like to keep as it is and
searchable, what shall I do?

Any idea/suggestion is highly appreciated.

Sabrina
circumvent this problem?




More information about the Xapian-discuss mailing list