[Xapian-devel] indexing and searching of timed events

Michal Fapso ifapso at fit.vutbr.cz
Thu Jun 5 15:33:10 BST 2008


Hello,

I am working on an indexing/search engine for speech and I would like to
try to use Xapian for that. I have an idea how to do it in Xapian, but I
am not sure, if it is correct since I have just quickly looked at the
Xapian code.

Tokens I need to index:
Each speech audio record, processed by a speech recognizer is converted
to an oriented graph of hypotheses. Each hypothesis contains the
recognized word, start time, end time and confidence score. These
hypotheses are overlapped in time, so there is generally a bunch of
hypotheses in each point of time. 

A simple graph of hypotheses (output of speech recognizer):
http://www.research.ibm.com/journal/sj/404/brown1.gif

So I suppose that the main thing I need to change in Xapian code is the
termpos type (in types.h), which is just an unsigned integer. For speech
indexing I need to change it to a struct containing start time, end time
and score of recognized words.

Then to be able to search for phrases correctly, I have to change the
code in ./matcher/phrasepostlist.cc to take start and end time into
account.

Please, correct me if I am wrong or if I missed something. I am really
new to Xapian, so I will be grateful for any hint on this problem
(tutorial, code snippet, doxygen page, ...).

Thank you,
Miso





More information about the Xapian-devel mailing list