[Xapian-devel] indexing and searching of timed events

Fri Jun 6 00:12:06 BST 2008

Michal Fapso wrote:
> I am working on an indexing/search engine for speech and I would like to
> try to use Xapian for that. I have an idea how to do it in Xapian, but I
> am not sure, if it is correct since I have just quickly looked at the
> Xapian code.

Are you aiming for the textual queries and spoken documents, or spoken 
queries and spoken documents?  (Spoken queries and textual documents is 
also probably an interesting case, but from the next paragraph it sounds 
like your documents are definitely spoken.)

> Tokens I need to index:
> Each speech audio record, processed by a speech recognizer is converted
> to an oriented graph of hypotheses. Each hypothesis contains the
> recognized word, start time, end time and confidence score. These
> hypotheses are overlapped in time, so there is generally a bunch of
> hypotheses in each point of time. 
> 
> A simple graph of hypotheses (output of speech recognizer):
> http://www.research.ibm.com/journal/sj/404/brown1.gif

Out of interest, what speech recogniser are you using?  If it's an open 
source one, others here might be interested in collaborating.  Sphinx-4 
is the only open source recogniser I know of, but there may be others 
(I've not looked for them).

> So I suppose that the main thing I need to change in Xapian code is the
> termpos type (in types.h), which is just an unsigned integer. For speech
> indexing I need to change it to a struct containing start time, end time
> and score of recognized words.

If you can possibly work around it, I recommend avoiding changing the 
termpos type - not only will you have to change the type in the header, 
you'll need to modify the code which stores the list of positions to 
store this header.  Currently, this uses quite a bit of compression, 
which you'd probably have to rip out and replace (possibly replacing 
with no compression, but that would make databases quite a bit bigger, 
and correspondingly slower).  It's going to be messy, and there is a 
better way.

The current phrase checking support in xapian works by performing an 
"AND" search for all the terms in the phrase.  Any document which 
doesn't match the AND can't possibly match the phrase, so it is 
immediately discarded.  Any document which does match the AND needs to 
be checked further - at this point, the position lists for the terms in 
the phrase are opened, and searched for any occurrences of the words in 
the correct order.

Xapian's phrase searching, as you've noticed, depends on each term 
having a position specifiable by a single integer.  It then looks for 
occurrences of the words, in increasing order of position, such that the 
difference in the position of the first and last word is less than a 
given window size.  This means that (unless the window size is equal to 
the number of terms), phrases can contain gaps.  This is no good for 
you, if I understand correctly, because you don't want phrases to be 
able to contain gaps.  However the lattices coming out of the recogniser 
don't allow the positions to be numbered such that all possible paths 
involve a difference of 1 between successive words.  Therefore, you do 
indeed need to store different information.

However, I believe that all you need to store is the links in the 
network coming out of the recogniser.

One approach to the problem is to implement your own phrase matching. 
Instead of using Xapian's built in support for storing the position 
list, you can store an arbitrary piece of data associated with each 
document.  I would recommend storing this in a value slot (see 
Document::add_value()).  If you store data describing the lattice from 
the recogniser, you should then be able to attempt to find a path 
through the lattice which corresponds to the query.  You can use a 
MatchDecider (or, if you use SVN HEAD on the trunk branch, you can use a 
PostingSource) to perform arbitrary calculations using this data at 
search time, so you can implement whatever check algorithm you need. 
It'll be fiddly, mainly in serialising and unserialising the data, but 
it should be doable without needing to modify any of the Xapian core code.

> Then to be able to search for phrases correctly, I have to change the
> code in ./matcher/phrasepostlist.cc to take start and end time into
> account.
> 
> Please, correct me if I am wrong or if I missed something. I am really
> new to Xapian, so I will be grateful for any hint on this problem
> (tutorial, code snippet, doxygen page, ...).

-- 
Richard