[Xapian-devel] indexing and searching of timed events

Fri Jun 6 09:38:01 BST 2008

Hi Richard,

thank you for your quick reply. I am aiming for textual queries and
spoken documents. 

The recognizer:
---------------
In our speech processing group, we have our own recognizer
(http://speech.fit.vutbr.cz/en/software/hmm-toolkit-stk-speech-fit). It
is released under GPL. It can produce recognition lattices in HTK format
(http://labrosa.ee.columbia.edu/doc/HTKBook21/node262.html), which is
quite a standard. So the indexer should work with the output lattices
and will not be hard-wired to any particular recognizer. Using it with
any other recognizer would be just a matter of writing another lattice
file parser.

Phrase searching:
-----------------
If I had a lattice stored with the indexed document, I could check the
phrase by traversing the lattice structure which would be quite
inefficient, since the lattices can be really huge. However, for
generating a context of search results, it would be helpful. I can
imagine another way. If I store term's start times in
OmDocumentTerm::term_positions vector and create another vector storing
the end times in the same order, OmDocumentTerm::term_positions_end, it
will be quite efficient to check the phrases. The terms will be inserted
in time-increasing order for the log complexity of search. Do you think,
that it would be easier to use the PostingSource for storing the end
times than adding another vector to OmDocumentTerm?

Best regards,
Miso

On Fri, 2008-06-06 at 00:12 +0100, Richard Boulton wrote:
> Michal Fapso wrote:
> > I am working on an indexing/search engine for speech and I would like to
> > try to use Xapian for that. I have an idea how to do it in Xapian, but I
> > am not sure, if it is correct since I have just quickly looked at the
> > Xapian code.
> 
> Are you aiming for the textual queries and spoken documents, or spoken 
> queries and spoken documents?  (Spoken queries and textual documents is 
> also probably an interesting case, but from the next paragraph it sounds 
> like your documents are definitely spoken.)
> 
> > Tokens I need to index:
> > Each speech audio record, processed by a speech recognizer is converted
> > to an oriented graph of hypotheses. Each hypothesis contains the
> > recognized word, start time, end time and confidence score. These
> > hypotheses are overlapped in time, so there is generally a bunch of
> > hypotheses in each point of time. 
> > 
> > A simple graph of hypotheses (output of speech recognizer):
> > http://www.research.ibm.com/journal/sj/404/brown1.gif
> 
> Out of interest, what speech recogniser are you using?  If it's an open 
> source one, others here might be interested in collaborating.  Sphinx-4 
> is the only open source recogniser I know of, but there may be others 
> (I've not looked for them).
> 
> > So I suppose that the main thing I need to change in Xapian code is the
> > termpos type (in types.h), which is just an unsigned integer. For speech
> > indexing I need to change it to a struct containing start time, end time
> > and score of recognized words.
> 
> If you can possibly work around it, I recommend avoiding changing the 
> termpos type - not only will you have to change the type in the header, 
> you'll need to modify the code which stores the list of positions to 
> store this header.  Currently, this uses quite a bit of compression, 
> which you'd probably have to rip out and replace (possibly replacing 
> with no compression, but that would make databases quite a bit bigger, 
> and correspondingly slower).  It's going to be messy, and there is a 
> better way.
> 
> The current phrase checking support in xapian works by performing an 
> "AND" search for all the terms in the phrase.  Any document which 
> doesn't match the AND can't possibly match the phrase, so it is 
> immediately discarded.  Any document which does match the AND needs to 
> be checked further - at this point, the position lists for the terms in 
> the phrase are opened, and searched for any occurrences of the words in 
> the correct order.
> 
> Xapian's phrase searching, as you've noticed, depends on each term 
> having a position specifiable by a single integer.  It then looks for 
> occurrences of the words, in increasing order of position, such that the 
> difference in the position of the first and last word is less than a 
> given window size.  This means that (unless the window size is equal to 
> the number of terms), phrases can contain gaps.  This is no good for 
> you, if I understand correctly, because you don't want phrases to be 
> able to contain gaps.  However the lattices coming out of the recogniser 
> don't allow the positions to be numbered such that all possible paths 
> involve a difference of 1 between successive words.  Therefore, you do 
> indeed need to store different information.
> 
> However, I believe that all you need to store is the links in the 
> network coming out of the recogniser.
> 
> One approach to the problem is to implement your own phrase matching. 
> Instead of using Xapian's built in support for storing the position 
> list, you can store an arbitrary piece of data associated with each 
> document.  I would recommend storing this in a value slot (see 
> Document::add_value()).  If you store data describing the lattice from 
> the recogniser, you should then be able to attempt to find a path 
> through the lattice which corresponds to the query.  You can use a 
> MatchDecider (or, if you use SVN HEAD on the trunk branch, you can use a 
> PostingSource) to perform arbitrary calculations using this data at 
> search time, so you can implement whatever check algorithm you need. 
> It'll be fiddly, mainly in serialising and unserialising the data, but 
> it should be doable without needing to modify any of the Xapian core code.
> 
> > Then to be able to search for phrases correctly, I have to change the
> > code in ./matcher/phrasepostlist.cc to take start and end time into
> > account.
> > 
> > Please, correct me if I am wrong or if I missed something. I am really
> > new to Xapian, so I will be grateful for any hint on this problem
> > (tutorial, code snippet, doxygen page, ...).
>