[Xapian-devel] indexing and searching of timed events
Richard Boulton
richard at lemurconsulting.com
Fri Jun 6 00:12:06 BST 2008
Michal Fapso wrote:
> I am working on an indexing/search engine for speech and I would like to
> try to use Xapian for that. I have an idea how to do it in Xapian, but I
> am not sure, if it is correct since I have just quickly looked at the
> Xapian code.
Are you aiming for the textual queries and spoken documents, or spoken
queries and spoken documents? (Spoken queries and textual documents is
also probably an interesting case, but from the next paragraph it sounds
like your documents are definitely spoken.)
> Tokens I need to index:
> Each speech audio record, processed by a speech recognizer is converted
> to an oriented graph of hypotheses. Each hypothesis contains the
> recognized word, start time, end time and confidence score. These
> hypotheses are overlapped in time, so there is generally a bunch of
> hypotheses in each point of time.
>
> A simple graph of hypotheses (output of speech recognizer):
> http://www.research.ibm.com/journal/sj/404/brown1.gif
Out of interest, what speech recogniser are you using? If it's an open
source one, others here might be interested in collaborating. Sphinx-4
is the only open source recogniser I know of, but there may be others
(I've not looked for them).
> So I suppose that the main thing I need to change in Xapian code is the
> termpos type (in types.h), which is just an unsigned integer. For speech
> indexing I need to change it to a struct containing start time, end time
> and score of recognized words.
If you can possibly work around it, I recommend avoiding changing the
termpos type - not only will you have to change the type in the header,
you'll need to modify the code which stores the list of positions to
store this header. Currently, this uses quite a bit of compression,
which you'd probably have to rip out and replace (possibly replacing
with no compression, but that would make databases quite a bit bigger,
and correspondingly slower). It's going to be messy, and there is a
better way.
The current phrase checking support in xapian works by performing an
"AND" search for all the terms in the phrase. Any document which
doesn't match the AND can't possibly match the phrase, so it is
immediately discarded. Any document which does match the AND needs to
be checked further - at this point, the position lists for the terms in
the phrase are opened, and searched for any occurrences of the words in
the correct order.
Xapian's phrase searching, as you've noticed, depends on each term
having a position specifiable by a single integer. It then looks for
occurrences of the words, in increasing order of position, such that the
difference in the position of the first and last word is less than a
given window size. This means that (unless the window size is equal to
the number of terms), phrases can contain gaps. This is no good for
you, if I understand correctly, because you don't want phrases to be
able to contain gaps. However the lattices coming out of the recogniser
don't allow the positions to be numbered such that all possible paths
involve a difference of 1 between successive words. Therefore, you do
indeed need to store different information.
However, I believe that all you need to store is the links in the
network coming out of the recogniser.
One approach to the problem is to implement your own phrase matching.
Instead of using Xapian's built in support for storing the position
list, you can store an arbitrary piece of data associated with each
document. I would recommend storing this in a value slot (see
Document::add_value()). If you store data describing the lattice from
the recogniser, you should then be able to attempt to find a path
through the lattice which corresponds to the query. You can use a
MatchDecider (or, if you use SVN HEAD on the trunk branch, you can use a
PostingSource) to perform arbitrary calculations using this data at
search time, so you can implement whatever check algorithm you need.
It'll be fiddly, mainly in serialising and unserialising the data, but
it should be doable without needing to modify any of the Xapian core code.
> Then to be able to search for phrases correctly, I have to change the
> code in ./matcher/phrasepostlist.cc to take start and end time into
> account.
>
> Please, correct me if I am wrong or if I missed something. I am really
> new to Xapian, so I will be grateful for any hint on this problem
> (tutorial, code snippet, doxygen page, ...).
--
Richard
More information about the Xapian-devel
mailing list