Integration of xapian in a framework

Fri Mar 11 12:33:09 GMT 2016

On Fri, Mar 11, 2016 at 01:54:43AM +0530, Ayush Gupta wrote:

> 1. Construct a weighted undirected graph with words as nodes ( weights as
>    the number of times those 2 words have been searched together or found
>    together in the documents).
> 
> 2. Each node also keeps track of 5 most commonly used words with
>    it(redundancy to save time from sorting its neighbours).
> 
> 3. Whenever a user starts typing his query, this graph is queried
>    and the first word is predicted by prefix matching. The results of
>    the matching are shown in order of their searched frequency.
> 
> 4. When the user starts typing the next word, the graph is visited and the
>    neighbours of this word and any other word are retrieved and shown to the
>    user in order of frequency(weight of edges.).
> 
> [snip] is a rough overview of what I think can be used as an algorithm for
> autocomplete. I am going to read some research papers and improve on this.

Okay, for a Xapian project I think you'd need to use some part of
Xapian rather than building something completely separate (which would
be a separate project). Does it make sense to use btree tables to make
a new database type? Could it be implemented on top of a Xapian
database, perhaps with terms indexing documents representing the
weighted edges? How would you then store search frequencies to enable
that sorting?

However you haven't answered one of my questions that I think is
fundamental: what data are you working with here? You seem to be
building a language model. Does autocomplete work principally on
phrases found across the entire corpus, or should it prefer say
document titles over phrases from the body text? That may affect the
way you want to tackle the problem. Try to think about a few concrete
examples of autocomplete, and see what data they'd be working with.

> Learning from user queries: The method I have suggested is pretty basic.
> Incrementing the weight of edges whenever 2 or more words are searched
> together.

In the algorithm (in step 1) you've said the weights are based on 'the
number of the times...searched together or found together in the
documents'. I assume you're suggesting blending the two weights
together into one. Do you have an idea of how you'd do this? Also,
what functionality would you provide to make it easy to feed those
search weights in (for instance, are you going to incorporate a
logging system for all queries)?

> Stop Words: Since the prediction is quite basic right now, I dont think
> that stop words can be integrated in providing query predictions in scope
> of this project. What are your thoughts?

I generally avoid using stop words on index, and rely on the weighting
algorithm to de-emphasise them based on corpus frequency. However I'm
not dealing with enormous datasets usually, and there are good reasons
in other cases to use stop words at index time. You'd need to at least
document what the behaviour is and how it will affect the user
experience of auto-completion in that case.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org