query regarding matcher optimisation and proposal submission

Olly Betts olly at survex.com
Tue Mar 15 00:34:01 GMT 2016


Hi Sanat,

On Tue, Mar 15, 2016 at 12:29:41AM +0530, Sanat Jain wrote:
> 1) In the ticket #215 (Boolean OR could be optimised further)
> https://trac.xapian.org/ticket/215
> 
>  (i)  Is there a predefined function to sort the posting lists in order of
> term frequency? If yes then where can I find it?

You'll need to learn to find your own way around the code.  We aim
to answer questions promptly, but we aren't available 24/7 - if you
stop, ask a question and have to wait for an answer every time you hit
an unknown, that's going to seriously reduce the amount you can get
done.

Tools like "git grep" are very useful for finding answers to such
questions:

$ git grep -i postlist|grep -i sort|grep -i freq
api/queryinternal.cc:    // Sort the postlists so that the postlist with the greatest term frequency
api/queryinternal.cc:    sort(pls.begin(), pls.end(), ComparePostListTermFreqAscending());

Admittedly I have the advantage of having some idea of what the answer
looks like, but even just a search for all the places that call sort()
would give you a manageable list of places to look:

$ git grep '\<sort('

>   (ii)  What does the following paragraph means as given in the above link:
> 
>  “We'd need to keep track of which sub-postlists have been moved up to the
>     current position, and which haven't. When next() is called, we'd call
> next() on any sub-postlists which are up-to-date, but we would need to call
> skip_to() on any other sub-postlists which are further behind.”
> 
> (iii) And can you please tell me what is the difference between next() and
> skip_to()?

Look at the header where the class you're interested in is defined, and
in most cases the methods have documentation comments - in this case,
see api/postlist.h.

It would be hard to explain the quoted paragraph if you don't understand
the purpose of these methods.

> 2) is there any explanation for ticket #394 (Speed up phrase queries with a
> "settling pond")

There's an explanation in the ticket's description...

I assume you've read that and it wasn't what you wanted, but it's hard
for me to know in what you are after.

It's generally very hard to give a helpful answer to vague or general
questions.  Try to ask precise questions if you want helpful answers.

> 3) Also can you please tell me where can I find some explanation of
> OP_SYNONYM as required in
> 
> Ticket #400 (Optimise AND_MAYBE when the RHS has a maxweight of 0)
> 
>                https://trac.xapian.org/ticket/400

In the API docs for the Xapian::Query class (or you can find everywhere
it is used with: git grep OP_SYNONYM

> 4) I am new to GSOC so can you please guide me, where should I submit my
> first draft proposal to you for your feedback, should it be this email or
> should I submit it on GSOC main website and then edit it later?

The GSoC website has been completely redone this year, and I'm not
yet sure what the new workflow for this is.

I know the final proposals are submitted as PDFs this year, and I can
see "Draft Proposals" and "PDF Proposals" in the dashboard, so I guess
you can submit a "Draft Proposal" as something other than PDF, but we
don't yet have any proposals submitted, so I can't see what they look
like.

PDF isn't the most helpful format for review, as it's not simple for
us to "diff" two versions of a proposal in PDF form to see what's
changed, and having to reread a whole proposal every time you make some
changes is inefficient.

If you're working in some text source format (LaTeX, reStructuredText,
etc) then showing us the source is more helpful.  You might want to
just stick the source in a git repo, which means you can't lose it
if your computer crashes, and we can easily diff versions, etc.

You're welcome to send drafts to the mailing list.  Please don't send
them by private email to mentors though, as it's impossible for us to
track what's going on then.

You don't need to be concerned about plagiarism - it is very obvious
if we get two proposals with text in common, and it'll be clear who
actually wrote the text in question.  Passing other other people's
work as your own is particularly taboo in the FOSS world.

>  5) i am planning to take ticket #215 before mid term evaluation and ticket
> #400 or # 394 after it, please guide me if this is acceptable approach or
> suggest any changes.

You don't say much about your level of experience, but two tickets in
three months seems a little under ambitious on the face of it.

We expect a proposal to analyse the work to be done and break it down
into small enough jobs that you can sensible reason about how long they
might take.  I'd suggest taking each ticket in turn and doing that until
you think you have 3 months worth of work.

It's also a good idea to include some "stretch goals", so you have a
plan for how to fill the time if things go faster than expected.

Cheers,
    Olly



More information about the Xapian-devel mailing list