[Xapian-discuss] Xapian and indexing text with layout

Sun Apr 30 12:43:08 BST 2006

Hi:
     First sorry if the question had been discussed before, I did check in
the list archives but I didn't found any answer, second I'm relative new to
Xapian, I managed to install it and use the python bindings and play with
the examples, it was really pretty straight forward.

I'm trying to index a large amount of PDF documents, all coming from
publications like newspapers and magazines, where the PDF file it has
multiple columns and complex layout, anyway is not a problem, my PDF's are
properly structured by the OCR.
By Using pdftotext - layout I get the text with the original layout from the
pdf file and I pass it trough stdin to the indexer,  nothing complicated
there. Now I'm using the example indexer for python (simpleindex.py) and I'm
getting confused with the paragraphs and documentsID.

If I look for a term, it returns it ok, but each paragraph is returned as a
separate documentID, that's is not good, due that for me and my application
logic each PDF it's a single document itself.

If I remove the layout option, pdftotext generate a text file pretty well
organized, converting any column to paragraph, being still handy due that
preserve the original distance and the "NEAR" search wont be affected. But
again after indexing, the same problem: Paragraphs are indexed as separated
documents.

I know that the code does that here exactly:

                    # At each point, find the next alnum character (i), then
                    # find the first non-alnum character after that (j).
Find
                    # the first non-plusminus character after that (k), and
if
                    # k is non-alnum (or is off the end of the para), set
j=k.
                    # The term generation string is [i,j), so len = j-i
                    i = 0
                    while i < len(para):
                        i = find_p(para, i, p_alnum)
                        j = find_p(para, i, p_notalnum)
                        k = find_p(para, j, p_notplusminus)
                        if k == len(para) or not p_alnum(para[k]):
                            j = k
                        if (j - i) <= MAX_PROB_TERM_LENGTH and j > i:
                            term =
stemmer.stem_word(string.lower(para[i:j]))
                            doc.add_posting(term, pos)
                            pos += 1
                        i = j
                    database.add_document(doc)

If I remove part of the code to ignore paragraphs, it wont be any problem
with the indexing? I'm really getting confused.

I definitely need to have each single PDF document being represented as a
single document in Xapian, otherwise it will be return duplicated hits to
the same file.

Any suggestions or ideas?

Thank you.