[Xapian-discuss] get the title from the document
James Aylett
james-xapian at tartarus.org
Mon Nov 12 10:05:22 GMT 2012
On 5 Nov 2012, at 03:07, jack young <young.2004 at yahoo.com> wrote:
> Then another question turns up intermediately: which part is going to be used for terms.
> For instance, in my json data, i store two parts:
> 1. filename
> 2. content (from file)
> Then given a specific keyword, the program is supposed to ONLY look for this keyword via the content, *NOT* via the filename. In other words, how can I build my database and search the information only from content?
Searches work using terms, so just don't put the filename in as a term or terms. Anything you put in document data is *not* used in searching.
> This is the typical code for building the index:
> ******************************
> # Load content
> content = open(filePath).read()
> # Get the file name
> fileName = os.path.basename(filePath)
> # save in json and document
> json_data = content + fileName
I don't think you understand what JSON is. You seem to be using Python, so check out the `json` module.
> document = xapian.Document()
> document.set_data(json_data)
>
> # Index document
> indexer.set_document(document)
> indexer.index_text(content)
> # Store indexed content in database
> database.add_document(document)
>
> ******************************
>
> what else do I need to process?
Nothing. You've indexed the content (using a `TermGenerator`, I'm assuming). All should be well.
> did i need to change
> indexer.index_text(json_data)
>
> to:
> indexer.index_text(content)
You want to index the content, not whatever you've put in `json_data`. So your earlier code is correct.
> OR:
> doc.add_term(content)
This would (try to) add a single term containing the entire content. That generally won't work as there is a limit on the length of the term. It also doesn't make much sense with text documents. Using `TermGenerator` is the correct approach here.
J
--
James Aylett, occasional trouble-maker
xapian.org
More information about the Xapian-discuss
mailing list