How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:
Susmita/Rajib
bkpsusmitaa at gmail.com
Mon Apr 22 05:17:54 BST 2024
Dear senior ML members and developers of Xapian Omega,
Mr. Olly has helped me cross the bump of the initial learning curve.
(ref: https://lists.xapian.org/pipermail/xapian-discuss/2024-April/010034.html)
How can I use Xapian Omega directly (i.e., without using `recoll` and
`xapiandb`) to index a directory of text files with all strings
greater than 3 characters, to create an index text file typically
occurs in the End of a Book, with location in specific files, without
using Recoll database? I want to create an extensive list first with
xapian omega, then have the list post-processed for all strings
greater than 3 characters, along with the indexing data, as to where
those texts appear in all specific text documents, like position,
document name, etc. How could I include phrases? Could `omgrep` or
`python` script be used for specific phrases?
Would the following steps help? Am I planning wrongly?:
1. Index the Text Files within a directory: Use the omindex command
to index the text files. The example command to index all text files
in a directory:
$omindex -d /path/to/index_directory /path/to/text/files/directory
2. Once the text files are indexed, use the omindex command with the
-i option to generate an extensive list of all words. example command:
$omindex -d /path/to/index_directory -i > extensive_list.txt
to generate a plain text file named extensive_list.txt containing all
the words extracted from the indexed text files.
3. Post-Processing for Strings Greater Than 3 Characters: After
generating the extensive list, to post-process it to filter out
strings greater than 3 characters. I can use various tools and
scripting languages like grep, awk, or Python to accomplish this task.
For example, using grep:
$grep -E '\b\w{4,}\b' extensive_list.txt > filtered_list.txt
This command should theoretically filter out words with more than 3
characters from the extensive_list.txt and save the result in
filtered_list.txt.
But how do I create an index for a pre-determined set of phrases?
Would I require a specific script using omgrep, like using?:
$omgrep "my phrase" /path/to/index/directory
Please suggest extensive code-lines, considering me a novice.
Best wishes,
Rajib
Etc.
More information about the Xapian-discuss
mailing list