How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:

Susmita/Rajib bkpsusmitaa at gmail.com
Mon Apr 22 05:17:54 BST 2024


Dear senior ML members and developers of Xapian Omega,

Mr. Olly has helped me cross the bump of the initial learning curve.
(ref: https://lists.xapian.org/pipermail/xapian-discuss/2024-April/010034.html)

How can I use Xapian Omega directly (i.e., without using `recoll` and
`xapiandb`) to index a directory of text files with all strings
greater than 3 characters, to create an index text file typically
occurs in the End of a Book, with location in specific files, without
using Recoll database? I want to create an extensive list first with
xapian omega, then have the list post-processed for all strings
greater than 3 characters, along with the indexing data, as to where
those texts appear in all specific text documents, like position,
document name, etc. How could I include phrases? Could `omgrep` or
`python` script be used for specific phrases?

Would the following steps help? Am I planning wrongly?:

1.  Index the Text Files within a directory: Use the omindex command
to index the text files. The example command to index all text files
in a directory:
$omindex -d /path/to/index_directory /path/to/text/files/directory

2.  Once the text files are indexed, use the omindex command with the
-i option to generate an extensive list of all words. example command:
$omindex -d /path/to/index_directory -i > extensive_list.txt
to generate a plain text file named extensive_list.txt containing all
the words extracted from the indexed text files.

3. Post-Processing for Strings Greater Than 3 Characters: After
generating the extensive list, to post-process it to filter out
strings greater than 3 characters. I can use various tools and
scripting languages like grep, awk, or Python to accomplish this task.
For example, using grep:
$grep -E '\b\w{4,}\b' extensive_list.txt > filtered_list.txt
This command should theoretically filter out words with more than 3
characters from the extensive_list.txt and save the result in
filtered_list.txt.

But how do I create an index for a pre-determined set of phrases?
Would I require a specific script using omgrep, like using?:
$omgrep "my phrase" /path/to/index/directory

Please suggest extensive code-lines, considering me a novice.

Best wishes,
Rajib
Etc.



More information about the Xapian-discuss mailing list