How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:
Olly Betts
olly at survex.com
Fri Apr 26 05:23:54 BST 2024
On Mon, Apr 22, 2024 at 09:47:54AM +0530, Susmita/Rajib wrote:
> How can I use Xapian Omega directly (i.e., without using `recoll` and
> `xapiandb`) to index a directory of text files with all strings
> greater than 3 characters, to create an index text file typically
> occurs in the End of a Book, with location in specific files, without
> using Recoll database? I want to create an extensive list first with
> xapian omega, then have the list post-processed for all strings
> greater than 3 characters, along with the indexing data, as to where
> those texts appear in all specific text documents, like position,
> document name, etc. How could I include phrases? Could `omgrep` or
> `python` script be used for specific phrases?
Um, there isn't an "omgrep" tool...
You could write a python script to extract the information needed to
do this from a Xapian database though.
> 2. Once the text files are indexed, use the omindex command with the
> -i option to generate an extensive list of all words. example command:
> $omindex -d /path/to/index_directory -i > extensive_list.txt
> to generate a plain text file named extensive_list.txt containing all
> the words extracted from the indexed text files.
No, "-i" means "ignore meta robots tags and similar exclusions" -
omindex will still index, and this option only affects decisions about
which files to index.
If you want a command line tool to extract this sort of information from
the database, look at "xapian-delve" (part of xapian-core; at least for
Debian and Ubuntu it's in the xapian-tools binary package). You could
also do it from Python or any other supported language.
> But how do I create an index for a pre-determined set of phrases?
> Would I require a specific script using omgrep, like using?:
> $omgrep "my phrase" /path/to/index/directory
You can just run a query to find the documents matching a phrase, or
any other question.
Maybe the "quest" command line tool is useful for that if you want an
existing command line tool to post-process output from?
E.g. this gives the top ten document ids matching the phrase:
quest -d data/default '"stemming algorithm"' |sed 's/^\([0-9]\+\): \[.*\]$/\1/p;d'
Probably better to write some python using Xapian's python bindings
rather than trying to parse output that was never intended to be
machine-readable.
The "getting started" guide shows how to write a simple search script:
https://getting-started-with-xapian.readthedocs.io/en/latest/
Cheers,
Olly
More information about the Xapian-discuss
mailing list