Indexing stack overflow

Wed Mar 20 21:40:27 GMT 2024

On Wed, Mar 20, 2024 at 11:08:36AM +0530, Sagar Acharya wrote:
> So, xapian and omega just converts documents into a searchable
> database right? It organizes the content.

Omega's omindex finds documents via recursing a directory tree,
determines if they need adding/updating, extracts text and fees
it to the xapian-core C++ API.

The xapian-core C++ API takes text as input and indexes it into
a searchable database, which it then allows searching.

There are other parts to Omega such as scriptindex which provides
more customisable batch indexing, and a CGI search front-end
(also called omega).  There's no web crawler though.

You can also mix and match pieces of Omega with other things
using the xapian-core C++ API, so you could index with omindex
and search with your own frontend, or index with your own code
and search with the omega CGI.  There are a few conventions
Omega assumes about how the xapian-core API is used which the
API itself doesn't enforce.

Finally there are bindings to the C++ API for various other
programming languages, mostly in xapian-bindings but there are
a few maintained separately by other developers.

> My website below is fully in html and no javascript but today,
> practically all website use js. In such cases, in order to search for
> the right content, managing js is necessary.

Currently omindex will see HTML pages much like a browser with
Javascript disabled, so you'll get a poor indexing experience
if much of your content is built in the browser by Javascript.  It's
also worth bearing in mind you may get a similar poor experience with
some web search engines, and it also means a slower page load for human
users.  Even web search engines which execute Javascript may penalise
pages with it (AIUI it's pretty common to give a ranking boost to pages
which load fast).

Potentially omindex could attempt to execute Javascript but it seems
complicated to do and I don't know of anybody with plans to work on it.

If you have such a site and want to index it with Omega, the best
approach is probably to write a bit of code to pull the content from
whatever your backend store is and write it out in the scriptindex dump
format (which is essentially stanzas of NAME=VALUE) and index it that
way.  If your backend store can provide a feed of new/changed documents
or tracks last modified timestamps you can do incremental updating with
this approach.

> Which web languages are targeted by xapian?

We have bindings for several languages commonly use for websites (e.g.
PHP, Perl, Python, Ruby) in xapian-bindings and there are some
third-party nodejs bindings.

We also support compilation with emscripten which allows running
Xapian in the browser - you can read more about that here:

https://blog.runbox.com/2019/01/the-secret-behind-runbox-7s-speed/

Cheers,
    Olly