Indexing stack overflow

Olly Betts olly at survex.com
Tue Mar 19 21:44:29 GMT 2024


On Tue, Mar 19, 2024 at 09:10:37PM +0530, Sagar Acharya wrote:
> I am using omindex to prepare a database.
> 
> While omindex has a way to index local website. What is the right way
> to index every subpage of stackoverflow?

We don't provide a crawler.

The simple approach is just to mirror the site locally (e.g. wget
--mirror can do this but there may well be better options) and index
with omindex from that local mirror.  If the mirroring tool you use
supports incremental updates and only touches the timestamps of the
new/changed files then omindex should be able to incrementally update.
It'll have to scan the directory tree to find the new/changed files but
that's not usually the slow part.

Or find an existing web crawler and write a bit of code to feed the
pages it crawls into the Xapian API.

> Which markups does xapian support, namely, html, javascript, reactjs,
> nodejs, etc.?

Of those, only HTML is actually a markup language (and is supported).
We don't attempt to execute javascript in pages, but nodejs is server
side so would effectively be supported when crawling a website.

There's a full list of supported formats in the Omega docs (search in
the page for `formats`):

https://xapian.org/docs/omega/overview.html

The code on git master supports a few additional formats so worth
checking there if there's one you really want not in that list.

If there's an existing extractor for a format (can be a command line
tool, or git master also support C/C++ libraries) then it shouldn't be
hard to hook up.  So if you really want client-side javascript support
then see if you can find a tool or library to render a webpage which
runs client-side javascript.

Cheers,
    Olly



More information about the Xapian-discuss mailing list