Indexing stack overflow

Emmanuel Engelhart kelson at kiwix.org
Wed Mar 20 07:01:46 GMT 2024


I strongly suggest to look at https://github.com/openzim/sotoki.

On 20.03.24 06:38, Sagar Acharya wrote:
> I thank you for your help.
> 
> So, xapian and omega just converts documents into a searchable database right? It organizes the content.
> 
> My website below is fully in html and no javascript but today, practically all website use js. In such cases, in order to search for the right content, managing js is necessary.
> 
> Which web languages are targeted by xapian?
> 
> On 20 March 2024 03:14:29 GMT+05:30, Olly Betts <olly at survex.com> wrote:
>> On Tue, Mar 19, 2024 at 09:10:37PM +0530, Sagar Acharya wrote:
>>> I am using omindex to prepare a database.
>>>
>>> While omindex has a way to index local website. What is the right way
>>> to index every subpage of stackoverflow?
>>
>> We don't provide a crawler.
>>
>> The simple approach is just to mirror the site locally (e.g. wget
>> --mirror can do this but there may well be better options) and index
>> with omindex from that local mirror.  If the mirroring tool you use
>> supports incremental updates and only touches the timestamps of the
>> new/changed files then omindex should be able to incrementally update.
>> It'll have to scan the directory tree to find the new/changed files but
>> that's not usually the slow part.
>>
>> Or find an existing web crawler and write a bit of code to feed the
>> pages it crawls into the Xapian API.
>>
>>> Which markups does xapian support, namely, html, javascript, reactjs,
>>> nodejs, etc.?
>>
>> Of those, only HTML is actually a markup language (and is supported).
>> We don't attempt to execute javascript in pages, but nodejs is server
>> side so would effectively be supported when crawling a website.
>>
>> There's a full list of supported formats in the Omega docs (search in
>> the page for `formats`):
>>
>> https://xapian.org/docs/omega/overview.html
>>
>> The code on git master supports a few additional formats so worth
>> checking there if there's one you really want not in that list.
>>
>> If there's an existing extractor for a format (can be a command line
>> tool, or git master also support C/C++ libraries) then it shouldn't be
>> hard to hook up.  So if you really want client-side javascript support
>> then see if you can find a tool or library to render a webpage which
>> runs client-side javascript.
>>
>> Cheers,
>>     Olly
> 
> Thanking you
> Sagar Acharya
> https://dumbdevices.in

-- 
Kiwix - Wikipedia Offline & more
* Web: https://kiwix.org/
* Mastodon: https://mastodon.social/@kiwix
* Wiki: https://wiki.kiwix.org/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 203 bytes
Desc: OpenPGP digital signature
URL: <http://lists.xapian.org/pipermail/xapian-discuss/attachments/20240320/8ac4d3e1/attachment.sig>


More information about the Xapian-discuss mailing list