xapian-core and Windows non-ASCII paths
jf at dockes.org
Tue Jun 9 07:24:56 BST 2020
Olly Betts writes:
> On Thu, Jun 04, 2020 at 12:49:58PM +0200, Jean-Francois Dockes wrote:
> > I am attaching a patch against the xapian-core 1.4 branch.
> Patches need to go to git master first (unless they're only relevant to
> 1.4.x, which this clearly isn't).
Understood. This was intended mostly as a proof of concept, to show how the
file interface calls would need to be changed for general Unicode path
access to work, and not intended for direct merging, as I mentionned.
> > The idea of the patch is that a conversion to a Windows Unicode wide char
> > string is attempted prior to performing a relevant system call. If the
> > conversion succeeds, the wide version of the call is used, else, the
> > previous narrow call is used. This should ensure that existing
> > applications are undisturbed, and provides a way to tunnel a Unicode path
> > by using utf-8.
> I think this needs input from people with deeper knowledge of this
Sure. That's also why I first asked if somebody on the list had an idea of
the right approach. When nobody answered, I just applied an equivalent of
the changes which were needed in Recoll.
> The approach of patching every affected call site doesn't really seem
> workable to me - the maintenance and development overhead just seems too
> high. We do need platform-specific code for some things, but no other
> platform needs platform-specific code for something as pervasive as
> working on a filename. We'll just end up fighting an ongoing battle
> against newly introduced places that also need this special handling,
> and because it works fine without for common uses such issues can too
> easily go undetected for a long time (yours is the first report of this
> problem, but it's always been there).
> If it's really necessary to use these wide-character variants of
> everything which takes a filename, I think the only way to sensibly
> deal with that is to have a set of wrappers which present them as
> the non-wide variants to the rest of the code - that way this at
> least only needs addressing once per such function (though even that
> is a maintenance pain as a patch making use of a currently-unwrapped C
> library function taking a filename would require a new wrapper).
After reading a bit in this area, I have the impression that most
experienced people think that the only sane approach to non-ASCII file
names on Windows is to use the wide interfaces.
I would be very happy if someone could indicate another approach.
The problem has surfaced with Recoll, because it is an end-user tool, so I
have to make their life easy and work with their home directory as it
is. When Xapian is used with a WEB site or some other configuration set up
by professionals, obviously, they can avoid storing stuff in C:/users/시냇물
A set of wrappers is how it is implemented in Recoll, and, yes, you end up
wrapping the whole file name interface from directory reading to opening
files to unlink etc., and you keep reintroducing bare calls which fail, and
the whole thing is a PITA. The Recoll wrappers are not pretty enough to be
reused, which is why I just changed the call sites in the patch.
This said, Xapian does a lot less file manipulations than Recoll, as
demonstrated by the limited amount of changes in the patch, so I don't
think that this should be really unworkable. As another consequence, it
should be easy for me to maintain the thing for my own use.
It all depends on how fully you want to support the platform.
I guess that there is a nice library somewhere to do this, and actually I
think it's std::filesystem, which was unfortunately not really there when I
> In terms of workarounds, simply changing directory to where the database
> lives and then using a relative non-wide path should work.
It quite probably would, assuming that Xapian never computes an
absolute path (which you know, but I don't without scanning the code), and
also that there are no getcwd/chdir Windows pitfalls waiting for me...
More information about the Xapian-devel