Idea/Discussion of unicode filepath support for C++ STL on Windows

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Idea/Discussion of unicode filepath support for C++ STL on Windows

Emily Leiviskä
First off, I'm not sure I'm on the right list... if this is not the place please pardon me and
I would be much obliged if you could point me in the right direction.

I've exhausted my Google-Fu trying to find a good solution to reading files with Unicode
file paths and complex characters on MinGW. A short summary of what I've found:

1) MSVC provides non-standard wchar_t* overloads for relevant methods that open files
     in the C++ STL. These are not available on GCC/libstdc++ nor MinGW. I can't find the link
     anymore but I recall someone making a patch to libstdc++ and submitting it to implement
     these overloads but it was ultimately rejected.
2) One can use the "short name hack" to convert the name to DOS8.3 format and pray that
     there will not be a name collision in the converted namespace. This is brittle at best, and
     I heard a rumour that this functionality might be going away in the future.
3) The Microsoft CRT provides _wfopen that takes UTF-16 encoded wchar_t paths which can
     be used for opening complex paths. One suggestion is to use this and go back to C-style IO.

Out of the above #3 seems the most reliable but possibly least desirable solution.

I've groked around the libstdc++ source and after some detective work I've found that fstreams
(through some intermediaries) eventually will refer to `__basic_file` for the actual file IO operations.

See: https://github.com/gcc-mirror/gcc/blob/e11be3ea01eaf8acd8cd86d3f9c427621b64e6b4/libstdc%2B%2B-v3/config/io/basic_file_stdio.cc#L230

To me it looks like an easy patch to add something to the tune of:

#if defined _WIN32
    auto __inlen = strlen(__file_name) + 1; // Add null byte to be processed
    wchar_t* __buffer = new wchar_t [__inlen]; // UTF-8 string will have at most as many code points as bytes.
    if(0 == MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, __file_name, __inlen, __buffer, __inlen)){
        delete [] __buffer;
        set_fail();
        return;
    }
    _M_cfile= _wfopen(__buffer, __c_mode);
    delete [] __buffer;
    if(_M_c_file)
#else if defined _GLIBCXX_USE_LFS
   
If I'm not mistaken this change would change make any function that accepts a const char* filename
become UTF-8 aware. The downside is that it changes the current (undocumented, and unspecified)
behaviour from using the current Active Code Page for character encoding in const char* filenames to
being UTF-8.

This might break some applications that rely on this undocumented feature; However it might also fix
some applications that are currently assuming that fstreams etc are UTF-8 capable as they are on Linux.

I'm looking for comments; what do you think of such a change and whether it is any idea of
trying to pursue an attempt to get it into MinGW (or upstream?).

Would you find it motivated to have some (hopefully minor) breakage of undocumented features wrt
ACP paths in exchange for UTF-8 support? Which would be in line with Microsofts recommendations to
use UTF-8 or UTF-16 when possible.
If not, would you consider it OK if it could be enabled by setting locale or codepage? For example if
std::locale() contains "UTF-8" then use the above conversion otherwise use old behaviour?
The standard locale on startup is "C" on windows AFAICT. Other ideas?

(p.s. the archive search isn't working for me, search.gmane.org can't be reached)


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Idea/Discussion of unicode filepath support for C++ STL on Windows

Eli Zaretskii
> From: Emily Leiviskä <[hidden email]>
> Date: Fri, 7 Oct 2016 07:59:18 +0000
>
> #if defined _WIN32
>     auto __inlen = strlen(__file_name) + 1; // Add null byte to be processed
>     wchar_t* __buffer = new wchar_t [__inlen]; // UTF-8 string will have at most as many code points as bytes.
>     if(0 == MultiByteToWideChar(CP_UTF8, MB_ERR_INVALID_CHARS, __file_name, __inlen, __buffer, __inlen)){
>         delete [] __buffer;
>         set_fail();
>         return;
>     }
>     _M_cfile= _wfopen(__buffer, __c_mode);
>     delete [] __buffer;
>     if(_M_c_file)
> #else if defined _GLIBCXX_USE_LFS
>    
> If I'm not mistaken this change would change make any function that accepts a const char* filename
> become UTF-8 aware. The downside is that it changes the current (undocumented, and unspecified)
> behaviour from using the current Active Code Page for character encoding in const char* filenames to
> being UTF-8.
>
> This might break some applications that rely on this undocumented feature; However it might also fix
> some applications that are currently assuming that fstreams etc are UTF-8 capable as they are on Linux.
>
> I'm looking for comments; what do you think of such a change and whether it is any idea of
> trying to pursue an attempt to get it into MinGW (or upstream?).
>
> Would you find it motivated to have some (hopefully minor) breakage of undocumented features wrt
> ACP paths in exchange for UTF-8 support? Which would be in line with Microsofts recommendations to
> use UTF-8 or UTF-16 when possible.
> If not, would you consider it OK if it could be enabled by setting locale or codepage? For example if
> std::locale() contains "UTF-8" then use the above conversion otherwise use old behaviour?
> The standard locale on startup is "C" on windows AFAICT. Other ideas?

Doing this only for file-open functionality is a start, but it is a
partial solution at best.  Applications that manipulate file names
almost always need to do string processing with file names, like
finding only the base part of a file name, constructing a full
absolute file name from a directory, a file name, and an extension,
comparing file names in case-insensitive manner, etc.  All of this
will become subtly broken if you use UTF-8 encoded strings, because
Windows locales cannot use UTF-8 as their codeset, which means
functions like isalpha, isupper, strcasecmp, strcoll, mbstowcs,
etc. will not work for any non-ASCII character encoded as a UTF-8
sequence.

So if we want to make MinGW Unicode-compatible, we need to have
locale-aware functions that support UTF-8, which means replacements
for all of them, starting with 'setlocale'.  Anything less than that
will get us semi-broken implementation full of caveats.

I do agree that this is the right direction, though.  I just think
that more than a single API needs to be fixed for it to become a
reliable feature.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Idea/Discussion of unicode filepath support for C++ STL on Windows

Emily Leiviskä
> Doing this only for file-open functionality is a start, but it is a partial solution at best.  Applications that manipulate
> file names almost always need to do string processing with file names, like finding only the base part of a file name,
> constructing a full absolute file name from a directory, a file name, and an extension, comparing file names in
> case-insensitive manner, etc.  All of this will become subtly broken if you use UTF-8 encoded strings, because
> Windows locales cannot use UTF-8 as their codeset, which means functions like isalpha, isupper, strcasecmp,
> strcoll, mbstowcs, etc. will not work for any non-ASCII character encoded as a UTF-8 sequence.

Agreed that it's a partial solution and that a full solution would be desirable. I argue that mingw is already subtly
broken because std::locale("") doesn't work. On MSYS2 it crashes because LANG=en_US.UTF-8 by default.
On CMD it reports "C" when it in fact should be something with Latin 1 encoding which is what my ACP is set to.

The input from std::cin is encoded in ACP in CMD (which is expected) which will break isalpha and company  
for anything else than 7bit ASCII, unless set::locale is fixed. The input from std::cin is encoded with UTF-8 on MSYS2
so we have breakage anyway because we can't get the right locale and isalpha etc are equally broken in the default
settings. What's even weirder is that std::wcin will encode UTF-8 into the low byte of the wchar_t leaving the rest
zero, which is totally unexpected and a waste of space, it is also inconsistent with how WinAPI expects wchar_t to
be encoded with UTF-16.

So I argue that the proposed change will not make the situation any worse, but rather fix one of the already broken
APIs. Or if we provide the wchar_t overloads that MSVC has and preserve the old behaviour for chars.
 
> So if we want to make MinGW Unicode-compatible, we need to have locale-aware functions that support UTF-8,
> which means replacements for all of them, starting with 'setlocale'.  Anything less than that will get us semi-broken
> implementation full of caveats.
>
> I do agree that this is the right direction, though.  I just think that more than a single API needs to be fixed for it to
> become a reliable feature.

Don't get me wrong, I would LOVE to see all the other broken APIs fixed and have MinGW be fully Unicode compatible.
 However I do believe that fixing the file open APIs is low hanging fruit that would help a lot of people, even if it isn't
full blown Unicode compatibility. And as was said, it is a first step. For example we have a particular case where we get
 Unicode file paths from Java into a JNI DLL compiled with MinGW, we just want to pass these strings through, and
open the file.

Do you think there is a reasonable chance to get this into MinGW?


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Idea/Discussion of unicode filepath support for C++ STL on Windows

Eli Zaretskii
> From: Emily Leiviskä <[hidden email]>
> Date: Thu, 13 Oct 2016 09:41:22 +0000
>
> Don't get me wrong, I would LOVE to see all the other broken APIs fixed and have MinGW be fully Unicode compatible.
>  However I do believe that fixing the file open APIs is low hanging fruit that would help a lot of people, even if it isn't
> full blown Unicode compatibility. And as was said, it is a first step. For example we have a particular case where we get
>  Unicode file paths from Java into a JNI DLL compiled with MinGW, we just want to pass these strings through, and
> open the file.
>
> Do you think there is a reasonable chance to get this into MinGW?

That's for Keith to answer, not me.  I presume that presenting a
working patch would be a useful first step for a practical discussion.
If the patch goes into libmingwex, it's entirely a MinGW decision,
AFAIU.  If you need to patch libstdc++, you will have to talk to GCC
maintainers, where that library is developed and maintained.

TIA

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
MinGW-users mailing list
[hidden email]

This list observes the Etiquette found at
http://www.mingw.org/Mailing_Lists.
We ask that you be polite and do the same.  Disregard for the list etiquette may cause your account to be moderated.

_______________________________________________
You may change your MinGW Account Options or unsubscribe at:
https://lists.sourceforge.net/lists/listinfo/mingw-users
Also: mailto:[hidden email]?subject=unsubscribe
Loading...