WorstFit: Unveiling Hidden Transformers in Windows ANSI
183 comments
·January 9, 2025vessenes
wongarsu
The opt-out is to use the unicode windows APIs (the functions ending in "w" instead of "a"). This also magically fixes all issues with paths longer than 260 characters (if you add a "\\?\" prefix or set you manifest correctly), and has been available and recommended since Windows XP.
I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.
comex
_Or_ set your application to use UTF-8 for the "A" APIs. Apparently this is supported as of a Windows 10 update from 2019. [1]
[1] https://learn.microsoft.com/en-us/windows/apps/design/global...
asveikau
It should have been supported approximately 20 years earlier than that. I was coding against Win32 looong before 2019 and wondering for years why they wouldn't let you.
An explanation I heard ~10 years prior is that doing so exposed bugs in CRT and nobody wanted to fix them.
kazinator
Does that mean that in this UTF-8 mode, GetCommandLineA would, when the full-width double quote occurs in the command line, return the UTF-8 bytes for that double quote, rather than steamrolling it to an ASCII double quote with the WorstFit mapping?
account42
UTF-8 ACP might fix these exploits but it doesn't fix the root issue that your application encoding can't represent the whole internal system encoding (WTF-16, NOT UTF-16 despite what it claims).
Sharlin
As mentioned elsewhere in this discussion, 99% of the time the cause is likely the use of standard C functions (or C++ `std::string`…) instead of MS's nonstandard wide versions. Which of course is a ubiquitous practice in portable command-line software like curl.
smatija
A lot of details is in linked curl hackerone: https://hackerone.com/reports/2550951
account42
std::string is not an issue, how you get strings from the environment into it is.
You can use W functions and convert the WTF-16 strings you get to WTF-8 and use that in std::string without problems.
pishpash
So the culprit is still the software writer. They should have wrapped the C++ library for OS-specific behavior on Windows. Because they are publishing buggy software and calling it cross-platform.
Thorrez
>I'm not sure why the non-unicode APIs are still so commonly used.
Even argv is affected on Windows. That's part of the C and C++ standard, not really a Windows API. Telling all C/C++ devs they need to stop using argv is kind of a tough ask.
shakna
You also have to use wmain instead of main, with a wchar_t argv, otherwise the compiled-in argparser will be calling the ANSI version. In other words... Anyone using MSVC and the cross-platform standardised and normal C system, are hit by this.
Oh, and wmain is a VisualC thing. It isn't found on other platforms. Not standardised.
mort96
Writing cross platform code which consistently uses UCS-2 wchar_t* on Windows and UTF-8 char* on UNIX-like systems sounds like absolute hell
vessenes
I think the issue is that native OS things like the windows command line, say, don’t always do this. Check the results of their ‘cd’ commands with Japanese Yen characters introduced. You can see that the path descriptor somehow has updated to a directory name with Yen (or a wide backslash) in it, while the file system underneath has munged, and put them into an actual directory. It’s precisely the problem that you can’t control the rest of the API surface to use W that is the source of the difficulties.
ack_complete
Using \\?\ has a downside: since it bypasses Win32's path processing, it also prevents relative paths like d:test.txt from working. Kind of annoying on the command line with tools like 7z.exe.
account42
Sounds more like an upside TBH.
cesarb
> I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.
Nowadays, it's either for historical reasons (code written back when supporting Windows 9x was important, or even code migrated from Windows 3.x), or out of a desire to support non-Windows systems. Most operating systems use a byte-based multi-byte encoding (nowadays usually UTF-8) as their native encoding, instead of UTF-16.
asveikau
I share your recommendations of always using PWSTR when using windows apis.
> I'm not sure why the non-unicode APIs are still so commonly used
I think because the rest of the C world uses char* with utf-8, so that is what people are habituated to. Setting the ACP to CP_UTF8 would have solved a lot of problems, but I believe that's only been supported for a short period of time, bafflingly.
account42
> Setting the ACP to CP_UTF8 would have solved a lot of problems, but I believe that's only been supported for a short period of time, bafflingly.
It wouldn't solve all encoding problems though because most Windows APIs can store/return invalid UTF-16 which you can't represent in CP_UTF8 - you'd need a CP_WTF8 for that which doesn't even exist so you have to use the W APIs and do the conversion yourself.
captainmuon
Windows has a way of opting out of legacy behavior since Windows XP - manifest files. If you don't include a manifest, even GetWindowsVersion will not return the current version IIRC. It should be not too hard to add an opt-out in there (and at some point make it default in Visual Studio).
I think what is also needed is some kind of linting - there is usually no need to call ANSI WinAPI functions in a modern application (unless you set the locale to UTF-8 and only use the 8-bit functions, but I don't know how well that works). I think there are also a couple of settings and headers to include to make everything "just work" - meaning argv, printf and std::cout work with UTF-8, you get no strange conversions, and you just have functions to convert between UTF-8 and UTF-16 to use WinAPI. I'm pretty sure I have a Visual Studio project lying around somewhere where it works. But all those steps necessary need to be documented and put in one place by MS.
account42
> If you don't include a manifest, even GetWindowsVersion will not return the current version IIRC.
Worse than that, even reading relevant registry keys will be faked.
Arwill
Using UTF8 internally and converting strings for W API calls is a way to gain some performance.
cryptonector
More like it's a way to keep your Windows port code to a minimum so that the rest can run on Unix. I.e., you want to use UTF-8 because that's the standard on Unix, and you don't want to have completely different versions of your code for Windows and Unix because now you have twice the maintenance trouble.
account42
*WTF-8 unless you want to not be able to handle all possible filenames.
account42
> To quote the curl maintainer “curl is a victim” here — but who is the culprit?
Security vulnerability or not, it's a bug with curl on windows as it doesn't correctly handle unicode arguments.
UltraSane
The loosey-goosey mapping of code points to characters has always bothered me about Unicode.
cryptonector
This isn't about Unicode having "loosey-goosey" anything. It's about aa mapping that Microsoft came up with to map Unicode to non-Unicode.
SAI_Peregrinus
Yeah, they could have mapped code points to their textual descriptions. That'd require reallocations, but converting "to UNICODE_FULLWIDTH_QUOTATION_MARK_U+FF02 would be unambiguous. Ugly, but obvious what happened. Better than � IMO!
mmastrac
This is kind of unsurprising, but still new to me even as someone who did Windows development (and some Wine API hacking) for a decade around when this W/A mess came about.
Windows is like the card game Munchkin, where a whole bunch of features can add up to a completely, unbelievably random over-powered exploit because of unintentional synergy between random bits.
I'm happy to see that they are converting the ANSI subsystem to UTF-8, which should, in theory, mitigate a lot of these problems.
I wonder if the Rust team is going to need YetAnotherFix to the process spawning API to fix this...
ChrisSD
Rust's standard library basically never uses ANSI APIs. The article doesn't demonstrate any attack that works against Rust. If they do have one I'd highly recommend reporting it.
Of course, Rust can't control what happens on the other side of side of a process boundary. So if an application invoked by Rust uses ANSI APIs then they'll have a problem. But also that's their responsibility.
okanat
What about the entry point? Because one of the issues mentioned in the article is about mainCRTStartup calling an ANSI API. Most of the Rust programs are linked with the C runtime. Does Rust make sure that the C library initialization is also done in Unicode APIs?
ChrisSD
No but it doesn't use C library values so it doesn't matter. E.g. getting the command line arguments is done via calling `GetCommandLineW` so it doesn't use argv or argc.
This is actually necessary because Rust cannot assume it owns the entry point. E.g. a Rust library could be called from a C++ application or in a DLL, etc. So when someone calls `std::env::args` it asks the OS directly for the arguments instead of getting them from C.
Joker_vD
> the only thing we can do is to encourage everyone, the users, organizations, and developers, to gradually phase out ANSI and promote the use of the Wide Character API,
This has been Microsoft's official position since NT 3.5, if I remember correctly.
Sadly, one of the main hurdles is the way Microsoft's own C/C++ runtime library (msvcrt.dll) is implemented. Its non-standard "wide" functions like _wfopen(), _wgetenv(), etc. internally use W-functions from Win API. But the standard, "narrow" functions like fopen(), getenv(), etc., instead of using the "wide" versions and converting to-from Unicode themselves (and reporting conversion failures), simply use A-functions. Which, as you see, generally don't report any Unicode conversion failures but instead try to gloss over them using best-fit approach.
And of course, nobody who ports software, written in C, to Windows wants to rewrite all of the uses of standard functions to use Microsoft's non-portable functions because at this point, it becomes a full-blown rewrite.
terinjokes
The position I got reading documentation Microsoft has written in the last two years is the opposite: set activeCodePage in your application manifest to UTF-8 and only ever use the "ANSI" functions.
ziml77
Yes that does seem to be the way going forward. Makes it a lot easier to write cross-platform code. Though library code still has to use the Wide Character APIs because it's up to the application as a whole to opt into UTF-8. Also if you're looking for maximal efficiency, the WChar APIs still make sense because it avoids the conversion of all the string inputs and outputs on every call.
terinjokes
Many libraries I've encountered have defines available now to use the -A APIs; previously they were using -W APIs and converting to/from UTF-8 internally.
As for my application, any wchar conversions being done by the runtime are a drop in the bucket compared to the actual compute.
account42
> Also if you're looking for maximal efficiency, the WChar APIs still make sense because it avoids the conversion of all the string inputs and outputs on every call.
OTOH you need ~twice as much memory / copy ~twice as much data around than if you converted to WTF-8 internally.
Joker_vD
Ah, so they've finally given up? Interesting to hear. But I guess the app manifests does give them a way to move forward this way while maintaining the backward-compatible behaviour (for apps without this setting in their manifests).
dataflow
Despite whatever Microsoft may seem to be suggesting, you don't want to do this. Just use the wide APIs. Lots of reasons why UTF-8'ing the narrow APIs is a bad idea:
- The wide APIs accept and/or produce invalid UTF-16 in some places (like filesystems). There's no corresponding UTF-8 for invalid UTF-16. Meaning there are cases that lead to loss of information and that you simply cannot handle.
- You have no control over all the DLLs loaded in your process. If a user DLL loads that can't handle UTF-8 narrow APIs, you're just praying it won't break.
- Some APIs simply don't have narrow versions. Like CommandLineToArgvW() or GetFileInformationByHandleEx() (e.g., FILE_NAME_INFO). You will not avoid wide APIs by doing this if you need to use enough of the APIs; you're just going to have to perform conversions that have dubious semantics anyway (see point #1 above).
- Compatibility with previous Windows versions, obviously.
- Performance
cesarb
> You have no control over all the DLLs loaded in your process. If a user DLL loads that can't handle UTF-8 narrow APIs, you're just praying it won't break.
I want to emphasize this point. From what I've heard, on Windows it's very common for DLLs from who knows where to end up loaded in your process. Not only the things you'd also find on other operating systems like the user-space component of graphics APIs like OpenGL and Vulkan, but also things like printer drivers, shell extensions, "anti-malware" stuff, and I've even heard of things like RGB LED control software injecting their DLLs into every single process. It's gotten so bad that browsers like Firefox and Chrome use fairly elaborate mechanisms to try to prevent arbitrary DLLs from being injected into their sandbox processes, since they used to be a common source of crashes.
account42
> The wide APIs accept and/or produce invalid UTF-16 in some places (like filesystems). There's no corresponding UTF-8 for invalid UTF-16. Meaning there are cases that lead to loss of information and that you simply cannot handle.
There's WTF-8 - too bad that's not what Microsoft chose to use for their universal 8-bit codepage.
cryptonector
Disagree. At least in the context of Unix utilities portable to Windows. We are NOT going to be forking those to use wchar_t on Windows and char on Unix -that's a non-starter- and we're also not going to be switching to wchar_t on both because wchar_t is a second-class citizen on Unix.
Using UTF-8 with the "A" Windows APIs is the only reasonable solution, and Microsoft needs to commit to that.
> - The wide APIs accept and/or produce invalid UTF-16 in some places (like filesystems). There's no corresponding UTF-8 for invalid UTF-16. Meaning there are cases that lead to loss of information and that you simply cannot handle.
This is also true on Unix systems as to `char`. Yes, that means there will be loss of information regarding paths that have garbage in them. And again, if you want to write code for Windows _and_ Unix, using wchar_t won't spare you this loss on Unix. So you're damned if you do and damned if you don't, so just accept this loss and say "don't do that".
> - You have no control over all the DLLs loaded in your process. If a user DLL loads that can't handle UTF-8 narrow APIs, you're just praying it won't break.
In some cases you do have such control, but if some DLL unknown to you uses "W" APIs then.. it doesn't matter because if it's unknown to you then you're not interacting with it, or if you are interacting with it via another DLL that is known to you then it's that DLL's responsibility to convert between char and wchar_t as needed. I.e., this is not your problem -- I get that other people's bugs have a way of becoming your problem, but strictly speaking it's their problem not yours.
> - Some APIs simply don't have narrow versions. Like CommandLineToArgvW() or GetFileInformationByHandleEx() (e.g., FILE_NAME_INFO). You will not avoid wide APIs by doing this if you need to use enough of the APIs; you're just going to have to perform conversions that have dubious semantics anyway (see point #1 above).
True, but these can be wrapped with code that converts as needed. This is a lot better from a portability point of view than to fork your entire code into Windows and Unix versions.
> - Compatibility with previous Windows versions, obviously.
Sigh. At some point people (companies, contractors/consultants, ...) need to put their feet down and tell the U.S. government to upgrade their ancient Windows systems.
> - Performance
The performance difference between UTF-8 and UTF-16 is in the noise, and it depends greatly on context. But it doesn't matter. UTF-8 could be invariably slower than UTF-16 and it would still be better to move Windows code to UTF-8 than to move Unix to UTF-16 or lose portability between Windows and Unix.
In case you and others had not noticed Linux has a huge share of the market on servers while Windows has a huge share of the market on laptops, which means that giving up on portability is not an option.
The advice we give developers here has to include advice we give to developers who have to write and look after code that is meant to be portable to Windows and Unix. Sure, if you're talking to strictly-Windows-only devs, the advice you give is alright enough, but if later their code needs porting to Unix they'll be sad.
The reality is that UTF-8 is superior to UTF-16. UTF-8 has won. There's just a few UTF-16 holdouts: Windows and JavaScript/ECMAScript. Even Java has moved to UTF-8. And even Microsoft seems to be heading in the direction of making UTF-8 a first-class citizen on Windows.
SleepyMyroslav
In gamedev a lot of people read those docs but not a lot of them shipped anything using it. The reason is that file paths are not everything that has A/W versions. There is user input, window message handling ... The API is a maze.
I really would like to learn otherwise. But when I have to suggest fixes my old opinion stays. Dropping any C runtime use and going from API macro or A version to W is the solution to all weird and hard to repro problems on platforms from Ms.
7bit
Not a Programmer. Wouldn't manifests risk the application breaking, if the manifest is not copied with the exe file? As a power user, I see the manifests sometimes, but honestly ,if I download e.g., bun.exe I would just copy the bun.exe without any manifest that the downloaded archive would contain.
That does not sound like a good solution.
lmz
You can embed manifests in the exe.
masfuerte
In my portable code I #define standard functions like main and fopen to their wide equivalents when building on Windows.
This does mean I can't just use char* and unadorned string literals, so I define a tchar type (which is char on Linux and wchar_t on Windows) and an _T() macro for string literals.
This mostly works without thinking about it.
dblohm7
What really annoys me these days is that if you search for a Win32 API on Google, it will always come up with the -A variant, not the -W variant. I don't know if they've got something weird in their robots.txt or what, but I find it bizarre that an API whose guidelines desire developers to use the -W variants in all greenfield code, instead returns the legacy APIs by default.
ack_complete
They did a strange reorg of the API docs at one point. Not only does it now have functions split by A/W (mostly unnecessarily), it also groups them by header file instead of feature reference, which is kind of annoying. It used to be just that the function doc would note at the bottom if A/W variants were present and they were grouped under Functions in the feature/subsystem area of the docs tree.
dblohm7
Yeah, that new content management system is awful too -- it doesn't grok preprocessor stuff at all, so sometimes you get nonsensical struct definitions, kernel-mode structs instead of user-mode structs, etc.
delta_p_delta_x
> Microsoft's own C/C++ runtime library (msvcrt.dll) is implemented
This has been superseded by the Universal C runtime (UCRT)[1] which is C99-compliant.
pjmlp
Mostly C99 compliant, some things are left out.
https://learn.microsoft.com/en-us/cpp/c-runtime-library/comp...
sigsev_251
I think the documentation is outdated given that C11 atomics [1] and threads [2] are available for more than a year now. Same goes for pretty much everything MSVC frontend related stuff (I've yet to try which C++23 features are supported at the moment, but they've secretly added support for C23 features like typeof and attributes, as well as GNU Statement Expressions).
[1]: https://devblogs.microsoft.com/cppblog/c11-atomics-in-visual...
[2]: https://devblogs.microsoft.com/cppblog/c11-threads-in-visual...
nialv7
Windows really should provide an API that treats path names as just bytes, without any of these stupid encoding stuff. Could probably have done that when they introduced UNC paths.
Dwedit
Ever since Windows 95 Long File Names for FAT, filenames have been 16-bit characters in their on-disk format. So passing "bytes" means that they need to become wide characters before the filesystem can act on them. And case-sensitivity is still applied, stupidly enough, using locale-specific rules. (Change your locale, and you change how case-insensitive filenames work!)
It is possible to request for a directory to contain case-sensitive files though, and the filesystem will respect that. And if you use the NT Native API, you have no restrictions on filenames, except for the Backslash character. You can even use filenames that Win32 doesn't allow (name with a ":", name with a null byte, file named "con" etc), and every Win32 program will break badly if it tries to access such a file.
It's also possible to use unpaired surrogate characters (D800-DFFF without the matching second part) in a filename. Now you have a file on the disk whose name can't be represented in UTF-8, but the filename is still sitting happily in the filesystem. So people invented "WTF-8" encoding to allow those characters to be represented.
cesarb
> And case-sensitivity is still applied, stupidly enough, using locale-specific rules. (Change your locale, and you change how case-insensitive filenames work!)
AFAIK, it's even worse: it uses the rules for the locale which was in use when the filesystem was created (it's stored in the $UpCase table in NTFS, or its equivalent in EXFAT). So you could have different case-insensitive rules in a single system, if it has more than one partition and they were formatted with different locales.
IMO, case-insensitive filesystems are an abomination; the case-insensitivity should have been done in the user interface layer, not in the filesystem layer.
userbinator
And of course making everything twice as big as it needs to be is also extremely repugnant.
Joker_vD
Not everyone uses Latin-based scripts, you know. Most of the symbols in the BMP (including Brahmic scripts) take two bytes in either UTF-8 or UTF-16, and CJK symbols take 3 bytes in UTF-8 instead of 2 in UTF-16. Emojis, again, are 4 bytes long in either encoding. So for the most people in the world, UTF-16 is either slightly more compact encoding, or literally the same as UTF-8.
account42
> Not everyone uses Latin-based scripts
Actually, everyne does use Latin-based scripts extensively. Maybe not exclusively but your almost all text-like data intended to be consumed by programs will mainly be Latin-based scripts. So even for languages where you have characters that need 3-bytes in UTF-8 but two in UTF-16 you can still end up saving memory with UTF-8 because all the boilerplate syntax around your fancy characters is ASCII.
Dwedit
There are two ways to force the "Ansi" codepage to actually be UTF-8 for an application that you write (or an EXE that you patch).
One way is with a Manifest file, and works as of a particular build of Windows 10. This can also be applied to any EXE after building it. So if you want a program to gain UTF-8 support, you can hack it in. Most useful for console-mode programs.
The other way is to use the hacks that "App Locale" type tools use. One way involves undocumented function calls from NTDLL. I'm not sure exactly which functions you need to call, but I think it might involve "RtlInitNlsTables" and "RtlResetRtlTranslations" (not actually sure).
layer8
> until Microsoft chooses to enable UTF-8 by default in all of their Windows editions.
I don’t know how likely this is. There are a lot of old applications that assume a particular code page, or assume 1 byte per character, that this would break. There are also more subtle variations of this, like applications assuming that converting from wide characters to ANSI can’t increase the number of bytes (and hence an existing buffer can be safely reused), which isn’t the case for UTF-8 (but for all, or almost all, existing code pages). It can open up new vulnerabilities.
It would probably cause much less breakage to remove the Best-Fit logic from the win32 xxxA APIs, and instead have all unmappable characters be replaced by a character without any common meta semantics, like “x”.
tambre
One example of such an application is Adobe After Effects [0]. Or at least used to be, I no longer use Windows.
[0] https://tambre.ee/blog/adobe_after_effects_windows_utf-8/
kgeist
Maybe they can introduce OS API versions (if there's no such thing yet) and require new (or updated) apps targetting new API versions/newer SDKs to assume UTF8 by default? So everything below a certain API version is emulated legacy mode. Windows already has the concept of shims to emulate behavior of different Windows versions.
layer8
Apps can already opt-in to UTF-8 for the ANSI APIs (see https://news.ycombinator.com/item?id=42649122), or use the wide-character APIs.
cryptonector
You already had this problem pre-UTF-8 in Windows: changing your default codepage could cause app fuckiness. So giving the user the option to use UTF-8 is reasonable. Making it the default is also reasonable given the problems that the Best-Fit mapping are causing, though Microsoft would have to do something to help users easily figure out how to run older code.
Another not-so-reasonable thing would be to drop all mappings to "special" ASCII characters from the Best-Fit mappings, though this wouldn't help apps linked statically with the CRT. Not so reasonable because it doesn't fix the vulnerabilities.
Sometimes security vulnerabilities motivate backwards-compatibility breakage.
garganzol
Microsoft was aware of this issue at least 1 year ago. I know this because they released a special code analysis rule CA2101 [1] that explicitly discouraged the use of the best-fit mapping. They mentioned security vulnerabilities in the rule’s description, but they were purposefully vague in details though.
[1] https://learn.microsoft.com/en-us/dotnet/fundamentals/code-a...
cesarb
> However, resolving this problem isn’t that as simple as just replacing the main() with its wide-character counterpart. Since the function signature has been changed, maintainers would need to rewrite all variable definitions and argument parsing logics, converting everything from simple char * to wchar_t *. This process can be painful and error-prone.
You don't need to convert everything from char * to wchar *. You can instead convert the wide characters you received to UTF-8 (or to something like Rust's WTF-8, if you want to also allow invalid sequences like unpaired surrogates), and keep using "char" everywhere; of course, you have to take care to not mix ANSI or OEMCP strings with UTF-8 strings, which is easy if you simply use UTF-8 everywhere. This is the approach advocated by the classic https://utf8everywhere.org/ site.
segasaturn
I've been inadvertantly safe from this bug on my personal Windows computer for years thanks to having the UTF-8 mode set, as shown at the bottom of the article. I had it set due to some old, foreign games showing garbled nonsense text on my computer. Have not noticed any bugs or side effects despite it being labelled as "Beta".
numpad0
Interesting, to me that checkbox have done nothing but crashing too many random apps. I guess whether it works depends on the user's home codepage with it off.
UltraSane
I just enabled the "Beta: Use Unicode UTF-8 for worldwide language support" option. Going to be interesting to see how many apps this breaks.
scoopr
I was wondering if the beta checkbox the same thing as setting the ActiveCodePage to UTF-8 in the manifest, but the docs[0] clarify that GDI doesn't adhere to per-process codepage, but only a single global one, which is what the checkbox sets.
Bit of a shame that you can't fully opt-in to be UTF-8 with the *A API, for your own apps. But I think for the issues highlighted in the post, I think it would still be a valid workaround/defence-in-depth thing.
[0] https://learn.microsoft.com/en-us/windows/apps/design/global...
lifthrasiir
Oh, my, freaking, god. I knew Windows API provides that sort of best-fit conversions, but didn't realize that it was a default behavior for several ANSI functions in my native code page (949 [1])! At this point they should be just banned like gets.
[1] Yes, I know there is a UTF-8 code page (65001). That was really unusable for a long time and still is suffering compatibility issues to this day.
mouse_
Unicode on modern systems is absolutely terrifying. Anyone remember the black dot of death? https://mashable.com/article/black-dot-of-death-unicode-imes...
kazinator
HN, Help! Before I dive into this, does anyone know whether this affects the argument parsing in Cygwin, that prepares the arguments for a regular int main(int argc, char *argv)?
TXR Lisp uses wchar_t strings, and the "W" functions on Windows. So that's well and good. But it does start with a regular C main, relying on the Cygwin run-time for that.
If that's vulnerable, I will hack it to have its own argument parsing, using the wide char command line.
Maybe I should ask this on the Cygwin mailing list.
shakna
Cygwin has been using the W variant of most things. [0]
Unfortunately, they're using it on "__argv", not "__wargv".
Which means they are probably vulnerable to this.
[0] https://cygwin.com/cgit/newlib-cygwin/tree/winsup/cygwin/ker...
kazinator
Cygwin evidently contains its own versions of certain Win32 functions, like in this source file:
https://github.com/cygwin/cygwin/blob/main/winsup/cygwin/ker...
If we jump to the very bottom, we see that GetCommandLineA is one of the functions implemented in this file. It uses something called RtlUnicodeStringToAnsiString.
Microsoft declares that one in <wdm.h> and places it in NTOSKRNL.EXE.
That very function is mentioned in the submitted article as having the problem!
If Cygwin handles arguments through this function in preparation for main(), and it's the Microsoft one, it has the issue.
cryptonector
Starting with main() instead of wmain() is enough to make it vulnerable :(
kazinator
It's enough to make it suspicious, but there is a way for a C compiler on Windows to support regular main startup such that it is not vulnerable. It depends on which/whose piece of code takes the WCHAR-based command line and produces the char-based argv[]. Does that code do the "WorstFit" thing, or not.
In my program, I could fix it (if necessary) by taking the WCHAR command line and parsing it myself, using my own UTF8 encoding routines to make the multi-byte strings.
cryptonector
Yes, that's what I'd do: write a `wmain()` that converts UTF-16 arguments to UTF-8 then calls the real `main()` (which you'll have to rename).
This is a tough one. It’s systemic —- MS provides a “best fit” code mapping from wide Unicode to ASCII, which is a known, published, “vibes-based” mapper. This best fit parser is used a lottt of places, and I’m sure that it’s required for ongoing inclusion based on how MS views backward compatibility. It’s linked in by default everywhere, whether or not you know you included it.
The exploits largely revolved around either speccing an unusual code point that “vibes” into say a slash or a hyphen or quotes. These code points are typically evaluated one way (correct full Unicode evaluation) inside a modern programming language, but when passed to shell commands or other Win32 API things are vibes-downed. Crucially this happens after you check them, since it’s when you’ve passed control.
To quote the curl maintainer “curl is a victim” here — but who is the culprit? It seems certain that curl will be used to retrieve user supplied data automatically by a server in the future. When that server mangles user input in one way for validation and another when applied to system libraries, you’re going to have a problem.
It seems to me like maybe the solution is to provide an opt-out of “best fit” munging in the Win32 space, but I’m not a Windows guy, so I speculate. At least then open source providers could just add the opt out to best practices, and deal with the many terrible problems that things like a Unicode wide variant of “ or \ delivers to them.
And of course even if you do that, you’ll interact with officially shipped APIs and software that has not opted out.