Understanding Surrogate Pairs: Why Some Windows Filenames Can't Be Read
16 comments
·February 24, 2025qingcharles
Aha! Found the name of my next album. Try downloading me on Napster now!
n_plus_1_acc
I think it's hilarious that the event viewer XML gets borked.
feldrim
I am not 100% sure but mmc.exe has not been updated for years and it must be relying on WebBrowser control of Internet Explorer. Yes, IE is still alive in Windows.
https://learn.microsoft.com/en-us/previous-versions/windows/...
kzrdude
I remember that in Mac OS X times, sometime between OS X v10.1 and 10.4, a system upgrade caused a bunch of unicode named files to become inaccessible/untouchable (but still present with a directory listing). At the time I didn't have the skills to figure out what had happened. I'm still curious to know if it was an intended breaking change.
Devasta
Stuff like this is why UTF and any attempt at trying to encode all characters is a mistake.
The real solution is to force the entire world population to use the Rotokas language of Papua New Guinea.
theiebrjfb
Yet another reason to use Linux everywhere. It is 2025 and Windows (and probably Mac) users have to deal with weird Unicode filesystem issues. Good luck puting Chinese characters or emoticons into filenames.
Ext4 filename has maximal length 255 characters. That is the only legacy limit you have to deal with as a Linux user. And even that can be avoided by using more modern filesystems.
And we get filesystem level snapshots etc...
layer8
You have the same, if not worse, issue on Linux with filenames that aren’t valid UTF-8 sequences. Not to mention that on Linux switching the locale may change the interpretation of filenames as characters, which isn’t the case with NTFS.
kwertzzz
> Not to mention that on Linux switching the locale may change the interpretation of filenames as characters, which isn’t the case with NTFS.
If you change the locale to an uninstalled one, then yes. But if the locale is installed, then I don't see a problem.
echo $LANG
# output: en_US.UTF-8
touch fusée.txt
LANG=fr_FR.UTF-8 ls
# output: 'fus'$'\303\251''e.txt'
sudo locale-gen fr_FR.UTF-8
sudo update-locale
LANG=fr_FR.UTF-8 ls
# output: fusée.txt
Are you maybe using non-UTF-8 locale?
feldrim
I see two points here. First, you did not read the article and did not see the footnote that these are valid in Linux as well.
Second, your comment shows you are lacking the knowledge on Linux as well. In Linux, as I have written in the foot note, accepts anything but 0x00 (null) and 0x2F (“/”). Other than that, all characters are valid paths. If you consider these a problem, I'd like to remind that the 2048 surrogate pairs is a really small subset of unrenderable combinations allowed in Linux.
Anyone are free to have their opinions but at least, before making bold claims, please do your due diligence.
skissane
> In Linux, as I have written in the foot note, accepts anything but 0x00 (null) and 0x2F (“/”)
POSIX 2024 encourages (but doesn’t require) implementations to disallow newline in file names, returning EILSEQ if you t try to create a new file or directory with a name containing a newline. Thus far Linux hasn’t adopted that recommendation, but I personally hope it does some day.
For backward compatibility, it would have to be a mount option. It could be done at VFS level so it applies to all filesystems.
Personally I would go even further and introduce a “require_sane_filenames” mount option, which would block you (at the VFS layer) from creating any file name containing invalid UTF-8 (including overlong sequences and UTF-8 encoded surrogates), C0 controls or (UTF-8 encoded) C1 controls.
Also I think it would be great if filesystems had a superblock bit that declared they only supported “sane filenames”. Then even accessing such a file would error because it would be a sign of filesystem corruption.
feldrim
This I did not know. I know that ZFS has "utf8only" option, but not sure about others.
rob74
Or otherwise said: Surrogate Pairs are used in UTF-16 (which uses two bytes per character, so it can encode up to 65536 characters) to encode Unicode characters that have code points that can't be encoded using just two bytes.
feldrim
Yep. The quirk here is that the surrogates, that are merely enablers for other characters, can be paired with each other. With the absence of other valid characters, they are not enabling anything. One assumes there is a validation but it does not exist here.
layer8
There is no validation on the file system level because file names in NTFS are sequences of arbitrary 16-bit values, similar to how on Unix file systems, file names are sequences of arbitrary 8-bit values. Arguably the situation on Unix is worse, because there the interpretation and validity depends on the current locale.
feldrim
Totally. These are design choices made by development teams. But as users, we "assume" all are readable until one day we learn that it does not work that way. Until I came across this issue, I assumed them to be all valid, renderable characters.
Hi all. OP here. I added a Postscriptum about the surrogte pairs and their status in Linux. I used WSL to access those files under Windows, and generated the same on Linux. You can see that behavior differs on the same file names: 1. On Windows, accessed by WSL 2. On Linux (WSL), using UTF-8 locale 3. On Linux (WSL), using POSIX locale
The difference is weird for me as a user. I'd like to know about the decisions made behind these. If anyone has information, please let me know.