Smuggling arbitrary data through an emoji

205 comments

·February 12, 2025

riskable

Oh this is just the tip of the iceberg when it comes to abusing Unicode! You can use a similar technique to this to overflow the buffer on loads of systems that accept Unicode strings. Normally it just produces an error and/or a crash but sometimes you get lucky and it'll do all sorts of fun things! :)

I remember doing penetration testing waaaaaay back in the day (before Python 3 existed) and using mere diacritics to turn a single character into many bytes that would then overflow the buffer of a back-end web server. This only ever caused it to crash (and usually auto-restart) but I could definitely see how this could be used to exploit certain systems/software with enough fiddling.

sourque

This was the premise of a Google CTF quals 2024 challenge ("encrypted runner").

capitainenemo

Yeah. Zalgo text is a common test for input fields on websites. But it usually doesn't do anything interesting. Maybe an exception trigger on some database length limit. Doesn't typically even kill any processes. The exception is normally just in your thread. You can often trigger it just by disabling JS on even modern forms, but,, at best you're maybe leaking a bit of info if they left debug on and print the stack trace or a query. Another common slip-up is failing to count \n vs \r\n in text strings since JS usually usually counts a carriage return as 1 byte, but HTTP spec requires two.

unescape(encodeURIComponent("ç")).length is the quick and dirty way to do a JS byte length check. The \r\n thing can be done just by cleaning up the string before length counting.

dotancohen

Does Zalgo even work on HN? I've never thought of using it to test my systems, thank you. I've got some new testing to do tonight.

Edit: No, Zalgo doesn't work on HN. This comment itself was an experiment to try.

Rendello

A few months ago I made a post which I (should've) named "Unicode codepoints that expand or contract when case is changed in UTF-8". A decent parser shouldn't have any issues with things like this, but software that makes bad Unicode assumptions might.

https://news.ycombinator.com/item?id=42014045

n0id34

Sorry n00b here, can you explain more about this or how you did this? I feel like this is definitely a loophole that would be worth testing for.

ComputerGuru

This is cute but unnecessary - Unicode includes a massive range called PUA: the private use area. The codes in this range aren’t mapped to anything (and won’t be mapped to anything) and are for internal/custom use, not to be passed to external systems (for example, we use them in fish-shell to safely parse tokens into a string, turning an unescaped special character into just another Unicode code point in the string, but in the PUA area, then intercept that later in the pipeline).

You’re not supposed to expose them outside your api boundary but when you encounter them you are prescribed to pass them through as-is, and that’s what most systems and libraries do. It’s a clear potential exfiltration avenue, but given that most sane developers don’t know much more about Unicode other than “always use Unicode to avoid internationalization issues”, it’s often left wide open.

paulgb

I just tested and private use characters render as boxes for me (󰀀), the point here was to encode them in a way that they are hidden and treated as "part of" another character when copy/pasting.

diggan

> the point here was to encode them in a way that they are hidden and treated as "part of" another character when copy/pasting

AKA "Steganography" for the curious ones: https://en.wikipedia.org/wiki/Steganography

reaperducer

Like when we used to encode the phone numbers of warez boards in GIFs.

bruce343434

On my Android phone,that displays "Go[][]" in the Google logo font.

layer8

The difference is that PUA characters are usually rendered in some way that is rather visible, whereas the variation selectors aren’t.

lolinder

Context that some may be missing is that this was inspired by discussion surrounding the Open Heart Protocol submission:

https://news.ycombinator.com/item?id=42791378

People immediately began discussing the applications for criminal use given the constraint that only emoji are accepted by the API. So for that use case the PUA wouldn't be an option, you have to encode it in the emoji.

Sniffnoy

Isn't this more what the designated noncharacters are for, rather than the private-use area? Given how the private-use area sometimes gets for unofficial encodings of scripts not currently in Unicode (or for things like the Apple logo and such) I'd be worried about running into collisions with that if I used the PUA in such a way.

Note that designated noncharacters includes not only 0xFFFF and 0xFFFE, and not only the final two code points of every plane, but also an area in the middle of Arabic Presentation Forms that was at some point added to the list of noncharacters specifically so that there would be more noncharacters for people using them this way!

juped

I'll be h󠄾󠅟󠅠󠅕󠄜󠄐󠅞󠅟󠄐󠅣󠅕󠅓󠅢󠅕󠅤󠅣󠄐󠅘󠅕󠅢󠅕onest, I pasted this comment in the provided decoder thinking no one could miss the point this badly and there was probably a hidden message inside it, but either you really did or this website is stripping them.

You can't invisibly watermark an arbitrary character (I did it to one above! If this website isn't stripping them, try it out in the provided decoder and you'll see) with unrecognized PUA characters, because it won't treat them as combining characters. You will cause separately rendered rendered placeholder-box characters to appear. Like this one:  (may not be a placeholder-box if you're privately-using the private use area yourself).

egypturnash

j󠄗󠅄󠅧󠅑󠅣󠄐󠅒󠅢󠅙󠅜󠅜󠅙󠅗󠄜󠄐󠅑󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅣󠅜󠅙󠅤󠅘󠅩󠄐󠅤󠅟󠅦󠅕󠅣󠄴󠅙󠅔󠄐󠅗󠅩󠅢󠅕󠄐󠅑󠅞󠅔󠄐󠅗󠅙󠅝󠅒󠅜󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅧󠅑󠅒󠅕󠄫󠄱󠅜󠅜󠄐󠅝󠅙󠅝󠅣󠅩󠄐󠅧󠅕󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠅒󠅟󠅢󠅟󠅗󠅟󠅦󠅕󠅣󠄜󠄱󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅝󠅟󠅝󠅕󠄐󠅢󠅑󠅤󠅘󠅣󠄐󠅟󠅥󠅤󠅗󠅢󠅑󠅒󠅕󠄞󠄒󠄲󠅕󠅧󠅑󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠄺󠅑󠅒󠅒󠅕󠅢󠅧󠅟󠅓󠅛󠄜󠄐󠅝󠅩󠄐󠅣󠅟󠅞󠄑󠅄󠅘󠅕󠄐󠅚󠅑󠅧󠅣󠄐󠅤󠅘󠅑󠅤󠄐󠅒󠅙󠅤󠅕󠄜󠄐󠅤󠅘󠅕󠄐󠅓󠅜󠅑󠅧󠅣󠄐󠅤󠅘󠅑󠅤󠄐󠅓󠅑󠅤󠅓󠅘󠄑󠄲󠅕󠅧󠅑󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠄺󠅥󠅒󠅚󠅥󠅒󠄐󠅒󠅙󠅢󠅔󠄜󠄐󠅑󠅞󠅔󠄐󠅣󠅘󠅥󠅞󠅄󠅘󠅕󠄐󠅖󠅢󠅥󠅝󠅙󠅟󠅥󠅣󠄐󠄲󠅑󠅞󠅔󠅕󠅢󠅣󠅞󠅑󠅤󠅓󠅘󠄑󠄒󠄸󠅕󠄐󠅤󠅟󠅟󠅛󠄐󠅘󠅙󠅣󠄐󠅦󠅟󠅢󠅠󠅑󠅜󠄐󠅣󠅧󠅟󠅢󠅔󠄐󠅙󠅞󠄐󠅘󠅑󠅞󠅔󠄪󠄼󠅟󠅞󠅗󠄐󠅤󠅙󠅝󠅕󠄐󠅤󠅘󠅕󠄐󠅝󠅑󠅞󠅨󠅟󠅝󠅕󠄐󠅖󠅟󠅕󠄐󠅘󠅕󠄐󠅣󠅟󠅥󠅗󠅘󠅤󠇒󠅰󠆄󠅃󠅟󠄐󠅢󠅕󠅣󠅤󠅕󠅔󠄐󠅘󠅕󠄐󠅒󠅩󠄐󠅤󠅘󠅕󠄐󠅄󠅥󠅝󠅤󠅥󠅝󠄐󠅤󠅢󠅕󠅕󠄜󠄱󠅞󠅔󠄐󠅣󠅤󠅟󠅟󠅔󠄐󠅑󠅧󠅘󠅙󠅜󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅟󠅥󠅗󠅘󠅤󠄞󠄱󠅞󠅔󠄐󠅑󠅣󠄐󠅙󠅞󠄐󠅥󠅖󠅖󠅙󠅣󠅘󠄐󠅤󠅘󠅟󠅥󠅗󠅘󠅤󠄐󠅘󠅕󠄐󠅣󠅤󠅟󠅟󠅔󠄜󠅄󠅘󠅕󠄐󠄺󠅑󠅒󠅒󠅕󠅢󠅧󠅟󠅓󠅛󠄜󠄐󠅧󠅙󠅤󠅘󠄐󠅕󠅩󠅕󠅣󠄐󠅟󠅖󠄐󠅖󠅜󠅑󠅝󠅕󠄜󠄳󠅑󠅝󠅕󠄐󠅧󠅘󠅙󠅖󠅖󠅜󠅙󠅞󠅗󠄐󠅤󠅘󠅢󠅟󠅥󠅗󠅘󠄐󠅤󠅘󠅕󠄐󠅤󠅥󠅜󠅗󠅕󠅩󠄐󠅧󠅟󠅟󠅔󠄜󠄱󠅞󠅔󠄐󠅒󠅥󠅢󠅒󠅜󠅕󠅔󠄐󠅑󠅣󠄐󠅙󠅤󠄐󠅓󠅑󠅝󠅕󠄑󠄿󠅞󠅕󠄜󠄐󠅤󠅧󠅟󠄑󠄐󠄿󠅞󠅕󠄜󠄐󠅤󠅧󠅟󠄑󠄐󠄱󠅞󠅔󠄐󠅤󠅘󠅢󠅟󠅥󠅗󠅘󠄐󠅑󠅞󠅔󠄐󠅤󠅘󠅢󠅟󠅥󠅗󠅘󠅄󠅘󠅕󠄐󠅦󠅟󠅢󠅠󠅑󠅜󠄐󠅒󠅜󠅑󠅔󠅕󠄐󠅧󠅕󠅞󠅤󠄐󠅣󠅞󠅙󠅓󠅛󠅕󠅢󠄝󠅣󠅞󠅑󠅓󠅛󠄑󠄸󠅕󠄐󠅜󠅕󠅖󠅤󠄐󠅙󠅤󠄐󠅔󠅕󠅑󠅔󠄜󠄐󠅑󠅞󠅔󠄐󠅧󠅙󠅤󠅘󠄐󠅙󠅤󠅣󠄐󠅘󠅕󠅑󠅔󠄸󠅕󠄐󠅧󠅕󠅞󠅤󠄐󠅗󠅑󠅜󠅥󠅝󠅠󠅘󠅙󠅞󠅗󠄐󠅒󠅑󠅓󠅛󠄞󠄒󠄱󠅞󠅔󠄐󠅘󠅑󠅣󠅤󠄐󠅤󠅘󠅟󠅥󠄐󠅣󠅜󠅑󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠄺󠅑󠅒󠅒󠅕󠅢󠅧󠅟󠅓󠅛󠄯󠄳󠅟󠅝󠅕󠄐󠅤󠅟󠄐󠅝󠅩󠄐󠅑󠅢󠅝󠅣󠄜󠄐󠅝󠅩󠄐󠅒󠅕󠅑󠅝󠅙󠅣󠅘󠄐󠅒󠅟󠅩󠄑󠄿󠄐󠅖󠅢󠅑󠅒󠅚󠅟󠅥󠅣󠄐󠅔󠅑󠅩󠄑󠄐󠄳󠅑󠅜󠅜󠅟󠅟󠅘󠄑󠄐󠄳󠅑󠅜󠅜󠅑󠅩󠄑󠄒󠄸󠅕󠄐󠅓󠅘󠅟󠅢󠅤󠅜󠅕󠅔󠄐󠅙󠅞󠄐󠅘󠅙󠅣󠄐󠅚󠅟󠅩󠄞󠄗󠅄󠅧󠅑󠅣󠄐󠅒󠅢󠅙󠅜󠅜󠅙󠅗󠄜󠄐󠅑󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅣󠅜󠅙󠅤󠅘󠅩󠄐󠅤󠅟󠅦󠅕󠅣󠄴󠅙󠅔󠄐󠅗󠅩󠅢󠅕󠄐󠅑󠅞󠅔󠄐󠅗󠅙󠅝󠅒󠅜󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅧󠅑󠅒󠅕󠄫󠄱󠅜󠅜󠄐󠅝󠅙󠅝󠅣󠅩󠄐󠅧󠅕󠅢󠅕󠄐󠅤󠅘󠅕󠄐󠅒󠅟󠅢󠅟󠅗󠅟󠅦󠅕󠅣󠄜󠄱󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅝󠅟󠅝󠅕󠄐󠅢󠅑󠅤󠅘󠅣󠄐󠅟󠅥󠅤󠅗󠅢󠅑󠅒󠅕󠄞 is for Jabberwocky. Does this decode?

edit: Yes, it does.

omnibrain

10 years or so ago I shocked coworkers with using U+202D LEFT-TO-RIGHT OVERRIDE mid in filenames on windows. So funnypicturegnp.exe became funnypictureexe.png Combined with a custom icon for the program that mimics a picture preview it was pretty convincing.

mdup

I worked in phishing detection. This was a common pattern used by attackers, although .exe are blocked automatically most of the time, .html is the new malicious extension (often hosting an obfuscated window.location redirect to a fake login page).

RTL abuse like cute-cat-lmth.png was relatively common, but also trivial to detect. We would immediately flag such an email as phishing.

omoikane

The source code version of that is CVE-2021-42574, and they have a website:

https://trojansource.codes/

Basically it's possible to hide some code that looks like comments but compiles like code. I seem to recall the CVE status was disputed since many text editors already make these suspicious comments visible.

taneq

I’d never heard of this particular trick but I’m glad my decades of paranoia-fueled “right click -> open with” treatment of any potentially sketchy media file was warranted! :D

oefnak

I created a guitar_tab.txt which was a bat file.

hosteur

Wow this is a clever trick.

rexxars

For a real-world use case: Sanity used this trick[0] to encode Content Source Maps[1] into the actual text served on a webpage when it is in "preview mode". This allows an editor to easily trace some piece of content back to a potentially deep content structure just by clicking on the text/content in question.

It has it's drawbacks/limitations - eg you want to prevent adding it for things that needs to be parsed/used verbatim, like date/timestamps, urls, "ids" etc - but it's still a pretty fun trick.

[0] https://www.sanity.io/docs/stega

[1] https://github.com/sanity-io/content-source-maps

vessenes

I love the idea of using this for LLM output watermarking. It hits the sweet spot - will catch 99% of slop generators with no fuss, since they only copy and paste anyway, almost no impact on other core use cases.

I wonder how much you’d embed with each letter or token that’s output - userid, prompt ref, date, token number?

I also wonder how this is interpreted in a terminal. Really cool!

fennecfoxy

Why does anybody think AI watermarking will ever work? Of course it will never work, any watermarking can be instantly & easily stripped...

The only real AI protection is to require all human interaction to be signed by a key verified by irl identity and even then that will: A never happen, B be open to abuse by countries with corrupt governments and countries with corrupt governments heavily influenced by private industry (like the US).

teruakohatu

> any watermarking can be instantly & easily stripped...

I think it took a while before printer watermarking (yellow dots) was discovered. It certainly cannot be stripped. It was possibly developed in the mid-80s but not known to the public until mid 2000s.

zos_kia

With the amount of pre processing that is done before integrating stuff in a dataset I'd be surprised if those kinds of shenanigans even worked

capitainenemo

In most linux terminals, what you pass it is just a sequence of bytes that is passed unmangled. And since this technique is UTF-8 compliant and doesn't use any extra glyphs, it is invisible to humans in unicode compliant terminals. I tried it on a few. It shows up if you echo the sentence to, say, xxd ofc.

(unlike the PUA suggestion in the currently top voted comment which shows up immediately ofc)

Additional test corrections: While xxd shows the message passing through completely unmangled on pasting it into the terminal, when I selected from the terminal (echoed sentence, verified unmangled in xxd, then selected and pasted the result of echo), it was truncated to a few words using X select in mate terminal and konsole - I'm not sure where that truncation happens, whether it's the terminal or X. In xterm, the final e was mangled, and the selection was even more truncated.

The sentence is written unmangled to files though, so I think it's more about copying out of the terminal dropping some data. Verified by echoing the sentence to a test file, opening it in a browser, and copying the text from there.

vessenes

On MacOS, kitty shows an empty box, then an a for the "h󠅘󠅕󠅜󠅜󠅟󠄐󠅖󠅕󠅜󠅜󠅟󠅧󠄐󠅘󠅑󠅓󠅛󠅕󠅢󠄐󠄪󠄙a" post below. I think this is fair and even appreciated. Mac Terminal shows "ha". That "h󠅘󠅕󠅜󠅜󠅟󠄐󠅖󠅕󠅜󠅜󠅟󠅧󠄐󠅘󠅑󠅓󠅛󠅕󠅢󠄐󠄪󠄙a" (and this one!) can be copied and pasted into the decoder successfully.

ChadNauseam

There are other possible approaches to LLM watermarking that would be much more robust and harder to detect. They exploit the fact that LLMs work by producing a probability distribution that gives a probability for each possible next token. These are then sampled randomly to produce the output. To add fingerprints when generating, you could do some trickery in how you do that sampling that would then be detectable by re-running the LLM and observing its outputs. For example, you could alternate between selecting high-probability and low-probability tokens. (A real implementation of this would be much more sophisticated than that obviously, but hopefully you get the idea)

vessenes

This is not a great method in a world with closed models and highly diverse open models and samplers. It’s intellectually appealing for sure! But it will always be at best a probabilistic method, and that’s if you have the llm weights at hand.

ChadNauseam

What makes it not a good method? Of course if a model's weights are publicly available, you can't compel anyone using it to add fingerprinting at the sampler stage or later. But I would be shocked if OpenAI was not doing something like this, since it would be so easy and couldn't hurt them, but could help them if they don't want to train on outputs they generated. (Although they could also record hashes of their outputs or something similar as well – I would be surprised if they don't.)

OutOfHere

Just you wait until AI starts calling human output to be slop.

vessenes

That's already happening - my kids have had papers unfairly blamed on chatgpt by automated tools. Protect yourself kids, use an editor that can show letter by letter history.

neom

2 people I worked with had this happen and one of them is going to war over it as it was enough to lower the kids grade for college or something. Crazy times.

red369

Do you have any examples of editors that show letter by letter history? I have never looked for that as a feature.

Edit: I've been looking, and Google Docs seems to have version history to the minute.

roguecoder

There are of course human writers who are less-communicative than AI, called "shit writers", and humans who are less accurate than AI, called "liars".

The difference is humans are responsible for what they write, whereas the human user who used an AI to generate text is responsible for what the computer wrote.

ethin

It's worth noting, just as a curiosity, that screen readers can detect these variation selectors when I navigate by character. For example, if I arrow over the example he provided (I can't paste it here lol), I here: "Smiling face with smiling eyes", "Symbol e zero one five five", "Symbol e zero one five c", "Symbol e zero one five c", "Symbol e zero one five f". This is unfortunately dependent on the speech synthesizer used, and I wouldn't know if the characters were there if I was just reading a document, so this isn't much of an advantage all things considered.

llm_trw

Ironically enough I have a script that strips all non-ascii characters from my screen reader because I found that _all_ online text was polluted with invisible and annoying to listen to characters.

ethin

Mine (NVDA) isn't annoying about non-ASCII symbols, interestingly enough. But for something like this form of Unicode "abuse" (?), if you throw a ton of them into a message or something, they become "lines" I have to arrow past because my screen reader will otherwise remain silent on those lines unless I use say-all (continuous reading for those who don't use screen readers).

kevinsync

StegCloak [0] is in the same ballpark and takes this idea a step further by encrypting the hidden payload via AES-256-CTR -- pretty neat little trick

[0] https://github.com/KuroLabs/stegcloak

giancarlostoro

There's a Better Discord plugin that I think uses this or something similar, so you could send completely encrypted messages, that look like nothing to everyone else. You'd need to share a password secret for them to decode it though.

johnisgood

Could probably add OTR, too.

putna

wow, thats neat.

Wanted to try on Cloudflare DNS TXT record. But Cloudflare is smart enough to decode when pasting in TXT field.

UltraSane

DNS only supports ASCII for record values. It has a hack to support unicode domain names using Punycode

vzaliva

The title lis little misleading: "Note that the base character does not need to be an emoji – the treatment of variation selectors is the same with regular characters. It’s just more fun with emoji."

Using this approach with non-emoji characters makes it more stealth and even more disturbing.

LorenPechtel

I don't see this as all that disturbing. A detector for it wouldn't be hard to write (variant on something that doesn't actually have variants, flag it!), it seems to me it could be useful for signing things.

HanClinto

Even more than just simply watermarking LLM output, it seems like this could be a neat way to package logprobs data.

Basically, include probability information about every token generated to give a bit of transparency to the generation process. It's part of the OpenAI api spec, and many other engines (such as llama.cpp) support providing this information. Normally it's attached as a separate field, but there are neat ways to visualize it (such as mikupad [0]).

Probably a bad idea, but this still tickles my brain.

* [0]: https://github.com/lmg-anon/mikupad

wunderwuzzi23

This is cool. There are also the Unicode Tag characters that mirror ASCII and are often invisible in UI elements (especially web apps).

The unique thing about Tag characters is that some LLMs interpret the hidden text as ASCII and follow instructions, and they can even write them:

https://embracethered.com/blog/posts/2024/hiding-and-finding...

Here an actual exploit POC that Microsoft fixed in Copilot: https://embracethered.com/blog/posts/2024/m365-copilot-promp...

nonameiguess

More generally, you can use encoding formats that reserve uninterpreted byte sequences for future use to pass data that is only readable by receivers who know what you're doing, though note this not a cryptographically secure scheme and any sort of statistical analysis can reveal what you're doing.

The png spec, for instance, allows you to include as many metadata chunks as you wish, and these may be used to hold data that cannot be used by any mainstream png reader. We used this in the Navy to embed geolocation and sensor origin data that was readable by specialized viewers that only the Navy had, but if you opened the file in a browser or common image viewer, it would either ignore or discard the unknown chunks.

dkarl

Lots of image formats store arbitrary metadata (and data data) either by design or by application-specific extensions. I remember seeing seismic and medical images that contained data for display in specialized applications and writing code to figure out if binary metadata was written in big-endian or little-endian byte order (the metadata often did not have the same endianness as the image data!) For example, TIFF files containing 3d scans as a sequence of slices, with binary metadata attached to each slice. If you opened it up in your system default image viewer, you'd only see the first slice, but a specialized viewer (which I did not have) would display it as a 3d model. Luckily (IIRC) the KDE file browser let you quickly flip through all the images in a directory using the keyboard, so I was able to dump all the layers into separate files and flip through them to see the 3d image.

themoonisachees

i've found a genuinely useful application of this that isn't just "talk secretively": I'm part of a community that has a chatroom on both discord and another platform. we have developped a bot that can make the bridge between the two, but it needs to maintain a table in a database of all the bridged messages, so that people use the "reply" feature on either platform, the reply is also performed on the other platform. With this, each platform's bridged message can contain hidden data that points to the other's platform ID, without the need to store it ourselves, and without it being visible to users.

obviously we won't be tearing down the infra already in place anytime soon, so we won't actually be doing that, but that's definitely a useful application for when you don't want to be hosting a database and just wish you could attach additional data to a restrictive format.