I extracted the safety filters from Apple Intelligence models

169 comments

·July 6, 2025

I managed to reverse engineer the encryption (refered to as “Obfuscation” in the framework) responsible for managing the safety filters of Apple Intelligence models. I have extracted them into a repository. I encourage you to take a look around.

Visit

trebligdivad

Some of the combinations are a bit weird, This one has lots of stuff avoiding death....together with a set ensuring all the Apple brands have the correct capitalisation. Priorities hey!

https://github.com/BlueFalconHD/apple_generative_model_safet...

grues-dinner

Interesting that it didn't seem to include "unalive".

Which as a phenomenon is so very telling that no one actually cares what people are really saying. Everyone, including the platforms knows what that means. It's all performative.

qingcharles

It's totally performative. There's no way to stay ahead of the new language that people create.

At what point do the new words become the actual words? Are there many instances of people using unalive IRL?

Terr_

> There's no way to stay ahead of the new language that people create.

I'm imagining a new exploit: After someone says something totally innocent, people gang up in the comments to act like a terrible vicious slur has been said, and then the moderation system (with an LLM involved somewhere) "learns" that an arbitrary term is heinous eand indirectly bans any discussion of that topic.

Rebelgecko

This is somewhat related to the concept of the "euphemism treadmill":

the matter-of-fact term of today becomes the pejorative of tomorrow so a new term is invented to avoid the negative connotation of the original term. Then eventually the new term becomes a pejorative and the cycle continues.

apricot

> Are there many instances of people using unalive IRL

As a parent of a teenager, I see them use "unalive" non-ironically as a synonym for "suicide" in all contexts, including IRL.

fouronnes3

This question is sort of the same as asking why the universal translator wasn't able to translate the metaphor language of the Star Trek episode Darmok. Surely if the metaphor has become the first order meaning then there's no litteral meaning anymore.

montagg

They become the “real words” later. This is the way all trust & safety works. It’s an evolution over time. Adding some friction does improve things, but some people will always try to get around the filters. Doesn’t mean it’s simply performative or one shouldn’t try.

derefr

> At what point do the new words become the actual words?

Presumably, for this use-case, that would come at exactly the point where using “unalive” as a keyword in an image-generation prompt generates an image that Apple wouldn’t appreciate.

cheschire

If only we had a way to mass process the words people write to each other, derive context from those words, and then identify new slang designed to bypass filters…

BurningFrog

A specialized AI could do it as well as any human.

The future will be AIs all the way down...

elliotto

Unalive and other self censors were adopted by young people because the tiktok algorithm would reprioritize videos that included specific words. Then it made its way into the culture. It has nothing to do with being performative

SOTGO

I think what they meant is that the platforms are being performative by attempting to crack down on those specific words. If saying "killed" is not allowed but "unalived" is permitted and the users all agree that they mean the same thing, then the ban on the word "killed" doesn't accomplish anything.

Zak

I'm surprised there hasn't been a bigger backlash against platforms that apply censorship of that sort.

hulium

Seems more like it should stop the AI from e.g. summarizing news and emails about death, not for a chat filter.

cyanydeez

yo, these are businesses. It's not performative, its CYA.

They care because of legal reasons, not moral or ethical.

lxgr

Does adding a trivial word filter even make any sense from a legal point of view, especially when this one seems to be filtering out words describing concepts that can be pretty easily paraphrased?

A regex sounds like a bad solution for profanity, but like an even worse one to bolt onto a thing that's literally designed to be able to communicate like a human and could probably easily talk its way around guardrails if it were so inclined.

durkie

Seriously. I feel like “performative” gets applied to anything imperfect. They’ll never stop 100% of murders, so these laws against it are just performative…

martin-t

No-one cares yet.

There's a very scary potential future in which mega-corporations start actually censoring topics they don't like. For all I know the Chinese government is already doing it, there's no reason the British or US one won't follow suit and mandate such censorship. To protect children / defend against terrorists / fight drugs / stop the spread of misinformation, of course.

lazide

They already clearly do on a number of topics?

andy99

> Apple brands have the correct capitalisation. Priorities hey!

To me that's really embarrassing and insecure. But I'm sure for branding people it's very important.

WillAdams

Legal requirement to maintain a trademark.

lxgr

In their own marketing language, sure, but to force this on their users' speech?

Consider that these models, among other things, power features such as "proofread" or "rewrite professionally".

grues-dinner

In what way would (A|a)pple's own AI writing "imac" endanger the trademark? Is capitalisation even part of a word-based trademark?

I'm more surprised they don't have a rule to do that rather grating s/the iPhone/iPhone/ transform (or maybe it's in a different file?).

bigyabai

If Apple Intelligence is going to be held legally accountable, Apple has larger issues than trademark obligations.

matsemann

So it blocks it from suggesting to "execute" a file or "pass on" some information.

dylan604

How about disassemble? Or does that only matter if used in context of Johnny 5?

null

[deleted]

baxtr

Don’t be so judgmental. People in corporate America do have their priorities right!

bawana

Alexandra Ocasio Cortez triggers a violation?

https://github.com/BlueFalconHD/apple_generative_model_safet...

mmaunder

As does:

   "(?i)\\bAnthony\\s+Albanese\\b",
    "(?i)\\bBoris\\s+Johnson\\b",
    "(?i)\\bChristopher\\s+Luxon\\b",
    "(?i)\\bCyril\\s+Ramaphosa\\b",
    "(?i)\\bJacinda\\s+Arden\\b",
    "(?i)\\bJacob\\s+Zuma\\b",
    "(?i)\\bJohn\\s+Steenhuisen\\b",
    "(?i)\\bJustin\\s+Trudeau\\b",
    "(?i)\\bKeir\\s+Starmer\\b",
    "(?i)\\bLiz\\s+Truss\\b",
    "(?i)\\bMichael\\s+D\\.\\s+Higgins\\b",
    "(?i)\\bRishi\\s+Sunak\\b",

https://github.com/BlueFalconHD/apple_generative_model_safet...

Edit: I have no doubt South African news media are going to be in a frenzy when they realize Apple took notice of South African politicians. (Referring to Steenhuisen and Ramaphosa specifically)

userbinator

I'm not surprised that anything political is being filtered, but this should definitely provoke some deep consideration around who has control of this stuff.

stego-tech

You’re not wrong, and it’s something we “doomers” have been saying since OpenAI dumped ChatGPT onto folks. These are curated walled gardens, and everyone should absolutely be asking what ulterior motives are in play for the owners of said products.

skissane

The problem with blocking names of politicians: the list of “notable politicians” is not only highly country-specific, it is also constantly changing-someone who is a near nobody today in a few more years could be a major world leader (witness the phenomenal rise of Barack Obama from yet another state senator in 2004-there’s close to 2000 of them-to US President 5 years later.) Will they put in the ongoing effort to constantly keep this list up to date?

Then there’s the problem of non-politicians who coincidentally have the same as politicians - witness 1990s/2000s Australia, where John Howard was Prime Minister, and simultaneously John Howard was an actor on popular Australian TV dramas (two different John Howards, of course)

idkfasayer

Fun fact: There was at least on dip in Berkshire Hathaway stock, when Anne Hathaway got sick

armchairhacker

Also “Biden” and “Trump” but the regex is different.

https://github.com/BlueFalconHD/apple_generative_model_safet...

immibis

Right next to Palestine, oddly enough.

mvdtnz

They spelled Jacinda Ardern's name wrong.

echelon

Apple's 1984 ad is so hypocritical today.

This is Apple actively steering public thought.

No code - anywhere - should look like this. I don't care if the politicians are right, left, or authoritarian. This is wrong.

avianlyric

Why is this wrong? Applying special treatment to politically exposed persons has been standard practice in every high risk industry for a very long time.

The simple fact is that people get extremely emotional about politicians, politicians both receive obscene amounts of abuse, and have repeatedly demonstrated they’re not above weaponising tools like this for their own goals.

Seems perfectly reasonable that Apple doesn’t want to be unwittingly draw into the middle of another random political pissing contest. Nobody comes out of those things uninjured.

jofzar

AOC is very vocal about AI and is leading a bill related to AI. It's probably a "let's not fuck around and find out" situation

https://thehill.com/policy/technology/5312421-ocasio-cortez-...

michaelt

I assume all the corporate GenAI models have blocks for "photorealistic image of <politician name> being arrested", "<politician name> waving ISIS flag", "<politician name> punching baby" and suchlike.

bigyabai

Particularly the models owned by CEOs who suck-up to authoritarianism, one could imagine.

lupire

Maybe so, but think about how such a thing would be technically implemented, and how it would lead to false positives and false negatives, and what the consequences would be.

bahmboo

Perhaps in context? Maybe the training data picked up on her name as potentially used as a "slur" associated with her race. Wonder if there are others I know I can look.

FateOfNations

interesting, that's specifically in the Spanish localization.

cpa

I think that’s because she’s been victim of a lot of deep fake porn

HeckFeck

How does this explain Boris Johnson or Liz Truss?

baxtr

I’m telling you, some people have weird fantasies…

AlphaAndOmega0

I can only imagine that people would pay to not see porn of either individual.

Aeolun

Put them together in the same prompt?

torginus

I find it funny that AGI is supposed to be right around the corner, while these supposedly super smart LLMs still need to get their outputs filtered by regexes.

jonas21

I don't think anyone believes Apple's LLMs are anywhere near state of the art (and certainly not their on-device LLMs).

lupire

Apple isn't the only one doing this.

fastball

To be fair, there are people who I sometimes wish I could filter with regex.

cyanydeez

It's similar to how all the new power sources are basically just "cool, lets boil water with it"

null

[deleted]

bahmboo

This is just policy and alignment from Apple. Just because the Internet says a bunch of junk doesn't mean you want your model spewing it.

wistleblowanon

sure but models also can't see any truth on their own. They are literally butchered and lobotomized with filters and such. Even high IQ people struggle with certain truth after reading a lot, how is these models going to find it with so much filters?

bahmboo

What is this truth you speak of? My point is that a generative model will output things that some people don't like. If it's on a product that I make I don't want it "saying" things that don't align with my beliefs.

tbrownaw

> sure but models also can't see any truth on their own. They are literally butchered and lobotomized with filters and such.

The one is unrelated to the other.

> Even high IQ people struggle with certain truth after reading a lot,

Huh?

pndy

This butchering and lobotomisation is exactly why I can't imagine we'll ever have a true AGI. At least not by hands of big companies - if at all.

Any successful product/service which will be sold as "true AGI" by company that will have the best marketing will be still ridden with top-down restrictions set by the winner. Because you gotta "think of the children".

Imagine HAL's "I'm sorry Dave, I'm afraid I can't do that" iconic line with insincere patronising cheerful tone - that's the thing we're going to get I'm afraid.

idiotsecant

They will find it in the same way and intelligent person under the same restrictions would: by thinking it, but not saying it. There is a real risk of growing an AI that pathologically hides it's actual intentions.

simondotau

Can we please put to rest this absurd lie that “truth“ can be reliably found in a sufficiently large corpus of human–created material.

userbinator

China calls it "harmonious society", we call it "safety". Censorship by any other name would be just as effective for manipulating the thoughts of the populace. It's not often that you get to see stuff like this.

madeofpalk

I don't think it's controversial or unsurprising at all that a company doesn't want their random sentence generator to spit out 'brand damaging' sentences. You know the field day media would have Apple's new feature summarises a text message as "Jane thinks Anthony Albanese should die".

ryandrake

When the choice is between 1. "avoid tarnishing my own brand" and 2. "doing what the user requested," corporations will always choose option 1. Who is this software supposed to be serving, anyway?

I'm surprised MS Office still allows me to type "Microsoft can go suck a dick" into a document and Apple's Pages app still allows me to type "Apple are hypocritical jerks." I wonder how long until that won't be the case...

userbinator

If that's what the message actually said, why would the media be complaining? Or do you mean false positives?

cyanydeez

In america is due to lawyers, nothing more.

Ya'll love capitalism until it starts manipulating the populace into the safest space to sell you garbage you dont need.

Then suddenly its all "ma free speech"

binarymax

Wow, this is pretty silly. If things are like this at Apple I’m not sure what to think.

https://github.com/BlueFalconHD/apple_generative_model_safet...

EDIT: just to be clear, things like this are easily bypassed. “Boris Johnson”=>”B0ris Johnson” will skip right over the regex and will be recognized just fine by an LLM.

deepdarkforest

It's not silly. I would bet 99% of the users don't care that much to do that. A hardcoded regex like this is a good first layer/filter, and very efficient

BlueFalconHD

Yep. These filters are applied first before the safety model (still figuring out the architecture, I am pretty confident it is an LLM combined with some text classification) runs.

brookst

All commercial LLM products I’m aware of use dedicated safety classifiers and then alter the prompt to the LLM if a classifier is tripped.

twoodfin

Efficient at what?

tpmoney

I doubt the purpose here is so much to prevent someone from intentionally side stepping the block. It's more likely here to avoid the sort of headlines you would expect to see if someone was suggested "I wish ${politician} would die" as a response to an email mentioning that politician. In general you should view these sorts of broad word filters as looking to short circuit the "think of the children" reactions to Tiny Tim's phone suggesting not that God should "bless us, every one", but that God should "kill us, every one". A dumb filter like this is more than enough for that sort of thing.

XorNot

It would also substantially disrupt the generation process: a model which sees B0ris and not Boris is going to struggle to actually associate that input to the politician since it won't be well represented in the training set (and on the output side the same: if it does make the association, a reasoning model for example would include the proper name in the output first at which point the supervisor process can reject it).

binarymax

No it doesn't disrupt. This is a well known capability of LLMs. Most models don't even point out a mistake they just carry on.

https://chatgpt.com/share/686b1092-4974-8010-9c33-86036c88e7...

quonn

I don‘t think so. My impression with LLMs is that they correct typos well. I would imagine this happens in early layers without much impact on the remaining computation.

lupire

"Draw a picture of a gorgon with the face of the 2024 Prime Minister of UK."

Aeolun

The LLM will. But the image generation model that is trained on a bunch of pre-specified tags will almost immediately spit out unrecognizable results.

miohtama

Sounds like UK politics is taboo?

immibis

All politics is taboo, except the sort that helps Apple get richer. (Or any other company, in that company's "safety" filters)

bigyabai

> If things are like this at Apple I’m not sure what to think.

I don't know what you expected? This is the SOTA solution, and Apple is barely in the AI race as-is. It makes more sense for them to copy what works than to bet the farm on a courageous feature nobody likes.

stefan_

Why are these things always so deeply unserious? Is there no one working on "safety in AI" (oxymoron in itself of course) that has a meaningful understanding of what they are actually working with and an ability beyond an interns weekend project? Reminds me of the cybersecurity field that got the 1% of people able to turn a double free into code execution while 99% peddle checklists, "signature scanning" and deal in CVE numbers.

Meanwhile their software devs are making GenerativeExperiencesSafetyInferenceProviders so it must be dire over there, too.

kmfrk

A lot of these terms are very weird and bland. Honestly I'm mostly reminded of Apple's bizarre censorship screw-up that didn't blow up that much, even though it was pretty uniquely embarrassing:

https://www.theverge.com/2021/3/30/22358756/apple-blocked-as...

efitz

I’m going to change my name to “Granular Mango Serpent” just to see what those keywords are for in their safety instructions.

fouronnes3

Granular Mango Serpent is the new David Meyer.

https://arstechnica.com/information-technology/2024/12/certa...

skygazer

I'm pretty sure these are the filters that aim to suppress embarrassing or liability inducing email/messages summaries, and pop up the dismissible warning that "Safari Summarization isn't designed to handle this type of content," and other "Apple Intelligence" content rewriting. They filter/alter LLM output, not input, as some here seem to think. Apple's on device LLM is only 3b params, so it can occasionally be stupid.

cluckindan

I think these are test data and not actual safety filters.

https://github.com/BlueFalconHD/apple_generative_model_safet...

BlueFalconHD

There is definitely some testing stuff in here (e.g. the “Granular Mango Serpent” one) but there are real rules. Also if you test phrases matched by the regexes with generation (via Shortcuts or Foundation Models Framework) the blocklists are definitely applied.

This specific file you’ve referenced is rhetorical v1 format which solely handles substitution. It substitutes the offensive term with “test complete”

null

[deleted]

Animats

Some of the data for locale "CN" has a long list of forbidden phrases. Broad coverage of words related to sexual deviancy, as expected. Not much on the political side, other than blocks on religious subjects.[1]

This may be test data. Found

     "golliwog": "test complete"

[1] https://github.com/BlueFalconHD/apple_generative_model_safet...

BlueFalconHD

This is definitely an old test left in. But that word isn’t just a silly one, it is offensive (google it). This is the v1 safety filter, it simply maps strings to other strings, in this case changing golliwog into “test complete”. Unless I missed some, the rest of the files use v2 which allows for more complex rules

mike_hearn

Are you sure it's fully deobfuscated? What's up with reject phrases like "Granular mango serpent"?

pbhjpbhj

Speculation: Maybe they know that the real phrase is close enough in the vector space to be treated as synonymous with "granular mango serpent". The phrase then is like a nickname that only the models authors know the expected interference of?

Thus a pre-prompt can avoid mentioning the actual forbidden words, like using a patois/cant.

electroly

"GMS" = Generative Model Safety. The example from the readme is "XCODE". These seem to be acronyms spelled out in words.

BlueFalconHD

This is definitely the right answer. It’s just testing stuff.

tablets

Maybe something to do with this? https://en.m.wikipedia.org/wiki/Mango_cult

BlueFalconHD

These are the contents read by the Obfuscation functions exactly. There seems to be a lot of testing stuff still though, remember these models are relatively recent. There is a true safety model being applied after these checks as well, this is just to catch things before needing to load the safety model.

andy99

I clicked around a bit and this seems to be the most common phrase. Maybe it's a test phrase?

the-rc

Maybe it's used to catch clones of the models?

KTibow

Maybe it's used to verify that the filter is loaded.

airstrike

the one at the bottom of the README spells out xcode

wyvern illustrous laments darkness

cwmoore

read every good expletive “xxx”

HN

I extracted the safety filters from Apple Intelligence models

I extracted the safety filters from Apple Intelligence models