Skip to content(if available)orjump to list(if available)

Extracting AI models from mobile apps

Extracting AI models from mobile apps

241 comments

·January 5, 2025

ipsum2

This is cool, but only the first part in extracting a ML model for usage. The second part is reverse engineering the tokenizer and input transformations that are needed to before passing the data to the model, and outputting a human readable format.

mentalgear

Would be interesting if someone could detail the approach to decode the pre-post processing steps before it enters the model, and how to find the correct input encoding.

refulgentis

Boils down to "use Frida to find the arguments to the TensorFlow call beyond the model file"

Key here is, a binary model is just a bag-of-floats with primitively typed inputs and outputs.

It's ~impossible to write up more than what's here because either:

A) you understand reverse engineering and model basics, and thus the current content is clear you'd use Frida to figure out how the arguments are passed to TensorFlow

or

B) you don't understand this is a binary reverse engineering problem, even when shown Frida. If more content was provided, you'd see it as specific to a particular problem. Which it has to be. You'd also need a walkthrough by hand about batching, tokenization, so on and so forth, too much for a write up, and it'd be too confusing to follow for another model.

TL;Dr a request for more content is asking for a reverse engineering article to give you a full education on modal inference

TeMPOraL

> It's ~impossible to write up more than what's here

Except you just did - or at least you wrote an outline for it, which is 80% of the value already.

null

[deleted]

littlestymaar

> TL;Dr a request for more content is asking for a reverse engineering article to give you a full education on modal inference

I don't understand what you mean: I have no clue about anything related to reverse engineering, but I ported the mistral tokenizer to Rust and also wrote a basic CPU Llama training and inference implementation in Rust, so I definitely wouldn't need an intro to model inference…

refulgentis

This is a good comment, but only in the sense it documents a model file doesn't run the model by itself.

An analogous situation is seeing a blog that purports to "show you code", and the code returns an object, and commenting "This is cool, but doesn't show you how to turn a function return value into a human readable format" More noise, than signal.

The techniques in the article are trivially understood to also apply to discovering the input tokenization format, and Netron shows you the types of inputs and outputs.

Thanks for the article OP, really fascinating.

ipsum2

Just having the shape of the input and output are not sufficient, the image (in this example) needs to be normalized. It's presumably not difficult to find the exact numbers, but it is a source of errors when reverse engineering a ML model.

refulgentis

Right, you get it: it's a Frida problem.

rob_c

If you can't fix this with a little help from chatgpt or Google you shouldn't be building the models frankly let alone mucking with other people's...

null

[deleted]

garyfirestorm

Lot of comments here seem to think that there’s no novelty. I disagree. As a new ML engineer I am not very familiar with any reverse engineering techniques and this is a good starting point. Something about ML yet it’s simple enough to follow, and my 17yr old cousin who is ambitious to start cyber security would love this article. Maybe its too advanced for him!

biosboiii

Thanks a lot :)

My general writing style is directed mainly towards my non-technical colleagues, which I wish to inspire to learn about computers.

This is no novelty, by far, it is a pretty standard use-case of Frida. But I think many people, even software developers, don't grasp the concept of "what runs on your device is yours, you just dont have it yet".

Especially in mobile apps, many devs get sloppy on their mobile APIs because you can't just open the developer tools.

UnreachableCode

I'm a mobile developer and I'm new to using Frida and other such tools. Do you have any tips or reading material on how to use things like Frida?

biosboiii

I think you are starting off from the perfect direction, being a forward-engineer first, and then a reverse-engineer.

The community around Frida is a a) a bit small and b) a bit unorganized/shadowy. You cannot find that many resources, atleast I have not found them.

I would suggest you to use Objection, explore an app, enumerate the classes with android hooking list classes or android hooking search classes, then dynamically watch and unwatch them. That is the quickest way to start, when you start developing your own scripts you can always check out code at https://codeshare.frida.re/.

For everything else join the Frida Telegram chat, most knowledge sits there, I am also there feel free to reach out to @altayakkus

Oh and btw, I would start with Android, even though iOS is fun too, and I would really really suggest getting a rooted phone/emulator. For the Android Studio Emulator you can use rootAVD (GitHub), just install Magisk Frida. Installing the Frida gadget into APKs is a mess which you wont miss when you go root

nthingtohide

One thing I noticed in Gboard is it uses homeomorphic encryption to do federated learning of common words used amongst public to do encrypted suggestions.

E.g. there are two common spelling of bizarre which are popular on Gboard : bizzare and bizarre.

Can something similar help in model encryption?

antman

Had to look it up, this seems to be the paper https://research.google/pubs/federated-learning-for-mobile-k...

1oooqooq

they have a very "interesting" definition of private data on the paper. it's so outlandish that if you buy their definition, there's zero value on the trained data. heh.

they also claim unsuppervisioned users typing away is better than tagged training data, which explain the wild grammar suggestions on the top comment. guess the age of quantity over quality is finally peaking.

in the end it's the same as grammarly but without any verification of the interested data, and calling the collection of user data "federation"

nthingtohide

actually letting users type whatever they want is good because they are many dialects of english : chinglish, thailish, singlish, hinglish and so on.

they have made the system so general that it can handle any quirk users throw at it.

biosboiii

Author here, no clue about homeomorphic (or whatever) encryption, what could certainly be done is some sort of encryption of the model into the inference engine.

So e.g.: Apple CoreML issues a Public Key, the model is encrypted with that Public Key, and somewhere in a trusted computing environment the model is decrypted using a private key, and then inferred.

They should of course use multiple keypairs etc. but in the end this is just another obstacle in your way. When you own the device, root it or even gain JTAG access to it, you can access and control everything.

And matrix-multiplication is a computationally expensive process, in which I guess they won't add some sort of encryption technique for each and every cycle.

miki123211

In principle, device manufacturers could make hardware DRM work for ML models.

You usually inference those on GPUs anyway, and they usually have some kind of hardware DRM support for video already.

The way hardware DRM works is that you pass some encrypted content to the GPU and get a blob containing the content key from somewhere, encrypted in a way that only this GPU can decrypt. This way, even if the OS is fully compromised, it never sees the decrypted content.

biosboiii

But then you could compromise the GPU, probably :)

Look at the bootloader, can you open a console?

If not, can you desolder the flash and read the key?

If not, can you access the bootloader when the flash is not detected anymore?

...

Can you solder off the capacitors and glitch the power line, to do a [Voltage Fault Injection](https://www.synacktiv.com/en/publications/how-to-voltage-fau...)?

Can you solder a shunt resistor to the power line, observe the fluctuations and do [Power analysis](https://en.wikipedia.org/wiki/Power_analysis)?

There are a lot of doors and every time someone closes them a window remains tilted.

umeshunni

Homomorphic, not homeomorphic

hyperbovine

`enc(coffee cup) == enc(donut)` would be an interesting guarantee.

vlovich123

In theory yes, in practice right now no. Homomorphic encryption is too computationally expensive.

janalsncm

I’m a huge fan of ML on device. It’s a big improvement in privacy for the user. That said, there’s always a chance for the user to extract your model, so on-device models will need to be fairly generic.

Zambyte

Maybe someday we will build a society where standing on the shoulders of giants is encouraged, even when they haven't been dead for 100 years yet.

andrewfromx

this would be yellow in https://en.wikipedia.org/wiki/Spiral_Dynamics but we are still a mix of orange and green.

avg_dev

pretty cool; that frida tool seems really nice. https://frida.re/docs/home/

(and a bunch of people seem to be interested in the "IP" note, but I took as, just trying to not get run into legal trouble for advertising "here's how you can 'steal' models!")

frogsRnice

frida is an amazing tool - it has empowered me to do things that would have otherwise took weeks or even months. This video is a little old, but the creator is also cracked https://www.youtube.com/watch?v=CLpW1tZCblo

It's supposed to be "free-IDA" and the work put in by the developers and maintainers is truly phenomenal.

EDIT: This isn't really an attack imo. If you are going to take "secrets" and shove it into a mobile app, they can't really be considered secret. I suppose it's a tradeoff - if you want to do this kind of thing client-side - the secret sauce isn't so secret.

JTyQZSnP3cQGa8B

> Keep in mind that AI models [...] are considered intellectual property

Is it ironic or missing a /s? I can't really tell here.

SunlitCat

To be honest, that was my first thought on reading that headline as well. Given that especially those large companies (but who knows how smaller ones got their training data) got a huge amount of backlash for their unprecedented collection of data all over the web and not just there but everywhere else, it's kinda ironic to talk about intellectual property.

If you use one of those AI model as a basis for your AI model the real danger could be that the owners of the originating data are going after you at some point as well.

ToucanLoucan

Standard corporate hypocrisy. "Rules for thee, not for me."

If you actually expected anything to be open about OpenAI's products, please get in touch, I have an incredible business opportunity for you in the form of a bridge in New York.

xdennis

They got backlash, but (if I'm not mistaken) it was ruled that it's okay to use copyrighted works in your model.

So if a model is copyrighted, you should still be able to use it if you generate a different one based on it. I.e. copyright laundry. I assume this would be similar to how fonts work. You can copyright a font file, but not the actual shapes. So if you re-encode the shapes with different points, that's legal.

But, I don't think a model can be copyrighted. Isn't it the case that something created mechanically can't be copyrighted? It has to be authored by a person.

I find it weird that so many hackers go out of their way to approve of the legal claims of Big AI before it's even settled, instead of undermining Big AI. Isn't the hacker ethos all about decentralization?

Freak_NL

Standard disclaimer. Like inserting a bunch of 'hypothetically' in a comment telling one where to find some piece of abandoned media where using an unsanctioned channel would entail infringing upon someone's intellectual property.

biosboiii

Hey, author here.

I understand that its not very clear if the neural net and its weights & biases are considered as IP, I personally think that if some OpenAI employee just leaks GPT-4o it isn't magically public domain and everyone can just use it. I think lawmakers would start to sue AWS if they just re-host ChatGPT. Not that I endorse it, but especially in IP and in law in general "judge law" ("Richterrecht" in german) is prevalent, and laws are not a DSL with a few ifs and whiles.

But it is also a "cover my ass" notice as others said, I live in Germany and our law regarding "hacking" is quite ancient.

GuB-42

For now, it is better to assume it is the truth.

The simple fact that models are released under license, which may or may not be free, imply that it is intellectual property. You can't license something that is not intellectual property.

It is a standard disclaimer, if you disagree, talk to your lawyer. The legal situation of AI models is such a mess that I am not even sure that a non-specialist professional will be of great help, let alone random people on the internet.

npteljes

I think it's both. It's

1. the current, unproven-in-court legal understanding, 2. standard disclaimer to cover OP's ass 3. tongue-in-cheek reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data

TeMPOraL

> reference to the prevalent argument that training AI on data, and then offering it via AI is being a parasite on that original data

Prevalent or not, phrased this way it's clear how nonsense it is. The data isn't hurt or destroyed in the process of being trained on, nor does the process deprive the data owners from their data or opportunity to monetize it the way they ordinarily would.

The right terms here are "learning from", "taking inspiration from", not "being a parasite".

(Now, feeling entitled to rent because someone invented something useful and your work accidentally turned out to be useful, infinitesimally, in making it happen - now that is wanting to be a parasite on society.)

npteljes

I think the bad part of it is stripping consent from the original creators, after they published their work. I personally see it as an unfortunate side-effect of change. The artists of the future can create with AI already in mind, but this was not the privilege of the current, and previous generations.

Getting back to "learning from", I think the issue is not the learning part, but the recreation part. AI can churn content to orders of magnitude higher than before, even in the age of Fiverr and other tools-opportunities. This changes the dynamics of the interaction, because previously, it took someone tens of hours to create something, now it takes AI minutes. That is not participating in the same playing field, it's absolutely dominating it, completely changing it. That is something to have feelings about, especially if one's livelihood is also impacted. Data is not destroyed, and neither is its ownership, but people don't usually want the exact thing, they are content with a good enough thing, and this takes away a lot of power from the artists, whose work is the lifeblood of artistic AI in the first place.

So I don't think it's as nonsense as you state it. But I do understand that it's not cut and dry the other way around either. Gatekeeping culture is definitely not a humane thing to do. Culture comes and goes, intermingles, inspires and changes all the time, and people take from it and add to it all the time. Preserving copyright perfectly would neuter it, and slant the landscape even more towards the already powerful.

boothby

If I understand the position of major players in this field, downloading models in bulk and training a ML model on that corpus shouldn't violate anybody's IP.

zitterbewegung

IANAL But, this is not true it would be a piece of the software. If there is a copyright on the app itself it would extend to the model. Even models have licenses for example LLAMA is release under this license [1]

[1] https://github.com/meta-llama/llama/blob/main/LICENSE

dragonwriter

The fact that models creators assert that they are protectrd by copyright and offer licenses does not mean:

(1) That they are actually protected by copyright in the first place, or

(2) That the particular act described does not fall into an exception to copyright like fair use, exactly as many model creators assert that the exact same act done with the materials models are trained on does, rendering the restrictions of the license offered moot for that purpose.

boothby

LLMs are trained on works -- software, graphics and text -- covered by my copyright. What's the difference?

TeMPOraL

The difference is that you pulling out a model is you potentially violating copyright, while the model itself being trained on copyrighted models is potentially them violating copyrights.

I.e. the first one concerns you, the other is none of your business.

blitzar

If I understand the position of major players in this field, copyright itself is optional (for them at least).

zitterbewegung

True, I think there has to be a case that sets precedent for this issue.

rusk

They claim “safe harbour” - if nobody complains it’s fair game

Drakim

Is there a material difference between the copyright laws for software and the copyright laws for images and text?

_DeadFred_

Yeah no.

An example for legal reference might be convolution reverb. Basically it's a way to record what a fancy reverb machines does (using copyrighted complex math algorithms) and cheaply recreate the reverb on my computer. It seems like companies can do this as long as they distribute protected reverbs separately from the commercial application. So Liquidsonics (https://www.liquidsonics.com/software/) sells reverb software but puts for free download the 'protected' convolution reverbs specifically the Bricasti ones in dispute (https://www.liquidsonics.com/fusion-ir/reverberate-3/)

Also, while a SQL server can be copyright protected, a SQL database is not given copyright protection/ownership to the SQL server software creators by extension of that.

Fragoel2

There's an interesting research paper from a few years ago that extracted models from Android apps on a large scale: https://impillar.github.io/files/ccs2022advdroid.pdf

asciii

That's pretty cool! I am impressed by the Frida tool, especially to read in the binary and dump it to disk by overwriting the native method.

The author only mentions APK for Android, but what about iOS IPA? Is there an alternative method for handling that archive?

null

[deleted]

1vuio0pswjnm7

"Keep in mind that AI models, like most things, are considered intellectual property. Before using or modifying any extracted models, you need the explicit permission of their owner."

Is that really true. Is the law settled in this area. Is it the same everywhere or does it vary from jurisdiction to jurisdiction.

See, e.g., https://news.ycombinator.com/item?id=42617889

Ekaros

Can you launder AI model by feeding it to some other model or training process? After all that is how it was originally created. So it cannot be any less legal...

benreesman

There are a family of techniques, often called something like “distillation”. There are also various synthetic training data strategies, it’s a very active area of research.

As for the copyright treatment? As far as I know it’s a bit up in the air at the moment. I suspect that the major frontier vendors would mostly contend that training data is fair use but weights are copyrighted. But that’s because they’re bad people.

qup

The weights are my training data. I scraped them from the internet

benreesman

That sentiment is ethically sound and logically robust and directionally consistent with any uniform application of the law as written.

But there is a group of people, growing daily in influence, who utterly reject such principles as either worthy or useful. This group of people is defined by the ego necessary to conclude that when the stakes are this high, the decisions should be made by them, that the ends justify the means on arbitrary antisocial behavior (c.f. the behavior of their scrapers) as long as this quasi-religious orgasm of singularity is steered by the firm hand that is willing and able to see it through.

That doesn’t distress me: L Ron Hubbard has that.

It distresses me that HN as a community refuses to stand up to these people.

bangaladore

To some extent this is how many models are being produced today.

Basically its just a synthetic loop of using a previously developed SOTA (was) model like GPT-4 to train your model.

This can produce models with seemingly similar performance at a smaller size, but to some extent, less bits will be less good.