Skip to content(if available)orjump to list(if available)

Reverse engineering the obfuscated TikTok VM

kleiba

I've been using a shitty streaming website whose player interrupts the playback of a video in irregular intervals and presents a cryptic error message. I've started looking into the JavaScript code to see if I can't code up a work-around mechanism (basically debugging their garbage implementation), and of course (why actually?) their player code is also obfuscated.

And I've gotta say, emplying an AI assistant has proven to be an invaluable help in trying to understand obfuscated code. It's actually really cool to take a function of gobbledegook JavaScript and ask the AI to rewrite it in a more canonical and easily understandable way, with inline comments. Of course, there are flaws every now and then, but the ability to do this has been such a game changer for reverse engineering, IMO.

I can even ask to take a guess at finding better variable/function names and the AI can infer from the code (maybe has seen the unobfuscated libraries during training?) what this code is actually doing on a high-level and turn something like e.g(e.g) into player.initialize(player.state) which is nothing short of amazing.

So for anyone doing similar work, I cannot recommend highly enough to have an AI agent as another tool in your tool belt.

lukan

Which AI agents did you use?

kleiba

I've tried different ones, they all seem to do a great job.

sureIy

Could you name a couple?

klabetron

Out of curiosity (as someone disappointingly new to prompt engineering), what’s an example prompt you used with some success?

godelski

This seems like quite a lot of work to hide the code. What would the legitimate reasons for this be? Because it looks like it would make the program less optimized and more complexity just leads to more errors.

I understand the desire to make it harder for bots, but 1) it doesn't seem to be effective and bots seem to be going a very different route 2) there's got to be better ways that are more effective. It's not like you're going to stop clones through this because clones can replicate by just seeing how things work and reverse engineer blackbox style.

noduerme

A generous take would be that they have their own internal GUI tools that make it easier for non-programmers to set up visual elements in this. That was historically the reason to invent VMs like Flash. A less generous take would account for the enormous potential for hiding nefarious code inside such a thing, and account for the nature of the government which deployed it, and conclude that it was a national security / defense project disguised as a candy-coated trojan horse.

supriyo-biswas

VM-based architectures are really common in the obfuscation space, which is why you have executable packers[1], JS packers[2] and bot management products[3][4] leveraging similar techniques.

As for why the obfuscation is needed: bot management products suffer from a fundamental weakness in that ultimately, all of them simply collect static data from the environment, therefore it would make much more sense to make the steps involved as difficult to reverse engineer as possible. Once that is done, all you need to do is slightly change the schematics of your script every few weeks and publish a new bundle, and you've got yourself a pretty unsubvertible* protection scheme.

Regarding the "trojan horse", I think someone is yet to show proof that it's a Javascript exploit.

(*Unsubvertible is obviously relative, but raising the cost the attack, from say, $0.01/1000 requests to $10/1000 requests would massively cut down on abuse.)

[1] https://vmpsoft.com/

[2] https://jscrambler.com/

[3] https://github.com/neuroradiology/InsideReCaptcha

[4] https://www.zenrows.com/blog/bypass-cloudflare#_qEu5MvVdnILJ...

davidsojevic

Making it harder for bots usually means that it drives up the cost for the bots to operate; so if they need to run in a headless browser to get around the anti-bot measures it might mean that it takes, for example, 1.5 seconds to execute a request as compared to the 0.1 seconds it would without them in place.

On top of that 1.5 seconds is also that there is a much larger CPU and memory cost from having to run that browser compared to a simple direct HTTP request which is near negligible.

So while you'll never truly defeat a sufficiently motivated actor, you may be able to drive their costs up high enough that it makes it difficult to enter the space or difficult to turn a profit if they're so inclined.

rfoo

Google has been doing this since forever for recaptcha. And, to be fair, it seems to be fairly effectively for bot detection.

https://github.com/neuroradiology/InsideReCaptcha

> bots seem to be going a very different route

If the "very different route" means running a headless browser, then it's a success for this tech. Because the bot must run a blackbox JS now, and this gives people a whole new street of ways to run bot detection, using the bot's CPU.

throwaway48476

Makes it easier to hide code that does browser fingerprinting.

null

[deleted]

aaron695

[dead]

davidsojevic

Very impressive work! I always enjoy a good write up about reverse engineering efforts and yours was really simple to follow.

Many popular/large websites and bot protection services usually have environment checking as a baseline and mouse-movement tracking in some of the more aggressive anti-bot checks.

It's always interesting to see how long it takes from when the measures have been defeated/publicised until the service ends up making changes to their mechanism to make you start over (hopefully not from scratch).

ronsor

There is no legitimate reason for a social media platform to employ this much obfuscation.

fidotron

If you believe this you underestimate how adversarial the software world really is. TikTok will be on the receiving end of botnets by everything from commercial entities, state backed groups and criminals.

They won't be betting that this stops that entirely, but it adds a layer of friction that is easy for them to change on a continuous basis. These things are also very good for leaving honeypots in where if someone is found to still be using something after a change you can tag them as a bot or otherwise hacking. Both of those approaches are also widely used in game anti-cheat mechanisms, and as shown there the lengths people will go to anyway are completely insane.

krackers

The legitimate reason could be bot protection, the same way recaptcha uses a similar VM technique for obfuscation.

vasco

You not being able to come up with one is different from there not being any possible reason.

miohtama

It's to keep bots away and not turn to be another Twitter.

dns_snek

That's probably not the goal. There are bots advertising illegal services (e.g. ads for "hacking services", illegal drugs) in most comment sections. If you report these comments, 99.9% of the time the report will be rejected with "no violations found" and the spam stays up.

bolognafairy

That doesn’t mean that it’s “probably not the intention”.

yard2010

This is not a social media platform but a government backed tool for doing stuff for the government.

0xDEADFED5

this is cool. i briefly worked on a TikTok bot a while back and it was a huge pain in the ass.

RexM

Is this VM somehow related to Lynx (their cross platform dev tooling?)

https://lynxjs.org/

Also discussed on HN

https://news.ycombinator.com/item?id=43264957

weinzierl

Is there also a VM in their iOS app? I thought a VM would be against Apple's policies?

xmodem

Apple's policies prevent using JIT compilation, they don't ban VM's outright.

heinternets

Is TikTok so obfuscated to prevent people from knowing the full extent of data collection and device fingerprinting?

sylware

What's terrible are the humans writing such software...

But if AI can help to fight those people's work, good for humanity I guess.

That said... Is AI going to de-obfuscate/reverse engineer their obsfuscated AI prompts or web apps?

domfie

Looks like a lot of work. I recently discovered webcrack and the tool jehna/humanify for such deobfuscate tasks

3abiton

It could be interesting to see a comparison to OP's work.

worldsavior

That's a very strong obfuscation. Takes a lot of work to deobfuscate such a thing. Great writeup.

xfeeefeee

[flagged]

noduerme

Is calling a massive embedded JS obfuscator a "VM" a bit of a stretch? Ultimately it's not translating anything to a lower-level language.

Still, I had no idea. This is really taking JS obfuscation to the next level.

One kind of wonders, what is the purpose of that level of obfuscation? The naive take is that obfuscation is usually to protect intellectual property... but this is client-side code that wouldn't give away anything about their secret sauce algorithm.

MonkeyClub

> Is calling a massive embedded JS obfuscator a "VM" a bit of a stretch? Ultimately it's not translating anything to a lower-level language.

From the Repo's README:

"TikTok is using a full-fledged bytecode VM, if you browse through it, it supports scopes, nested functions and exception handling. This isn't a typical VM and shows that it is definitely sophiscated."

throwaway48476

VM obfuscation is a common technique for malware developers.

The VM term is applied because the obfuscator creates a custom instruction set and executes custom byte code. This is generated per build.

userbinator

You are replying to a comment that looks extremely unhuman.

codetrotter

It looks like OP filled out the text area alongside with the URL when submitting the post.

HN takes that text and turns it into a comment. I’ve seen it happen before.

The unfortunate outcome of that IMO is that sometimes text that makes sense as a description of a submission feels a bit out of place as a comment due to how they are worded. And these comments sometimes then end up getting downvoted.

I wouldn’t be completely sure it was not human written. Even though it feels a bit weird to read it as a comment.