Skip to content(if available)orjump to list(if available)

Beating Google's kernelCTF PoW using AVX512

pittma

Cool stuff! This method is very similar to how AVX-512-optimized RSA implementations work too, as they also have to do Very Large Exponentiations. This paper[1] covers how RSA does its windowing, which includes the formula showing how the window size can be arbitrary. One additional thing AVX-512 RSA implementations do, though, is store the results of the multiplications for the range [0..2^{window-size}) in a table; then, for each window, that result is retrieved from the table[2] and only the shifting/repositioning has to be done.

1. https://dpitt.me/files/sime.pdf (hosted on my domain because it's pulled from a journal)

2. https://github.com/aws/aws-lc/blob/9c8bd6d7b8adccdd8af4242e0...

anematode

Ooh interesting, I should have looked at this while developing.... Looks like that code could definitely use another version for e.g. Zen 5 where using zmm registers would lead to a 2x multiplication throughput. Also the mask registers are bounced to GPRs for arithmetic but that's suboptimal on Zen 4/5.

Separately, I'm wondering whether the carries really need to be propagated in one step. (At least I think that's what's going on?) The chance that a carry in leads to an additional carry out beyond what's already there in the high 12 bits is very small, so in my code, I assume that carries only happen once and then loop back if necessary. That reduces the latency in the common case. I guess with a branch there could be timing attack issues though

pittma

ymms were used here on purpose! With full-width registers, the IFMA insns have a deleterious effect on frequency, at least in the Icelake timeframe.

anematode

Ye, hence a separate version for CPUs which don't have that problem. Although, maintaining so many of these RSA kernels does seem like a pain. Didn't realize u wrote that code; super cool that it's used in practice!

ignoramous

> dpitt.me/files/sime.pdf (hosted on my domain because it's pulled from a journal

One can also upload to archive.org: https://archive.org/download/sime_20250531/sime.pdf

SunlitCat

There is something off with

> ...and despite supporting it [AVX512] on consumer CPUs for several generations...

I dunno. Before Rocket Lake (11th gen) AVX512 was only available in those enthusiast cpu, xeon cpu or in some mobile processors (which i wouldn't really call consumer cpu).

With the 12th gen (and that performance/efficiency core concept), they disabled it after a few months in those core and it was never seen again.

I am pretty sure tho, after AMD has some kind of success with AVX512 Intel will reintroduce it again.

btw. I am still rocking an Intel i9-11900 cpu in my setup here. ;)

oparin10

That tracks. Intel's updated AVX10 whitepaper[1] from a few months back seems to confirm this. It explicitly states 512-bit AVX will be standard for both P and E cores, moving away from 256-bit only configs. This strongly implies AVX-512 is making a proper comeback, not just on servers but future consumer CPUs with E-cores too. Probably trying to catch up with AMD's wider AVX-512 adoption.

[1] - https://cdrdv2.intel.com/v1/dl/getContent/784343 (PDF)

SunlitCat

Nice find! I really hope so, as AVX512 is something interesting to play around with and i am pretty sure, it will play a bigger role in the future, especially with AI and all that stuff around it!

Aurornis

> With the 12th gen (and that performance/efficiency core concept), they disabled it after a few months in those core and it was never seen again.

The 12th gen CPUs with performance cores didn't even advertise AVX512 support or have it enabled out of the box. They didn't include AVX512 on the efficiency cores for space reasons, so the entire CPU was considered to not have AVX512.

It was only through a quirk of some BIOS options that you could disable the efficiency cores and enable AVX512 on the remaining CPU. You had to give up the E-cores as a tradeoff.

adgjlsfhk1

IMO this is more a sign of complete dysfunction at Intel. They definitely could have made avx512 instructions trigger a switch to p-cores, and honestly probably could have supported them completely (if slightly slowly) by splitting the same way AMD does on Zen4 and Zen5 C cores. The fact that they shipped P cores and E cores that had different assembly sets is what you get when you have separate teams competing with each other rather than working together to make a good product.

pitust2

> They definitely could have made avx512 instructions trigger a switch to p-cores

Not really, no. OS-level schedulers are complicated as is with only P vs E cores to worry about, let alone having to dynamically move tasks because they used a CPU feature (and then moving them back after they don't need them anymore).

> and honestly probably could have supported them completely by splitting the same way AMD does on Zen4 and Zen5 C cores.

The issue with AVX512 is not (just) that you need a very wide vector unit, but mostly that you need an incredibly large register file: you go up from 16 * 256 bit = 4096 bits (AVX2) to 32 * 512 bit = 16384 bits (AVX512), and on top of that you need to add a whole bunch of extra registers for renaming purposes.

johnklos

> They definitely could have made avx512 instructions trigger a switch to p-cores,

That'd be an OS thing.

This is a problem that has been solved in the mainframe / supercomputing world and which was discussed in the BSD world a quarter of a century ago. It's simple, really.

Each CPU offers a list of supported features (cpuctl identify), and the scheduler keeps track of whether a program advertises use of certain features. If it does want features that some CPUs don't support, that process can't be scheduled on those CPUs.

I remember thinking about this way back when dual Nintendo cartridge Pentium motherboards came out. To experiment, I ran a Pentium III and a Celery on an adapter card, which, like talking about self hosting email, infuriated people who told me it can't be done. Different clock speeds, different CPU features, et cetera, worked, and worked well enough to wonder what scheduler changes would make using those different features work properly.

anematode

fixed :) thx

SunlitCat

Awesome! I always love to read about such stuff, especially if it includes playing around with AVX512!

fitzn

They won in 3.6 secs, but second place took 3.73 (or 3.74 if being consistent with the winning time). So, did second place also optimize the PoW or were they presumably on an FPGA as well?

The previous submission the author describes as being some expensive FPGA one was 4+ seconds. You'd think he'd mention something about how second place on his week was potentially the second fastest submission of all time, no?

underdeserver

The image says (dupe). I'm guessing it's the OP's team, trying to submit as themselves multiple times in parallel.

yorwba

From the article: "Only the first team who is able to connect to and exploit the server, and submit the flag to a Google Form, receives a payout; any subsequent submissions are marked as duplicates."

null

[deleted]

bluelightning2k

This is impressive. But it just strikes me as optimizing the wrong thing. A CTF shouldn't be about submission ops. Surely it would be better for everyone if all teams who send the flag within a submission window share the prize

kimixa

I'd also say that this encourages people to sit on exploits instead of reporting them immediately, so they can try to submit it next time if they didn't get it, even without submission timing shenanigans.

So it may be actively encouraging the "wrong" thing as well.

rfoo

That would be a different meta-game and while I didn't think deep into it, I believe the likely outcome is - people are just discouraged and won't consider submitting to kernelCTF.

bawolff

So if i unerstand correctly - its 4 second proof of work, and there is a payout once a month.

Are there really so many exploits that people are racing to get it every month?

Aurornis

The server was open every 2 weeks. The proof of work was to slow down connections slightly to reduce incentives to spam as many connection requests as possible.

Public CTFs are hard because inevitably some team will try something resembling a DDoS as part racing to the finish.

Google removed the proof of work step after this.

philodeon

Yes, the myth of Linux kernel security is just that, a myth.

JackSlateur

Is it, tho ?

The ratio bug/feature is what matter (systems with no bugs but no features are nice but we have work to do)

jeffbee

How does this myth even exist? There has never been a moment in the history of Linux where it lacked an exploitable flaw. The only uncertainty is that for the last couple of days we aren't sure exactly what the exploit is.

thephyber

The myth has some truth to it.

The impact of the average Windows exploit was higher than the average Linux exploit because non-NT Windows didn’t use best practices such as multiple accounts + least privilege. And for years there were daemons on Windows that would get exploited with RCE just idling on a network (eg. Conficker).

It took Microsoft several years of being a laughing stock of security before Bill Gates made the decision to slow new feature development while the company prioritized security. Windows security then increased. I believe that was around the time that Microsoft started reusing the NT kernel in the consumer versions of Windows.

Also, the myth ignores the fact that cybersecurity risk has components of likelihood (a percentage) and impact (an absolute number, sometimes a dollar value). This conflation of two components invites lots of arguments and confusion, as commonly seen when certain CVEs are assigned non-obvious CVS scores.

aidenn0

This exists because:

1. When Linux was "coming of age" (around the turn of the millennium), Windows security was really bad.

2. At the same time, there were far more Windows machines than Linux

3. #1 and #2 together made Windows such an attractive target, that exploits for Linux were less of a thing.

landr0id

A myth propelled by people who don't understand security continually saying "Anyone can read the code, therefore it's more secure".

api

But we don’t need safe languages.

dist-epoch

These are local escalation of privilege exploits (becoming root from a regular user), not remote code execution. Escalation of privilege bugs are a dime a dozen.

singron

I think this also requires user (i.e. unprivileged) namespaces since you have to manipulate traffic control queue disciplines (tc qdisc). You normally need to be root to do this, so it's only useful as an escalation if you can do it within a namespace or you can convince some root daemon to do the qdisc manipulation for you (I think unlikely?).

User namespaces opened up an enormous surface area to local privilege escalation since it made a ton of root-only APIs available to non-root users.

I don't think user namespaces are available on android, and it's sometimes disabled on linux distributions, although I think more are starting to enable it.

Relatedly, kernelCTF just announced they will be disabling user namespaces, which is probably a good idea if they are inundated with exploits: https://github.com/google/security-research/commit/7171625f5...

internetter

It wouldn't be reasonable to expect it to be a RCE bug. That wouldn't be a kernel bug, it would be in the networking stack or software running.

eptcyka

Where is the networking stack running on a typical Linux deployment? (:

mmsc

Amazing stuff, but also reads like a comedy due to the obstacles to win this challenge. Real rube goldberg stuff.

dcrazy

For more on the base-52 representation that this article mentions, see this other post from today’s front page: https://news.ycombinator.com/item?id=44132673

mshockwave

Nitpick: static link won’t give you inlining but only eliminates the overhead of PLT. LTO will give you more opportunities for inlining

anematode

Consulted with Will to see exactly what he did and fixed, thx!

vlovich123

I don't understand the purpose of the race. Why not just pay out per unique exploit?

rfoo

Boss want a strictly fixed budget for running those cool programs. The rationale behind these programs (at least partially) is about measuring exploits and mitigations dynamics, not buying bugs. And, Linux is just too buggy that if you pay for every 0-day it's basically out of control. Google tried to do so (and to drain people's hoarded bugs) once, ran a limited time promotion with no race, every 0 day counts, got flooded.

And at the same time you don't want to piss the community, so here we go.

0x0

If linux kernel security really is so bad that google had to add a proof-of-work to introduce a 4 second race for 0day submissions, I'm surprised they're ok with still using the Linux kernel as the base for Android.

udev4096

Android has a vastly improved and better version of linux kernel: https://old.reddit.com/r/GrapheneOS/comments/bddq5u/os_secur...

bonzini

What's the alternative that is efficient, feature complete(*) and more secure?

(*) For example, Android uses SELinux to confine apps, virtual machines (pKVM) to run DRM code, and so on. All these increase the overall security of the system and decrease the cost of kernel bugs, so there's a tradeoff that's not easy to evaluate.

kevincox

What is the alternative? I suspect all modern kernels are more or less just as vulnerable? They did start https://fuchsia.dev/ so maybe they are hedging against this problem? But making a fully-featured OS is a huge undertaking, especially if you need compatibility with existing apps and a wide variety of hardware.

imtringued

Google isn't doing the race thing. They just pay out to whoever submits a valid submission. The people doing the racing are the submitters who want to get ahead of their competition and are stockpiling their exploits. If they were altruists, they would just submit their exploits for no renumeration. Hence the race isn't something Google is doing.

The proof of work isn't there to add a "four second race". It's to prevent ddos like spam.

pas

is there some more info somewhere about that flood? how many teams? how many submissions? how many unique? how much money G paid out?

rfoo

So, here is a tweet from Eduardo highlighting that they got a lot of 0day submissions immediately after they launched the promotion: https://x.com/sirdarckcat/status/1681491274489282565

And the spreadsheet is public [0], I counted 7 unique hoarded bugs (submitted immediately after launch) after deduplicating by patch commit hash. Then in the month following, another 9 unique bugs were submitted.

As for how much paid, I don't remember. Could be around $200~300k in total.

[0] https://docs.google.com/spreadsheets/d/e/2PACX-1vS1REdTA29OJ...

joelthelion

It's a bit depressing that after all of these years, professionals only need 3 seconds to own a Linux box...

supriyo-biswas

Can anyone tell me how they equated the Python function in the blog to what was shown in the Google's POW implementation? Seems a little difficult to follow.

bonzini

The "exponent = (p + 1) // 4" in the Google code is 2^1277. To raise a number to such a huge power, you square it 1277 times (and that is what the Python function does). That works because if the initial value is "x", each squaring doubles the number of "x"s, and at the end you have 2^1277 of them.

quasiconnected

Beautiful use of number theory for the mersenne number modulo!