Skip to content(if available)orjump to list(if available)

Constant-time coding will soon become infeasible

rwmj

RISC-V has a constant time coding extension (Zkt, https://riscv-software-src.github.io/riscv-unified-db/manual...). I believe the idea of this is that your CPU will have a dedicated core that implements Zkt and performs a known set of operations in constant time. I'm not sure that anyone has actually implemented it.

Do other ISAs have anything similar?

camel-cdr

No, this Zkt is part of the application profiles, so expected to be implemented by high perf OoO cores. It just defines that a list of instructions must have "data-independent timing".

tromp

The article details how current techniques for preventing timing attack are becoming ineffective for the following reasons:

> 1. Compilers are applying optimization techniques that are heuristically good for performance on general software, but happen to leak information through timing-based sidechannels when used on secret data. Since general-purpose CPUs and compilers are applied to a wide variety of tasks, compilers have no reason to stop doing such optimizations.

> 2. Constant-time coding techniques mostly aim at fooling the compiler, to prevent it from applying these optimizations. Since compilers keep getting smarter, theses techniques lose their efficiency.

> 3. Just-in-time (JIT) compilers have access to runtime data; they can, and will, use such information to perform extra optimizations that static compilers cannot do, and further destroy the developer’s attempt at achieving constant-time code.

> 4. JIT compilation is becoming pervasive and is now employed in various contexts, in particular inside CPUs. Such compilers can do the same optimizations as “external” JIT compilers, but are not publicly documented, and the output of their “machine” code translation cannot be inspected to detect deviations from constant-time execution.

> 5. Modern hardware and software stacks use a very layered structure, with each layer striving to hide its internal functioning details from upper layers. The socio-economic context which allowed such hardware to exist inherently relies on such abstraction (under the name of “industrial secrecy”). An effective constant-time coding process would require a model of computation with strong guarantees on timing-related characteristics, that would be maintained through all these layers, down to the semiconductors that implement the logic gates. This industry-wide vertical cooperation is unlikely to happen.

hinkley

> The intent is to produce the mask values without using a plain comparison (“x != 0”) so that the compiler does not notice that we are really making a conditional move. The compiler is not fooled.

So practice your martial arts on the bow of a rowboat. Balance, Danielsan.

The whole timing thing is essentially an issue of getting off-balance as in martial arts and the solution is that there are actions you simply never do, and ones that look like they lack economy of motion because so much of the work is in keeping your center of gravity from shifting outside of your base. Tai chi is infinitely better at protecting you from concussing yourself on black ice than it is at protecting you from a mugger.

So if you have a conditional move for one calculation, then move a to b if !c and move it to d if c. And when the compilers chase that down in another generation, or someone notices that the L1 cache timing is different if d gets cold, then use both calculations in the output. Calculate complements where the runtime of a + !a = 1 to several decimal points.

Will it work for eternity? Unlikely. Will it work for a decade? Probably.

Terr_

So basically compilers, and optimizations inside CPUs, will keep aggressively optimizing code to create the the vulnerability.

Perhaps this indicates a missing concept in our compilation and execution of code, metadata that demands a section is not optimized.

colonial

At least on the compiler level, LLVM already offers a "black box" intrinsic that nudges the optimizer to be maximally pessimistic about whatever it wraps.

It doesn't carry any security guarantees (Rust's documentation on their wrapper for it explicitly strikes out "cryptographic or security" purposes) but one could imagine a similar intrinsic with stronger abilities.

ForOldHack

Everyone will be forced to start vibe-coding, a fad type coding where you don't know the language, but you hope that the AI does, and will spin out reams of useless unintelligible random gibberish that no one understands, hopefully to compile into something that only by luck produces any answer at all, for investors to shout for glee, until CloudStrikes strikes again, and there will be no one to explain how it happened. Computers were not that useful in the first place.

lovich

> Everyone will be forced to start vibe-coding, a fad type coding where you don't know the language, but you hope that the AI does, and will spin out reams of useless unintelligible random gibberish that no one understands, hopefully to compile into something that only by luck produces any answer at all, for investors to shout for glee, until CloudStrikes strikes again, and there will be no one to explain how it happened. Computers were not that useful in the first place.

I mean, I can walk you through how my c# gets compiled down to IL but start going lower and definitely by the time you’re in assembly and machine code and it’s as inscrutable to me as is how AI is spitting out answers.

The days when someone could explain how every piece of these machines worked are long gone

neonsunset

It's surprisingly straightforward. If you run your application with `DOTNET_JitDisasm='methodName'` you will get exact code that gets executed for a given method. It's pretty well-annotated and readable. You can also control JIT behavior with e.g. AggressiveOptimization flag (which can ironically make performance worse) which bypasses tiered compilation and pgo-optimization, always producing the same output.

But other than that yes, with techniques like load value prediction and memory-data dependent prefetching there is little you can do at the codegen level.

null

[deleted]

briansmith

At https://rwc.iacr.org/2025/program.php you can see there is a talk scheduled to be given in a couple weeks titled "Testing Side-channel Security of Cryptographic Implementations against Future Microarchitectures" with the following bits in the abstract: "Using this framework, we conduct an empirical study of 18 proposed microarchitectural optimizations on 25 implementations of eight cryptographic primitives in five popular libraries. We find that every implementation would contain secret-dependent leaks, sometimes sufficient to recover a victim’s secret key, if these optimizations were realized. Ironically, some leaks are possible only because of coding idioms used to prevent leaks under the standard constant-time model."

kibwen

Give me hardware with a discrete cryptographic processing unit which is just a CPU that does has no speculation, caching, or other nonsense. In other words, a CPU that actually works the way it's supposed to work. I don't care that idiots want their wrong software to be as fast as possible, I want correct software that's as fast as possible, and no faster.

layer8

As the paper notes (page 21 last paragraph), this isn’t just about cryptographic algorithms though, it’s about the processing of any “secret” data.

pjc50

A TPM, HSM, or Intel AES-NI type solution?

hinkley

You know given the rise of chiplets, a custom core isn’t a stupid idea…

hinkley

Are we going to end up doing all our crypto on the big.LITTLE cores?

I wonder if they’re fast enough…

adamc

But they want to sell as many cpus as possible, right? So that market wins.

Sesse__

You can easily buy such CPUs in the microcontroller market, just go ahead. They're being sold by the billions.

adamc

Presumably, performance also matters.

i2km

One whole technique not mentioned in the paper or comments is bitslicing. For non-branching code (e.g. symmetric ciphers) it's guaranteed constant-time and it would be a remarkable compiler indeed which could introduce optimizations and timing variations to bit-sliced code...

jjmarr

Can an integrated CPU-FPGA design (like the ones AMD sells) be a solution to this problem? Point 5 of the paper says:

> Modern hardware and software stacks use a very layered structure, with each layer striving to hide its internal functioning details from upper layers. The socio-economic context which allowed such hardware to exist inherently relies on such abstraction (under the name of “industrial secrecy”). An effective constant-time coding process would require a model of computation with strong guarantees on timing-related characteristics, that would be maintained through all these layers, down to the semiconductors that implement the logic gates. This industry-wide vertical cooperation is unlikely to happen

It's not mentioned in the paper, but hardware description languages provide strong guarantees on timing-related characteristics on the semiconductor level. Hardware companies are willing to provide user access to FPGAs, as evidenced by the userspace FPGA subsystem in the Linux kernel as well.[1] You can buy Versal SOCs right now that combine ARM CPUs with FPGAs on one die, run Linux on it, and outsource your crypto tasks to the FPGA.

I see a future where we give up on constant-time coding for CPUs and move crypto libraries to FPGAs, given the hundreds of billions of dollars in our economy reliant on strong encryption. That'd be a world in which every corporate laptop is required to have an FPGA for compliance purposes.

It'd be interesting to know if anyone has tried this.

[1]https://www.phoronix.com/news/FPGA-User-Space-Interface-AMD

NoahZuniga

FPGAs are very expensive, so a TPU that implements the most important algorithms might be so much cheaper because the design cost can be amortized on the huge amount of required chips.

jjmarr

I believe the challenge that FPGAs can address is sociotechnical, because the developer of a crypto library will have much more control over the stack and does not need to trust others.

Many high-frequency trading companies use FPGAs over ASICs for similar reasons. FPGAs are more expensive but allow them to have full control over the algorithms implemented and doesn't require giving secret information to a foundry.

In other words, eliminate the impedance mismatch between responsibility and control for developing a secure system.

It'll be cheaper to implement cryptography on an ASIC. But the author of this paper wants to control every single aspect of the encryption/decryption process. Said developer can confidently say the system is secure. You can't say you've delivered a secure system if you're getting your ASICs from another company that doesn't provide implementation details because it'd (justifiably) give others a competitive advantage.

Question I'd have is whether the cost difference between ASICs/FPGAs is worth it for the majority of applications. $1 or $10 on every CPU might mean a world in which every laptop has an FPGA, but $100? What about for server-side applications? Would a hyperscaler spend $1000 extra on every rack if it allowed guaranteed constant-time encryption?

burch45

It’s not about “giving secret information to a foundry”. It’s entirely the field programmable (FP) feature. It’s also not really programmable in the sense that you would be sending in new instructions in realtime. Reconfigurable is a better word. So giving everyone an FPGA in their laptop isn’t really going help anyone in except some enthusiast who wants to try out some different algorithms.

hinkley

My impression is that there’s a lot of mental shorthand in the chip design community and FPGAs are used for prototyping and then translated into ASICs for any niche or larger applications. I presume there’s a pretty straightforward translation process between the two, though no one has ever tried to explain it to me.

boothby

A very simple description of an FPGA is that it's got a bunch of switches on the ends of wires. Some of the switches can connect wires to logic elements and some can connect wires to other wires. In this view, programming an FPGA is just toggling switches so that some of them are "on" and the rest are "off".

The easiest migration from FPGA to ASIC is to just make a chip with the same circuit elements and wire segments, but instead of making switches, you just short out connections in the "on" state and leave the rest open.

pjc50

The FPGA idea raises security questions of its own - how do you get the bitstream over securely, then the data and results, without leaking the data you're trying to protect or being vulnerable to evil clones of the bitstream? Or the FPGA debug JTAG interface?

Meanwhile Windows is requiring that every laptop have a small crypto processor for its own compliance processes (i.e. bitlocker).

jjmarr

I would assume the bitstream would only contain non-secret implementation details and keys would be loaded at runtime rather than being synthesized into the design.

In terms of moving the data over to the FPGA, I have no idea. But if it's all on the same die, it doesn't seem like there's a big physical attack surface that's different than just doing stuff on the CPU.

hinkley

I’m confused by the quoted text because timing attacks rely on at-most behavior and abstraction layers on at-least behavior.

Abstractions cannot send messages before they receive them. So any delay you add at the top must be magnified as you go down. The exception to this is if the contents of the message require different behaviors for different payloads, in which case they are correct. But encrypted payloads are opaque to the layers they are sent through. You can observe who the message was sent to, and know the maximum amount of data the conversation might have contained. But not a very clear idea of how long it took to build the message, unless you’ve already compromised the machine.

manwe150

Recent timing attacks rely on the observation that modern CPUs send some messages faster than other messages, based on predicting what they might contain, so some delays get magnified (those that deviate from expectations) while other delays (those that match prior data) get minimized as you go down. An encrypted payload leaks this same information too (the process is independent of what data is being transferred), although that leaked information is (hopefully) not useful since it just leaks the encrypted data, which (hopefully) looks like random noise. But that data has to be encrypted and decrypted at some point somewhere, which gives a point to attack.

kmeisthax

x86 and ARM both have options for executing certain instructions with data-independent timing. The problem here is that languages and compilers need to expose and honor data types that compile down to only those instructions.

This would be something like a 'secret' keyword that would qualify integer types; i.e. in C you could have a variable of type 'secret unsigned int' and the compiler would reject optimizations that would reveal timing information about the variable's contents, while optimizing the rest of the program's non-secret variables. Values can be promoted to secret but not demoted. Code that intends to reveal secrets (e.g. to write them out to secure storage) must explicitly cast away the secretness.

AFAIK Golang has crypto/subtle, but that's more special functions to do certain things in a data-independent way and not pervasive language support for it (and, presumably, they have to keep updating the compiler to not optimize that specific module).

jiehong

So far, SIMD operations are contant time, since the whole vector needs to be computed, and not just a single element in it.

ChaCha20 or Kyber might be using that. Although, I don’t know if any cryptographic algorithm depend on vector only calculation as a base for their strength.

Alternatively, I suppose that specialised cryptographic CPUs could be made to follow strict standards ensuing constant time runtime.

viraptor

> SIMD operations are contant time, since the whole vector needs to be computed

Is that actually guaranteed? Having to do the whole vector doesn't necessarily mean the time can't vary for different values for complex operations. (like division)

I can't find the details, but https://gleissen.github.io/papers/iodine.pdf at least mentions "Dually, if timing variability is unavoidable, e.g., in SIMD or floating-point units, making this variability explicit can better inform..." so it sounds like simd is at least situation specific here.

manwe150

I wonder if you could have a compiler that intentionally pads every instruction to use a vector register pair of <value, slow-const>, so that every value gets paired with whatever constant or other expression will cause the slowest execution of that vector instruction and with the maximum latency chain. Otherwise, it does look like the VDIV instructions can have variable latencies based on the data itself (and not just spectre-like speculation issues on the memory access patterns).

From https://uops.info/table.html

hinkley

In MD5, there are calculations that if they result in certain data ranges in certain bytes, results in a secondary calculation to spread that change into other bytes. Like a non Euclidean carry operation.

So I would think if you treat all of the bytes as one SIMD instruction, do a SIMD compare, and then fire another operation as a result… if you could guarantee that the third operation would have to fire for at least one of the vector values every time, then I think you’d have your solution.

pclmulqdq

Most constant-time code today is done in assembly, which is a mild downgrade from when it was done in C.

pjc50

Yes. I keep arguing that C is further and further from pretending to be a "portable assembler", and rather than trying increasingly elaborate tricks to generate the series of machine instructions you want you should just .. emit those instructions yourself in the first place.

The mention of CMOV is good. Additional tricks are available through vectorization - necessarily all elements of the vector have to execute at the same speed, which gives you extra options to put crafted data in a different column to force computation to proceed at a known rate.

We should not forget that some CPUs offer dedicated instructions - e.g. Intel AES-NI.

Another solution is to just stop trying to do it on the general purpose processor altogether and move it to some sort of HSM or TPM. Put your eggs in a smaller, more defensible basket. In a world of GPUs and AI accelerators it's easier to ask for a crypto accelerator (or rather decelerator!).

Is there any good work on constant-time crypto on the GPU?

If anyone manages to make homomorphic encryption viable, that provides another solution, at a huge performance penalty.

pclmulqdq

I don't think it's anywhere close to viable to move the cryptographic parts of the data plane into HSMs/TPMs. There's just too much work to do there, and you have to move plaintext over unsecured channels to do it. That means that you have to put at least an ephemeral key into the CPU, and the rest follows.

AES-NI, the SHA instructions, and constant-time subsets of instructions are generally good enough that you can do this in assembly.

zokier

editorialized title, originally "Constant-Time Code: The Pessimist Case"

most of the points brought up in the article have been happening already for decade(s), it's not something that 'will soon' happen.