Compiling a neural net to C for a speedup
91 comments
·May 28, 2025enricozb
hansvm
One of the easier solutions that does the rounds periodically (in many forms above and beyond logic gates, such as symbolic regression) is just densely connecting everything and ensuring that an identity function exists. Anneal a penalty around non-identity nodes, use L1 for the penalty, and you can learn a sparse representation.
There are a number of details to work through, such as making an "identity" for 2 inputs and 1 output (just don't offer those; use gates like a half adder instead of AND or XOR, adding a post-processing step removing extra wires you don't care about), or defining "densely connected" in a way that doesn't explode combinatorially (many solutions, details only matter a little), but it's the brute-force solution, and you only pay the cost during training.
There are lots of other fun ways to handle that problem though. One of my favorites is to represent your circuit as "fields" rather than discrete nodes. Choose your favorite representation for R2->Rn (could be a stack of grids, could be a neural net, who cares), and you conceptually represent the problem as a plane of wire density, a plane of "AND" density, a plane of "XOR" density, etc. Hook up the boundary conditions (inputs and outputs on the left and right side of the planes) and run your favorite differentiable PDE solver, annealing the discreteness of the wires and gates during training.
mochomocha
Ha! I have spent the last 2 years on this idea as a pet research project and have recently found a way of learning the wiring in a scalable fashion (arbitrary number of input bits, arbitray number of output bits). Would love to chat with someone also obsessed with this idea.
andy12_
I also I'm very interested. I had played around a lot with Differentiable Logic Networks a couple of months ago and how to make the learned wiring scale to bigger number of gates. I had a couple of ideas that seemed to worked in a smaller scale, but that had trouble converging with deeper networks.
UncleOxidant
Also very interested. Do you have any code on github?
mattdesl
I think the techniques in “Weight Agnostic Neural Networks” should be applicable here, too. It uses a variant of NEAT I believe. This would allow for learning the topology and wiring rather than just gates. But, in practice it is probably pretty slow, and may not be all that different than a pruned and optimized DLGN..
jimkoen
To ruin it for everyone: They're also patented :) https://patents.google.com/patent/WO2023143707A1/en?inventor...
Lerc
What's the innovation here?
Using logic operators? Picking something from a range of options with SoftMax? Having a distribution to pick from?
I remember reading about adaptive boolean logic networks in the 90's. I remember a paper about them using the phrase "Just say no to backpropagation". It probably goes back considerably earlier.
Fuzzy logic was all the rage in the 90's too. Almost at the level of marketers sticking the label on everything the way AI is done today. Most of that was just 'may contain traces of stochasticity' but the academic field used actual defined logical operators for interpolated values from zero to one.
A quick look on picking from a selection found https://psycnet.apa.org/record/1960-03588-000 but these days softmax is just about ubiquitous.
jimkoen
> What's the innovation here? > Having a distribution to pick from?
As I understand it, it's exactly this. Specifically, representing neurons in a neural network via a probability distribution of logic gates and then collapsing the distribution into the optimal logic gate for a given neuron via hyper-parameter tuning in the form of gradient descent. The author has a few more details in their thesis:
https://arxiv.org/abs/2209.00616
Specifically it's the training approach that's patented. I'm glad to see that people are trying to improve on his method, so the patent will likely become irrelevant in the future as better methods emerge.
The author also published an approach on applying their idea onto convolutional kernels in CNN's:
https://arxiv.org/abs/2411.04732
In the paper they promise to update their difflogic library with the resulting code, but apparently they seem to have conveniently forgotten to do this.
I also think their patent is too broad, but I guess it speaks for the entire ML community that we haven't seen more patents in this area. I could also imagine that, given that the approach promises some very impressive performance improvements, they're somewhat afraid that this will be used for embedded military applications.
genewitch
My Zojirushi rice cooker says fuzzy logic on it, it's 15 years old, so that phrase was still marketed 15 years after "inception".
JonChesterfield
If you replace the uint64_t cell with an attribute((vector_size(32))) and build with march=native, the bitwise ops will work exactly as before but you'll light up the vector units on the x64 machine.
Good blog post, thanks!
isaacimagine
Glad you enjoyed it, and thanks for the tip!
nine_k
The interesting thing here is that it's not a straightforward port. JAX is already very fast, for the architecture it implements. The point is that the network is heavily contracted by removing nodes that only do pass-through, and then hugely parallelizing the computations using bitwise operations on 64 bits at once. Hence this incredible speedup.
gwern
How much of a speedup is the C compiler optimization able to achieve in terms of compiling it down to a hand-written C equivalent vs the -O0 non-optimized assembler? What does the optimized C/assembler do which isn't actually necessary and accounts for the remaining inefficiency?
isaacimagine
There are 163 lines of C. Of them, with -O3, 104 lines are present in the assembly output. So the C compiler is able to eliminate an additional ~36.2% of the instructions. It doesn't do anything fancy, like autovectorization.
I profiled just now:
| instrs (aarch64) | time 100k (s) | conway samples (%) |
| -O0 | 606 | 19.10s | 78.50% |
| -O3 | 135 | 3.45 | 90.52% |
The 3.45s surprises me, because it's faster than the 4.09s I measured earlier. Maybe I had a P core vs an E core. For -O0, the compiler is emitting machine code like: 0000000100002d6c ldr x8, [sp, #0x4a0]
0000000100002d70 ldr x9, [sp, #0x488]
0000000100002d74 orn x8, x8, x9
0000000100002d78 str x8, [sp, #0x470]
Which is comically bad. If I try with e.g. -Og, I get the same disassembly as -O3. Even -01 gives me the same disassembly as -O3. The assembly (-0g, -01, -03) looks like a pretty direct translation of the C. Better, but also nothing crazy (e.g. no autovectorization): 0000000100003744 orr x3, x3, x10
0000000100003748 orn x1, x1, x9
000000010000374c and x1, x3, x1
0000000100003750 orr x3, x8, x17
Looking more closely, there's actually surprisingly little register spilling.I think the real question you're asking is, as I wrote:
> If we assume instruction latency is 1 cycle, we should expect 2,590 fps. But we measure a number nearly 10× higher! What gives?
Part of this is due to counting the instructions in the dissassembly wrong. In the blogpost I used 349 instructions, going off Godbolt, but in reality it's 135. If I redo the calculations with this new numbers, I get 2.11 instructions per bit, 0.553 million instrs per step, dividing out 3.70 gcycles/s gives 6,690 fps. Which is better than 2,590 fps, but still 3.6x slower than 24,400. But I think 3.6x is a factor you can chalk up to instruction-level parallelism,.
Hope that answers your questions. Love your writing Gwern.
gwern
Thanks for checking. It sounds like the C compiler isn't doing a great job here of 'seeing through' the logic gate operations and compiling them down to something closer to optimal machine code. Maybe this is an example of how C isn't necessarily great for numerical optimization, or the C compiler is just bailing out of analysis before it can fix it all up.
A fullstrength symbolic optimization framework like a SMT solver might be able to boil the logic gates down into something truly optimal, which would then be a very interesting proof of concept to certain people, but I expect that might be for you an entire project in its own right and not something you could quickly check.
Still, something to keep in mind: there's an interesting neurosymbolic research direction here in training logic gates to try to extract learned 'lottery tickets' which can then be turned into hyper-optimized symbolic code achieving the same task-performance but possibly far more energy-efficient or formally-verifiably.
JonChesterfield
Something like this should be hitting the instruction level vectoriser, the basic block at a time one, nearly bang on. Its a lot of the same arithmetic op interleaved. It might be a good test case for llvm - I would have expected almost entirely vector instructions from this.
isaacimagine
z3 has good python bindings, which I've messed around with before. My manual solution uses 42 gates, I would be interested to see how close to being optimal it is. I didn't ask the compiler to vectorize anything, doing that explicitly might yield a better speedup.
Re:neurosymbolics, I'm sympathetic to wake-sleep program synthesis and that branch of research; in a draft of this blog post, I had an aside about the possibility of extracting circuits and reusing them, and another about the possibility of doing student-teacher training to replace stable subnets of standard e.g. dense relu networks with optimized DLGNs during training, to free up parameters for other things.
gomoboo
Relevant post from a few years ago: https://news.ycombinator.com/item?id=25290112
“ NN-512 is an open-source Go program that generates fully AVX-512 vectorized, human-readable, stand-alone C implementations of convolutional neural nets”
andy12_
I also worked a long time ago in recreating the original Deep Differentiable Logic Network paper [1], so I have a couple of additions to make.
> I wanted to see if I could learn the wires in addition to the gates. I still think it’s possible, but it’s something I had to abandon to get the model to converge.
Actually, I read some other paper where they also learned the wiring, but they did so by alternating the training of the gates and the wires (in some iterations they learned the wiring while keeping the gates frozen, and in other they learned the gates while keeping the wiring frozen). The problem with this approach is that it is inherently non-escalable: you need a lot of gates to approximate the behavior of a simple MLP, and if you need a full NxM learned matrix to encode the wiring, the memory needed to learn, for example, MNIST, gets huge, quickly. I think that for this there are 2 fixes:
- You actually don't need to learn a full NxM matrix to increase the expressivity of the network. You can, for each output gate, select a random subset of possible input gates of size K, and then you only need a learned matrix of size KxM. I did the numbers, and even a moderately small K, like 16 or 32, wildly increases the number of circuits you can learn with a smaller number of layers and gates.
- You could use a LoRA kind of matrix. Instead of a matrix NxM, use a pair of matrices NxK and KxM, where K<<N,M.
Learning the wiring also has other benefits. As the output gate can learn to swap the inputs if needed, you can remove some learnable gates that are "mirrors" or "permutations" of each other (a and not b, not a and b; a or not b, not a or b), which can help scale the networks to use gates of more inputs (I tried with 3-input gates and 4-input gates).
Also, as the author pointed out, it was very difficult to get the models to converge. It was very frustrating that I never managed to get a working model that performed really well on MNIST. In the end, I gave up on that and I worked on how to make the network consistently learn simple 3-input or 4-input functions with perfect accuracy, and I managed to make it learn them consistently with a couple dozen iterations, which was nice.
isaacimagine
Very cool, thank you for sharing!
Vox_Leone
Well done — really enjoyed this. We could use this kind of optimization in our library[0], which builds differentiable logic networks out of gates like AND, XOR, etc.
It focuses on training circuit-like structures via gradient descent using soft logic semantics. The idea of compiling trained models down to efficient bit-parallel C is exactly the kind of post-training optimization we’ve been exploring — converting soft gates back into hard boolean logic (e.g. by thresholding or symbolic substitution), then emitting optimized code for inference (C, WASM, HDL, etc).
The Game of Life kernel is a great example of where logic-based nets really shine.
kookamamie
> 1,744x speedup
Is that 1744x or 1.7x?
genewitch
Former, also there's too many digits of precision for it to be the latter.
isaacimagine
Author here. Any questions, ask away.
mlajtos
That part about passthrough strongly reminded me of Turing’s Unorganized Machines (randomly wired NAND-gate networks): https://weightagnostic.github.io/papers/turing1948.pdf (worth a read from page 9)
viraptor
Is there soon expanded explanation for "Of course it is biased! There’s no way to train the network otherwise!" ?
I'm still struggling to understand why is that the case. As far as I understand the training, in a bad case (probably mostly at the start) you could happen to learn the wrong gate early and then have to revert from it. Why isn't the same thing happening without the biasing to pass-thru? I get why pass-thru would make things faster, but not why it would prevent converging.
GloamingNiblets
Thank you for the excellent writeup of some extremely interesting work! Do you have any opinions on whether binary networks and/or differentiable circuits will play a large role in the future of AI? I've long had this hunch that we'll look back on current dense vector representations as an inferior way of encoding information.
isaacimagine
Thank you, I'm glad you enjoyed it!
Well, I'm not an expert. I think that this research direction is very cool. I think that, at the limit, for some (but not all!) applications, we'll be training over the raw instructions available to the hardware, or perhaps even the hardware itself. Maybe something as in this short story[0]:
> A descendant of AutoML-Zero, “HQU” starts with raw GPU primitives like matrix multiplication, and it directly outputs binary blobs. These blobs are then executed in a wide family of simulated games, each randomized, and the HQU outer loop evolved to increase reward.
I also think that different applications will require different architectures and tools, much like how you don't write systems software in Lua, nor script games mods with Zsh. It's fun to speculate, but who knows.
NooneAtAll3
how does ~300 gates you got compare to modern optimal implementations?
iirc it's around 30-40?
djmips
Was this result surprising?
isaacimagine
Yes and no. I wasn't expecting to be able to reproduce the work, so I'm just content that it works. I was very surprised by how much hyperparameter finagling I had to do to get the DLGN converging; the tiny relu network I trained at the beginning, in comparison, converged with dead-simple SGD in a third of the epochs.
The speedup was surprising in the sense that the bit-level parallelism fell out naturally: that 64× speedup alone was unexpected and pretty sweet. There's likely still a lot of speed left on the table. I just did the bare minimum to get the C code working: it's single-threaded, there's no vectorization, lots of register spilling, etc. Imagine the speedup you'd get running the circuit on e.g. an FPGA.
But no, it was not surprising in the sense that yeah, multiplying billions of floats is going to be much slower than a handful of parallel bitwise ops. Physics is physics, doesn't matter how good your optimizer is.
jgord
what percentage of ops were passthru ?
ps. superb writeup and project
Twirrim
You've made some mistakes with the Game of Life rules. You've missed out the overpopulation rule:
Any live cell with more than three live neighbours dies
Nit: > I guess there’s a harsh third rule which is, “if the cell is dead, it stays dead”.
That phrasing is inaccurate, if a dead cell stayed dead, the first rule wouldn't work. I'm not sure that particular sentence adds much to the flow, honestly.
nightpool
You're thinking about the cells as toggles on a stateful grid, TFA is thinking about them as pure functions that take in an input state and output a new state (with "off" being the default).
From that perspective, there's no point in "killing" a cell, it's simpler to only write out the 0 -> 1 and 1 -> 1 transition cases and leave all of the other cases as implicitly 0
thirtygeo
That approach is bananas! I had seen the source inspiration paper from Google but it's need to see it replicated and extended so shortly after.
isaacimagine
+10 respect, thank you <3
AmazingTurtle
I recently read about DLGAs on HN and instantly thought: damn thats some hot take. But I was too stupid to implement it from the paper. Glad you got it working and documented it! Thanks!
hermitShell
This is very fascinating as a limit case, which always serve as a good example of the bound. I think it highlights that “efficiency isn’t everything” just like in so many other systems like healthcare and justice. In this case we could figure out the activation functions by analysis, which is impossible for problems of higher dimensionality. The magic of AI isn’t in it’s efficiency, it’s in making things computable that simply aren’t by other means.
Differentiable Logic Gate Networks [0] are super interesting. However, I still don't like that the wiring is fixed initially rather than learned.
I did some extremely rough research into doing learnable wirings [1], but couldn't get past even learning ~4-bit addition.
[0]: https://arxiv.org/abs/2210.08277
[1]: https://ezb.io/thoughts/program_synthesis/boolean_circuits/2...