Weight-sparse transformers have interpretable circuits [pdf]
8 comments
·November 14, 2025edvardas
HTML version: https://arxiv.org/html/2511.13653v1
lambdaone
I find this fascinating, as it raises the possibility of a single framework that can unify neural and symbolic computation by "defuzzing" activations into what are effectively symbols. Has anyone looked at the possibility of going the other way, by fuzzifying logical computation?
calebh
Yes, you can relax logic gates into continuous versions which makes the system differentiable. An AND gate can be constructed with the function x*y and NOT by 1-x (on inputs in the range [0,1]. From there you can construct a NAND gate, which is universal and can be used to construct all other gates. Sigmoid can be used to squash the inputs into [0,1] if necessary.
This paper lists out all 16 possible logic gates in Table 1 if you're interested in this sort of thing: https://arxiv.org/abs/2210.08277
smokel
Do you mean fuzzy logic [1]? It was all the hype in the 1990s.
radarsat1
> fuzzifying logical computation?
Isn't that basically what the sigmoid operator does? Or more in the direction of averaging many logical computations, we have random forests.
oli5679
This ties directly into the superposition theory.
It is believed dense models cram many features into shared weights, making circuits hard to interpret.
Sparsity reduces that pressure by giving features more isolated space, so individual neurons are more likely to represent a single, interpretable concept.
peter_d_sherman
>"To assess the interpretability of our models, we isolate the small sparse circuits that our models use to perform each task using a novel pruning method. Since interpretable models should be easy to untangle, individual behaviors should be implemented by compact standalone circuits.
Sparse circuits are defined as a set of nodes connected by edges."
...which could also be considered/viewed as Graphs...
(Then from earlier in the paper):
>"We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them.
And (jumping around a bit more in the paper):
>"A major difficulty for interpreting transformers is that the activations and weights are not directly comprehensible; for example, neurons activate in unpredictable patterns that don’t correspond to human-understandable concepts. One hypothesized cause is superposition (Elhage et al., 2022b), the idea that dense models are an approximation to the computations of a much larger untangled sparse network."
A very interesting paper -- and a very interesting postulated potential relationship with superposition! (which also could be related to data compression... and if so, in turn, by relationship, potentially entropy as well...)
Anyway, great paper!
We really need new hardware optimized for sparse compute. Deep Learning models would work way better with much higher dimensional sparse vectors but current hardware only excels at dense GMMs and structured sparsity.