Differentiable Programming from Scratch
18 comments
·April 17, 2025FilosofumRex
taeric
Computer Algebra Systems (CAS) were not really a secret. And they often have many many tricks that we are constantly relearning. Some of this relearning, of course, is by design. A lot of what they do are things we teach. How to calculate different functions and such. Repeated squares and such are fun topics.
A lot of the current new set of learning is that we have the compute power to do these things in more places. It is also something that has been long done in expensive environments that many of us just don't have access to.
kxyvr
Automatic differentiation was actively and continuously used in some communities for the last 40 years. Louis Rall has an entire book about it published in 1981. One of the more popular books on AD written by Griewank was published in 2000. I learned about it in university in the early 2000s. I do agree that the technology was not as well used as it should have been until more recently, but the technology was well known within numerical math world and used continuously over the years.
constantcrying
It wasn't forgotten, I learned it in university outside of any AI context. It just had most of its applications exhausted and ceased being a particularly active research topic.
thechao
My PhD dissertation included a chapter (originally from a 2006 paper) on generic programming in CAS' on the algorithmic differentiable ring (and operator). By the 1990s, algorithmic differentiation was easily 30–40 years old. Griewank & Monagan both knew guys who had built early electromagnetic naval targeting computers that used the methodology back by at least the early 60s "by hand". (Very literally.)
I watched the ML/AI bros actively ignore previous research — even when they were requested to properly cite sources they were plagiarizing — in real time. The race to publish (even for big journals) was so important that it was easier to ignore the rank dishonesty than it was to correct their misbehavior. I'm 1000x happier to not have stayed around for all that crap.
srean
> on the algorithmic differentiable ring (and operator)
That sounds like an interesting read. Do you have the chapter or the reference to the paper that you can share ?
Regarding th crop of deep neural network research their self-serving and willful blindness has a reputation that's well deserved.
A grad student from Hinton's lab mentioned one researcher who would misspell a citation on purpose so that the citation count of the cited paper does not go up.
thechao
My PhD, and its chapters are junk. Just read Griewank; I don't have anything more to say than his monograph. (We could never generalize the AD operator to the hybrid mode, so it's useless for real workloads.)
constantcrying
>Unfortunately, finite differences have a big problem: they only compute the derivative of f in one direction. If our input is very high dimensional, computing the full gradient of f becomes computationally infeasible, as we would have to evaluate f for each dimension separately.
And it is terribly unstable numerically. f(x) and f(x+h) are very similar, h is very small. You have to expect destructive cancellation to happen. For black boxes it is the only real alternative though, you can do a bit better by taking a derivative in both directions.
weinzierl
"Now that we understand differentiation, let’s move on to programming. So far, we’ve only considered mathematical functions, but we can easily translate our perspective to programs. For simplicity, we’ll only consider pure functions, i.e. functions whose output depends solely on its parameters (no state)."
I think I've seen this notion that the constraint is pureness also in documentation of autodiff libraries, but this cannot be strong enough, right?
It easy enough to come up with functions that are nowhere differentiable. So my question is, what are the actual requirements a state of the art autodiff library has for the input function and why do people focus on the pureness aspect if that is probably the least of the problems.
yorwba
If the function is pure, autodiff can produce a result. If the function is not differentiable for a given input, the result produced by autodiff of course cannot be the derivative (since none exists) but it could be another number (e.g. a subderivative) that is useful in some of the ways a derivative would have been useful.
grandempire
State is just values which are not considered to be variable function inputs - in other words each time you change state you have new functions.
For example f(x, y) = xy and then defining a differentiable function g(x) = f(x, a). You can imagine “a” being a state variable.
hansvm
Purity, interestingly, is not a requirement. You can represent any closure as something implicitly captured, and you can represent mutable state as an immutable input with a different, related, immutable output. For any autodiff library not also throwing in dense+sparse decompositions (none of the major ones do any such optimization; the programmer has to manually choose which things to represent how), that doesn't even waste any more space for reverse-mode autodiff than your other options (i.e., you need to maintain the before and after state anyway, give or take your particular checkpointing scheme, and if you don't have a way to represent the delta cheaply then contrasted with the underlying mutable forward pass which probably was faster and cheaper than the alternatives, the differentiation is probably quite expensive). Purity is often easier to code around, especially in "complicated" languages with many features you'd like to support, but it's not mandatory.
In terms of actual requirements, something that's sufficient [0] is for every sub-component to be differentiable and for no dynamic control flow to depend on the things being differentiated. In practice, most libraries wind up requiring something like this, mostly because it's very hard to do anything else. As an example, define f(x) := 0 for floats with an even LSB and 1 for floats with an odd LSB. Define g(x) := 1 - f(x). Neither of these are meaningfully differentiable, but g(x) + f(x) is identically equal to one. Autodiff relies crucially on the fact that it can perform local transformations, and that sort of whole-program analysis is (a) impossible in general, and (b) hard even when it's possible.
For local-only autodiff (the only thing people ever do), the thing that's necessary is for every sub-component to have a derivative-like operator defined such that if the sub-components are composed into a differentiable function then the normal chain rule and other autodiff compositions of those operators is also differentiable and represents the derivative in question (along with some requirements on dynamic control flow -- they don't have to be quite as strict as I described, but it's impossible to relax in general that with local-only autodiff, so that dynamic requirement from the above paragraph is also necessary).
There are few (zero?) components where that's possible -- an adversary can always come up with a composition violating the derivative being incorrect. However, for some interesting functions (like eigenvalues and eigenvectors) in the normal way people use them, these sorts of things can be defined. E.g., the eigenvalue derivative is not unique (up to a choice of phase), but if your composition also doesn't depend on phase then you're still fine.
[0] Even for things like differentiating through a loop converging to a value, this property holds, with one meaningful exception: The error in the derivative compared with the true function you're approximating will still converge to zero with enough iterations, but that number can be much higher than you need to get the function itself to converge. You _will_, however, get the derivative of the approximation perfect.
hwpythonner
I’m not deep into autodiff (just recall some calculus from university), but the syntax in this post reminds me a lot of ML (the programming language, not machine learning)
I know autodiff isn’t lambda calculus, but the expression-based structure and evaluation rules feel similar. Couldn’t this be implemented in something like ML or Clojure? Just wondering what the custom DSL adds that existing functional languages wouldn’t already support
hansvm
I didn't see a DSL anywhere, just normal JS code.
As to what it adds?
- It's more accessible to a wider audience (and looks like how you'd implement autodiff in most languages)
- It runs in the browser trivially (powering those demos)
- The author (potentially) didn't have to learn a new language just to get started
- Programs are not fully differentiable, or at the very least there are some crazy edge cases and dragons lurking if you attempt to make them so. A dedicated whitelist of supported operations isn't necessarily a bad design, contrasted with an implicit whitelist in Clojure (depending on the implementation of course, but there wasn't a lot of source-to-source boilerplate even in this example, so I assume the benefit of a functional language would be stripping away some of the characteristics I think are important).
fire_lake
There is an F# implementation (Microsoft flavoured ML) called DiffSharp
constantcrying
Automatic differentiation can be implemented in essentially any language. Some just make it look "nicer".
kgwgk
For example, it looks nice in Common Lisp: https://people.eecs.berkeley.edu/~fateman/papers/overload-AD...
Historical fact: Differentiable programming was a little known secret back in the 90's, used mainly by engineers simulating numerically stiff systems like nukes and chemicals in FORTRAN 95. It then disappeared for nearly 30 yrs before rediscovery by the ML/AI researchers!