Three Hundred Years Later, a Tool from Isaac Newton Gets an Update

40 comments

·March 24, 2025

magicalhippo

In the paper[1], they note that their method seems much more well-behaved for multi-variate problems. In the case they numerically examined, the regular Newton's method had a fractal pattern of which starting points converged, while their method had a nice contiguous area with fairly sharp boundary.

That is, if using Newton's method one could start close to the solution but still not converge, while their method seems much more well-behaved.

I guess that might be a more attractive property than "just" being faster. However not an expert so not sure if other, cheaper alternatives also has this property.

That said, the whole point of their method is to use higher-order derivatives. Admittedly I don't do optimization problems that often, but when I do my function is seldom analytical, and so getting good higher-order derivatives are usually expensive at best and problematic at worst.

They do touch on this in the section about future work, wondering if a quasi-Newton method[2] could be created in a similar vein, using only sparse sampling of the higher-order derivatives.

[1]: https://arxiv.org/abs/2311.06374

[2]: https://en.wikipedia.org/wiki/Quasi-Newton_method

yorwba

I think the result is primarily relevant from a complexity-theory perspective, showing that this is possible at all in time polynomial in the number n of variables. A big hint is that the paper never explicitly states the degree of that polynomial, but it would need to be at least as large as the degree d of the Taylor expansion used.

The convergence speed depends in a complicated way on the Lipschitz constant of the d-th derivative, but if you ignore that, the number of iterations to reach a fixed target precision is proportional to 1/ln(d). So to halve the number of iterations, you would need to square d, which is bad news if the runtime is already at least exponential in d. It would only make sense if the function you're trying to minimize is extremely costly to compute but the cost of differentiating it d times is negligible in comparison. Even then, there'll be an optimal d beyond which further increases only slow it down.

In the univariate case, if you're doing 1 + d work per iteration for 1/ln(d) iterations, the optimal integer value of d is 4. So at least in that case, going a bit higher than Newton's method can be beneficial.

casualrandomcom

Maybe I had my Gell-Mann effect moment (https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect)

The illustration of the Newton method is wrong, isn't it? The approximating second order polynomials that were drawn are not really graphs of second order polynomials, they are not even functions... those are parables in the 2D plane, but Newton's method won't work like that... Besides, the convexity of the target function looks negative near to the first guess, the Newton method shall probably miserably fail here

jonlong

For late arrivals to this thread: note that the illustration being discussed has been updated (see the Wayback Machine for the old version). The new version probably still does not show the best second-order approximations, but the obvious qualitative errors have been corrected.

magicalhippo

Perhaps the graphics designer didn't get the memo. You can see in the panels that it's the same parabola, they've just rotated it in the first panels so it seems to fit the shape of the local curve "better".

casualrandomcom

Yes, that's what I guess has happened.

mrkeen

I'm not following your objection. To my eye those approximation graphs are indeed 2nd order functions in the 2d plane. But they are perhaps not parabolic for lack of symmetry?

casualrandomcom

They look symmetric to me, but that's not point anyway, I guess they are parables, only they are rotated, their axis is not parallel to the y axis.

If those are 2nd order functions in the 2d plane, then we don't agree on terminology.

The reason why I feel an expert is that it is clear at first sight to me that if you really implement the Newton method in that situation, the approximating functions that you will get are totally different from those that were drawn in the illustration.

The third parable from the left on the lower left figure, is definitely not a second order approximation to the target function: the convexity is reversed!

MITSardine

Well, now that you mention it, even the first one is on the wrong side of the graph. The second derivative is apparently (to my eyes) negative there, so the approximating parabola should be turning downwards. Whichever the sign, its value is close to zero anyways, the approximation would look very flat there.

Anyways, it's a little graph to illustrate the idea of a second order approximation, I think it does the job.

seanhunter

When you say they are not even functions, are you saying because they are not everywhere bijective? Because they look like rotated parabolas to me which means they will have continuous first and second derivatives which is (I think) all you need for Newton’s method isn’t it?

casualrandomcom

You are thinking of those parables as parametric curves, I guess, but Newton's method is about approximating an arbitrary function by its Taylor expansion truncated to the second degree, which is a polynomial function of second degree. The graphs of these functions can be thought as parametric curves, but (my point is also) these are not those that were drawn in the illustration.

MITSardine

I think the person meant that, to a single x, they map more than one value. A function needs to have at most one image to the antecedent. In other words, it can't have "vertical stretches". This is different from bijectivity (specifically, injectivity here) which asks that two different x values never map to the same y.

null

[deleted]

emmanueloga_

I spent way too long figuring this one out, so this is what I got:

An improvement on [1], which I vaguely remember using with pen and paper to find minimums of differentiable functions. The original algorithm runs "on a loop" (iteratively) and utilizes the first and second order derivative of a function (f', f''). From the article:

> Newton did it for degree 2. He did that because nobody knew how to minimize higher-order polynomials

The improved version looks a lot more complex but seems to sacrifice simplicity to converge faster to the minimum when implemented as a program.

Trivia: a "single loop" of the Newthon method is famously used in Quake's Fast InvSqrt() implementation [2].

1: https://en.wikipedia.org/wiki/Newton's_method

2: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Newto...

Sniffnoy

There's actually a separate Wikipedia article about the variant/application of Newton's method being improved upon here: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimizat... Note the italic text at the top of the "Newton's method" article!

thrdbndndn

This article is really badly written IMHO. And I feel the content should just be merged back to the main article.

emmanueloga_

Awesome thank you for the clarification!

nyeah

I wish we could call the Quake algorithm "fast reciprocal square root". But it's far too late now.

quanto

> “Our algorithm right now is provably faster, in theory,” Ahmadi said. He’s hopeful, he added, that in 10 to 20 years, it will also be so in practice.

I don't see what a decade or two will add to this technique.

My understanding is that the proposed method is faster in the sense of sampling efficiency (of the cost function to construct the Taylor series), but not in the sense of FLOPS. The higher derivatives do not come for free.

It was never the interesting to see the mathematical maneuver to massage the Taylor series into a more digestible form and prove the error bounds.

MITSardine

This is highly problem dependent. Sometimes, you get second derivatives almost for free. It all depends on your cost function.

It also depends on your stopping tolerance. Computing a higher order derivative incurs a constant cost, whereas a higher order method converges faster, but not just by a constant. Hence, as you ask for more and more precise results, the ratio of cost of your high order to your low order method will go to zero.

The cost of Newton's method is not so much computing the second derivative anyways, but solving the resulting linear system... So, for small dimensional problems, Newton's method is still the go-to optimization method, it is stupidly efficient. (with a good line search, this partly answers your "I don't see what a decade or two will add")

larodi

Weirdly, the original algo is showcased visually, very easy to understand even by 9th graders. While the new one...can't be or quanta woulnd't show it to us just like that?

What I'm trying to state is that if the new algo is so much better, that it can't be explained visually, then the simplicity perhaps was sacrificed, which was what kept the original algo going for 300 years...

magicalhippo

The paper has some examples, and a very clear one for the univariate case[1].

Yes, weird of Quanta to not include such an example.

[1]: https://arxiv.org/html/2311.06374v2#S5.F1

asplake

The illustrated example isn’t multivariate and can be drawn in 2 dimensions. The harder examples are also harder to visualise.

vlovich123

> My understanding is that the proposed method is faster in the sense of sampling efficiency (of the cost function to construct the Taylor series), but not in the sense of FLOPS. The higher derivatives do not come for free.

Sure, but as long as this remains cheaper than the process of computing the next convergence, this would still be a net win. For example the article talks about how AI training uses gradient descent and I’m pretty sure that the gradient descent part is a tiny fraction of the time spent training vs evaluating the math kernels in all the layers; therefore taking fewer steps should be a substantial win.

djoldman

> Like the original version of Newton’s method, each iteration of this new algorithm is still computationally more expensive than methods such as gradient descent. As a result, for the moment, the new work won’t change the way self-driving cars, machine learning algorithms or air traffic control systems work. The best bet in these cases is still gradient descent.

Unfortunately not.

alexchamberlain

Each iteration is (much) more expensive, but you should be able to get to the final answer in fewer iterations.

ErigmolCt

What I find fascinating is that, after 300 years, Newton's core idea still underpins so much of modern optimization

teleforce

Please check this excellent video describing and explaining on 200 years of logic, optimization and constraint programming since the pioneering work of George Boole [1].

Towards the end of the video you can also see Donald Knuth asking questions on the presentation.

[1] Logic, Optimization, and Constraint Programming: A Fruitful Collaboration (2023) [video]:

https://www.youtube.com/live/TknN8fCQvRk

JDEW

> “Our algorithm right now is provably faster, in theory,” Ahmadi said. He’s hopeful, he added, that in 10 to 20 years, it will also be so in practice.

imtringued

Using semidefinite programming to find a better taylor expansion in your newton step sounds like crazy talk to me. No way this could ever be practical.

Remember that semidefinite programming is often implemented via the interior point method, which is an application of Newton's method with inequality constraints.

DrNosferatu

Is this reporting on the Newton approximation method or on an update to the Newton method?

magicalhippo

The paper is about Newton's method in optimization.

user3939382

The TI-83+?

bbor

Very, very cool -- thanks for posting! Quanta is yet again proving itself to be the very best casual science publication in the English-speaking world. This is a short one, but as always, clean, original graphics + big pictures of the participants is a winning combo IMO.

  Other iterative methods, like gradient descent, converge toward the true minimum at a linear rate. Newton’s method converges toward it... at an [exponential] rate... 
  The rate of convergence would scale with the number of derivatives used: Just as using two derivatives allowed Newton to approach the true minimum at a quadratic rate, using three derivatives enabled the researchers to approach it at a cubic rate, and so on.

Was not aware of this reality myself from my ~undergrad ML math education. Obviously we're getting very, very good at optimizing these cheap linear algorithms, but as someone who thinks DL represents the intuitive faculties of human cognition--and thus is missing/badly-recreating our unique deliberative faculties--this gives me hope for future qualitative breakthroughs. I'm sure it's not this simple, but working with ~1500-dimension spaces (or, sometimes, in the millions!) makes this description sound pretty damn promising.

I'm still confident that medium-term progress on deliberative cognition lies in the very uncool Neat/symbolic direction, but this is the first time I'm really thinking hard about what it would take to crack it the hard, more durable way (i.e. biological verisimilitude).

In terms of the paper itself (https://arxiv.org/html/2311.06374v2), this is by far the most interesting part for this novice, in 1.2:

  We prove that our algorithm is well-defined in the sense that the semidefinite programs it executes are always feasible and that the next iterate is always uniquely defined...[, and that the program] has local convergence of order d. Compared to the classical Newton method, this leads to fewer calls to the Taylor expansion oracle (a common oracle in this literature...) at the price of requiring higher-order derivatives.

I've long said math is just a collection of intellectual tools, but it's very satisfying to read what feels like an unusually flexible application of this premise, relying on "programs" that invoke "oracles". More fundamentally, the specifics of SDP are immediately promising for any among us inclined to extend connectionism to some of the workings of the cosmos itself, IMO: https://en.wikipedia.org/wiki/Semidefinite_programming

Sniffnoy

Why did you replace "quadratic" with "exponential"? Quadratic isn't exponential.

Straw

The number of correct digits in the answer grows exponentially with iteration count- "quadratic convergence" means we roughly square the error each time, so after many iterations we raise the error to an exponentially high power.

sfpotter

People have tried applying Newton's method and other quasi-Newton methods to ML problems, and my understanding is that it isn't worth it for large problems.

In ML, you care about getting a few digits for a very large, very nonlinear, high-dimensional optimization problem. The cost function is gnarly, it's expensive to evaluate, computing derivatives is even more expensive, and there are many different optima. You don't even really care too much which optimum you get.

The theory of Newton's method (Kantorovich's theorem) says that you obtain quadratic convergence once you're in a small ball surrounding your optimum. How big this ball is depends on the smoothness of the cost function. E.g., if you try to minimize a positive definite quadratic function with Newton's method, you will converge to the minimum in one step (the basin is all of R^n), with the first order necessary equations being equivalent to solving a symmetric positive definite linear system. But if the function is more wild and crazy, the ball may be quite small; if you try to run an unadulterated Newton's method outside this ball, the iteration can easily diverge.

With this in mind, Newton's method is usually used when you want 1) lots of digits (or ALL the digits), 2) you have a good reason to believe you're within the basin of convergence, 3) you know your function has a single global optimum and you're willing to use a damped Newton's method and wait out some linear convergence until you get to the basin of contraction. At least if you're training a neural network, none of this is true. ($)

That said, even if you're minimizing a nice, smooth, univariate function with exactly one optimum, the secant method will get the same result with fewer FLOPs. So, although the order of convergence of secant is lower (in fact, the order is ~1.618), the fact that each iteration is more economic than Newton means that secant will deliver a target error with fewer FLOPs even though it will take more iterations. That said, in my experience, the distinction between "Newton" and "secant" (i.e. quasi-Newton) in low dimensions is less clear cut. Sometimes your Hessian is easy to evaluate and it's simpler to program and run Newton. The difference between these approaches is also pretty minimal in low dimensions... secant might be faster but not that much faster.

This last point also relates to the original paper. It seems likely that this algorithm will have a worse "iteration vs FLOPs" tradeoff even than Newton.

Now, I think the one place where Newton really shines is in interval arithmetic. There are interval Newton methods which can be used to not only compute but prove the existence and uniqueness of a zero for f(x) in R^n. For this reason, they get used for verified numerics and things like rigorously finding all zeros of a function in a compact domain (e.g. by combining with subdivision).

($): Although, actually, I've seen people train neural networks like this. I believe sometimes, if you have a very SMALL neural network (like, tens to thousands of parameters), you once again enter the regime of "modern" optimization and can train a neural network potentially very accurately using Gauss-Newton.

webprofusion

Is that like a Palm Pilot? Seems like 300 years ago.

ineedasername

300 years? Listen if you can’t ship product then it doesn’t matter how good you are or what your rep is. Pack it in and hit code academy or coursera or whatever and skill up.

HN

Three Hundred Years Later, a Tool from Isaac Newton Gets an Update

Three Hundred Years Later, a Tool from Isaac Newton Gets an Update