Skip to content(if available)orjump to list(if available)

Are polynomial features the root of all evil? (2024)

ForceBru

In another post (https://alexshtf.github.io/2025/03/27/Free-Poly.html) the author fits a degree-10000 (ten thousand!) polynomial using the Legendre basis. The polynomial _doesn't overfit_, demonstrating double descent. "What happened to our overfitting from ML 101 textbooks? There is no regularization. No control of the degree. But “magically” our high degree polynomial is not that bad!"

So... are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,...). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?

StableAlkyne

> So... are _all_ introductions to machine learning just extremely wrong here?

It's more of a heuristic. Most people have their first experience in Excel, where you can fit a polynomial. Cranking up degree will always improve r2, so it's a very common mistake new students make.

It's much more understandable at the beginner level to say "you'll overfit if you crank up the degree" than it is to explain regularization and basises. Later on you can introduce it, but early on it's confusing and distracting to students who might not even know how to solve for an ordinary least squares model.

jordigh

Well, one catch is that you're doing degree ten thousand. That alone already increases your computational costs and can run afoul of numerical error as your numbers start losing precision. You'll have to start worrying about overflow and underflow. Horner's method or another representation of your polynomial might start to become important. You'll also start to notice your CPU spiking more. In essence, this is kind of fitting more and more points until you get rid of every oscillation you don't want.

The other catch, which the author mentions, is that extrapolation is still essentially impossible with polynomials. This is easy to see. The highest-degree term will dominate all the other terms by an order of magnitude once you step out of the interval of interest. Every non-constant polynomial grows without bound.

Honestly, though, what the author is doing isn't much different than a Taylor approximation. If you're okay fitting any polynomial in a small neighbourhood of the data, then go whole hog and fit the best possible polynomial: a Taylor polynomial. You wouldn't usually need to go to that high degree to get what you want in a neighbourhood either.

null

[deleted]

wbl

Taylor is for points. For functions on interval you want Chebyshev interpolation and probably want to throw in some poles to do better if L1 not L2 matters.

mmillin

> So... are _all_ introductions to machine learning just extremely wrong here?

Maybe? https://www.argmin.net/p/thou-shalt-not-overfit

FabHK

This article is much better, more informative, and factual than I'd have expected from the title. Note that it's part of an 8-article series.

Worth a read if you're ever fitting functions.

creata

A well-known related paper that I didn't see mentioned in the article (although Trefethen was mentioned) is "Six Myths of Polynomial Interpolation and Quadrature".

https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf

ziofill

In this case wouldn't a Fourier-type approach work better? At least there's no risk the function blows up and it possibly needs fewer parameters?

programjames

This is why I believe a numerical methods course should be a requirement for any AI majors.

stuxnet79

I'm exactly in the category of AI majors who are not familiar with numerical methods. Can you broadly explain where the gap in AI pedagogy is and how students can fill it?

The series of articles posted here are interesting and I plan to review them in more detail. But I'm concerned about what the "unknown-unknowns" are.

fancyfredbot

Wanted to add my voice to the chorus of appreciation for this article (actually a series of 8). Very informative and engaging.

petters

This part about double decent is really good: https://alexshtf.github.io/2025/03/27/Free-Poly.html#fnref:2

ForceBru

Right? I'm no stranger to ML, but this feels like magic, so cool! It clearly explains why normalizing features can be important (some polynomials blow up outside their range), provides a simple example of double descent and fits extremely high-degree polynomials without loss of generalization - amazing stuff!

PaulHoule

See also https://en.wikipedia.org/wiki/Chebyshev_polynomials

They're the kind of math which is non-obvious and a bit intricate but yet if you knew the basics of how they worked and you were bored you might sit down with pencil and paper and derive everything about them. Then you wouldn't be bored anymore.

ComplexSystems

Great article! Very curious how this orthogonalization + regularization idea could be extended to other kinds of series, such as Fourier series.

tc4v

good article, but I am very bother by the "standard basis"... it's called canonical in math. I don't think standard is the right name in any context.

LegionMammal978

If you treat the ring of polynomials as a vector space in their coefficients, then the unit monomials 1, x, x^2, etc. form the standard basis of that vector space. Wikipedia has an example of "standard basis" in this sense [0].

[0] https://en.wikipedia.org/wiki/Standard_basis#Generalizations

xg15

Great article and clever use of linkbait!

SkyBelow

One thought I had while looking into regression recently is to consider the model created with a given regularization coefficient not as a line on a 2 dimensional graph but as a slice of a surface on a 3 dimensions graph where the third dimension is the regularization coefficient.

In my case the model was for logistic regression and it was the boundary lines of the classification, but the thought is largely the same. Viewing it as a 3d shape form by boundary lines and considering hill tops as areas where entire classification boundaries disappeared as the regularization coefficient grew large enough to eliminate them. Impractical to do on models of any size and only useful when looking at two features at a time, but a fun consideration.

More on topic with the article, how well does this work with considering multiple features and the different combinations of them. Instead of sigma(n => 50) of x^n, what happens if you have sigma(n => 50) of sigma(m => 50) of (x^n)*(y^n). Well probably less than 50 in the second example, maybe it is fair to have n and m go to 7 so there are 49 total terms compared to the original 50, instead of 2500 terms if they both go to 50.