Low responsiveness of ML models to critical or deteriorating health conditions
56 comments
·March 26, 2025wswope
AlotOfReading
Calling EKFs "ML" is certainly a choice.
getnormality
It is a reasonable choice, and especially with the quotes around it, completely understandable.
The distinction between statistical inference and machine learning is too blurry to police Kalman filters onto one side.
idiotsecant
It's machine learning until you understand how it works, then it's just control theory and filters again.
whatshisface
Diffusion models are a happy middle ground. :-)
Epa095
Is it less ML than linear regression?
klodolph
If you want to draw the line between ML and not ML, I think you’ll have to put Kalman filters and linear regression on the non-ML side. You can put support vector machines and neural networks on the ML side.
In some sense the exact place you draw the distinction is arbitrary. You could try to characterize where the distinction is by saying that models with fewer parameters and lower complexity tend to be called “not ML”, and models with more parameters and higher complexity tend to be called “ML”.
wswope
Hence the quotes ;).
CamperBob2
EKFs work by 'learning' the covariance matrix on the fly, so I don't see why not?
nyrikki
As an intuition on why many people see this as different.
PAC Learning is about compression, KF/EKF is more like Taylor expansion.
The specific types of PAC Learning that this paper covers has problems with a simplicity bias, and fairly low sensitivity.
While based on UHATs, this paper may provide some insights.
https://arxiv.org/abs/2502.02393
Obviously LLM and LRMs are the most studied, but even the recent posts on here from anthropic show that without a few high probability entries in the k-top results, confabulations are difficult for transformers.
Obviously there are PAC Learning methods that target anomaly detection, but they are very different than even EKF + Mc
You will note in this paper that even highly weighted features exhibited low sensitivity.
While the industry may find some pathological cases that make the approach usable, autograd and the need for parallelism make the application of this papers methods to tiny variations to multivariate problem ambitious.
They also only trained on medical data. Part of the reason the foundation models do so well is that they encode verifiers from a huge corpus that invalidates the traditional bias variance tradeoffs from the early 90's papers.
But they are still selecting from the needles and don't have access to the hay in the haystack.
The following paper is really not related except it shows how compression exacerbates that problem.
https://arxiv.org/abs/2205.06977
Chaitin's constant encoding the Halting problem, and that it is normal and uncomputable is the extreme top end of computability, but relates to the compression idea.
EKFs have access to the computable reals, and while non-linear, KF and EKFs can be thought of linearization of the approximations as a lens.
If the diagnostic indicators were both ergodic and Markovian, this paper's approach would probably be fairly reliable.
But these efforts are really about finding a many to one reduction that works.
I am skeptical about it in this case for PAC ML, but perhaps they will find a pathological case.
But the tradeoffs between statistical learning and expansive methods are quite different.
Obviously hype cycles drive efforts, I encourage you to look at this years AAAI conference report and see that you are not alone with the frustration on the single minded approach.
IMHO this paper is a net positive, showing that we are moving from a broad exploration to targeted applications.
But that is just my opinion.
jvanderbot
Parameter estimation is ML now?
klodolph
I think ML is in quotes for a reason—the reason is because the usage is not typical.
bbstats
Am I missing something or is this just "We built models that are bad"?
ohgr
Bad model, bad method or bad paper?
magicalhippo
For IHM prediction, LSTM models and transformer models were trained for 100 epochs using the MIMIC-III and eICU datasets separately.
I might be blind, but I don't see any mention of loss. Did they stop at 100 because it was a nice round number or because it was a good place to stop?
The LSTM model they used had 7k trainable parameters, the CW-LSTM model 153k while the transformer model had 800k parameters (300k trainable parameters and 600k optimizer parameters as they say).
I don't follow the field close enough, but is it reasonable these models all converged at the same time, given the large difference in size?
They mention the transformer model outperforming the LSTMs, but I wonder if it could have done a lot better.
rakejake
A 7k param LSTM is very tiny. Not sure if LSTMs would even work at that scale although someone with more theoretical knowledge can correct me on this.
As an aside, I'm trying to train transformers for some classification tasks on audio data. The models are "small" (like 1M-15M params at most) and I find they are very finicky to train. Below 1M parameters I find them hard to train at all. I have thrown all sorts of learning rate schedules at them and the best I can get is the network learns for a bit and then plateaus, after which I can't do anything to get them out of that minima. Training an LSTM/GRU on the same data gives me a much better loss value.
I couldn't find many papers on training transformers at that scale. The only one I was able to find was MS's TinyStories [0], but that paper didn't delve much into how they trained the models and whether they trained from scratch or distilled from a larger model.
At those scales, I find LSTMs and CNNs are a lot more stable. The few online threads I've found comparing LSTMs and Transformers had the same thing to say - Transformers need a lot more data and model size to achieve parity and exceed LSTMs/GRUs/CNNs, maybe because the inductive bias provided is hard to beat at those scales. Others can comment on what they've seen.
Al-Khwarizmi
I don't have much help to offer, but just to echo your experience... at my group we have tried to train Transformers from scratch for various NLP tasks and we always have been hit with them being extremely brittle, and BiLSTMs working better. We only succeeded by following a pre-established recipe (e.g. training a BERT model from scratch for a new language, where the architecture, parameters and tasks are as in BERT), or of course by fine-tuning existing models, but just throwing some layers at a problem and training them from scratch... nope, won't work without arcane knowledge that doesn't seem to be written anywhere accessible. This is one of the reasons why I dislike Transformers and I root for the likes of RWKV to take the throne.
rakejake
I think the "arcane knowledge" is true for LLMs (billions). But there are lots of people who train models in the open in the hundreds of millions realm, but never below. Maybe transformers simply don't work as well below a size and data threshold.
PaulHoule
What ever happened to early stopping?
I see so many papers where people train neural networks with half-baked recipes. I think I saw early stopping first around 1990 but it is so often for people to pick some arbitrary number of epochs to run. I have to admit I never liked the term "early stopping", I think people should have called it just "stopping", because it makes it seem optional.
Back when I was training LSTM networks it was straightforward to train nets reliably with early stopping...
Al-Khwarizmi
I'm also annoyed about this. I suppose the main reason is because 20 years ago, if you didn't use early stopping, typically your accuracy would plummet. In earlier, smaller neural networks, overfitting was a huge issue; and lack of dropout, batch normalization, etc. made learning much more brittle.
Now the young'uns don't bother because you can just set 100 epochs, or whatever, and the result might not be optimal but it will generally be fine. Still, it's a pity because you're often wasting computational resources that could be spent in trying alternative architectures, exploring hyperparameters or whatever.
BTW, I also think "early stopping" is a terrible name. If you don't know the term, it suggests that you're going to undertrain the network, sacrificing some accuracy for efficiency. No one wants undertrained networks. I think it's not an overstatement to say that if it were called "adaptive stopping", "validation guided stopping", or even something more catchy like "smart stopping", probably more people would use it.
PaulHoule
I have a smart RSS reader YOShInOn which uses BERT + a probability calibrated SVM as its main model, I want to make a general purpose model trainer for text classification that is able to do harder problems.
People who hold court on ML forums will tell you fine-tuned BERT is the way to go but BERT fine-tuning doesn't seem to be compatible with early stopping with anything like the training recipes I see in the literature. Compared to old days these networks soak up knowledge like a sponge, my hunch is that with N=10,000 samples or so you don't benefit from running more than one epoch because the the network doesn't have the capacity to learn from that many samples.
I find it depressing to find arXiv papers where people copy a training recipe from other papers for BERT and compare it 5-15 different text classification problems with maybe N=500 samples. My BERT experiments take about 30 minutes so it's no small thing to do parametric scans on them, particularly when the epoch count is one of the parameters. With "smart stopping" I'm not afraid of undertraining models so I could run trainings all night and believe I'm seeing representative performance as I vary parameters.
My plan is to couple ModernBERT to a LSTM or Bi-LSTM model as the literature seems to show that this frequently ties or beats fine-tuned BERT and my experience so far as I can build reliable trainers for LSTM whereas team fined tuned BERT is indifferent to the very idea of "reliable".
Another pet peeve is all the papers with N=500 samples where I regularly get N=10,000+ in systems that I use everyday and on a rainy weekend I can lay in bed with my iPad and switch to an Android tablet when the battery runs out and get N=5000 samples. [1] When I wrote my first text classification paper we found we needed N=10,000 to get really good models, sure the world knowledge in BERT helps models learn fast and that's great (a problem I worried about in 2005 and still worry about because I think the average person wants good results at N<10!) but I need calibrated usable accuracy and look at AUC-ROC as my metric, not "accuracy", F1 or anything like that.
Then there's the effort people waste with things that can't possibly work like Word2Vec, seems like people can read a lot of papers and not see it in front of them that Word2Vec is useless! I want to write a meta-analysis but instead I'm writing a diatribe and I'm not going to be happy until I repeat the paradigm with methods that are... repeatable, not for the science but for the engineering.
[1] with hallucinations as a side effect if it is a visual task but so what
rakejake
Nowadays, even the definition of an "epoch" is not well defined. Traditionally it meant a pass over the entire training set, but datasets are so massive today that many now define an epoch as X steps - where a step is a minibatch (of whatever size) from the training set. So 1 epoch is a random sample of X minibatches from the training set. I'd guess the logic is that datasets are so massive that you pick as much data as you can fit in VRAM.
Karpathy's Zero To Hero series also uses this.
null
MeteorMarc
Maybe the training set had too many zeroshot patients.
amelius
Maybe start a Kaggle competition?
timewizard
All of this seems designed to make the hospital more labor efficient. None of this seems designed to improve long term outcomes for patients.
My continuing suspicion that this technology gets the hype it does as part of an effort to reduce wages for all workers grows.
chiph
[flagged]
anthk
LLM models are organic? They somehow obbey the laws of Thermodinamycs by some parallel on algorythms and underlying Math? It would be amazing if it were some parallel between Biologhycs (specially Funghi with emergent properties) and neural networks...
I work in the ICU monitoring field, on the R&D team of a company with live systems at dozens of hospitals and multiple FDA approvals. We use extended Kalman filters (i.e. non-blackbox "ML") to estimate certain lab values of patients that are highly indicative of them crashing, based on live data from whatever set of monitors they're hooked up to - and it's highly robust.
What the authors of this paper are doing is throwing stuff at the wall to see if it works, and publishing results. That's not necessarily a bad thing at all, but I say this to underline that their results are not at all reflective of SOTA capabilities, and they're not doing much exploration of prior art.