We investigated Amsterdam's attempt to build a 'fair' fraud detection model

61 comments

·June 12, 2025

bananaquant

What nobody seems to talk about is that their resulting models are basically garbage. If you look at the last provided confusion matrix, their model is right in about 2/3 of cases when it makes a positive prediction. The actual positives are about 60%. So, any improvement is marginal at best and a far cry from ~90% accuracy you would expect from a model in such a high-stakes scenario. They could have thrown a half of cases out at random and had about the same reduction in case load without introducing any bias into the process.

A part of me wonders how do such unsatisfactory projects get greenlighted and kept at all. I can imagine a higher-level official who knows almost nothing of machine learning get bamboozled by a bright-eyed university graduate who has only worked on toy projects before. After some time, everyone is afraid to admit that the project has basically failed, but it gets pushed through anyway to save face. There needs to be a process to get rid of such crap fast and without blame, just so that powers that be don't mount unreasonable opposition to removing something that is provably garbage and a waste of tax money.

tbrownaw

> But the model designers were aware that features could be correlated with demographic groups in a way that would make them proxies.

There's a huge problem with people trying to use umbrella usage to predict flooding. Some people are trying to develop a computer model that uses rainfall instead, but watchdog groups have raised concerns that rainfall may be used as a proxy for umbrella usage.

(It seems rather strange to expect a statistical model trained for accuracy to infer and indirect through a shadow variable that makes it less accurate, simply because it's something easy for humans to observe directly and then use as a lossy shortcut or to promote alternate goals that aren't part of the labels being trained for or whatever.)

> These are two sets of unavoidable tradeoffs: focusing on one fairness definition can lead to worse outcomes on others. Similarly, focusing on one group can lead to worse performance for other groups. In evaluating its model, the city made a choice to focus on false positives and on reducing ethnicity/nationality based disparities. Precisely because the reweighting procedure made some gains in this direction, the model did worse on other dimensions.

Nice to see an investigation that's serious enough to acknowledge this.

tripletao

They correctly note the existence of a tradeoff, but I don't find their statement of it very clear. Ideally, a model would be fair in the senses that:

1. In aggregate over any nationality, people face the same probability of a false positive.

2. Two people who are identical except for their nationality face the same probability of a false positive.

In general, it's impossible to achieve both properties. If the output and at least one other input correlate with nationality, then a model that ignores nationality fails (1). We can add back nationality and reweight to fix that, but then it fails (2).

This tradeoff is most frequently discussed in the context of statistical models, since those make that explicit. It applies to any process for deciding though, including human decisions.

londons_explore

> Two people who are identical except for their nationality face the same probability of a false positive

It would be immoral to disadvantage one nationality over another. But we also cannot disadvantage one age group over another. Or one gender over another. Or one hair colour over another. Or one brand of car over another.

So if we update this statement:

> Two people who are identical except for any set of properties face the same probability of a false positive.

With that new constraint, I don't believe it is possible to construct a model which outperforms a data-less coin flip.

drdaeman

I think you took too much of a jump, considering all properties the same, as if the only way to make the system fair is to make it entirely blind to the applicant.

We tend to distinguish between ascribed and achieved characteristics. It is considered to be unethical to discriminate upon things a person has no control over, such as their nationality, gender, age or natural hair color.

However, things like a car brand are entirely dependent on one's own actions, and if there's a meaningful statistically significant correlation owning a Maserati and fraudulently applying for welfare, I'm not entirely sure it would be unethical to consider such factor.

And it also depends on what a false positive means for a person in question. Fairness (like most things social) is not binary, and while outright rejections can be very unfair, additional scrutiny can be less so, even though still not fair (causing prolonged times and extra stress). If things are working normally, I believe there's a sort of (ever-changing, of course, as times and circumstances evolve) an unspoken social agreement on what's the balance between fairness and abuse that can be afforded.

Borealid

I think the ethical desire is not to remove bias across all properties. Properties that result from an individual's conscious choices are allowed to be used as factors.

One can't change one's race, but changing marital status is possible.

Where it gets tricky is things like physical fitness or social groups...

kurthr

This is a really key result. You can't effectively be "blind" to a parameter that is significantly correlated to multiple inputs and your output prediction. By using those inputs to minimize false positives you are not statistically blind, and you can't correct the statistics while being blind.

My suspicion is that in many situations you could build a detector/estimator which was fairly close to being blind without a significant total increase in false positives, but how much is too much?

I'm actually more concerned that where I live even accuracy has ceased to be the point.

thatguymike

Congrats Amsterdam: they funded a worthy and feasible project; put appropriate ethical guardrails in place; iterated scientifically; then didn’t deploy when they couldn’t achieve a result that satisfied their guardrails. We need more of this in the world.

tbrownaw

What were the error rates for the various groups with the old process? Was the new process that included the model actually worse for any group, or was it just uneven in how much better it was?

jaoane

[flagged]

nxobject

> because I don't even need to look at the data to know that some groups are more likely to commit fraud.

That is by definition prejudice: bias without evidence. Perhaps they want to avoid that.

jaoane

Thankfully, this project got evidence. Unfortunately, it was shelved.

3abiton

> A more concerning limitation is that when the city re-ran parts of its analysis, it did not fully replicate its own data and results. For example, the city was unable to replicate its train and test split. Furthermore, the data related to the model after reweighting is not identical to what the city published in its bias report and although the results are substantively the same, the differences cannot be explained by mere rounding errors.

Very well written, but that last part id concerning and point to one part: did they hire interns? How cone they do not have systems? It just cast a big doubt on the whole experiment.

wongarsu

A big part of the difficulty of such an attempt is that we don't know the ground truth. A model is fair or unbiased if its performance is equally good for all groups. Meaning e.g. if 90% of cases of Arabs committing fraud are flagged as fraud, then 90% of cases of Danish people committing fraud should be flagged as fraud. The paper agrees on this.

The issue is that we don't know how many Danish commit fraud, and we don't know how many Arabs commit fraud, because we don't trust the old process to be unbiased. So how are we supposed to judge if the new model is unbiased? This seems fundamentally impossible without improving our ground truth in some way.

The project presented here instead tries to do some mental gymnastics to define a version of "fair" that doesn't require that better ground truth. They were able to evaluate their results on the false-positive rate by investigating the flagged cases, but they were completely in the dark about the false-negative rate.

In the end, the new model was just as biased, but in the other direction, and performance was simply worse:

> In addition to the reappearance of biases, the model’s performance in the pilot also deteriorated. Crucially, the model was meant to lead to fewer investigations and more rejections. What happened instead was mostly an increase in investigations , while the likelihood to find investigation worthy applications barely changed in comparison to the analogue process. In late November 2023, the city announced that it would shelve the pilot.

golemiprague

[dead]

tomp

Key point:

The model is considered fair if its performance is equal across these groups.

One can immediately see why this is problematic, easily by considering equivalent example in less controversial (i.e. emotionally charged) situations.

Should basketball performance be equal across racial, or sex groups? How about marathon performance?

It’s not unusual that relevant features are correlated with protected features. In the specific example above, being an immigrant is likely correlated with not knowing the local language, therefore being underemployed and hence more likely to apply for benefits.

atherton33

I think they're saying something more subtle.

In your basketball analogy, it's more like they have a model that predicts basketball performance, and they're saying that model should predict performance equally well across groups, not that the groups should themselves perform equally well.

tomp

You’re right, I misinterpreted it.

Jimmc414

Amsterdam reduced bias by one measure (False Positive Share) and bias increased by another measure (False Discovery Rate). This isn’t a failure of implementation; it’s a mathematical reality that you often can’t satisfy multiple fairness criteria simultaneously.

Training on past human decisions inevitably bakes in existing biases.

ncruces

I have a growing feeling that the only way to be fair in these situations is to be completely random.

LorenPechtel

Why is there so much focus on "fair" even when reality isn't?

Not all misdeeds are equally likely to be detected. What matter is minimizing the false positives and false negatives. But it sounds like they don't even have a base truth to be comparing it against, making the whole thing an exercise in bureaucracy.

Fraterkes

Who says reality isnt fair? Isnt that up to us, the people inhabiting reality?

BonoboIO

The article talks a lot about fairness metrics but never mentions whether the system actually catches fraud.

Without figures for true positives, recall, or financial recoveries, its effectiveness remains completely in the dark.

In short: great for moral grandstanding in the comments section, but zero evidence that taxpayer money or investigative time was ever saved.

stefan_

It also doesn't mention what numbers we are even talking about that given the expansive size of the Dutch government make this an at all useful thing.

TacticalCoder

[dead]

zeroCalories

Does anyone know what they mean by reweighing demographics? Are they penalizing incorrect classifications more heavily for those demographics, or making sure that each demographic is equally represented, or something else? Putting aside the model's degraded performance, I think it's fair to try and make sure the model is performing well for all demographics.

null

[deleted]

HN

We investigated Amsterdam's attempt to build a 'fair' fraud detection model

We investigated Amsterdam's attempt to build a 'fair' fraud detection model