A Developer Accidentally Found CSAM in AI Data. Google Banned Him for It

amarcheschi

Just a few days ago I was doing some low paid (well, not so low) Ai classification task - akin to mechanical turk ones - for a very big company and was - involuntarily, since I guess they don't review them before showing - shown an ai image by the platform depicting a naked man and naked kid. though it was more barbie like than anything else. I didn't really enjoy the view tbh, contacted them but got no answer back

jsnell

As a small point of order, they did not get banned for "finding CSAM" like the outrage- and clickbait title claims. They got banned for uploading child porn to Google Drive. They did not find it, and them later reporting the data set to an appropriate organization is not why they got banned.

jfindper

>They got banned for uploading child porn to Google Drive

They uploaded the full "widely-used" training dataset, which happened to include CSAM (child sexual abuse material).

While the title of the article is not great, your wording here implies that they purposefully uploaded some independent CSAM pictures, which is not accurate.

AdamJacobMuller

No but "They got banned for uploading child porn to Google Drive" is a correct framing and "google banned a developer for finding child porn" is incorrect.

There is important additional context around it, of course, which mitigates (should remove) any criminal legal implications, and should also result in google unsuspending his account in a reasonable timeframe but what happened is also reasonable. Google does automated scans of all data uploaded to drive and caught CP images being uploaded (presumably via hashes from something like NCMEC?) and banned the user. Totally reasonable thing. Google should have an appeal process where a reasonable human can look at it and say "oh shit the guy just uploaded 100m AI training images and 7 of them were CP, he's not a pedo, unban him, ask him not to do it again and report this to someone."

The headline frames it like the story was "A developer found CP in AI training data from google and banned him in retaliation for reporting it." Totally disingenuous framing of the situation.

jeffbee

Literally every headline that 404 media has published about subjects I understand first-hand has been false.

deltoidmaximus

Back when the first moat creation gambit for AI failed (that they were creating SkyNet so the government needs to block anyone else from working on SkyNet since only OpenAI can be trusted to control it not just any rando) they moved onto the safety angle with the same idea. I recall seeing an infographic that all the major players were signed onto some kind of safety pledge, Meta, OpenAI, Microsoft, etc. Basically they didn't want anyone else training on the whole world's data because only they could be trusted to not do nefarious things with it. The infographic had a statement about not training on CSAM and revenge porn and the like but the corpospeak it was worded in made it sound like they were promising not to do it anymore, not that they never did.

I've tried to find this graphic against several times over the years but it's either been scrubbed from the internet or I just can't remember enough details to find it. Amusingly, it only just occurred to me that maybe I should ask ChatGPT to help me find it.

jsheard

> The infographic had a statement about not training on CSAM and revenge porn and the like but the corpospeak it was worded in made it sound like they were promising not to do it anymore, not that they never did.

We know they did, an earlier version of the LAION dataset was found to contain CSAM after everyone had already trained their image generation models on it.

https://www.theverge.com/2023/12/20/24009418/generative-ai-i...

bsowl

More like "A developer accidentally uploaded child porn to his Google Drive account and Google banned him for it".

jkaplowitz

The penalties for unknowingly possessing or transmitting child porn are far too harsh, both in this case and in general (far beyond just Google's corporate policies).

Again, to avoid misunderstandings, I said unknowingly - I'm not defending anything about people who knowingly possess or traffic in child porn, other than for the few appropriate purposes like reporting it to the proper authorities when discovered.

gillesjacobs

https://archive.ph/awvmJ

giantg2

This raises an interesting point. Do you need to train models using CSAM so that the model can self-enforce restrictions on CSAM? If so, I wonder what moral/ethical questions this brings up.

jsheard

It's a delicate subject but not an unprecedented one. Automatic detection of already known CSAM images (as opposed to heuristic detection of unknown images) has been around for much longer than AI, and for that service to exist someone has to handle the actual CSAM before it's reduced to a perceptual hash in a database.

Maybe AI detection is more ethically fraught since you'd need to keep hold of the CSAM until the next training run, rather than hashing it then immediately destroying it.

HN

A Developer Accidentally Found CSAM in AI Data. Google Banned Him for It

A Developer Accidentally Found CSAM in AI Data. Google Banned Him for It