We in-housed our data labelling
28 comments
·February 27, 2025turtlebits
"and assess financial penalties for failed tests"
That's an immediate nope for me. I don't care if I can file a dispute, unless I can resolve it then and there, I'm not going to be at the whim of some faceless escalation system, or an uninformed CS agent.
mlsu
Where I'm from, "in house" means employees. I see "contractors" and "negative earnings" in the same article.
They do say that reviewers have to have some kind of aviation experience. I'd be more curious reading an article about how they source the talent here.
gpvos
> All labellers are either licensed pilots or controllers (or VATSIM pilots/controllers).
I would think such people can make better money by actually working as a pilot or controller?
JamesSwift
And doing the quickmath based on the UI saying that 25,150 pts == $50.30 along with them saying 600 pts ~= 15 minutes of work... this is coming out to be ~$4.80/hr. No thanks.
EDIT: and that assumes perfect accuracy, the actual pay will be lower if you miss anything
repiret
Not all pilots have a commercial pilots license, without which you can’t get paid to fly at all.
Early career professional pilots make surprisingly little money flying.
And professional pilots of all sorts often find themselves in a hotel in a city away from home with time to kill.
llm_trw
Data is king. Even when a new better model comes along a high quality dataset is still just as valuable.
Paying top performers above market rates to do nothing but data labelling is a moat that just keeps getting deeper.
xnx
Good data and good evals are two legs of the 3-legged stool that a lot of AI teams are missing.
antognini
It also can't really be overstated how helpful it is as an ML engineer to simply spend the time going through thousands of examples yourself. If you abstract yourself away from the data and just "make metric go up" you'll be missing out on valuable insights about how and why your model might be failing.
neilv
I think this didn't age well, for HN, and it prompts some serious questions about our techbro startup culture.
> Obvious but necessary: to incentivize productive work, we tie compensation to the number of characters transcribed, and assess financial penalties for failed tests (more on tests below). Penalties are priced such that subpar performance will result in little to no earnings for the labeller.
So, these aren't employees? The writeup talks about not trusting gig workers, but it sounds like they have gig workers, and a particularly questionable kind.
Not like independent contractors with the usual freedoms. But rather, under a punishing set of Kafkaesque rules, like someone was thinking only of computer programs, oops. "Gamified", with huge negative points penalties and everything. To be under threat of not getting paid at all.
I see that this article is dated the 16th, so it's before the HN outrage last week, over the founders who demoed a system for monitoring factory worker performance, and were ripped a new one online for dehumanizing employees.
Despite the factory system being not as invasive, dehumanizing, and potentially labor law-violating as what's described in this article: about whip-cracking of gig workers, moment-to-moment, and even not paying them.
I'm not even sure you'd get away with calling them "independent contractors", under these conditions, when workers save copies of this blog post, to show to labor lawyers and state regulators.
(Incidentally, I wasn't aware that a company working in aviation gets skilled workers this way. The usual way I've seen is to hire someone, with all the respect, rights, and benefits that entails. Or to hire a consultant who is decidedly not treated like a gig worker in a techno-dystopian sweatshop.)
I don't want Internet mob justice here, but I want to ask who is advising these startups regarding how they think of their place in the world, relative to other humans?
I can understand getting as far as VC pitches while overwhelmed with fixating on other aspects of the business problems, and still passing the "does this person have a good enough chance to have a big exit" gut feel test of the VCs. But are there no ongoing checks and advising, so that people don't miss everything else?
anon373839
> To be under threat of not getting paid at all.
If they are operating as described, it’s almost certainly illegal. They deserve to be hit with a nice, fat PAGA lawsuit. These workers would have to satisfy the “ABC test” to be exempt from minimum wage obligations, and it’s a difficult standard to meet: https://www.labor.ca.gov/employmentstatus/abctest/
> I want to ask who is advising these startups regarding how they think of their place in the world, relative to other humans?
To me, this has been one of the most dispiriting things to witness in the last few years: not just the normalization, but the outright glorification, of indecency. Shameful.
michaelt
> But rather, under a punishing set of Kafkaesque rules, like someone was thinking only of computer programs, oops. "Gamified", with huge negative points penalties and everything. To be under threat of not getting paid at all.
I'm not defending these practices, but to share some context:
One of the problems with getting workers to review ML output is it's incredibly, unbelievably boring. When the task is to review model output you're going to hit the 'approve' button 99% of the time - and when you're being paid for speed, nothing's faster than hitting the approve button.
So understandably a decent number of folks will just zone out, maybe put youtube on in another window, and sit there hitting approve 100% of the time. That's just human nature when dealing with such an incredibly dull task - I know I don't pay attention when I have to do my annual refresher training on how to sit in a chair.
This sort of thing is a big problem for things like airport baggage scanner operators; pilots with their planes on autopilot; lifeguards; casino CCTV operators; and suchlike. There are loads of studies about this kind of stuff.
This makes getting good quality ML output reviews quite tricky. There are ways to do it, though, and you don't have to resort to negative income!
robertlagrant
Stack Overflow does this by sometimes prompting you with known bad changes that you shouldn't approve. But then they're managing volunteers, not paying for bad reviews, so they have no money to waste.
pglevy
Seems like some of the techniques described here could be part of a larger "accuracy-based commission" form of compensation (as opposed to what is apparently presented).
robertlagrant
> I'm not even sure you'd get away with calling them "independent contractors", under these conditions, when workers save copies of this blog post, to show to labor lawyers and state regulators.
An independent contractor is more likely to not be paid for meeting mutually agreed terms, not less likely.
wodenokoto
What do you mean “didn’t age well”, its a brand new article. It hasn’t aged at all.
tbrownaw
Some of the big-dollar contracts $employer has involve financial penalties if performance metrics aren't up to standard.
michaelt
At the same time, many organisations getting work done through platforms like Mechanical Turk set their piece rate to make sure all but the worst workers will make at least minimum wage.
totetsu
Anyone have the link to that thread?
neilv
The posts last week about the factory worker monitoring startup? There were at least 3 posts (and I think dang let the dupes through, since some mentioned YC, and HN moderates less in such cases):
https://news.ycombinator.com/item?id=43175023
rrr_oh_man
How else would you do this?
v9v
> Still, expert reviewers will occasionally disagree in their labelling. To ensure quality, an audio clip [box characters], at which point [...]
Have they censored their own article?
beardedwizard
I noticed the same thing, very strange.
blitzar
Pivot to sweat shop.
> Failing a test will cost a user 600 points, or roughly the equivalent of 15 minutes of work on the platform. A correctly tuned penalty system removes the need for setting reviewer accuracy minimums; poor performers will simply not earn enough money to continue on the platform.
This still sets a reviewer accuracy minimum, but it is determined implicitly by the arbitrary test penalty instead of consciously chosen based on application requirements. I don't see how that's an improvement. If you absolutely want to have negative earnings, it would make more sense to choose a reviewer accuracy minimum to aim for, and then determine the penalty that would achieve that target, instead of the other way around.
Moreover, a reviewer earning nothing on expectation under this scheme (they work for 15 minutes, then fail a test, and have all their earnings wiped out) could team up with a second reviewer with the same problem, submitting their answer only when both agree, and as long as their errors aren't 100% correlated, they would end up with positive expected earnings they could split between them.
This clearly indicates that the incentive scheme as designed doesn't capture the full economic value of even lower-quality data when processed appropriately. Of course you can't expect random reviewers to spontaneously work together in this way, so it's up to the data consumer to combine the work of multiple reviewers as appropriate.
Trying to get reliable results from humans by exclusively hiring the most reliable ones can only get you so far; you can do much better by designing systems to use redundancy to correct errors when they inevitably do appear. Ironically, this is a case where treating humans as fallible cogs in a big machine would be more respectful.