Where I'm from, "in house" means employees. I see "contractors" and "negative earnings" in the same article.
They do say that reviewers have to have some kind of aviation experience. I'd be more curious reading an article about how they source the talent here.
Data labeling has been moving to onshore / higher paid work. There's still a lot offshore, but for LLMs in particular and various specialized models, there's a massive trend toward hiring highly educated, highly paid specialists in the US.
But as other commenters have warned: beware of labor laws, especially in CA/NY/MA.
I've had a front-row seat to this...our company hires + employs contract W2 and 1099 workers for the tech industry. Two years ago we started to get a ton of demand from data labeling companies and more recently foundation model cos who are doing DIY data labeling. Companies are converting 1099 workforces to W2 to avoid misclassification. Or they're trying to button up their use of 1099 to avoid being offside.
> Failing a test will cost a user 600 points, or roughly the equivalent of 15 minutes of work on the platform. A correctly tuned penalty system removes the need for setting reviewer accuracy minimums; poor performers will simply not earn enough money to continue on the platform.
This still sets a reviewer accuracy minimum, but it is determined implicitly by the arbitrary test penalty instead of consciously chosen based on application requirements. I don't see how that's an improvement. If you absolutely want to have negative earnings, it would make more sense to choose a reviewer accuracy minimum to aim for, and then determine the penalty that would achieve that target, instead of the other way around.
Moreover, a reviewer earning nothing on expectation under this scheme (they work for 15 minutes, then fail a test, and have all their earnings wiped out) could team up with a second reviewer with the same problem, submitting their answer only when both agree, and as long as their errors aren't 100% correlated, they would end up with positive expected earnings they could split between them.
This clearly indicates that the incentive scheme as designed doesn't capture the full economic value of even lower-quality data when processed appropriately. Of course you can't expect random reviewers to spontaneously work together in this way, so it's up to the data consumer to combine the work of multiple reviewers as appropriate.
Trying to get reliable results from humans by exclusively hiring the most reliable ones can only get you so far; you can do much better by designing systems to use redundancy to correct errors when they inevitably do appear. Ironically, this is a case where treating humans as fallible cogs in a big machine would be more respectful.
The screenshot shows that 25,000 points is about $50, so 500 points is about $1. If 600 points is about 15 minutes work, that means reviewers are getng paid less than $4 per hour?
All the numbers in the article are made up (I coded up a quick JSON so we could render the page without showing real user information). Our reviewers make ~10x that number. In hindsight, we shouldn't have made up numbers that ended up with this math. That's on me!
For sure they can! The vast majority of our reviewers make a few extra $hundred every week around their day jobs in aviation. Our target audience for this is the folks who enjoy spending their free time watching YouTube videos like VASAviation or PilotDebrief (like me). We get a kick out of listening to air traffic control audio and hearing what's going on in the airspace system.
And doing the quickmath based on the UI saying that 25,150 pts == $50.30 along with them saying 600 pts ~= 15 minutes of work... this is coming out to be ~$4.80/hr. No thanks.
EDIT: and that assumes perfect accuracy, the actual pay will be lower if you miss anything
That's an immediate nope for me. I don't care if I can file a dispute, unless I can resolve it then and there, I'm not going to be at the whim of some faceless escalation system, or an uninformed CS agent.
It also can't really be overstated how helpful it is as an ML engineer to simply spend the time going through thousands of examples yourself. If you abstract yourself away from the data and just "make metric go up" you'll be missing out on valuable insights about how and why your model might be failing.
For my one foray into ML, in 2020, I also built my own labeling system. It was stupidly simple; IIRC, it was a Jupyter Notebook that presented you with text to label, and you’d do so by hitting 1-5, which were mapped to sentiments / emotions. If you got bored, or just wanted to see how it performed with X% training, you could save progress and quit. It worked well enough, and I think I labeled a couple of thousand entries using it.
So they are building a system which has all the hallmarks of an extremely addictive game. But that's ok because they pay the players a small amount of money?
They didn't even address the wellbeing of players, managing addiction and overwork etc.
Some context, the average labeller on our platform puts in a single digit number of hours on our platform, works whenever and how much they want, and earn significantly more than an uber driver.
I think this didn't age well, for HN, and it prompts some serious questions about our techbro startup culture.
> Obvious but necessary: to incentivize productive work, we tie compensation to the number of characters transcribed, and assess financial penalties for failed tests (more on tests below). Penalties are priced such that subpar performance will result in little to no earnings for the labeller.
So, these aren't employees? The writeup talks about not trusting gig workers, but it sounds like they have gig workers, and a particularly questionable kind.
Not like independent contractors with the usual freedoms. But rather, under a punishing set of Kafkaesque rules, like someone was thinking only of computer programs, oops. "Gamified", with huge negative points penalties and everything. To be under threat of not getting paid at all.
I see that this article is dated the 16th, so it's before the HN outrage last week, over the founders who demoed a system for monitoring factory worker performance, and were ripped a new one online for dehumanizing employees.
Despite the factory system being not as invasive, dehumanizing, and potentially labor law-violating as what's described in this article: about whip-cracking of gig workers, moment-to-moment, and even not paying them.
I'm not even sure you'd get away with calling them "independent contractors", under these conditions, when workers save copies of this blog post, to show to labor lawyers and state regulators.
(Incidentally, I wasn't aware that a company working in aviation gets skilled workers this way. The usual way I've seen is to hire someone, with all the respect, rights, and benefits that entails. Or to hire a consultant who is decidedly not treated like a gig worker in a techno-dystopian sweatshop.)
I don't want Internet mob justice here, but I want to ask who is advising these startups regarding how they think of their place in the world, relative to other humans?
I can understand getting as far as VC pitches while overwhelmed with fixating on other aspects of the business problems, and still passing the "does this person have a good enough chance to have a big exit" gut feel test of the VCs. But are there no ongoing checks and advising, so that people don't miss everything else?
If they are operating as described, it’s almost certainly illegal. They deserve to be hit with a nice, fat PAGA lawsuit. These workers would have to satisfy the “ABC test” to be exempt from minimum wage obligations, and it’s a difficult standard to meet: https://www.labor.ca.gov/employmentstatus/abctest/
> I want to ask who is advising these startups regarding how they think of their place in the world, relative to other humans?
To me, this has been one of the most dispiriting things to witness in the last few years: not just the normalization, but the outright glorification, of indecency. Shameful.
> But rather, under a punishing set of Kafkaesque rules, like someone was thinking only of computer programs, oops. "Gamified", with huge negative points penalties and everything. To be under threat of not getting paid at all.
I'm not defending these practices, but to share some context:
One of the problems with getting workers to review ML output is it's incredibly, unbelievably boring. When the task is to review model output you're going to hit the 'approve' button 99% of the time - and when you're being paid for speed, nothing's faster than hitting the approve button.
So understandably a decent number of folks will just zone out, maybe put youtube on in another window, and sit there hitting approve 100% of the time. That's just human nature when dealing with such an incredibly dull task - I know I don't pay attention when I have to do my annual refresher training on how to sit in a chair.
This sort of thing is a big problem for things like airport baggage scanner operators; pilots with their planes on autopilot; lifeguards; casino CCTV operators; and suchlike. There are loads of studies about this kind of stuff.
This makes getting good quality ML output reviews quite tricky. There are ways to do it, though, and you don't have to resort to negative income!
Stack Overflow does this by sometimes prompting you with known bad changes that you shouldn't approve. But then they're managing volunteers, not paying for bad reviews, so they have no money to waste.
I honestly still have trouble believing there are people willing to moderate SO content for free. Do a boring job to make a rich company richer, get paid nothing and occasionally get yelled at.
Seems like some of the techniques described here could be part of a larger "accuracy-based commission" form of compensation (as opposed to what is apparently presented).
The article is dated the 16th, before the HN outrage around the 25th, over the video skit about some YC factory worker surveillance startup.
If the writer had the benefit of seeing the few-post outrage on the 25th, they probably would've written the article differently, and maybe also reflected on the dynamic with the workers.
In a startup, when you have to do all the things, and you're constantly learning, it's easy to miss some things. Also, a lot of the funded tech startups are of founders of rich parents and sheltered upbringings, for various reasons. So (like all humans) they often have little understanding of the situations of people who are not them, and therefore little automatic empathy for the not understood. Often (this can also be normal human reaction), they will implicitly imagine themselves as deserving whatever privileges they have, and therefore having superior merit over others who don't have that. So, without reflection, one might accept the situation of one person calling themselves CEO at 20, and taking the lion's share of the entire effort's wealth, while another person is belittled and treated like shit, since (the implicit belief goes) they both must merit their lots in life.
Unless and until we stop and think about it. I think that most people here on HN, when we're distracted from empathy, by all the commotion of all things we have to pay attention to, will care once it's pointed out. We stop and reflect, and then we try to learn and do better.
> I'm not even sure you'd get away with calling them "independent contractors", under these conditions, when workers save copies of this blog post, to show to labor lawyers and state regulators.
An independent contractor is more likely to not be paid for meeting mutually agreed terms, not less likely.
At the same time, many organisations getting work done through platforms like Mechanical Turk set their piece rate to make sure all but the worst workers will make at least minimum wage.
The posts last week about the factory worker monitoring startup? There were at least 3 posts (and I think dang let the dupes through, since some mentioned YC, and HN moderates less in such cases):
Where I'm from, "in house" means employees. I see "contractors" and "negative earnings" in the same article.
They do say that reviewers have to have some kind of aviation experience. I'd be more curious reading an article about how they source the talent here.
Especially given you lose 15m of work every time you get a few s clip wrong.
Data labeling has been moving to onshore / higher paid work. There's still a lot offshore, but for LLMs in particular and various specialized models, there's a massive trend toward hiring highly educated, highly paid specialists in the US.
But as other commenters have warned: beware of labor laws, especially in CA/NY/MA.
I've had a front-row seat to this...our company hires + employs contract W2 and 1099 workers for the tech industry. Two years ago we started to get a ton of demand from data labeling companies and more recently foundation model cos who are doing DIY data labeling. Companies are converting 1099 workforces to W2 to avoid misclassification. Or they're trying to button up their use of 1099 to avoid being offside.
> Failing a test will cost a user 600 points, or roughly the equivalent of 15 minutes of work on the platform. A correctly tuned penalty system removes the need for setting reviewer accuracy minimums; poor performers will simply not earn enough money to continue on the platform.
This still sets a reviewer accuracy minimum, but it is determined implicitly by the arbitrary test penalty instead of consciously chosen based on application requirements. I don't see how that's an improvement. If you absolutely want to have negative earnings, it would make more sense to choose a reviewer accuracy minimum to aim for, and then determine the penalty that would achieve that target, instead of the other way around.
Moreover, a reviewer earning nothing on expectation under this scheme (they work for 15 minutes, then fail a test, and have all their earnings wiped out) could team up with a second reviewer with the same problem, submitting their answer only when both agree, and as long as their errors aren't 100% correlated, they would end up with positive expected earnings they could split between them.
This clearly indicates that the incentive scheme as designed doesn't capture the full economic value of even lower-quality data when processed appropriately. Of course you can't expect random reviewers to spontaneously work together in this way, so it's up to the data consumer to combine the work of multiple reviewers as appropriate.
Trying to get reliable results from humans by exclusively hiring the most reliable ones can only get you so far; you can do much better by designing systems to use redundancy to correct errors when they inevitably do appear. Ironically, this is a case where treating humans as fallible cogs in a big machine would be more respectful.
The screenshot shows that 25,000 points is about $50, so 500 points is about $1. If 600 points is about 15 minutes work, that means reviewers are getng paid less than $4 per hour?
All the numbers in the article are made up (I coded up a quick JSON so we could render the page without showing real user information). Our reviewers make ~10x that number. In hindsight, we shouldn't have made up numbers that ended up with this math. That's on me!
Ah, cool. But that also means each penalty is $10. Eep.
> All labellers are either licensed pilots or controllers (or VATSIM pilots/controllers).
I would think such people can make better money by actually working as a pilot or controller?
For sure they can! The vast majority of our reviewers make a few extra $hundred every week around their day jobs in aviation. Our target audience for this is the folks who enjoy spending their free time watching YouTube videos like VASAviation or PilotDebrief (like me). We get a kick out of listening to air traffic control audio and hearing what's going on in the airspace system.
And doing the quickmath based on the UI saying that 25,150 pts == $50.30 along with them saying 600 pts ~= 15 minutes of work... this is coming out to be ~$4.80/hr. No thanks.
EDIT: and that assumes perfect accuracy, the actual pay will be lower if you miss anything
Not all pilots have a commercial pilots license, without which you can’t get paid to fly at all.
Early career professional pilots make surprisingly little money flying.
And professional pilots of all sorts often find themselves in a hotel in a city away from home with time to kill.
They said that their reviewers mostly have day jobs.
"and assess financial penalties for failed tests"
That's an immediate nope for me. I don't care if I can file a dispute, unless I can resolve it then and there, I'm not going to be at the whim of some faceless escalation system, or an uninformed CS agent.
Since they have diffs I imagine your mistake (or not) would be immediately obvious. I just think the penalty is too large.
> Still, expert reviewers will occasionally disagree in their labelling. To ensure quality, an audio clip [box characters], at which point [...]
Have they censored their own article?
I noticed the same thing, very strange.
Sure looks like it, didn't want to give away too much secret sauce?
Data is king. Even when a new better model comes along a high quality dataset is still just as valuable.
Paying top performers above market rates to do nothing but data labelling is a moat that just keeps getting deeper.
Good data and good evals are two legs of the 3-legged stool that a lot of AI teams are missing.
It also can't really be overstated how helpful it is as an ML engineer to simply spend the time going through thousands of examples yourself. If you abstract yourself away from the data and just "make metric go up" you'll be missing out on valuable insights about how and why your model might be failing.
What would a product look like in this space?
It's not a product. It's business core competency in the ml space.
There are several data labeling products on the market such as Label Studio.
I’ve resorted to building my own annotation apps.
For my one foray into ML, in 2020, I also built my own labeling system. It was stupidly simple; IIRC, it was a Jupyter Notebook that presented you with text to label, and you’d do so by hitting 1-5, which were mapped to sentiments / emotions. If you got bored, or just wanted to see how it performed with X% training, you could save progress and quit. It worked well enough, and I think I labeled a couple of thousand entries using it.
So they are building a system which has all the hallmarks of an extremely addictive game. But that's ok because they pay the players a small amount of money?
They didn't even address the wellbeing of players, managing addiction and overwork etc.
Some context, the average labeller on our platform puts in a single digit number of hours on our platform, works whenever and how much they want, and earn significantly more than an uber driver.
Thanks.
To play devil's advocate, the average gambler is not problematic either, it's the outliers that are the problem.
fair point!
I mean, as opposed to companies that get people addicted to a game and have them pay for the privilege? I would say so.
I think this didn't age well, for HN, and it prompts some serious questions about our techbro startup culture.
> Obvious but necessary: to incentivize productive work, we tie compensation to the number of characters transcribed, and assess financial penalties for failed tests (more on tests below). Penalties are priced such that subpar performance will result in little to no earnings for the labeller.
So, these aren't employees? The writeup talks about not trusting gig workers, but it sounds like they have gig workers, and a particularly questionable kind.
Not like independent contractors with the usual freedoms. But rather, under a punishing set of Kafkaesque rules, like someone was thinking only of computer programs, oops. "Gamified", with huge negative points penalties and everything. To be under threat of not getting paid at all.
I see that this article is dated the 16th, so it's before the HN outrage last week, over the founders who demoed a system for monitoring factory worker performance, and were ripped a new one online for dehumanizing employees.
Despite the factory system being not as invasive, dehumanizing, and potentially labor law-violating as what's described in this article: about whip-cracking of gig workers, moment-to-moment, and even not paying them.
I'm not even sure you'd get away with calling them "independent contractors", under these conditions, when workers save copies of this blog post, to show to labor lawyers and state regulators.
(Incidentally, I wasn't aware that a company working in aviation gets skilled workers this way. The usual way I've seen is to hire someone, with all the respect, rights, and benefits that entails. Or to hire a consultant who is decidedly not treated like a gig worker in a techno-dystopian sweatshop.)
I don't want Internet mob justice here, but I want to ask who is advising these startups regarding how they think of their place in the world, relative to other humans?
I can understand getting as far as VC pitches while overwhelmed with fixating on other aspects of the business problems, and still passing the "does this person have a good enough chance to have a big exit" gut feel test of the VCs. But are there no ongoing checks and advising, so that people don't miss everything else?
> To be under threat of not getting paid at all.
If they are operating as described, it’s almost certainly illegal. They deserve to be hit with a nice, fat PAGA lawsuit. These workers would have to satisfy the “ABC test” to be exempt from minimum wage obligations, and it’s a difficult standard to meet: https://www.labor.ca.gov/employmentstatus/abctest/
> I want to ask who is advising these startups regarding how they think of their place in the world, relative to other humans?
To me, this has been one of the most dispiriting things to witness in the last few years: not just the normalization, but the outright glorification, of indecency. Shameful.
> But rather, under a punishing set of Kafkaesque rules, like someone was thinking only of computer programs, oops. "Gamified", with huge negative points penalties and everything. To be under threat of not getting paid at all.
I'm not defending these practices, but to share some context:
One of the problems with getting workers to review ML output is it's incredibly, unbelievably boring. When the task is to review model output you're going to hit the 'approve' button 99% of the time - and when you're being paid for speed, nothing's faster than hitting the approve button.
So understandably a decent number of folks will just zone out, maybe put youtube on in another window, and sit there hitting approve 100% of the time. That's just human nature when dealing with such an incredibly dull task - I know I don't pay attention when I have to do my annual refresher training on how to sit in a chair.
This sort of thing is a big problem for things like airport baggage scanner operators; pilots with their planes on autopilot; lifeguards; casino CCTV operators; and suchlike. There are loads of studies about this kind of stuff.
This makes getting good quality ML output reviews quite tricky. There are ways to do it, though, and you don't have to resort to negative income!
Stack Overflow does this by sometimes prompting you with known bad changes that you shouldn't approve. But then they're managing volunteers, not paying for bad reviews, so they have no money to waste.
I honestly still have trouble believing there are people willing to moderate SO content for free. Do a boring job to make a rich company richer, get paid nothing and occasionally get yelled at.
It attracts a lot of petty tyrants, that’s for sure.
Seems like some of the techniques described here could be part of a larger "accuracy-based commission" form of compensation (as opposed to what is apparently presented).
What do you mean “didn’t age well”, its a brand new article. It hasn’t aged at all.
The article is dated the 16th, before the HN outrage around the 25th, over the video skit about some YC factory worker surveillance startup.
If the writer had the benefit of seeing the few-post outrage on the 25th, they probably would've written the article differently, and maybe also reflected on the dynamic with the workers.
In a startup, when you have to do all the things, and you're constantly learning, it's easy to miss some things. Also, a lot of the funded tech startups are of founders of rich parents and sheltered upbringings, for various reasons. So (like all humans) they often have little understanding of the situations of people who are not them, and therefore little automatic empathy for the not understood. Often (this can also be normal human reaction), they will implicitly imagine themselves as deserving whatever privileges they have, and therefore having superior merit over others who don't have that. So, without reflection, one might accept the situation of one person calling themselves CEO at 20, and taking the lion's share of the entire effort's wealth, while another person is belittled and treated like shit, since (the implicit belief goes) they both must merit their lots in life.
Unless and until we stop and think about it. I think that most people here on HN, when we're distracted from empathy, by all the commotion of all things we have to pay attention to, will care once it's pointed out. We stop and reflect, and then we try to learn and do better.
I don’t think HN being discontented about the gig economy is anything new, at least not if you are counting new in weeks and not years.
> I'm not even sure you'd get away with calling them "independent contractors", under these conditions, when workers save copies of this blog post, to show to labor lawyers and state regulators.
An independent contractor is more likely to not be paid for meeting mutually agreed terms, not less likely.
Some of the big-dollar contracts $employer has involve financial penalties if performance metrics aren't up to standard.
At the same time, many organisations getting work done through platforms like Mechanical Turk set their piece rate to make sure all but the worst workers will make at least minimum wage.
Anyone have the link to that thread?
The posts last week about the factory worker monitoring startup? There were at least 3 posts (and I think dang let the dupes through, since some mentioned YC, and HN moderates less in such cases):
https://news.ycombinator.com/item?id=43175023
https://news.ycombinator.com/item?id=43170850
https://news.ycombinator.com/item?id=43180133
How else would you do this?
Pivot to sweat shop.