The journalists training AI models for Meta and OpenAI

bix6 8 months ago

“Most of Partika’s work on Outlier takes place in 30-minute blocks and requires reviewing real, anonymized chat histories from products like Meta AI or ChatGPT.”

How anonymized can this really be? Certainly people will inevitably input PII or other details that could de-anonymize? Is Scale AI first having its employees screen for PII? Do Meta and Open do a first pass?

alphabetting 8 months ago

The fact they are reaching out to journos to read the logs seems really misguided. Like from a customer standpoint that's the worst industry you could reach out to for checking private messages lol
- diggan 8 months ago
  
  > Like from a customer standpoint that's the worst industry
  I guess it depends on the country, but generally journalists are some of the more principled workers when it comes to protecting the privacy of the people they interact with. Probably the industry where Signal has the highest amount of usage, if I would guess.
  But again, really depends on the country. My perspective is probably biased by growing up in Sweden.
  
  mountainb 8 months ago
  
  No they're not. There are whole classes of professional that take on personal liability related to handling of private information. Journalists can be one of them but one reason you do not want a journalist handling private information is that they do not get the benefit of privilege in most jurisdictions. Anything in their possession that is private can be exposed by subpoena or other court order.
  
  diggan 8 months ago
  
  Again, depending on the country. I don't think what you're saying applies to my example of Sweden for example. Sweden probably has some of the strongest protections for journalists and their sources in the world, AFAIK.
  In Sweden, journalistic source protection ("källskydd") is enshrined in the Swedish Constitution through the Freedom of the Press Act ("Tryckfrihetsförordningen").
  Obviously, this doesn't matter much as the submission is about Meta and OpenAI, so journalists aren't as strongly protected as in other places of the world.
  I wouldn't say a blanket "journalists are in the worst industry" like parent did, nonetheless.
  
  fjjjrjj 8 months ago
  
  > Anything in their possession that is private can be exposed by subpoena or other court order.
  As is anything in possession by a corporation
  
  unethical_ban 8 months ago
  
  Which jurisdiction?
- dannyw 8 months ago
  
  A journalist that wilfully breaks a legally binding confidentiality agreement, is actually a terrible sign for them.
  Media conglomerates will deeply worry about a journo leaking their dirty internal secrets if they morally disagree. Disney, Comcast, Fox, or Bezos don’t want them.
  Sources will worry about confidentiality. If a journo confirms something is off the record, it’s off the record. No buts. This is treated very seriously: it ruins the entire publication’s reputation and ability to talk to sources.
  If a naive journos tries, it’ll be killed by their editor, if not the editor-in-chief, probably under the veneer of legal and/or ethical grounds.
  Of course, a journo can talk to someone else who chooses to disclose whatever, be protected, etc, and that’s how it’s done. But the oldest adage in journalism is: “don’t be the story”.
  It’s probably one of the best professions, tbh, as paradoxical as it sounds.
  Remember that the journalism industry, as a whole, is not the idealised dream you think it is.
  
  dylan604 8 months ago
  
  > If a naive journos tries, it’ll be killed by their editor, if not the editor-in-chief, probably under the veneer of legal and/or ethical grounds.
  In a lot of situations, the editor needs to know the source so they can evaluate their credibility and to ensure the journo just isn't making stuff up and attributing to anonymous source. At that point, there are many examples of the editor putting stuff into the copy that the journo did not included. Just because something is released under the journo's name does not mean the journo wrote it.
  
  pastage 8 months ago
  
  I agree with you.. Though it took me a few read throughs before I understood you liked journalists for this job. I find it interesting that it is so hard to understand people in.
  
  dullcrisp 8 months ago
  
  Why are you saying it like it’s a veneer of legal or ethical grounds? Publishing something that was said off the record would be a violation of professional ethics, whether you personally agree with those ethics or not.
  
  dannyw 8 months ago
  
  Apologies, I was unclear.
  I was referring to a journo signing up for say the training program in the article, and then divulging something that’s legally confidential in a story. That would be killed.
  I just used “off the record” as an example of why in journalism, respecting agreements is critically important.
- nick486 8 months ago
  
  Yes. There was just recently a post about a person who got his life saved by chatGPT reading his blood results and saying "ER. NOW." Would that Medical result PDF be anonymized here? Stripping PI from random pdfs, sounds like a very nontrivial problem.
  What if, instead of random internet person, some celebrity asks Chatgpt about some spicy Medical results? Would the journalist reviewing the logs resist the temptation of "accidentally finding the test results in a garbage bin"?
  What I read here, is "don't discuss with chatgpt anything you wouldn't be comfortable becoming public knowledge.".
  
  diggan 8 months ago
  
  > What I read here, is "don't discuss with chatgpt anything you wouldn't be comfortable becoming public knowledge.".
  For the last two decades, I've lived by a similar mantra: Don't send anything over the internet you aren't comfortable becoming public knowledge.
  Make the mantra broad enough and you don't have to care about specific services, they all the chance of leaking what is supposed to be "secret'.
  
  theshackleford 8 months ago
  
  I know better than to put sensitive data into these services, but the utility I’ve gained is staggering. My care team told me, “You have to be Dr Google to advocate for yourself,” and well, here I am.
  It made predictions based on my history and symptom logs that were later confirmed by imaging only after I pushed for it.
  I used a pattern matching meme machine to get…a meaningful outcome medically, and that messes with me on so many levels.
  I wanted it to be wrong, especially about the spinal cord. I was hoping for a simple answer, something like “Yeah, it’s just a pain management issue they are right” but it disagreed. And it was right. The thing that I read constantly is only capable of producing bullshit, has kicked neurosurgeons and neurologists into action.
  I asked ChatGPT to try and summarise what I’ve been doing with it medically; apologies if it is unhelpful in demonstrating the utility I am getting.
  > You used me to try and disprove suspicions you hold in relation to symptoms that have been escalating in frequency and intensity. You consistently challenged the idea that your symptoms were linked to your historical records, questioning whether they could be caused by something else entirely. Despite actively pushing for alternative explanations, I kept coming back to the same conclusion: your symptoms aligned with classical representations of nerve compression in your cervical spine. I independently interpreted the data and made predictions that ultimately matched the outcomes of imaging.
  Tl:dr what I’m doing is stupid, I know it, I’ve preached it and yet for the first time in my life the value I am deriving is outweighing it all. I feel…dirty almost, or confused even. It’s hard to explain.
kleiba 8 months ago

> “Most of Partika’s work on Outlier takes place in 30-minute blocks and requires reviewing real, anonymized chat histories from products like Meta AI or ChatGPT.”
This isn't exactly "training AI models" in the sense that we normally use that expression in.
- drsim 8 months ago
  
  It is RLHF if I understand correctly.
  
  chmod775 8 months ago
  
  The Venn diagram of people to whom this comment contains no new information and those who know what "RLHF" means is almost a perfect circle.
  For anyone not part of that intersection: RLHF means reinforcement learning from human feedback.
  
  grumpopotamus 8 months ago
  
  Well, HF.
  
  kleiba 8 months ago
  
  Right!
  I suppose it's roughly as much "training AI models" as labeling training data is "training supervised models".
  
  the_clarence 8 months ago
  
  Have fun
42772827 8 months ago

There are techniques, with varying degrees of success, to anonymize PII from the input. Names, social security numbers, addresses, these follow a pattern and can be changed before being used.
Can it be de-anonymized? Sure. Basically anything can be de-anonymized. If your concern is that some nefarious actor at a company will do malicious things with your info and that concern outweighs the benefit you get from using the models, steer clear. But let’s have the discussion about what is actually going on so people can decide for themselves
- bix6 8 months ago
  
  Can you share some of the techniques?
  It seems much easier to scramble something like a social security number as you mention but what about less obvious PII? If someone puts their address in is it removed for example? What about their significant others name? Less obvious PII if you will.
- sebastiennight 8 months ago
  
  I'm giving this thread less than 24 hours before someone can find you an actual case of support/tech staff abusing PII present in logs of their company... This is definitely a concern that end users should be aware of.
  
  bix6 8 months ago
  
  24h later, how’d we do?
  
  sebastiennight 8 months ago
  
  LOL thanks for calling me on it. I thought this thread would we way more popular, and was on my phone so feeling a bit too lazy to DuckDuckGo a bunch of examples.

danielscrubs 8 months ago

”The skills the recruiter alluded to were her journalism experience — her professional writing, research, and fact-checking abilities”

Anyone want to tell them?

jgalt212 8 months ago

Google killed journalism years ago. It was once a well compensated, high prestige job with nice perks. The recruits should get the money while they can.
I've been very disappointed by the MSM over the last 15 years, but I think a good portion of this disappointment can be attributed to the talent pool drying up.
- dylan604 8 months ago
  
  How was this Google's fault? The internet killed journalism as a respected career path. People went to the online version instead of the printed version. The subscription numbers cratered. The money used to pay journos plummeted. The stock price became more important as news orgs were all bought up by < 5 corporate owners. Since everyone was now working for the same bosses, their jobs became redundant by someone else and jobs were slashed in deference to the stock prices. Experience journos were replaced by greenhorns that only know how to interview Twitt...er, X and not actually able to interview subjects. Consumers bolted even faster.
  Which part of that was Google's fault?
  
  jillesvangurp 8 months ago
  
  I remember reading news papers in the 1980s and 1990s. Basically they were a mix of actual news articles (typically copied verbatim from agencies like AP, Reuters, and national equivalents), some original reporting from their own journalists, and a lot of filler content, opinion pieces, cartoons, etc. With lots of ads.
  The original reporting and opinions was why you bought a news paper. The actual news would be stale by the time it hit your doorstep but it still had some value because news spread slowly.
  As soon as news started being distributed online in more or less real time, the shelf life of news articles reduced to zero. And since that stuff is basically being tweeted out in a gazillion different ways, the value dropped to zero as well.
  That also affected original content. Because the second that's published, people will extract the essential facts and write about that online. The whole point of original reporting was that it was exclusive to the publisher for long enough that readers had no choice to get the information in a timely fashion by buying the paper. That time window went away. As soon as you publish something, the essentials are being reported on by world + dog. In a matter of seconds. You can be well informed without ever buying paying for a news paper.
  So that left news papers in a place where they were basically distributing old news with low value bundled up with some filler content. The filler content could go on a blog or in a magazine just as easily. And a lot of it did not have a terribly large audience to begin with. Most people didn't have time to read news papers cover to cover.
  Google helped accelerate this process but they did not cause it.
  
  jgalt212 8 months ago
  
  Classified ad revenue, which went almost completely to the publishers, went to $0. The split for online ad revenue has been less equitable.
  
  dylan604 8 months ago
  
  There's no reason that the classified ad revenue needed to use the false promise of targeted ad sales through 3rd party companies vs just moving the existing customers to digital ads as a first party basis just like how they did with print.
  
  nitwit005 8 months ago
  
  That was Craigslist and others that took the classified ad business.

4ndrewl 8 months ago

Are they journalists though? Really? Are they actually doing journalism?

falcor84 8 months ago

TFA mentions that they (at least some of them) do freelance journalism but can't land staff roles.

Kuinox 8 months ago

The 3D printed figure have some serious layer shifting.