Zuckerberg appeared to know Llama trained on Libgen

www.rollingstone.com

57 points by arresin 56212 years ago

lvl155 56212 years ago

Of course he does. Heck most of us in early stages of LLM did the same thing. The data simply did not exists outside Google which is why it’s crazy that Google completely dropped the ball on AI this decade. They had such a huge lead in terms data access.

logifail 56212 years ago

Perhaps the more interesting question would be exactly how did they obtain their copy/copies of Libgen?
- perfmode 56212 years ago
  
  Are you asking for a way to obtain a copy?
  
  logifail 56212 years ago
  
  Nope, I have no need for any <whisper>further</whisper> copies.
  I'm more interested in how a for-profit corp decides to obtain a copy for development of a commercial product, and how they execute that ... whether they still have the data, and whether legal know about it :)
  It's exactly not the kind of thing you can say you "found on a USB stick lying around in the car park"...
  
  thfuran 56212 years ago
  
  You should've seen the size of it. More of a USB baton really.
  
  selimthegrim 56212 years ago
  
  If Ryobi and DeWalt can make Bluetooth speakers, ASP can get into USB drives.
  
  logifail 56212 years ago
  
  > You should've seen the size of it. More of a USB baton really.
  <glances at shelf with many, many external USB drives hooked up to a Pi 400>
  Oh, really? :)
  
  qingcharles 56212 years ago
  
  There's torrents of it. I remember one AI company saying somewhere they just grabbed the big 7z torrent of it for their training.
- janice1999 56212 years ago
  
  It's hinted at in the article. If they torrented one large dataset, it's likely they did the same for Libgen.
  > "I think torrenting from a corporate laptop doesn’t feel right,” wrote one engineer in April 2023, adding a smiley face emoji. (A later email acknowledged that the “SciMag” data had indeed been torrented.)
llm_trw 56212 years ago

And the only reason they had the data is because they scanned every book ever for Google books.
- pk-protect-ai 56212 years ago
  
  and every e-mail, and every document in google docs, and every video on youtube ...
tdb7893 56212 years ago

They dropped the ball on cloud and need to catch up and now it's AI. It's kinda interesting how being ahead with data center infrastructure and also AI research didn't lead to them being ahead on those products
- lvl155 56212 years ago
  
  To be fair, they did have the lead as late as 2018. It’s just they treated it like it was their PhD thesis. Didn’t protect their IP at all and let all their talent leave.
- sitkack 56212 years ago
  
  Google is a playground funded by Ads and Ads make so much damn money that nothing can compete, even internally. If I were an activist investor, I'd make ads its own company. I was the FTC, I'd make ads its own company.
  
  whiplash451 56212 years ago
  
  Ads fund Waymo.
  
  ripped_britches 56212 years ago
  
  And what are the other companies? Just GCP? Why separate those?
- xbmcuser 56212 years ago
  
  In my opinion the Ai and absorbing all knowledge part of Google was Larry Page after his health scare his focus and priorities changed about actually living his life not Google. I think he had also realized what was happening with Google and so wanted Alphabet as an umbrella organisation but in the end he gave it up and let be run as a normal company.
whiplash451 56212 years ago

Google dropping the ball on AI… given their achievements on Waymo, Gemini and Gemma (just to name a few)… does not sound like a fair statement
- arresin 56212 years ago
  
  Those models are absolutely garbage. Terrible code understanding. Ridiculous hallucinations.
  
  thebytefairy 56212 years ago
  
  Have you actually used them recently? Gemini is top of chatbot arena, and Gemma is one of the best open models at its size.
  
  arresin 56212 years ago
  
  And that makes me extremely suspicious of that ranking. I use it at least a few times a week when I have a problem that’s unusual for me (to see it’s just terrible in my domain but not in others). It has a 9/10 fail rate.
  It is the best at OCR though. Not many people are talking about that. It’s a very nice thing to know.
yencabulator 56212 years ago

How was the data Google already had access to any less protected by copyright?
The data Google had was book scans, search engine indexing of arbitrary 3rd party content, and private email and documents they hosted.

ChrisArchitect 56212 years ago

[dupe] https://news.ycombinator.com/item?id=42651007

https://news.ycombinator.com/item?id=42673628

neuroelectron 56212 years ago

Which is exactly why the want to shut it down, preventing competition in the Ai space.

savoyard 56212 years ago

Library Genesis is living up to its name.

gausswho 56212 years ago

"The note observed that including the LibGen material would help them reach certain performance benchmarks, and alluded to industry rumors that other AI companies, including OpenAI and Mistral AI, are “using the library for their models.”

A new chapter in "information wants to be free". Copyright was always an artificial restriction on human instinct. We now enter an age where piracy is keeping up with the Jones', and those who respect intellectual property choose irrelevance. Prosecution becomes impossible as the laundering grows ever more sophisticated. Adaptation and acceptance painful, but the only path forward .

bayindirh 56212 years ago

Tangential question: What's your take about Code AIs using GPL and source available code in their training sets, and breaching both licenses?
- gausswho 56212 years ago
  
  Eschewing the ethics of such an action, my take would be that it happens (and will happen) regardless of license and no one can stop it unless those utilizing such code leave evidence that they disregarded the license.
  
  bayindirh 56212 years ago
  
  The models can and will reproduce a lot of their training set given the correct prompt [0]. How you can know you're infringing a license if the model doesn't tell you from which repository it got "inspired" for generating that code?
  GitHub was working on a feature which supposedly tells you which repository the snippet you just got is copied from, or IOW, which repositories have similar code, sorted by date. Effectively pushing the blame further on you by making you spend the time you just saved by investigating which repository provided the code you just got from "AI".
  Supposedly, if the license is not friendly, you can delete the snippet and write your own version. :)
  [0]: https://x.com/docsparse/status/1581461734665367554
  
  gausswho 56212 years ago
  
  Github, and others, have strategies to mitigate culpability. One way, as you note, is to hand off such responsibility to the consumer. Another is to suppress the training set from passing through too close to verbatim.
  To be crystal, I am sympathetic with GPL and copyleft movements. I do think we've entered a time where it will be hard for license holders to enforce their rights. The laundering will only get more effective. And, per my original comment, competitive pressure will incentivize models to sail the most piratey tack.
- fenazego 56212 years ago
  
  It’s not obvious that training a model on GPL code would constitute a breach of the license.
  
  bayindirh 56212 years ago
  
  Given the correct prompt, you can get the training set almost or completely verbatim [0]. Getting a GPL function is enough for GPLs virality, since you effectively lift the code from a GPL codebase and add to your codebase.
  Plus, the stack's latest version contains at least one GPL repository which their license tool failed to detect. So it's not something hypothetical in the first place.
  [0]: https://x.com/docsparse/status/1581461734665367554

ripped_britches 56212 years ago

Sometimes I just feel like these people overestimate how much they are actually owed from these training runs.

It’s trained on 15T tokens. So how many did you provide that were genuinely novel? And how much money do you want? Like $5 from OpenAI? And $0 from meta since it’s open source?

I personally hope we can all get on the same team with AI and treat its advancement as scientific research for the betterment of humanity.

blibble 56212 years ago

well US copyright law statutory damages are $30,000 per work infringed, and $150,000 if done deliberately
so I think $150,000 per copyrighted work ingested is fair
papercrane 56212 years ago

Assuming the works have registered with the copyright office they're eligible for statutory damages.
The range for that is huge though, it can be in the hundreds of dollars per work, or if the infringement is shown to be wilful then a judge can award up to $150,000 per work.
- ripped_britches 56212 years ago
  
  Fair point, seems willful here
logifail 56212 years ago

> It’s trained on 15T tokens. So how many did you provide that were genuinely novel?
Are we suggesting that we should ditch creators' rights and instead value intellectual property along the lines of "I should be able to copy all your stuff as long as long as I copy lots of other stuff too, and give it all away for free or almost free...?"
- ripped_britches 56212 years ago
  
  No not at all, just that their damages are going to be measurably fairly low
- llm_trw 56212 years ago
  
  The question if training an llm is fair use is one that will have to be answered by the courts.
  
  yencabulator 56212 years ago
  
  The point (often) is to stop the practice, not to ask money for it.
  Like, you get fined for speeding, but if you keep speeding you'll get you're license revoked, and if you keep driving after that you get jail time. The payment required is punitive, but the point is to stop you.
  
  llm_trw 56212 years ago
  
  Again, that an opinion that will need to be tested in the courts.
  You will have to explain why Hunter Thompson copying every word of every Hemingway novel isn't copyright infringement, but a computer doing the same is.
  
  yencabulator 56212 years ago
  
  Three things:
  1. Humans are not machines; arguments saying because a human can learn LLMs must be allowed to copy is not interesting.
  2. Did Thompson publish the work? It sounds like you're referring to an activity Thompson did in private, to improve his skills as an author. Meanwhile, lawsuits are alleging that LLM services reproduce copyrighted materials.
  3. What can be fair use at small scale is no longer fair use at large scale.
  
  llm_trw 56212 years ago
  
  If you think that pretraining is copying then your opinions are irrelevant.
  
  yencabulator 56212 years ago
  
  If you think courts follow your personal opinion and technical definition, you're silly. The reality is we don't yet know how the courts will decide.
  
  logifail 56212 years ago
  
  > Hunter Thompson copying every word of every Hemingway novel isn't copyright infringement
  I make a high-res photo of a banknote. I can print that out at home. The bad stuff starts at the step after that...
  
  llm_trw 56212 years ago
  
  Try importing it in photoshop and report your findings back.
  
  logifail 56212 years ago
  
  No-one would possibly think of using GIMP instead...
  https://www.reddit.com/r/graphic_design/comments/ah9s8n/trie...
- marssaxman 56212 years ago
  
  That sounds like a pretty good deal to me - but I've always believed that the entire concept of "intellectual property" does overall more harm than good.
  
  logifail 56212 years ago
  
  > I've always believed that the entire concept of "intellectual property" does overall more harm than good
  It's fairly broken, but on balance it seems the creators are the ones getting screwed.
  I did years of research in a scientific lab which resulted in <drum roll> 2 (yes two ... count them) peer-reviewed papers.
  My colleagues and I did the work, wrote up the damned papers, yet to get them published we had to sign over copyright to what I'd now suggest is essentially a rent-seeking scientific publishing mafia.
  All a long time ago, but I never had (and still don't have) the ability to either legally download or legally redistribute my own work...
- jncfhnb 56212 years ago
  
  Yes. As long as you don’t reproduce other people’s stuff specifically.
  
  latexr 56212 years ago
  
  Which they do. That’s what the New York Times lawsuit is about. And in Meta’s case, they went specifically out of their way to remove the copyright notices to hide their actions.
  
  jncfhnb 56212 years ago
  
  “They” might be doing that. But this is not intrinsic to LLM usage
  
  latexr 56212 years ago
  
  The conversation is specifically about OpenAI and Meta.
  
  jncfhnb 56212 years ago
  
  I disagree. This thread seems to be about LLMs on a fundamental level.
  
  latexr 56212 years ago
  
  The submission is about Meta, and the comment that started the thread specifically mentioned OpenAI and Meta and no other LLMs or providers.
  
  jncfhnb 56212 years ago
  
  And the thread is about how LLM training interacts with copyright. Not whether OpenAI or Meta coincidentally blatantly copied other works.
- scotty79 56212 years ago
  
  Of course. The alternative is that creators dictate the price for any of the infinite number of zero cost copies which is and always has been ridiculous.
- tomrod 56212 years ago
  
  I think the statistical arguments cover this.
Hammershaft 56212 years ago

Hard to treat llms training aon your data at your expense as research for the betterment of humanity when it is specifically the private company who is imposing that cost on you that profits.
lm28469 56212 years ago

There is absolutely no logical pathway between the current flavor of hardcore free for all individualistic capitalism and what you describe here
fourside 56212 years ago

Does this goes both ways? Can I infringe of Disney’s IP on the grounds that their stories are so derivative that they aren’t actually that new?
The betterment of humanity seems to involve some parties making a ton of money while the people who provided the data apparently just need to be grateful.
- jncfhnb 56212 years ago
  
  You can absolutely use the story frameworks that Disney has done.
  You cannot make a story featuring Simba.
  
  giantrobot 56212 years ago
  
  > You cannot make a story featuring Simba.
  Just use Kimba the White Lion.[0]
  [0] https://12tomatoes.com/kimba-similarity-lion-king/
skulk 56212 years ago

> I personally hope we can all get on the same team with AI and treat its advancement as scientific research for the betterment of humanity.
s/AI/capital/.
It's painfully obvious that this is going to make material conditions worse for most people who use their minds to work instead of their hands. to these people, the "betterment of humanity" is a cruel joke.
- idunnoman1222 56212 years ago
  
  Yeah, just like Google and stack overflow
latexr 56212 years ago

> Sometimes I just feel like these people overestimate how much they are actually owed from these training runs.
It’s not about being paid for including their work, it’s about being compensated for having done so without permission. For crying out loud, they went out of their way to remove copyright notices from the pirated work.
> It’s trained on 15T tokens. So how many did you provide that were genuinely novel?
Then they can just take it out. And go ahead and take out every thing you didn’t have permission to include. What’s that? The model is now significantly worse? Yeah, these things compound.
> And how much money do you want? Like $5 from OpenAI? And $0 from meta since it’s open source?
No, they would’ve wanted for the work to not have been included without permission in the first place. Do you understand the world you’re advocating for? You’re arguing it’s OK for rich people to do whatever they want if they throw some scraps on the floor for you. Not everything is about money. Unfortunately there’s no other reasonable way (legal and non violent) to punish these infringers.
> I personally hope we can all get on the same team with AI and treat its advancement as scientific research for the betterment of humanity.
What you’re expressing is “I hope everyone will stop arguing and agree with me”. These moguls care about themselves, it is incredibly naive to believe they give a rat’s ass about “the betterment of humanity”.
forgetfulness 56212 years ago

LLM companies aren't being funded by the hundreds of billions because investors expect science to be advanced by text and image generators.
I find it very unlikely that the commodification of knowledge work will be for the betterment humanity, I don't know if people are expecting here that just because the value of more people's labor becomes zero that we will do, what, do away with money? No, it will just mean that fewer people will have the chance to earn the right to use space and resources in a meaningful way.
- inetknght 56212 years ago
  
  > I find it very unlikely that the commodification of knowledge work will be for the betterment humanity
  There's no law to force it. So of course it won't be.
  Even if there were a law to force it, how would you enforce it?
wat10000 56212 years ago

The normal way to figure this out is to negotiate. We’ll either come to a mutually agreeable amount, or they’ll decide it’s not worth the cost to use my stuff. If I think I deserve $5 from OpenAI, then I’d suggest that, and they’d accept or come back with a counteroffer or tell me I’m nuts and move on. Probably that last one.
But for some reason, these companies think they don’t need to bother, and can just use everyone’s stuff.
Wait, I phrased that wrong. For a very good reason based on long precedent, these companies know that IP law is a tool to be used by big companies against individuals and sometimes other big companies, but never by individuals against company, so they know they don’t have to bother.
jsheard 56212 years ago

The fact they they were willing to risk significant legal exposure in order to use this dataset suggests it's worth considerably more than 5 dollars to them. Zuck isn't putting his ass on the line for a Big Mac.
TruffleLabs 56212 years ago

Stealing is still illegal
sensanaty 56212 years ago

So if the works they're stealing aren't worth anything, why do they need it so badly?