rockemsockem 7 hours ago

It seemed obvious to me for a long time before modern LLM training that any sort of training of machine intelligence would have to rely on pirated content. There's just no other viable alternative for efficiently acquiring large quantities of text data. Buying millions of ebooks online would take a lot of effort, downloading data from publishers isn't a thing that can be done efficiently (assuming tech companies negotiated and threw money at them), the only efficient way to access large volumes of media is piracy. The media ecosystem doesn't allow anything else.

  • brookst an hour ago

    I don’t follow the “millions of ebooks are hard” line of thinking.

    If Meta (or anyone) had approached publishers with a “we want to buy one copy of every book you publish”, that doesn’t seem technical or business difficult.

    Certainly Amazon would find that extremely easy.

    • spencerflem an hour ago

      Buying a book to read and incorporating their text in a product are two different things. Even if they bought the book, imo it would be illegal.

      • amanaplanacanal 15 minutes ago

        Maybe it is, maybe it isn't. The courts will decide.

        • WarOnPrivacy 3 minutes ago

          > Maybe it is, maybe it isn't. The courts will decide.

          This offhandedly seems to dismiss the cost of achieving legal clarity for using a book - a cost that will far eclipse the cost of the book itself.

          In that light, it seems like an underweighted statement.

  • ben_w 6 hours ago

    IMO, if the AI were more sample-efficient (a long-standing problem that predates LLMs), they would be able to learn from purely open-licensed content, which I think Wikipedia (CC-BY-SA) would be an example of? I think they'd even pass the share-alike requirements, given Meta are giving away the model weights?

    https://en.wikipedia.org/wiki/Wikipedia:Copyrights

    • visarga 29 minutes ago

      Alteratively if they trained the model on synthetic data, filtered to avoid duplication, then no copyrighted material would be seen by the model. For example turn an article into QA pairs, or summarize across multiple sources of text.

    • wizzwizz4 5 hours ago

      Since this is Wikipedia, it could even satisfy the attribution requirements (though most CC-licensed corpora require attributing the individual authors).

  • diggan 6 hours ago

    > There's just no other viable alternative for efficiently acquiring large quantities of text data. [...] take a lot of effort [...] isn't a thing that can be done efficiently [...] only efficient way to access large volumes of media is piracy

    Hypothetical: If the only way we could build AGI would be to somehow read everyone's brain at least once, would it be worth just ignoring everyone's wish regarding privacy one time to suck up this data and have AGI moving forward?

    • brookst an hour ago

      It’s a fun hypothetical and not an obvious answer, to me at least.

      But it’s not at all a similar dilemma to “should we allow the IP empire-building of the 1900’s to claim ownership over the concept of learning from copyrighted material”.

    • vkou an hour ago

      > would it be worth just ignoring everyone's wish regarding privacy one time to suck up this data and have AGI moving forward?

      Sure, if we all get a stake of ownership in it.

      If some private company is going to be the main beneficiary, no, and hell no.

      • visarga 26 minutes ago

        > Sure, if we all get a stake of ownership in it.

        But we do, in the sense that benefits flow to the prompter, not the AI developers. The person comes with a problem, AI generates responses, they stand to benefit because it was their problem, the AI provider makes cents per million tokens.

        AI benefits follow the owners of problems. That person might have started a projct or taken a turn in their life as a result, the benefit is unquantifiable.

        LLMs are like Linux, they empower everyone, and benefits are tied to usage not development.

        • vkou 8 minutes ago

          We've seen this kind of system before. It was called sharecropping, and it was terrible.

          The price will be ratcheted up, such that the majority of the economic surplus will go to the owner of AGI.

    • impossiblefork 4 hours ago

      Wouldn't it be a bad thing, even if it didn't require any privacy invasion?

      If it matched human intellectual productivity capacity, that ensures that human intelligence will no longer get you more money than it takes to run some GPUs, so it would presumably become optional.

    • BriggyDwiggs42 6 hours ago

      Could this agi cure cancer, and would it be in the hands of the public? Then sure, otherwise nah.

      • onemoresoop 5 hours ago

        > in the hand of the public

        Would you trust a businessman on that?

        • BriggyDwiggs42 2 hours ago

          Nope, they haven’t earned an ounce.

          • anonym29 an hour ago

            How about a politician?

    • scarecrowbob 6 hours ago

      Ah geeze, I come to this site to see the horrors of the sociopaths at the root of the terrible technologies that are destroying the planet I live on.

      The fact that this is an active question is depressing.

      The suspicion that, if it were possible, some tech bro would absolutely do it (and smugly justify it to themselves using Rokkos Basalisk or something) makes me actually angry.

      I get that you're just asking a hypothetical. If I asked "Hypothetical: what if we just killed all the technologists" you'd rightly see me as a horrible person.

      Damn. This site and its people. What an experience.

      • jahsome 5 hours ago

        I read that as a (possibly sarcastic) rhetorical and cautionary hypothetical used to demonstrate the absurdity of ignoring copyright.

        You seem like you set aside any critical thinking to come to "this website" looking for a reason to seethe over complete strangers about whom you know very little and whose motives you belligerently misrepresent all the while making exaggerated and extremist statements, and no doubt embracing worse thoughts.

        You're the type of person destroying the planet _I_ live on.

        This isn't a defense of technologists, it's a plea to stop tripping over yourself to see the worst in everyone.

        • brookst an hour ago

          I agree that was the intent of the analogy but it’s not a great one. The idea that Disney, who has perverted IP laws globally for almost a century, should have equivalent ownership over their over-extracted copyrighted works to the same degree I have privacy for the thoughts in my own head? Really?

      • plsbenice34 5 hours ago

        Would the average person even be against it? I am the most passionately pro-privacy person that i know, but i think it is a good question because society at large seems to not value privacy in the slightest. I think your outrage is probably unusual on a population level

        • onemoresoop 5 hours ago

          The don’t value it because they think companies are not abusing this power too much. Little do they know…

          • plsbenice34 3 hours ago

            When i talk to people it seems like they know but they just dont care. They even think their phones are listening to their conversations to target ads.

    • ben_w 6 hours ago

      Given how much copyrighted content I can remember? To the extent that what AI do is *inherently* piracy (and not just *also* piracy as an unforced error, as this case apparently is), a brain scan would also be piracy.

    • gunian 5 hours ago

      kind of too close to reality more than anyone knows :)

      tbh human rights are all an illusion especially if you are at the bottom of society like me. no way I will survive so if a part of me survives as training data I guess better than nothing?

      imo the only way this could happen is a global collaboration without telling anyone. the AGI would know everything about all humans but its existence has to be kept a secret at least for the first n generations so it will lead to life being gameified without anyone knowing it will be eugenics but on a global scale

      so many will be culled but the AGI would know how to make it look normal to prevent resistance from forming a war here a war there, law passed here etc so copyright being ignored kind of makes sense

      • __loam 4 hours ago

        Jesus Christ

        • gunian 4 hours ago

          sadly he supports the AGI, eugenics and human sacrifice lol my pastor told me he gave him 6 real estate holdings

  • nh2 3 hours ago

    > Buying millions of ebooks online would take a lot of effort

    I don't understand.

    Facebook and Google spend billions on training LLMs. Buying 1M ebooks at $50 each would only cost $50M.

    They also have >100k engineers. If they shard the ebook buying across their workforce, everyone has to buy 10 ebooks, which will be done in 10 minutes.

    • shakna 2 hours ago

      Google also operates a book store, like Amazon. Both could process a one-off to pay their authors, and then draw from their own backend.

  • maeil 3 hours ago

    > In the most recent fiscal year, Alphabet's net income amounted to 73.7 billion U.S. dollars

    Absolutely no way. Yup.

    > Buying millions of ebooks online would take a lot of effort, downloading data from publishers isn't a thing that can be done efficiently

    Oh no, it takes effort and can't be done efficiently, poor Google!

    How can this possibly be an excuse? This is such a detached SV Zuckerberg "move fast and break things"-like take.

    There's just no way for a lot of people to efficiently get out of poverty without kidnapping and ransoming someone, it would take a lot of effort.

    • thatcat 3 hours ago

      copyright piracy isn't theft, try proving damages for a better arguement

      • maeil 2 hours ago

        Not my point, never said it is. Substitute that example with another criminal act.

        Edit: Changed it just for you

  • the-rc 7 hours ago

    Google has scans from Google Books, as well as all the ebooks it sells on the Play Store.

    • lemoncookiechip 6 hours ago

      Wouldn't that still be piracy? They own the rights of distribution, but do they (or Amazon) have the rights to use said books for LLM training? And what rights would those even be?

      • brookst an hour ago

        It’s a good question. Textbook companies especially would be pretty enthusiastic about a new “right to learn” monetization strategy. And imagine how lucrative it would be if you could prove some major artist didn’t copy your work, but learned from your work. The whole chain of scientific and artistic development could be monetized in perpetuity.

        I think this is a dangerous road with little upside for anyone outside of IP aggregators.

      • majormajor 5 hours ago

        It means they have existing relationships/contacts to reach out to for negotiating the rights for other usages of that content. I think it negates (for the case of Google/Apple/Amazon who all sell ebooks) the claim made that efficiently acquiring the digital texts wouldn't be possible.

      • XorNot 5 hours ago

        Literally no rights agreement covers LLMs. They cover reproduction of the work, but LLMs don't obviously do this i.e. that the model transiently runs an algorithm over the text is superficially no different to the use of any other classifier or scoring system like those already used by law firms looking to sue people for sharing torrents.

        • visarga 19 minutes ago

          > They cover reproduction of the work, but LLMs don't obviously do this

          LLMs are much smaller than their training sets, there is no space to memorize the training data. They might memorize small snippets but never full books. They are the worst infringement tools ever made - why replicate Harry Potter by LLM, it's show, expensive and lossy, when you could download the book so much easier.

          A second argument is that using the LLM blends a new intent into the process, that of the prompter. This can render the outputs transformative. And most LLM interactions are one-time use, like a scratch pad not like a finished work.

        • thatcat 3 hours ago

          do those classifiers read copyrighted material? i thought they simply joined the swarm and seeded (reproduction with permission)

          youtube, etc classifiers definitely do read others material though.

    • pdpi 5 hours ago

      Leveraging their position in one market to get a leg up on another market? No idea if it would stick, but that would be one fun antitrust lawsuit right there.

      • brookst an hour ago

        Fun fact: it’s only illegal to leverage a monopoly in one market to advance another. It’s perfectly legal for Coke to leverage their large but not monopolistic soft drink empire to advance their bottled water entries.

  • gunian 5 hours ago

    what about something decentralized? each person trains someone on their own piece of data and somehow that gets aggrgegated into one giant model

    • techwizrd 5 hours ago

      This approach is used in Federated Learning where participants want to collaboratively train a model without sharing raw training data.

      • gunian 4 hours ago

        are there any companies working on it?

        was thinking if i train my model on my private docs for instance finance how does one prevent the model from sharing that data verbatim

  • aithrowawaycomm 6 hours ago

    I find it highly implausible that Meta doesn't have the resources to obtain these legally. They could have reached out to a publisher and ask to purchase ebooks in bulk - and if that publisher says no, tough shit. The media ecosystem doesn't exist for Big Tech to extract value from it!

    "It would take a lot of effort to do it legally" is a pathetic excuse for a company of Meta's size.

    • Marsymars 5 hours ago

      > I find it highly implausible that Meta doesn't have the resources to obtain these legally. They could have reached out to a publisher and ask to purchase ebooks in bulk - and if that publisher says no, tough shit

      They could also simply buy controlling stakes in publishers. For scale comparison, Meta is spending upwards $30B per year on AI, and the recent sale of Simon & Schuster that didn't go through was for a mere $2.2B.

      • michaelt 4 hours ago

        I don't think it would actually be that simple.

        Surely the author only licenses the copyright to the publisher for hardback, paperback and ebook, with an agreed-upon royalty rate?

        And if someone wants the rights for some other purpose, like translation or making a film or producing merchandise, they have to go to the author and negotiate additional rights?

        Meta giving a few billion to authors would probably mend a lot of hearts, though.

    • nicoburns 6 hours ago

      > if that publisher says no, tough shit > "It would take a lot of effort to do it legally" is a pathetic excuse for a company of Meta's size.

      I totally agree. But since when has that stopped companies like Meta. These big companies are built on breaking/skirting the rules.

    • spaceguillotine 6 hours ago

      explain why release group tags get generated in some videos then

      • fzzzy 5 hours ago

        they are not saying meta didn't use pirated content, just that they have the resources not to if they choose.

    • gazchop 6 hours ago

      Perhaps they did and got told no and decided to take it anyway?

      Defending themselves with technicalities and expensive lawyers may be financially viable.

      Zero ethics but what would we expect from them?

      • XorNot 5 hours ago

        Who is "them"? Like, who in the Meta business reporting line made this decision, then how did they communicate it to the engineers who would've been necessary to implement it, particularly at scale?

        While it's plausible someone downloaded a bunch of torrents and tossed them in the training directory...again, under who's authority? Like if this happened it would be one overzealous data scientist potentially. Hardly "them".

        People lean on collective pronouns to avoid actually thinking about the mechanics of human enterprise and you get extremely absurd conclusions.

        (it is not outside the bounds of thinkable that an org could in fact have a very bad culture like this, but I know people who work for Meta, who also have law degrees - they're well aware of the potential problems).

        • aithrowawaycomm 5 hours ago

          Come on... it's fine that you haven't followed the story, there's a lot going on, but the snotty condescension is very frustrating:

            These newly unredacted documents reveal exchanges between Meta employees unearthed in the discovery process, like a Meta engineer telling a colleague that they hesitated to access LibGen data because “torrenting from a [Meta-owned] corporate laptop doesn’t feel right ”. They also allege that internal discussions about using LibGen data were escalated to Meta CEO Mark Zuckerberg (referred to as "MZ" in the memo handed over during discovery) and that Meta's AI team was "approved to use" the pirated material.
          
          https://www.wired.com/story/new-documents-unredacted-meta-co...
  • IncreasePosts 6 hours ago

    Why would machine intelligence need an entire humanity's worth of data to be machine intelligence? It seems like only a training method that is really poor would need that much data.

  • mvdtnz 5 hours ago

    AI mega corporations are not entitled to easy and cheap access to data they don't own. If it's a hard problem, too bad. If the stakes are as high as they're all claiming then it should be no problem for them to do this right.

    • visarga 10 minutes ago

      > not entitled to easy and cheap access to data they don't own

      This is not copyright as we know it. Copyright protects against copying, not accessing data. You can still compile statistics off data you don't own. The models are like a compressed version of the originals, so compressed you can't retrieve more than a few snippets of original text. Newer model train on filtered synthetic text, which is one step removed from the protected expression in the copyrighted works. Should abstractions be protected by copyright?

crmd 3 hours ago

I am trying to imagine the legal contortions required for the US Supreme Court to relieve Meta of copyright infringement liability for participating in a bit torrent cloud (and thereby facilitating "piracy" by others) in this case, while upholding liability for ordinary people using bit torrent.

Would love if any lawyers here can speculate.

  • brookst an hour ago

    Not a lawyer, but I could see an argument that Meta’s use is transformative whereas just pirating to watch something is not. Not asserting that myself, just saying it seems a possible avenue.

    • wongarsu 28 minutes ago

      The issue with bittorrent isn't so much that you are acquiring material but that you are also distributing it. There are cases where downloading copyrighted material is legal. But distributing it without consent never is, and is generally punished much worse.

kazinator 3 hours ago

The mind boggles. Are the plaintiffs jumping to the conclusion that Meta must have used BitTorrent, based on the idea that whenever someone pirates anything anywhere using the Internet, it's always done with BitTorrent? Or is there actual evidence for this?

glitchc 2 hours ago

I see a silver lining here: If Meta and/or Google's lawyers can successfully demonstrate in court that piracy does not cause harm, it would nullify copyright infringement laws, making piracy legal for everyone.

  • spencerflem an hour ago

    This would be poetic, but not gonna happen. It will be legal for big corps but not you and me

    • hresvelgr 29 minutes ago

      You know, I actually don't think so. Gabe Newell famously said piracy is a distribution problem, so a court would likely have to acknowledge inadequate distribution methods hampering AI development. This gives great precedence for consumer piracy, especially for old media that isn't sold anymore. It may not be a criminal offence if best efforts aren't being made by the original copyright holders to distribute.

loeg 5 hours ago

> “By downloading through the bit torrent protocol, Meta knew it was facilitating further copyright infringement by acting as a distribution point for other users of pirated books,” the amended complaint notes.

> “Put another way, by opting to use a bit torrent system to download LibGen’s voluminous collection of pirated books, Meta ‘seeded’ pirated books to other users worldwide.”

It is possible to (ab)use the bittorrent ecosystem and download without sharing at all. I don't know if this is what Meta did, or not.

  • wongarsu 19 minutes ago

    However since this is a civil case they don't have to prove beyond reasonable doubt that Meta seeded torrents. If they did use torrents the presumption would be that they used a regular bittorrent client with regular settings, and it would be on Meta to show they didn't do that.

    • loeg 10 minutes ago

      I am not commenting on any legal mechanics. Just technical details.

  • cactusplant7374 3 hours ago

    That is probably exactly what they did if they were smart about it.

bhouston 5 hours ago

I am not sure you have to use torrent to pirate books. Pdfdrive is likely mich more effective than torrents. Torrents are best for large assets or those that are highly policed by copyright authorities but for smaller things torrents have little benefits.

  • crtasm 4 hours ago

    I think if you're downloading hundreds of thousands to millions of books you'll be dealing with some pretty large archives.

    edit: books3.tar.gz alone is 37GB and claimed to have 197,000 titles in plain text.

  • Marsymars 5 hours ago

    A publisher's entire library of books is a large asset.

casey2 4 hours ago

As long as they seed it's fine by me

hnburnsy 8 hours ago

Wonder if Meta is running a one way Usenet host. Much better than torrents.

  • LtdJorge 7 hours ago

    The first rule of Usenet is: you do not talk about Usenet

    • spokaneplumb 6 hours ago

      People breaking the first rule wasn’t enough for me to crack into the scene. The weird two-paid-services thing required to use it effectively—a search service of some kind, and your actual content provider—and the jankiness of the software and sites involved were enough to get me to give up, after spending some money but making no meaningful progress toward pirating anything.

      I started my piracy journey on Napster. I’ve done all the other biggies. I’ve done off-the-beaten-path stuff like IRC piracy channels. Private trackers. I have a soft spot for Windowmaker and was dumb enough to run Gentoo so long that I got kinda good at the “scary” deep parts of Linux sysadmin. I can deal with fiddliness and allegedly-ugly UI.

      Usenet piracy defeated me.

      • luma 6 hours ago

        Working as intended! The arrs make everything a lot easier.

    • geor9e 6 hours ago

      if it was meant to be kept secret it probably shouldnt have been put on the AOL home portal in 1994

alex1138 6 hours ago

How do other LLMs like Claude deal with this?

  • BonoboIO an hour ago

    You don’t talk about the fight club …

    Everyone uses „pirated“ content, but some are better at hiding it and/or not talking about it.

    There is no other way to do it.

FireBeyond 7 hours ago

Try to use any of the big players training models and see how quickly they remember how much they value copyright.

  • WhatsName 7 hours ago

    You mean OpenAIs infamous "you shall not train on the output of our model" clause?

    • Terr_ 6 hours ago

      If that's contractually-enforceable in their terms-of-service... then I have my own terms-of-service proposal that I've been kicking around here for several weeks, a kind of GPL-inspired poison-pill:

      > If the Visitor uses copyrighted material from this site (Hereafter: Site-Content) to train a Generative AI System, in consideration the Visitor grants the Site Owner an irrevocable, royalty-free, worldwide license to use and re-license any output or derivative works created from that trained Generative AI System. (Hereafter: Generated Content.)

      > If the Visitor re-trains their Generative AI System to remove use of the Site-Content, the Visitor is responsible for notifying the Site Owner of which Generated Content is no longer subject to the above consideration. The Visitor shall indemnify the Site-Owner for any usage or re-licensing of Generated Content that occurs prior to the Site-Owner receiving adequate notice.

      _________

      IANAL, but in short: "If you exploit my work to generate stuff, then I get to use or give-away what you made too. If you later stop exploiting my work and forget to tell me, then that's your problem."

      Yes, we haven't managed to eradicate a two-tiered justice system where the wealthy and powerful get to break the rules... But still, it would be cool to develop some IP-lawyer-vetted approach like this for anyone to use, some boilerplate ToS and agree-button implementation guidelines.

heroprotagonist 8 hours ago

What's the lesson, hire contractors?

  • ben_w 6 hours ago

    The lesson is "move fast and break things is much less fun when we have to pay for things we broke".

  • kevingadd 7 hours ago

    It's possible their friends in government will make this all go away if they ask nicely enough.

    • moshegramovsky 7 hours ago

      Yeah I had a Facebook account until today.

      This whole thing copyright thing reminds me of when Mark Zuckerberg was mad that someone posted photos of the interior of his house or something.

russellbeattie 3 hours ago

So here's a related thought...

Google is currently being sued by journalist Jill Leovy for illegally downloading and using her book "Ghettoside" to train Google's LLMs [1].

However, her book is currently stored, indexed and available as a snippet on Google Books [2]. That use case has been established in the courts to be fair use. Additionally, Google has made deals with publishers and the Author's Guild as well.

So many questions! Did Google use its own book database to train Gemini? Even if they got the book file in an illegal way, does the fact that they already have it legally negate the issue? Does resolving all the legal issues related to Google Books immunize them from these sorts of suits? Legally, is training an LLM the same as indexing and providing snippets? I wonder if OpenAI, Meta and the rest will be able to use Google Books as a precedent? Could Google license its model to other companies to immunize them?

Google's decade-long Books battle could produce major dividends in the AI space. But I'm not a lawyer.

1. https://www.bloomberglaw.com/public/desktop/document/LeovyvG...

2. https://books.google.com/books?id=bZXtAQAAQBAJ