mcbuilder 2 years ago

> Rather than conceal the licenses of the under­lying open-source code it relies on, it could in prin­ciple keep this infor­ma­tion attached to each chunk of code as it wends its way through the model.

I don't know if the author understands how these transformer models work, but this would a impossible task in Byzantine complexity. The way these models work is by outputting a probably distribution of likely "token" embeddings given an input prompt. The output involves basically inverting a word2vec (probably beefed up with programming language keywords and other bells and plausible search techniques I don't have access to the details of).

This model was of course trained with real code, but you can't attach to the output any meaningful information from the gradient you get from the sample. It's a very messy computation to even think to write down (attaching a percentage that 1 training example affected 1 particular output), much less come up with a simple interpretation of.

  • gus_massa 2 years ago

    There is a problem because some licenses require attribution, but ignoring that...

    You can make a model that is trained only with BSD and MIT code, and IIRC/IIUC the result can be used with any license, including proprietary code.

    You can make a second model that is trained only with BSD, MIT and GPL2 code (and perhaps GPL2+), and IIRC/IIUC the result can be only with GPL2 code.

    You can make a third model that is trained only with BSD, MIT and GPL2+ and GPL3 code, and IIRC/IIUC the result can be only with GPL3 code.

    AGPL, Apache, WTFPL, ... Just add them to the correct model or create a new model for them.

    • q-big 2 years ago

      > You can make a model that is trained only with BSD and MIT code, and IIRC/IIUC the result can be used with any license, including proprietary code.

      BSD and MIT license still require attribution of the used source code (there exists a MIT No Attribution License, though: https://en.wikipedia.org/w/index.php?title=MIT_License&oldid...).

      • mbreese 2 years ago

        > BSD and MIT license still require attribution of the used source code

        Which in this scenario could be accomplished by starting that “this software contains algorithmically generated code that was trained using MIT derived code from X, Y, and Z.

        This attribution list would be very long, but it could be done.

        I think the interesting thing here would be to compare the generated code with the original training data after the fact. If the generated code does not already exist (with some degree of similarity), then it would be okay to use. Otherwise, if it did already exist, you’d then need to attribute the authors of the chunk of code — which you’d now know.

        I’m not discounting the hurdles involved, but there might be ways to overcome them.

      • jimmySixDOF 2 years ago

        Whats to stop them from just making a huge attribution page with every possible instance and a vague statement like "the following may or may not have been incorporated in part or in whole" ?

        • jacobolus 2 years ago

          They’d be gambling that such a crummy attribution would hold up in court. Generally judges aren’t too amused by people’s efforts to find wacky loopholes that make a mockery of the law.

          On the other hand, open source licenses haven’t been tested thoroughly in court, so maybe the result would be to throw out such attribution requirements entirely. The main relevant case to date is https://en.wikipedia.org/wiki/Jacobsen_v._Katzer

      • wolpoli 2 years ago

        Great. Now developers will need to check license compatibility before picking the copilot model.

        In reality through, we'll all just end up using the MIT No Attribution License copilot.

        • q-big 2 years ago

          Developers already have to check license compatibility. If copilot implemented such a feature, this would just represent the status quo.

        • xigoi 2 years ago

          > In reality through, we'll all just end up using the MIT No Attribution License copilot.

          Great, so GPL code would be left alone. I don't see how that's a bad thing.

    • winety 2 years ago

      > There is a problem because some licenses require attribution, but ignoring that...

      Surely the solution would be to give credit to every author from the training corpus. I am looking forward to the 10 000 lines of copyrights in every header. :P

      If Microsoft had trained it on its own code, there would be no such problems. Surely a company as large as Microsoft has produced enough code over the years to create a large enough training dataset.

      • ShamelessC 2 years ago

        > If Microsoft had trained it on its own code, there would be no such problems.

        I keep seeing this sentiment from the GPL/"laundering" side of the debate.

        Believe me, Microsoft wouldn't have released this thing (after what, 6 months of beta testing?) if they thought they had any "problems" at all.

        I'm not saying I don't sort of agree with you, but is there no room for what's actually _likely_ to happen in this debate? Because as best as I can tell, they aren't going to see any real legal issues from this.

        (There's also an option to remove generations that result in a collision with actual GitHub code, just fyi)

        I feel like when the singularity happens HN is going to be flooded with programmers mad that they got automated away despite it very much being one of the primary goals of computer science and software engineering. This stuff is a kind of just a fact of life now.

        Salesforce trained models (on GitHub) competitive with copilot without needing to own GitHub. I would spend less time worrying about how to lawyer up and more time figuring out how you're going to adapt to these new tools. That's the gig.

        • belorn 2 years ago

          Microsoft made a bet that releasing Copilot will mean more profits than the legal issues might cost them. This doesn't mean anything if there is or isn't a problem with it.

          The simply way to test the legal theory behind copilot would be to write a AI that write music notes, using music scraped from youtube or any other large music library. The idea that one can train on "public available material" and produce algorithms that output large chunk of copyrighted material is a bit untested in court, but go against the wrong target and we will quickly see a response. We have actually seen some traces of this with news bots that scrapes news site and produce "novel" interpretation of existing news, especially sports news.

          • ShamelessC 2 years ago

            This is what I'm talking about. Are we commenting on a news report of someone actually doing what you're describing - filling a suit or legal action of any kind against MS for this?

            No, we're not. Further, Amazon just announced a similar product and Salesforce has literally _released weights_ for their code models. You can't put the genie back in the bottle.

            Actually enforcing any action when the representations are learned rather than hard-coded just seems impossible to me. They have a check box that removes any predictions matching existing code - that basically makes it impossible to discern the source since this will be based on some subjective "semantic closeness" BS.

        • Ygg2 2 years ago

          > Believe me, Microsoft wouldn't have released this thing (after what, 6 months of beta testing?) if they thought they had any "problems" at all.

          They would. Did you just forget TAI? Microsoft didn't consider 4chan would train her to be the ultimate racist.

        • iostream24 2 years ago

          As you mentioned “WHEN the singularity happens” as an article of religious faith, followed by a vast leap of faith in proposing no-code tools taking over programming, I’m afraid that you adapting to the reality on the ground will be the difficult part here, rather than the lack of adaptation by programming writ large.

          Do you work in marketing? Do you program?

          • ShamelessC 2 years ago

            I'm only somewhat certain that the singularity is inevitable (and obviously my predictions aren't worth betting on anyways) - sorry for using poetic language.

            I'm a machine learning engineer, amateur researcher and open source contributor. Before that I was a software engineer for 8 years.

  • ninjin 2 years ago

    It does not matter whether Matthew understands how transformer models work. He does understand that legally this is looking messy – or “foggy”, in his words.

    As a person that teaches and conducts research with these models, there is no reason why a tool like Copilot has to be a purely parametric, generative transformer like GPT-3. Where attribution, as you describe it, is nearly impossible.

    For example, if the model was to use a retriever component to obtain specific pieces of concrete code from a database (where the licensing for each piece is known) conditioned on the original source code context; then generate its output based on this, the context in your source code, and pre-trained parameters, then it could theoretically at least satisfy Butterick’s request and probably be more akin to how a human programmer operates.

    This still does not rule out possible legal issues entirely as the pre-trained parameters are opaque, but it certainly makes it less problematic.

    Alternatively, there is active research on attributing a given output to a transformer’s training data. But it is still very early days and frankly it is very “foggy” as to what degree this can be done.

    Lastly, I have seen a bunch of comments elsewhere alluding to it being somehow sufficient to reduce the level of verbatim copying. Sadly, it will not, as even if you for example replace the variable names it is still a copyright violation. Just like if you manipulate the RGB space slightly for an image. Determining fair use, etc without a license is something that today can only be done in court, regardless of how we of a heavier technical disposition may feel about it. After all, that is exactly why we have explicit licenses in the first place!

  • Brian_K_White 2 years ago

    As much as I hate the entire concept, I would have a hard time articulating a substantive difference between this description of the ai mixing together bits of stuff it saw and what I do myself when I'm writing something that I fully describe as mine.

    • alpaca128 2 years ago

      You probably don't sell code snippets that you collected from various sources you can't remember without knowing whether you're even allowed to sell some of them.

      That's what Copilot does with extra steps. I don't see why it should be different just because Microsoft mixed the code snippets in a blender to obscure what they're doing. It's code laundering.

      • disconcision 2 years ago

        just trying to make the comparison more precise:

        using copilot is like subcontracting to a company who you know has no institutional qualms about their employees liberally copy/pasting from a known corpus of github repos. you know these employees don't always literally copy/paste, but all they do every day is read the corpus, so their code is always going to be basically derivative. you can optionally ask that the company avoids matches to existing code, which you know means that before sending you the code, the boss will fuzzy search their code against the corpus, asking their employees to write it again if a match is found.

        so given that this is the way the company works, and that you know this is the way they work, your task is to decide what your ethical and legal liabilities are.

      • chii 2 years ago

        > You probably don't sell code snippets that you collected from various sources you can't remember without knowing whether you're even allowed to sell some of them.

        actually i'm sure you do, you just don't think of it this way. The education you received to learn to code would consists of such snippets and examples. Then you sell your skill as a programmer, which consists of you recalling past experience and knowledge, and transform that into the final piece of code.

        Copilot is merely doing something like that, but way less sophisticated than a human brain.

        • Brian_K_White 2 years ago

          This. Sure I synthesize, but everything I synthesize is made out of stuff that came from somewhere. If the "snippets" are small enough, they become no different than me consulting the man page for a function.

          I think this will require looking at some actual examples to try to judge, and figure out how to articulate the basis for the judgement.

          Hm. That begs the question what good is it if it deals in such small bits? I guess that means if the snippets are large enough to be of any value at all, then they are automatically big enough to be damning.

    • amelius 2 years ago

      The difference is that the AI is doing it on a much larger scale.

      Analogy: the law has no problem with you memorizing the license plates of cars you see as they pass your street; however, building an automated system for recording and storing license plates in a database is a different story.

  • flohofwoe 2 years ago

    Why not simply split the input training data into a separate "license sets", and then train on them separately? Instead of one Copilot you'd have a "GPL Copilot", an "MIT Copilot", and "BSD Copilot" and so on and so forth, at least this would simplify the burden that's currently put on the user.

    Attribution would still be a problem though. How to attribute generated code that's a random mishmash from thousands of inputs? The only solution can be to attribute all of them by doing some sort of reverse match. This "IP scanning" must clearly also be provided by Copilot and not be offloaded to the user.

    TBH I would have expected that the "evil geniuses" who came up with Copilot would have thought about such obvious pitfalls beforehand. It's not much of an achievement to innovate by ignoring the law.

  • mnd999 2 years ago

    The model itself is a derivative work of GPLed code and thus should itself be open sourced. Granted by the time it’s spat out it’s probably an unreadable binary blob but it would be nice to have access to it for GPLed projects.

    • mbreese 2 years ago

      I’m not sure about the vitality of GPL here.

      Assuming the generated code does not exist and is not part of the training set (these are two giant caveats), how would this be different from me reading GPL code to learn how to do something, and then rewriting it in my own words?

      If I have not copied, but learned from GPL code, is my new code GPL? No, it’s not a derived work in terms of copyright. Am I standing on the shoulders of giants? Yes. But so long as that concept is expressed in a novel way (not just changing variable names), then it isn’t a derivative work.

      I haven’t worked with copilot at all to know how verbatim the results are, but it is theoretically possible to train a model with GPL code and not get GPL code generated out the other side. (Again, here be dragons).

      • jen20 2 years ago

        > how would this be different from me reading GPL code to learn how to do something, and then rewriting it in my own words?

        No different - that is also not allowed. The idea of clean room reverse engineering exists exactly for this reason.

        • nathanielarmer 2 years ago

          Do you have a source for that claim?

          My understanding is the GPL based on copyright - and that copyright only protects a specific expression, not a general idea or concept. To suggest that if something is copyrighted, humans cannot learn from it and generate thier own material seems to be absurd.

  • omegalulw 2 years ago

    There are much simpler solutions than that. Simply invest or reuse plagiarism models and tag output snippets to source repositories. You can either exclude this code or let the users know it's licenced code.

  • withinboredom 2 years ago

    Unless the model is truly coming up with something novel, there is something in it's training set that is truly similar, if not precisely the output. It should say so when that is the case, and should also say whether or not the output is novel. I'm sure GitHub could provide an API for searching code snippets, if they don't already.

    • mcbuilder 2 years ago

      Yes, but Co-Pilot is more like a system that has spent a lot of time learning from open source code but it has the logical reasoning ability of that code of the 12-year old who just learned about the problem yesterday that the author mentioned.

      But will it be capable of generating a copyright infringement? Let's imagine I want to implement an algorithm to find the shortest path between nodes in a graph. Well I remember from my MSc that you probably need to implement it with Dijkstra's algorithm, and I've implemented this a few times over the years in kata and leetcode. Now let's say I'm programming Rust, which I don't really know the syntax for so I go read some GPL code while browsing the net for syntax.

      Do I need to state that my implementation of Dijkstra's algorithm is 10% GPL, because I spent a few minutes reading syntax of a GPL file? What if someone else's copyrighted code looks 90% similar to mine, because hell there's only so many ways to implement it, can I get sued because there is 90% overlap and I might have looked at this code?

      These systems aren't even capable yet of generating more than a a few functions, much less a coherent library. Even then, as they generalize more and gain more expressive power they will be less likely to copy in the training data to the output now, a possibility that I consider quite remote even given a relatively primitive Co-Pilot.

      • belorn 2 years ago

        Describing an artificial intelligence algorithm as a 12-year old is a bit like describing advanced technology as magic. It kind of works as a way for people to related to it. It stops working however as soon someone start to unravel how they actually function, and at some level we understand that neither true AI or magic exist. It is just an illusion.

        When you write your implementation of Dijkstra's algorithm then you hopefully using human intelligence to write it, not an illusion of intelligence. If its original then its 0% of whatever GPL file you happen to have read at a previous date. If you copy something verbatim without using human intelligence, then its 100% GPL.

        True artificial intelligence == magic. if not magic then math.

      • withinboredom 2 years ago

        Yeah, you do have to obey the terms of the license if you base it on someone else’s code even if it “is the only way to implement it.” Even if it is only a single line.

        At least according to Google’s monetization policy when it comes to music.

        • zarzavat 2 years ago

          No, you do not. Copyright protects creative expression. If there is only one way to do something, or only one good way to do something, then there is no creativity and hence no copyright.

          In particular an algorithmic concept cannot be copyrighted but a description of an algorithm in code can be. I think that many programmers believe that the algorithm is the copyrightable bit because that is the “hard part” compared to say the variable names and comments which are the easy part. But copyright cares much more about the latter than loops and conditionals, which have much less creative value. Mathematical expressions cannot be copyrighted.

    • moyix 2 years ago

      They're currently doing this automatically for catching exact matches, and you can turn on a setting that will suppress any suggestion that was found verbatim in the training data.

      But of course this wouldn't catch copies where variable names have been changed, etc. One thing that I think would be really interesting (but hideously computationally expensive) is to compute an embedding of each chunk of the training data and then query for the k nearest neighbors when Copilot generates some code, so you can see what the closest snippets in the training data are and evaluate for yourself if they're too similar.

  • amelius 2 years ago

    That's no excuse. Attach all licenses if necessary.

    • sodality2 2 years ago

      And if the license terms conflict?

      • amelius 2 years ago

        Then you should have built two (or more) different non-conflicting models.

        • Gigachad 2 years ago

          Do you attach the licenses for every project you have ever looked at and learned something from? Do you switch learning modals when you use a MIT project?

          • amelius 2 years ago

            Perhaps I should. And if I built a product around copying other people's code, then I definitely should!

  • charcircuit 2 years ago

    It's not impossible. Amazon's CodeWhisperer is already able to tell you the license of code if what it generates is close to existing code.

    • treesprite82 2 years ago

      What's described as impossible is "[keeping the license] attached to each chunk of code as it wends its way through the model".

      Checking generated code for similarity against training set is possible, and is now done by both Copilot and CodeWhisperer. But it'll include code that just happens to be similar, even if that code had no influence on what the model generated.

    • jamal-kumar 2 years ago

      I wasn't aware that Amazon was developing something similar, how does it compare in terms of usefulness?

      I was finding the copilot trial was pretty good at reading something like an adjacent CSV file and building out a struct with proper data types from the data in it but beyond writing CSV->DB migration stuff I didn't really use it for a lot

throwaway675309 2 years ago

"Notably, Microsoft doesn’t claim that any of the code Copilot produces is correct. That’s still your problem. Thus, Copilot essen­tially tasks you with correcting a 12-year-old’s home­work, over and over. (I have no idea how this is prefer­able to just doing the home­work your­self.)"

Sidestepping the whole license violation issue, using GitHub copilot is the equivalent of using an IME to type Chinese - I may not remember exactly how to write the character but I'll quickly recognize it when I see it. It's the difference between being able to write traditional Chinese versus the ability to read the characters.

Anytime that I don't have to context switch and alt tab to look something up on MSDN or in Stack overflow is an absolute freakin win for me as a developer.

It kills me that people can't seem to comprehend this.

  • Barrin92 2 years ago

    >It kills me that people can't seem to comprehend this.

    because it's not a good comparison. Code is not language because language need not be syntactically correct, and people generally can infer correct semantics even from badly mangled communication.

    These automated coding solutions generate syntactical and semantic errors that another machine will not understand, and even more importantly, it generates kinds of errors that people are not accustomed to.

    Copilot is more like an automated car that goes off into random directions in ways that even human drivers would not, and you never know when. And when humans have to interact with black boxes whose behavior they cannot anticipate, small errors are not merely small errors, they create significant tension and uncertainty that requires almost constant attention.

  • Gigachad 2 years ago

    I found that the code copilot generated was so close to correct that my skim reading saw it as correct but it usually wrong in the smallest details with mistakes I’d never make myself.

    That’s more just a product usefulness complaint rather than any kind of ethics. If it’s working for others, that’s good.

  • workingon 2 years ago

    Yeah, maybe for 10x types it’s different but for normies like me checking code is 10x faster than writing it.

    • taneq 2 years ago

      Especially when 90% of code is either glue or boilerplate, and having something semi-intelligently suggest which API calls to use and how to structure a call to them will save you most of the time you'd otherwise have to spend dredging through documentation and/or stackoverflow.

      One of the hardest things to come to terms with as a budding young software dev is that most real world code is as wide as an ocean and as deep as a puddle, and the occasions where it's harmless (let alone actually helpful) to be clever are few and far between.

  • 2c2c2c 2 years ago

    it's basically just an example of dan luu's developer velocity article.

chrischen 2 years ago

I've been using Copilot during the beta and it's been absolutely amazing. That being said I mainly rely on it to autocomplete the rest of the line only, and it works great as a fancier auto complete. I can't imagine it being a copyright issue for this use case because the completions are what I would have written anyways. I'd probably never trust it to write a whole block of code implementing something and I think they definitely should add a feature to disable that because autocompleting only lines of code is just as useful.

It works even better in languages like Haskell or Ocaml, where often there is only a couple ways (often only one way) valid code could be written once you type part of it out and if it does spit out invalid code you get an instant compiler warning.

  • throwntoday 2 years ago

    Personally I hate autocomplete. I've had peers that were shocked I prefer to type everything out manually, and my IDE is essentially a text editor with syntax highlighting. To each their own of course, but I find the increase in productivity very quickly becomes a crutch for just remembering things.

    • chrischen 2 years ago

      There are 3 ways autocomplete can be used:

      1) As automatic documentation in a well-typed application. Autocomplete shows you the public properties or methods available based on what data you have started typing, and it's not guessed but guaranteed to be valid assuming your types are correct.

      2) As dumb autocomplete, where it just tries to guess based on the symbols in the document, and save you some keystrokes/remembering.

      3) As the Copilot style autocomplete where it will finish the entire line, and not just the next token—with the downside that it's really just guessed.

      I can see where 2 and 3 can be avoided, but 1) is really invaluable and often almost required in some strictly typed languages. I remember the days of writing PHP where I had to google for the function signatures every time because of inconsistent naming, and lack of typing. Memorizing these things is honestly a waste of a programmer's brain space. But now that I write exclusively in typed languages, auto-complete in the form of 1) has been invaluable not for the purposes of remembering names of things but being able to see what functions/methods are valid and can be used. No need to look up documentation manually anymore.

    • alpaca128 2 years ago

      I enjoy Vim's non-automatic completion where I only get suggestions and completions when I press the key combination. It's helpful to complete long words and still distraction free.

      Automatic completion or suggestion popups are super annoying, though. Even worse, some IDEs will dynamically reformat the code including the line I'm still typing. Feels like Clippy in MS Word but instead of showing options it automatically just does something random without asking. Efficient keyboard shortcuts make unreliable "helpers" redundant.

    • Barrin92 2 years ago

      Turning off autocomplete forced me to actually learn and memorize again, not just libraries, but even my own code. I didn't even notice how bad it had gotten when I went from IDE to plain text editing, felt like my brain had atrophied. Like when you hear people say they can't navigate any more without an app.

      • throwaway675309 2 years ago

        That's not really a good comparison, They've done actual tests for London cabbies and found that they had measurable difference in their hippocampus due to the spatial manipulations that they have to do.

        Whereas the difference between two developers one who remembers the exact parameters for some random arbitrary library and the other who relies on autocomplete is probably negligible. There's a huge difference between rote memorization and fluid intelligence.

        What's more important is your ability to solve problems and develop new algorithms and you can do that in pseudo code without memorizing a single pointless parameter list.

    • coredog64 2 years ago

      Autocomplete has actually turned into a negative for me, at least in VSCode. It keeps autocompleting things that don’t make any sense, it’s overly aggressive on doing it (like 3 characters), and is generally a PITA. I now have to undo the completion and then retype with a rapid escape key action.

      • chrischen 2 years ago

        I think this depends on how you configured autocomplete. In my Neovim setup autocomplete is opt in (you have to press tab to complete). It sounds like you have default autocomplete setup for some reason.

        • Gigachad 2 years ago

          Vscode has it on enter and it autocompletes insane things. You’ll type “end” to finish a block in ruby and it will auto complete to some weird function name.

    • omegalulw 2 years ago

      You are just wasting time. You should upskill yourself to the point you can write code with bare vim. Beyond that, use an IDE with good features to save time, that's what IDEs are for.

_7bxa 2 years ago

IMO, all these concerns about licensing ignore a pretty important fact: Copilot is (in most cases) no different from a human.

If I read the code for, I don't know, some GPL-3 library and then write my own MIT-licensed version--that's totally fine.

A programmer can read strictly licensed code and then use that knowledge to write their own non-strictly licensed code.

Copilot is not different from a human. It has knowledge & it uses it. It isn't copy-and-pasting (there are 1/2 edge cases where it is; but for the most part it's new ideas).

It's the same thing as saying "Dall-E 2 is plagiarizing art".

When I write a quicksort algorithm, I don't give any attribute to the code I saw for the algorithm in some random library.

Fundamentally, there's no real difference between Copilot and a human. I've watched Copilot write crazy lodash one-liners that were clearly contextual to my code.

I think what is fundamentally happening here is that older people / people of the last generation are realizing that just as sys admin jobs / etc. are going away, soon many rote coding jobs will be taken away since Copilot will automate them. And that's fine, but it is producing backlash which comes in the form of licensing issues.

Basically, there's no difference between Copilot and Dall-E and it's pretty clear that Dall-E has no licensing issues, thus Copilot should also be in the clear.

  • BeefWellington 2 years ago

    > IMO, all these concerns about licensing ignore a pretty important fact: Copilot is (in most cases) no different from a human.

    This is incorrect. Humans are capable of creating. Copilot is merely capable of regurgitating.

    > If I read the code for, I don't know, some GPL-3 library and then write my own MIT-licensed version--that's totally fine.

    It's actually not totally fine and there have been many many many court cases over this sort of thing, both with non-commercial and commercial licenses. The whole concept of clean-room implementation exists as a defense to this.

    > When I write a quicksort algorithm, I don't give any attribute to the code I saw for the algorithm in some random library.

    If it's substantially similar to the library's, it's entirely possible you're violating their license terms and/or copyright.

    > Basically, there's no difference between Copilot and Dall-E and it's pretty clear that Dall-E has no licensing issues, thus Copilot should also be in the clear.

    I'm not sure this is correct. If DALL-E began outputting verbatim copies of other people's works, they could very well be sued over it. Similarly, if it produced trademarked symbols like the Nike swoosh, the Golden Arches, or the Starbucks logo, it's not like those aren't going to get you sued.

    Infringement is about the produced thing (code block or image or whatever else) and its use, not the method of generating it.

    • _7bxa 2 years ago

      Nothing you say is credible unless you tell me you have tried copilot. Without you having tried copilot and seen its capabilities we have no common ground. So get back to me once you have used it.

      I claim copilot is capable of creating. When copilot writes an amazing lodash one liner in my code --that isn't regurgitating an existing lodash snippet, it's creating something new to fit my use case. This is undeniable and there is nothing to argue here. Copilot regularly looks at my code and writes new code that is highly specific to my existing code. It's honestly better than me at languages I don't know well (like C).

      And yes, 0.1% of the time copilot is spitting out existing code verbatim; the other 99.9% of the time copilot is not overfitting and is synthesizing new code. Luckily most companies don't needs clean room implementations.

      Copilot writes code that is adapted to my existing code base. This is quite obviously and undeniably not just regurgitation because my codebase is unique to me.

      Lastly copilot is a better programmer (locally) than many of my peers. It can write better lodash one liners, amongst other things, and while that's embarrassing it's true.

      • BeefWellington 2 years ago

        > Nothing you say is credible unless you tell me you have tried copilot. Without you having tried copilot and seen its capabilities we have no common ground. So get back to me once you have used it.

        I posted elsewhere in this thread talking about my experiences (both old and recent) using it. Dismissing a reply you don't like simply because of a (terrible, given the tool is very available) assumption is just a poor quality response.

        > I claim copilot is capable of creating. When copilot writes an amazing lodash one liner in my code --that isn't regurgitating an existing lodash snippet, it's creating something new to fit my use case. This is undeniable and there is nothing to argue here. Copilot regularly looks at my code and writes new code that is highly specific to my existing code. It's honestly better than me at languages I don't know well (like C).

        > And yes, 0.1% of the time copilot is spitting out existing code verbatim; the other 99.9% of the time copilot is not overfitting and is synthesizing new code. Luckily most companies don't needs clean room implementations.

        Where do you get this figure of 0.1%? I'm not aware of anyone having studied it, and absent that the figure seems entirely fabricated and could be higher or lower. Indeed, the way it works via prompting suggests that an overall percentage is irrelevant if 100% of the time you ask it for specific things it generates copyrighted or strictly licensed code. If you have references though, I'm interested.

        > Copilot writes code that is adapted to my existing code base. This is quite obviously and undeniably not just regurgitation because my codebase is unique to me.

        Using your code to show you code you might likely write isn't "creative" and is exactly regurgitating what it's seen. Is a Markov chain "Creative"? That's essentially what you're describing here.

        > Lastly copilot is a better programmer (locally) than many of my peers. It can write better lodash one liners, amongst other things, and while that's embarrassing it's true.

        You've used lodash one-liners as an example a couple of times now. Why? Why is that a litmus test for a good programmer? What makes the copilot generated ones superior? Do you have examples?

        My experiences with copilot as I've shared elsewhere are that it produces about 90% of the time code that needs to be debugged, doesn't quite fit coding conventions we use, and often doesn't do exactly what I'm looking for. It's fine if you're using it to stub in specific kinds of boilerplate or using it to generate a function to do some very standard math thing that exists or should exist in a library somewhere.

        • marmada 2 years ago

          The 0.1% figure comes from OpenAI: https://twitter.com/eevee/status/1410037309848752128 (person disagrees w/ me, the image is what's relevant).

          I talk about Lodash one-liners because it makes it pretty obvious that Copilot is not just copy-pasting code (which would be copyright infringement). It's quite unlikely (even if we consider all variables to have the same name), that Copilot is copy-pasting an exact copy of some other snippet, given that the snippet is very specific to my problems. (I'm not asking it to write a generic math function, I'm asking it to use lodash on data structures in my codebase to accomplish a very specific outcome). By quite unlikely I mean (1 / (2 ^ 32)), if we consider the one-liner to be composed of 32 different AST nodes.

          > Is a Markov chain "Creative"? That's essentially what you're describing here.

          Have you read the paper that (eventually) inspired Copilot? "Attention is All You Need". It's not a Markov chain. There's many, many, many different steps / layers. I think if you know & understand the fundamental building block that is responsible for it working (the "transformer"), then a lot of the worries around plagiarism go away.

          Also. What is coding if not a search problem over a very large space? I'm searching for the next N lines to write over the space of all possible lines. To guide my search I use things like my prior knowledge.

          In my head, this knowledge is encoded using neurons. In Copilot, the knowledge is encoded in parameters. Abstract concepts in my head are encoded in neurons. In Copilot's "head", abstract concepts are encoded using "embedding vectors" of 512 bytes (maybe more / less, not sure). Yeah, maybe I use more than 512 bytes to encode a concept, but still, I don't see a huge difference. If I'm not plagiarizing, then I can't imagine Copilot is (except for in the 0.1% of cases where it's over-fitting)

    • EMIRELADERO 2 years ago

      > It's actually not totally fine and there have been many many many court cases over this sort of thing, both with non-commercial and commercial licenses. The whole concept of clean-room implementation exists as a defense to this.

      I suggest you read the Sony v. Connectix appeal veredict.

      • BeefWellington 2 years ago

        This actually only bolsters my point.

        Is the risk that you'll be sued because some developer unintentionally used fully-reproduced code worth using copilot?

        • EMIRELADERO 2 years ago

          Stare decisis is supposed to account for that. The Connectix decision has created a safe space for both emulators and non-cleanroom reversing.

          • BeefWellington 2 years ago

            Well, the recent Supreme Court ruling would suggest that stare decisis is unreliable at best.

  • Tyr42 2 years ago

    > If I read the code for, I don't know, some GPL-3 library and then write my own MIT-licensed version--that's totally fine.

    There's a reason people sometimes clean room document some code, then have a different set of people who never read the source re-implement it with out ever having seen the source. To avoid these kinds of issues.

    I don't think that's always fine.

    • EMIRELADERO 2 years ago

      Which is done purely on speculative precaution and isn't based on any case law. In fact, the little case law that exists on the idea/expression distinction as related to software copyrights ended up in favor of the direct source code/disassembly reading approach (Sony v. Connectix)

    • _7bxa 2 years ago

      Yeah, using Copilot in a clean room implementation is bad. Luckily, many things don't have to be clean room implementations!

  • josephcsible 2 years ago

    > If I read the code for, I don't know, some GPL-3 library and then write my own MIT-licensed version--that's totally fine.

    > A programmer can read strictly licensed code and then use that knowledge to write their own non-strictly licensed code.

    If you've ever read the Windows source code, you're never allowed to contribute any code of your own to Wine or ReactOS.

    • EMIRELADERO 2 years ago

      And that is based purely on speculation and has no legal basis whatsoever. Whatever happened to the idea/expression distinction?

    • _7bxa 2 years ago

      Yeah, there are probably some edge projects where Copilot can't be used. For the vast majority? Seems fine.

    • google234123 2 years ago

      That’s funny because a lot of reactOS is clearly copied from the leaked windows research kernel.

      • josephcsible 2 years ago

        Is there actually any independently verifiable evidence of this, or do you believe it solely because a Microsoft employee said it was true?

  • kixiQu 2 years ago

    Why do you think it is clear that DALL-E has no licensing issues? Determining what art is ripping off what (to a legally meaningful extent, not cutesy "great artists steal" bullshit) is not at all clear cut, even before automated transformations get involved.

    • Gigachad 2 years ago

      What we are discovering is that copyright is largely bullshit. 5 line code snipets shouldn’t have any copyright.

      Copyright should apply to large and whole pieces of work only. The whole of a painting should be copyrighted. The style and technique should not. Same for code. Windows as a whole should have copyright. The snipit that handles a mouse click should not.

      • williamcotton 2 years ago

        If you take a look at existing case law this is basically the current interpretation. There is a notion of de minimis, for example.

  • trention 2 years ago

    >Fundamentally, there's no real difference between Copilot and a human

    Fundamentally, it's absolutely OK to allow humans to do the thing X and to prohibit AI from doing it. This is what I hope will happen here (though it probably won't).

    >soon many rote coding jobs will be taken away since Copilot will automate them. And that's fine

    Good luck finding enough non-rote jobs to re-employ those developers. It will be an interesting reflection though when "just learn to code" turns into "just learn to lay bricks for $10 an hour".

eduction 2 years ago

I find it extremely funny that programmers are upset that their own work is being used as fuel for the same kind of software exploitation that they have, as a group, gleefully and with little regard for the moral rights of creators inflicted on others.

When it was books, films, videos, news articles, journal articles, blog posts, photographs, extremely private personal data, music, art and other creative works being hoovered into the giant AI exploitation cloud in the sky, and non programmer creatives complained, they were generally labeled as luddites, trolls, whiners, maladapted (“learn to code”), and just generally ignorant worthless trash.

Now that it’s happening to you oh boy do you care.

  • analog31 2 years ago

    Indeed when it was recorded music, the musicians were the ones being blamed for having "the wrong business model" or an inconvenient distribution mechanism.

  • metalrain 2 years ago

    I don't really mind copyright at all. But I'm worried automation tools take away my (false) sense of control and the art of coding.

    Why care about variable naming since AI will generate most names anyway, why care about common libraries if you can always get good implementation for specific use case in a second or two?

    I know these small details don't matter in the big picture, code exists to do something and if it does, it doesn't matter how it was made.

    • BeefWellington 2 years ago

      The library point is especially instructive IMO.

      Copilot introducing "learned" security issues, for example, is a problem separate from the copyright risk.

      Currently, it's often a matter of updating a dependency to patch things away. How do you do that when you barely understand the code it helpfully generated for you?

      People will jump to say "well that's using it wrong" but the reality is if you allow this in your org people will use it this way.

  • langsoul-com 2 years ago

    Every group cares about their own interests first, whilst everything else is just progress.

eterevsky 2 years ago

I've seen estimations that verbatim copies of training code constitute around 0.1% of the code produced by the Copilot. It would be relatively straight-forward to implement a verification step that would remove this code. I would be surprised if it hasn't been done already.

With verbatim code copies out of the way, I don't see any basis to consider code produced by Copilot copyrighted. So unless you have some problem with the fact that portions of your code will be public domain, I don't see any reason not to use it.

And one more thought. The author gives an example of using Copilot to list prime numbers. That's not a good use for it. Copilot and similar systems are primarily useful for writing boring boilerplate code, saving your time for more involved parts.

  • overthetop2 2 years ago

    > The author gives an example of using Copilot to list prime numbers. That's not a good use for it.

    That is literally the first kind of usage promoted on the main landing page for Copilot. It shows Copilot filling in the body of function definitions in three different languages.

  • keithnz 2 years ago

    I agree, it needs to ensure no verbatim code. I also noticed that there is also an option not to use public code. But I actually like the idea of using open source code. In one breath we (software devs) have recommended reading open source software to learn coding, then when someone try's to automate that using AI so that it try's to contextually help you write code, some people seem upset that we can automate extracting code knowledge / patterns because we didn't do it with our wetware? I think it should be highly encouraged for people to extract knowledge from open code, even if at the moment it is far from perfect. The boiler plate stuff at the moment is great, and from time to time it writes some good contextually aware stuff as well. So, in summary, if humans are allowed to learn from open code, then I think AI should be allowed to learn from open code too.

  • axg11 2 years ago

    I believe this is already a feature that you can activate for Github Copilot.

    • ShamelessC 2 years ago

      It is. One thing you'll notice about all the copilot detractors - they frequently haven't actually tried it themselves.

      • iostream24 2 years ago

        I can’t think of a use case for it. By the time I have decided upon an approach, I likely have the constituent bits already somewhat composed, CoPilot would only get in the way.

        Assuming a use case, why in earth would I trust such a Trojan Horse from Microsoft, seeing as how it’s likely serving its master in intended ways I can only guess at?

        It’s not a useful tool, imo, but then I don’t use IDEs or autocomplete.

        Maybe I’m not the target developer.

        It’s a duck problem. Quacks like trouble, smells like trouble, looks like trouble, comes from an arch-enemy of FOSS that bought a major FOSS hub and is now seeking to do something with its purchase, which I’ll wager isn’t a good upright wholesome thing.

        • eterevsky 2 years ago

          Your comment sounds like "I hate it, but I can't find any concrete problems that I can criticize".

          • iostream24 2 years ago

            Or you can read my actual comment that states that I don’t have a use case for it, and that I de-facto don’t trust it’s corporate owner, whose slogan was literally “embrace, extend, extinguish” a few years after the birth of the web.

            Why invent things when I was perfectly clear?

            • eterevsky 2 years ago

              I read your comment. The sentiment about MS being anti-open source is ridiculous. Microsoft contributes tons of code to open source. I am not aware of any anti-open source efforts in MS under the current CEO.

              You also write "Quacks like trouble, smells like trouble, looks like trouble" which is simply baseless. This particular passage triggered my observation that you simply hate it without any justification.

              I totally appreciate that you might not find Copilot useful, but your comment went farther than that.

      • xigoi 2 years ago

        Why would I pay $10 for something just to be allowed to criticize it?

  • withinboredom 2 years ago

    I vaguely remember that generated code can't be copyrighted either. So you end up with code that you can't even own and some you can? How does that work?

    • _gabe_ 2 years ago

      This doesn't sound like it would be possible. Depending on how strict your definition of generated code is, it would be impossible to copyright any compiled code.

      If you have a looser definition, I don't see why you wouldn't be able to copyright code that you generated by using Unreal Engine's blueprint to C++ utility, or any other tool that assists in transpiling code. 99% of the web consists of transpiled JS these days.

      • withinboredom 2 years ago

        There's a difference between "generated" and "processed/transformed through a process". The fact that the chemical composition of paint changes as it dries does not change the art.

        • _gabe_ 2 years ago

          According to Websters, to generate is:

          > : to define or originate (something, such as a mathematical or linguistic set or structure) by the application of one or more rules or operations[0]

          Creating something from the application of rules or operations sounds just like what a compiler does. Whereas to process something is:

          > : treated or made by a special process especially when involving synthesis or artificial modification

          So if anything it sounds like you're asking if processed code is copyrightable. But this is just quibbling over pedantry. My original point stands. AI is just following a predefined set of rules to transform your request into code. It's a program, just like a compiler is a program. So it would be really hard to say that AI generated code is non copyrightable, but compiler/transpiler/fuzzy generated code is copyrightable.

          [0]: https://www.merriam-webster.com/dictionary/generate

          [1]: https://www.merriam-webster.com/dictionary/process

          • withinboredom 2 years ago

            I think we can all agree that copilot “generates” code. In that the code did not exist until copilot suggests it. I also think we can agree that a compiler turns existing code into new code that does the exact same thing the code describes. If we can’t agree on that, we have some pretty fundamental definitions to figure out in the courts.

    • eterevsky 2 years ago

      You'd own the copyright for the whole program, but not for implementations of some functions.

justin-tm 2 years ago

> Why? Because as a matter of basic legal hygiene, I expect that orga­ni­za­tions that create software assets will have to forbid the use of Copilot and other AI-assisted tools

I feel like this understates the wild west nature of software in non-tech Fortune 500

  • Gigachad 2 years ago

    We have companies literally taking the whole Linux kernel and not following the license out there. There is no way the legal system gives a shit about AI generating the same textbook degrees_to_radians function.

    Copilot never generated anything for me that would justify any kind of copyright.

    • BeefWellington 2 years ago

      Conversely, we have Oracle suing Google successfully for billions over function and property names. Eventually the Supreme Court overturned the result but how many businesses want a fight that expensive?

      It'll be highly dependent upon the type of output it generates.

      I've played with it again recently because of the updates around disabling open code and such, and it still strikes me that it's not worth the potential risk.

      Even setting aside the "could I get sued for this" question, the value just isn't there; half the time functions are just poorly written, inefficient, or buggy. For very simple actual math stuff (e.g.: a function to replicate any moderately complex spreadsheet function) it seems to work well. It seems to really struggle with common boilerplate stuff like simple HttpServer in Java, Flask blueprints, and so on. It may be bias due to the projects I've worked on, but I don't do a lot of my own implementations of calculating compounding interest rates, incredibly simple array slicing, or testing if a given number is prime.

      • iostream24 2 years ago

        You mentioned Supreme Court, and I feel obligated to point out that one can literally now no longer speculate on what odd judgements they may now hand down.

        They may find that SCO owns Linux for all we know… vote. Yesterday would have been ideal, but moving forward, remember to vote and remember which party hates common sense

      • justin-tm 2 years ago

        Yeah, but that still feels like an edge case compared to the hundreds of small software assets produced by companies every day. In most cases those things don't even have the potential to get to an "Oh god we pissed off Oracle" size

  • KMag 2 years ago

    I worked for over a decade at a Fortune 500 investment bank. I suspect they'll be very wary of Copilot, as well. I wouldn't lump all non-tech companies together.

    • ShamelessC 2 years ago

      Not gonna stop disgruntled employee #0151 from using it!

      • sidlls 2 years ago

        I think you underestimate the ability of Fortune 500 IT departments to lock down laptops and workstations. It's difficult, but far from too difficult.

        • Gigachad 2 years ago

          I have heard these days a lot of programmers in these companies do all of their development inside of docker or VMs so they can actually get stuff done without filling out an approval form to update their linter.

          • KMag 2 years ago

            At least where I worked, the HTTPS proxy blocked most downloads. Most software these days can install fine for a local user. It's more a matter of getting the installer. But, it was a pretty easy process for non-GPL3 open-source software: fill out a web form with the URL for the installer/source tarball and a URL for the license, wait a few hours, and the installer has been virus scanned and in available in the internal mirror repository of installers.

      • KMag 2 years ago

        We had internal repositories. You go to a website, give a URL for the library or executable/installer you want to use, and a URL for the license it's under. A few hours later, you get an email that it has been approved, downloaded, virus scanned, and is available from the internal repository.

        I think the HTTPS proxies necessary to reach the outside would block the communications necessary for Copilot to work.

withinboredom 2 years ago

I don't know why the title is editorialized here … but the actual title of the article "THIS COPILOT IS STUPID AND WANTS TO KILL ME" reminded me of a Tesla holding the left lane of the Autobahn and only going 100 kmh. As I passed them on the right (totally illegal maneuver on my part), honking my horn, and saw the driver reading a book; I wondered how long that "driver" was going to live before being rear ended by someone doing 150-200 kmh.

Maybe all AI secretly wants to kill us.

  • gus_massa 2 years ago

    > I don't know why the title is editorialized here …

    I saw the post some minutes ago and it was with posted with the original titles. I guess the mods changed it. The part about wanting to kill the writer is an exaggeration. It's usual to changed it to the subtitle or a relevant sentence, but without too much cherry picking (the last part is not very clear).

    Someone made a tracker to for these changes https://hackernewstitles.netlify.app/ HN discussion https://news.ycombinator.com/item?id=21617016 (366 points | Nov 23, 2019 | 94 comments) Most of the changes make sense (if you forget the infamous case of the asteroid/rocket part).

  • iostream24 2 years ago

    One need not even go so far as AÍ, my mobile phone tries to kill me on a daily basis. Both brands.

teaearlgraycold 2 years ago

Or, foolishly optimistically, we might reach a Kleenex-like state for all software published on GitHub. The destruction of software IP.

A man can dream.

vlthr 2 years ago

I have a really hard time understanding what future world this article (and other detractors) is arguing for and why that world is made better by taking their arguments seriously.

On the practical level I agree with the part advising caution to those that might end up embedding an identifiably licensed snippet in their codebase via copilot. I also agree that copilot users plagiarizing significant chunks of GPL code for profit is immoral. This needs to be prevented.

I also share the frustration stemming from big companies leveraging their disproportional access to data and resources for profit given that the greatest value of these models is precisely the open source code it is trained on.

Ultimately though, what I care about is the potential for building better tools. LLMs potentially offer paths towards genuinely new forms of human-machine interaction, and I don’t want that exploration to be suffocated by legalism.

  • overthetop2 2 years ago

    If one believes in the rule of law, surely it can't be true that Microsoft (and others) should get to enjoy all the benefits of intellectual-property laws (when convenient) and other times steamroll them (when convenient).

    Wealthy corporations are never going to be "suffocated by legalism" — they can afford to do their research privately. (And many still do.) The issue here is that Copilot is being foisted into the agora seemingly without sufficient (or maybe any) scrutiny of its legal consequences.

    More broadly we are seeing a norm emering where there is so much hype chasing AI that these wealthy corporations (see also: Tesla, of course) have a huge incentive to push their experiments into the public sphere prematurely, simply to assert their primacy.

    BTW this technique of front-running regulatory scrutiny can still backfire. If the initial public experience with an emerging technology is sufficiently bad, it can poison the acceptance forever. IOW, you can be suffocated by your own hubris faster than any external legalism.

    • vlthr 2 years ago

      I’m definitely not worried for Microsoft or the other big tech companies developing copilot-like products. To the extent that legal blowback focuses on issues that are both impactful and solvable (e.g. plagiarizing non-trivial snippets), they should be held to a high standard. Your point about the risk of poisoning the public’s acceptance of these technologies also resonates with me.

      What worries me the most is the effect the public backlash towards these big companies can have on smaller actors that could enter this space in the near future. In the past we’ve seen open source projects like GPT-J come together to fund and reproduce closed models, and if we’re not careful to be nuanced in our criticism of big-tech frontrunners we might end up poisoning the waters enough to deter small actors without a dedicated legal team.

      Copyright law is ultimately designed around humans as the only kind of actor. In an ideal world we would sit down and think about the way non-human learners should fit into this system and the balance of tradeoffs we want those laws to aim for. I hope that happens someday, but until then I hope we can cultivate a world where small actors are able to experiment with these technologies without fear of legal action.

      That’s why it bothers me to see people arguing that language models should be thought of like human programmers making derivative works, even suggesting that we should require attribution for all generated outputs (i.e. the entire training set, always). That helps nobody, except of course big companies with infinite manpower.

fxtentacle 2 years ago

I believe the solution is obvious: don't store all source code in a central place where it can be harvested automatically. I.e. don't use GitHub.

readthenotes1 2 years ago

Never fear. Copilot will hire a passel of lawyers not long after it's declared sentient (by some lawyers)

da39a3ee 2 years ago

Jesus christ people here are boring about Copilot. It’s amazing! It works fantastically well. It’s good not to keep it on all the time, because that gets distracting, but in certain circumstances it’s genuinely useful, and always impressive to see the way that it is not regurgitating, but has a real knack for producing uncannily apt code taking in multiple details of the local context. It’s a radical advance; a qualitative leap forward. The tedious idiots who miss all that, can’t appreciate the quality just because it comes from a mega corporation, and have spent the last year blathering on about how it is “just copying public code” are mean-spirited, ungracious, jealous, luddites.

prohobo 2 years ago

Ethics and scrutiny of downstream effects of technological innovation - something we as a society continue to pretend is not important, even as it destabilizes the entire god damn planet.

throwoutway 2 years ago

> On the one hand, we can’t expect Microsoft to offer legal advice to its zillions of users or a blanket indem­ni­fi­ca­tion. On the other hand, Microsoft isn’t sharing any of the infor­ma­tion users would need to make these deter­mi­na­tions.

Should someone ask a senator's chief-of-staff to send a strongly worded letter (on behalf of the senator) to ask Microsoft to publicly answer these pressing questions? We're several months in and the silence is deafening

zacwest 2 years ago

> In the large, I don’t think the prob­lems open-source authors have with AI training are that different from the prob­lems everyone will have. We’re just encoun­tering them sooner.

Call me a pessimist but it seems really unlikely that AI will cause problems so much as continually increasing automation. It makes such terrible choices around qualitative decisions and increasing existing AI to a general solution pretty much always fails.

williamcotton 2 years ago

If this is the viewpoint of a lawyer, where is all of the discussion around the idea/expression dichotomy? How about Abstract-Filter-Compare? No references to case law? No breakdown of the different forms of IP?

fgsdfgsdfg 2 years ago

Singularity is this, it is about scale, not about radical technological changes, which are ocurring, but they're just the means for an end: scale up things, radically faster than any human scale can sufficiently fastly adapt: attention span, legal times, resources quotas available, etc.

So yeah, most complains are quite right, how you enforce them? You could go against, maybe 10-50, but you'd still have the rest of the planet full of non-complainers (the more popular, good, efficient you code was at the moment Copilot learnt it).

So the problem isn't what's right or wrong, but it is that short of shutting down Copilot, most of the alternative solutions have no enough impact, not even close to the change Copilot is creating right now,

How singularity begins? It could have already started.

twawaaay 2 years ago

Get couple lawsuits for individual pieces of code used without attribution and pretty much every large company will instantly ban Copilot from their development process.

tpoacher 2 years ago

It means that sourcehut is suddenly an even more attractive option!

And that a lot of github's autofills will read "please see my sourcehut repository for details"

captainmuon 2 years ago

Legal arguments aside, Copilot is a nice and useful tool and should be allowed to exist. Please let's not ruin it.

If it has issues in the current framework of Copyright law, then that is yet another reason to change that framework.

mrfusion 2 years ago

How about any company that’s worried about it’s code being copied submits a copy for copilot to check against. If copilot generates any of that code it just doesn’t output it.

  • ntoskrnl 2 years ago

    "Enter your social security number on our site to check if it's been leaked!"