Lessons from Creating a VSCode Extension with GPT-4 bit.kevinslin.com 206 points by kevinslin 13 days ago
The author starts out with an excellent observation:I have been working on this problem quite a bit lately. I put together a writeup describing the solution that's been working well for me:
The problem I am trying to solve is that it’s difficult to use GPT-4 to modify or extend a large, complex pre-existing codebase. To modify such code, GPT needs to understand the dependencies and APIs which interconnect its subsystems. Somehow we need to provide this “code context” to GPT when we ask it to accomplish a coding task. Specifically, we need to:
1. Help GPT understand the overall codebase, so that it can decifer the meaning of code with complex dependencies and generate new code that respects and utilizes existing abstractions.
2. Convey all of this “code context” to GPT in an efficient manner that fits within the 8k-token context window.
To address these issues, I send GPT a concise map of the whole codebase. The map includes all declared variables and functions with call signatures. This "repo map" is built automatically using ctags and enables GPT to better comprehend, navigate and edit code in larger repos.
The writeup linked above goes into more detail, and provides some examples of the actual map that I send to GPT as well as examples of how well it can work.
I've been hearing executives and "tech leaders" recently saying that 80% of the new code is now written by chatgpt, and that it will "10x" a developer, but that sure mismatches with my experience. I suspect there will be a lot of managers with much higher expectations than is reasonable, which won't be good.
It's actually 1.2x productivity. Writing code does not take most of the day. If GPT can be just as good at debugging as it is at writing code, maybe the speedup would increase a bit. The ultimate AI speed == human reading + thinking speed, not AI generation speed.
I’ve found it’s very useful for understanding snippets of code, like a 200-line function. But systems are more than “lots of 200-line functions” - there’s a lot of context hidden in flags, conditional blocks, data, Git histories.
Maybe one day, we’ll be able to run it over all the Git histories, Jira tickets, Confluence documentation, Slack conversations, emails, meeting transcripts, presentations, etc. Until then, the humans will need to stitch it all together as best they can.
I can't think of a reason you couldn't specifically train an AI on your own large code base. After all, current LLMs are trained on effectively the entire internet.
Unless your documentation is 20~100x the size of your codebase and written in a conversational tone, the LLM won't be able to be asked any questions about it using English.
If your only aim is to use it like Copilot, sure, it's useful.
You might be able fine tune a model on pull requests if they have really high quality descriptions, high quality commit messages, and the code is well documented and organized.
I'm really not sure this is irony or a serious comment.
So the first step is to let the LLM write the documentation.... :)
Sure. Because it understands the code so well.
I haven't yet seen anything that can scan an entire codebase and build, say, a data lineage to understand how a value in the UI was calculated. I'm sure it's coming, though.
It's coming, I'm sure.
Just right after we have invented AGI.
but not all human thinking is worthwhile. I had it do a simple chrome extension to let me reply inline on HN, and it coughed out a manifest.json that worked first try. I didn't have to poke around the Internet to find a reference and then debug that via stack overflows. Easily saved me half an hour and gave me more mental bandwidth for the futzing with the DOM that I did need to do. (to your point tho, I didn't try feeding the html to it to see if it could do that part for me.)
so it's somewhere between 1.2x and 10x for me, depending on what I'm doing that day. Maybe 3x on a good day?
I don't mean to pick on you specifically, but this kind of approach doesn't fit the way I like to work.
For example, just because the manifest.json worked doesn't mean it is correct - is it free of issues (security or otherwise)?
I would argue that every system in production today seemed to "just work" when it was built and initially tested, and yet how many serious issues are in the wild today (security or otherwise)?
I prefer to take a little more time solving problems, gaining an understanding of WHY things are done certain ways, and some insight into potential problems that may arise in the future even if something seems to work at first glance.
Now I get that you are just talking about a small chrome extension that maybe you are only using for yourself... but scaling that up to anything beyond that seems like a ticking time bomb to me.
I feel like you would get more benefit out of GPT. you could ask it if it finds any vulnerabilities, common mistakes, other inconstancies. please provide comments on what each line does. what are some other common ways to write this line of code, etc.
what are some ways to handle this XYZ problem. I see you might have missed sql injection attacks. would that apply here?
Same goes for code you find on the internet.
I got this out put for this line of code what do you think the problem is.
Big misunderstanding about those chat-bot AIs.
Even OpenAI says clearly: You should not, by any means, ask the AI any questions you don't know the answer already!
> more benefit out of GPT. you could ask it if it finds any vulnerabilities, common mistakes, other inconstancies. please provide comments on what each line does. what are some other common ways to write this line of code, etc.
And than it spits out some completely made up bullshit…
How would you know if you don't understand what you're actually doing?
Every time I've tried chatgpt I've been shocked at the mistakes. It isn't a good tool to use if you care about correctness, and I care about correctness.
It may be able to regurgitate code for simple tasks, but that's all I've seen it get right.
You using 3.5 or 4?
Makes no difference. Both versions are mostly a bullshit generator.
But to see that you actually need to know in detail the things you're asking about.
After using it for some time I'm by now quite surprised when this thingy gets something actually right. But that are very seldom cases.
Most managers don't care for people like you. Companies sell their product. Another successful fiscal year. Irregardless of the absolute shit code base, wasteful architecture and gaping security vulnerabilities.
Maybe, but I've always had lucrative jobs and my work has always been appreciated. Maybe you just have to find the right employer. I think longer-term, employers that value high quality work will have the upper-hand.
And to be honest, I don't care for managers like that, so the feeling is mutual.
What managers? The LLM will do their job first.
I think it is worthwhile considering the multiple of the alternatives.
That is, SERPs providing relevant discussion and exacting or largely turnkey solutions.
On “easy” tasks in technical niches I’m not familiar with, I would take gpt over DDG + SO the majority of the time.
I’ve had situations where I’m wanting a tutorial or walk through on something and mixing various sub stack and independent blog posts.
There is no consistency in quality or even correctness from those sources, while you must also deal with format and styling variation.
I block adtech within reason, but the problem of greyhat or SEO-focused filler also isn’t really a thing in gpt. You just get the fat of the land.
The biggest problem w gpt 4 is the cutoff date. The LLM needs to be updated regularly, the way Google initially seemed to crawl all the things and make them available to queries as they appeared.
Data recency and the ability to process excess tokens is going to show OP’s test as but a toy example of what the systems can do.
In the absence of constant updates of a massive model and all that entails, I foresee Companies temporarily dominating attention by providing pretrained LORA-like add-ons to LLMs at key events.
For example, Apple could release a new model trained on all of the updated Swift libraries coming to market at WWDC shortly after the keynote.
It can contain all the docs, example code and allow devs to really, really experiment and have warm and fuzzies on the newest stuff.
It could even include the details on product announcements and largely handle questions of compatibility.
If the companies hosted the topic focused chatgpt-like bots, they could also own all the unexpected questions, and both clarify and retrain on what the most enthusiastic users want to know.
This is going a bit of another direction, but I think all of this is very exciting and will hasten the delivery of software for brand new SDKs.
> The biggest problem w gpt 4 is the cutoff date. The LLM needs to be updated regularly, the way Google initially seemed to crawl all the things and make them available to queries as they appeared.
Have you tried out Phind? It's essentially GPT-4 + internet access. It hasn't been perfect, but it's been a very useful tool for me.
Using a search engine would have yielded https://github.com/plibither8/refined-hacker-news in a tiny fraction of the time wasted with the AI.
Also chances are great that the AI just spit out some code from that extension… (Of course without attribution. Which would make it a copyright volition.)
It works great for such quick making tasks based on open source, but not so great for large existing closed system developing.
1.2x is a good rough number. Some things it saves me hours (regex, writing unfamiliar languages), some thing I never even bother asking it about
> Writing code does not take most of the day.
Definitely this. I spend about 10-15% of my time writing code so a 20% increase really doesn't save me a lot of time. Also AI generated code requires more reading of code, which is harder and more cognitively expensive than writing code.
Depends on the code. If AI can quickly write even mediocre quality automated tests, that's a tremendous speedup, and, if I'm being totally honest, morale boost.
If AI can write your tests, you're testing likely implementation details. This has "negative value"!
But an AI can't write higher level tests as it would need to understand large chunks of code (sometimes whole systems), which it can't.
I disagree about negative value. When it's this easy to throw away and rewrite tests, it's helpful to test the implementation details of your code.
Throwing away and rewriting tests is work (== negative value).
Testing implementation details is always contra productive. Have you watched the video? (I'm not recommending videos often, as I think writings have more value per time-unit, but this talk is a kind of classic on that topic.)
Sorry, I can't accept that a one hour video is the cost to participate in this conversation. I don't disagree that there is a cost associated with doing this, but the cost is so much smaller than it used to be that it can be economical now.
Implementation details are in the eye of the beholder IMO. I'm open to reasons why that's not the case here.
It can. I use aider/GPT-4 for this all the time. It’s super valuable and very low effort.
Most of the people I know don't use ChatGPT anymore. Many do use copilot as it is sometimes handy but far from life changing.
I use it pretty regularly for things like “write me a typed retry decorator in python, and write tests for it” or “parallelize this for loop using a ThreadPoolExecutor”
I work with people who have used it almost daily since it came out. None of them have 10x’d anything. Even their open source ChatGPT related projects are stalled and not going anywhere. Not to say it can’t help but I’ve not yet encountered this mythical 10x boost.
As others have observed, maybe most have stopped using it.
Maybe Ray Kurzweil is kind of on the money about computer brain interfaces, that’s when the fun really starts.
No worry. As soon as 80%+ of those managers will be replaced by ChatGPT such nonsense claims will stop; as ChatGPT knows things better than most managers. /s
I did not know what Ctags were, here is the explanation from exuberant Ctags:
>Ctags generates an index (or tag) file of language objects found in source files that allows these items to be quickly and easily located by a text editor or other utility. A tag signifies a language object for which an index entry is available (or, alternatively, the index entry created for that object).
Good point, ctags is a bit old school! Most IDEs use Language Server Protocol now for similar purposes.
I added a few sentences of explanation and background on ctags to my writeup.
I'm a big fan of ctags. My old Emacs setup utilized Ctags when all else failed. So if I wanted to find a reference for something it would use LSP, then ctags if LSP returned nothing.
I've been thinking that there needs to be a LangChain or similar in-prompt Tool that allows the model to automatically query a Language Server Protocol server. Maybe your tool does everything that could be done that way anyhow. aider looks interesting, I will try it out.
Absolutely, LSP is strictly more capable than ctags for this purpose. I mention it in the future work section of the writeup I linked above.
The main reason I started with ctags is because the `universal ctags` tool supports a ton of languages right out of the box. Each LSP server implementation tends to only support a single language. So it would be more work for users to find, install and stand up the LSP server(s) they need for their particular projects.
I plan to try some experiments with LSP in the future.
I keep reading that you can upload data to one of the GPT interfaces, I'm not sure if it's a ChatGPT plugin, or one of the OpenAI API, but I wonder if you can include some source code that you can then prompt based on...
I've been kind of tricking GPT, admittedly I'm currently limited to gpt-3.5-turbo, because with the API you can submit whole conversations and ask for the next reply from the assistant. I don't know how much of that whole conversation it re-reads, but I'll do things like:At this point, I then ask it to generate a response.
I've found that if I try to combine all the above information into a single prompt, I often run into token limits.
How much time did it take to prepare all of that for ChatGPT? Won’t you have to redo all of that work every time you ask for more help since code bases are not static? Would it take less time and effort to just write the code on your own?
Ya, it would be tedious to do all of this manually. I guess it wasn't clear, but all of this is part of my open source GPT coding tool called "aider".
aider is a command-line chat tool that allows you to write and edit code with GPT-4. As you are chatting, aider automatically manages all of this context and provides it to GPT as part of the chat. It all happens transparently behind the scenes.
> needs to understand
LLMs don’t understand. They don’t understand because they have no model of reality.
Can you prove that? Because understanding ultimately is the ability to map to abstractions and GPT4 and other models are certifiably able to abstract.
The onus is on you to prove that they do understand.
One may not simply claim that a machine ‘understands’ because it appears to. Now we’re actually in the much-misunderstood Occam’s razor: that the theory which introduces the fewest new assumptions is the most likely correct.
The assumption that a machine ‘understands’ — is capable of such a thing — is new, and requires extreme justification.
> Can you prove that?
Yes. One can easily construct a conversation where it is obvious that ChatGPT doesn't understand.
Ya that’s not how proof works. People can do the same with humans. Not understanding things happens to be not a sign for the lack of capacity of understanding in humans.
This is a game changer. I’ve been doing something similar, relatively manually. I’ll give this a spin and report back.
Glad to hear you're going to give it a try. Let me know how it goes.
> Convey all of this “code context” to GPT in an efficient manner that fits within the 8k-token context window.
> To address these issues, I send GPT a concise map of the whole codebase. The map includes all declared variables and functions with call signatures.
Even if you could encode a function signature in only one token (which isn't be possible, of course) this would only work for small single developer programs, and not large projects.
Mid-sized frameworks / libs have often teens of millions of lines of code, and hundred thousands of function signatures. With 8k tokens to spend you go nowhere even close to a large code base…
This is awesome, thanks. I wonder if complementing it with a vector store of the full repo further assists.
Ya, vector search is certainly the most common hammer to reach for when you're trying to figure out what subset of your full dataset to share with an LLM. And you're right, it's probably a piece of the puzzle for coding with GPT against a large codebase.
But I think code has such a semantically useful structure that we should probably try and exploit that as much as possible before falling back to "just search for stuff that seems similar".
Check out the "future work" section near the end of the writeup I linked above. I have a few possible improvements to this basic "repo map" concept that I'm experimenting with now.
No I agree the distilled map is most useful in context. However I wonder if providing a vector store of the total code base amplifies the effect. You could also pull in vector stores of all dependencies as well. Regardless amazing work and looking forward to seeing your future work as outlined.
LLMs can put their embeddings into a vector database, extending their short-term memory to encompass an entire codebase, every support ticket, PR, etc...
I haven't seen any turnkey solutions yet good enough to use for development, but it's coming with months, not years.
Using ctags for this purpose is genius. Thanks for posting!
Author here. Happy to answer any questions or comments.
As a teaser for a followup: I've been able to create a version of this that is able to generate an extension in one shot without any issues. This is done by giving GPT more context on the expected project layout as well as a checklist of constraints in order to create a valid app. Its not hard to envision a future where all software projects can be scaffolded by a LLM
I appreciate your write-up, it was well done. How long did you spend on this? I'd be most interested to know how long, not counting writing the blog post.
they started the project on the 13th per the git commits. so on the magnitude of two weeks it seems.
Ah, good eye, ty.
Tangential, but which model did you use to generate the picture at the start of the article?
dalle2 with the following prompt: Humanoid robots working on putting chips together in a factory, digital art
Random plug, but I made a front end for d2, that is 6-10x cheaper than using their site: https://dall-e.sonnet.io/
nice! will check it out
Looking forward to the follow up. I assume you set up a more capable gpt agent that can execute commands?
same agent. just more constraints around the domain (of creating a vscode extension)
This is a great write-up, and I've had similar experiences in domain specific scaffolding (using GPT-4 directly rather than Smoll-AI and I was creating Odoo extensions rather than VSCode).
That said, I think some of your takeaways are more criticisms of Smoll-AI than GPT-4:
> there are no instructions on how to develop, use, or publish the extension
... Because you didn't ask for instructions. All it would take is one more prompt to get these.
> there is no launch.json file that configures running the extension in development
Given the prompts it's fair to expect these should have been created. But again, since you already demonstrated some domain specific knowledge is required, when you see something like this is missing from the original prompt then you need to ask for it in a new prompt.
> there are no tests
I think we're seeing a pattern here. Did you ask for tests? In defense of the AI here, I think very few human coders would be writing tests in such an early stage of development.
> there is no code reuse
This one is fair. But I wonder if you fed all the original code in - I guess it would fit in one 32k context? - and asked it to find code reuse possibilities (and bugs while we're at it), what would you get?
It seems to me there are much better ways to get all of this stuff than chatting with an AI.
Unless you already know what you want to do, relying on the AI seems like trying to dress yourself in the dark.
If you do know what you want, use a more specific, better tool like (in this case) a project generator or starter template off of Github.
In my specific case, Odoo modules are pretty unusual, the default scaffolding is barely existant, the documentation is not great. I haven't found any useable alternative starter templates either.
Beside that, it's also hard to use other Odoo modules s references because there's some weird thing where Odoo decided to recreate an old version of React that uses XML rather than JSX. They call it OWL. But they've only partially migrated so modules are mixed between that and I think at least two older styles. Oh, and some time in the past they decided to invent their own version of JS modules too. It's a mess.
Oh yeah, and I'm using Odoo 16 while nearly everything else I can find is for Odoo 15, 14, or 13, and there have been a lot of technical changes between those versions.
I've been programming professionally for about 15 years and this is one of the most tangled pieces of spaghetti I've ever had to unravel. There's just too many threads to easily keep them in my head at once, and very little in the way of tutorials.
In this situation, for all it's faults, GPT-4 does a great job as an assistant. It genuinely is better than the alternatives I have available - even though it's also limited in knowledge of Odoo 16 which was released after the training date.
> To test GPT-4's ability to generate a complex program, ...
I wonder how much the complexity of the various ecosystems we find ourselves in contribute to the lack of effectiveness of the language model. The task at hand really shouldn't be considered complex. Making and registering a command to do this in Emacs is essentially (defun inc-heading () (interactive) (save-excursion (search-backward-regexp "^\\# ") (insert "#"))). No project structure, no dependencies, no tooling: something a LLM should have no problem doing.
So basically a waste of time. People are getting it wrong. It shouldn't be used for any generation. It should be used for compression.
It works better when you use it to generate something sane. My take away from the article is that to program in typescript you have to jump through insane hoops.
I disagree. It is like 80% there.
"Prompt engineer" didn't exactly make sense to me until a coworker (graphic design) talked about the prompts that he'd see in Midjourney's discord server. Particularly when he mentioned specifications around camera lenses and perspective and other things I'm not familiar with. Very specific choices, continuing to be added and refined.
Then seeing what guidance is intended to do (LLM-prompt templating) it becomes obvious what people have in mind for this. It won't obviate understanding the lower-level code -- in fact, I expect the value of that skill to increase over time as fewer people bother to learn it -- but it will cut out the time it takes to scaffold an application.
If anything, mid journey is an argument against prompt engineering. Each successive iteration of mid journey (currently at v5) has made it progressively easier and dispensed with the need for stable diffusion specific terminology.
I’ve heard similar things from the same individual about the progress of results. Indeed, they said the things I mentioned in my first comment a few months ago. Still, it sounds like a current problem is that it’s difficult to reliably get visually nearly-identical output with incremental changes. That’s one area where LLM prompt templating can be useful.
This lack of reliability is often mentioned about these generative technologies. It seems to me that “how do I get the same T-Rex with lasers shooting out of its eyes but different color lasers?” and “how do I get the same JSON structure in my output and always only the JSON with no other text?” are roughly similar problems in “prompt engineering”.
If an LLM can learn to create a picture it's likely it can also learn to create a prompt.
You could pull one of the sample extensions and it would actually compile and run without these typos. The actual logic of getting the current selection, calculating the header length and writing back the new header is only like 10 LOC.
Yeah, but you know what they say about the last 20%.
edit: simply check the update... managed to be done one-shot
The author uses smol-developer for this. I've had mixed luck using smol-developer for actual development, but reading the code and prompts it's using, and just the general workflow is pretty fascinating and I've been playing around with adjusting it (Issue #34 https://github.com/smol-ai/developer/issues/34).
In short:One could imagine even breaking it further down ("what functions need to be in file X"), which might get it closer to what the author of this post says about GPT being better at smaller chunks. But you also kind of want some shared context as well.
I broke out smol-developer's prompts and modified them to include "ask any clarifying questions you need to build the [list of files/dependencies/source]" and then was able to provide some more feedback during the development process.
I also started off with that trick of "You are an AI prompt engineer, gather requirements and ask any follow up questions that are necessary to generate the perfect prompt, optimized to feed into GPT, to achieve the user's goals." trick, did some back and forth during which it asked some fascinating questions, including at least one that wasn't even on my radar. I then used that generated prompt in the following steps with smol-developer.
> One could imagine even breaking it further down ("what functions need to be in file X")
Based on my manual usage, I usually guide it through a top-down process, something like: - give an overview of how you would approach the problem - describe the main steps - do you see any problems with this approach? can you think of any alternatives? - taking into account everything that we have discussed so far, please carefully review and check your approach, propose a revised approach - only then would i get it to generate a list of source files and implement them one by one - now, given your implementation as a whole, review it and note any errors or improvements that could be made - fix the code ...
Once I have code in hand I'll paste it into a new chat with a prompt like "You are our resident compiler guru. Please review the following code:..." It helps to be more specific about the reviewer's background and what you want. Rinse and repeat until it responds with high praise. Python works better than C++. GPT4 is much better than 3.5 of course.
very cool to see usage of smol-developer in the wild! have been traveling so havent committed code but have a bunch of plans coming to level it up while staying smol. (see issues)
big fan of smol-developer. it provides a great starting point for doing complicated things while being simple enough to reason about :)
I feel like GTP-4's training data cut-off is too old. I'm trying to get it to help me setup infrastructure in Azure. Blog posts describing my task are often from 2021 and drop down list and names of things simply don't match what they are now.
The most important, and which takes the most time, part of producing software is NOT the act of mapping solutions to code/architecture and writing said code. It is coming up with the solutions in the first place.
Ask it first how to break down the problem and then ask each step
and keep enough shared context between the steps so that there is coherence
And to not reply in Klingon