This idea that you can get good results from a bad process as long as you have good quality control seems… dubious, to say the least. “Sure, it’ll produce endless broken nonsense, but as long as someone is checking, it’s fine.” This, generally, doesn’t really work. You see people _try_ it in industry a bit; have a process which produces a high rate of failures, catch them in QA, rework (the US car industry used to be notorious for this). I don’t know of any case where it has really worked out.
Imagine that your boss came to you, the tech lead of a small team, and said “okay, instead of having five competent people, your team will now have 25 complete idiots. We expect that their random flailing will sometimes produce stuff that kinda works, and it will be your job to review it all.” Now, you would, of course, think that your boss had gone crazy. No-one would expect this to produce good results. But somehow, stick ‘AI’ on this scenario, and a lot of people start to think “hey, maybe that could work.”
Reviewing code from less experienced or unmotivated people is also very taxing, both in a cognitive and emotional sense. It will never approach a really good level of quality because you just give up after 4 rounds of reviews on the same feature.
Except humans learn from your PR comments and in other interactions with more experienced people, and so inexperienced devs become experienced devs eventually. LLMs are not so trainable.
Some people say we're near the end of pre-training scaling, and RLHF etc is going to be more important in the future. I'm interested in trying out systems like https://github.com/OpenPipe/ART to be able to train agents to work on a particular codebase and learn from my development logs and previous interactions with agents.
If they're unmotivated enough to not get there after four review rounds for a junior-appropriate feature, they're not going to get better. It's a little impolite to say, but if you spend any significant amount of time coaching juniors you'll encounter exactly what I'm talking about.
Here’s the thing about AI though - you don’t need to worry about its confidence or impact on professional development if you’re overly critical, and it will do a turn within seconds. That gives a tremendous amount of flexibility and leverage to the code reviewer. Works better on some types of problems than others, but it’s worth exploring!
With human co-workers, you can generally assume things you can't with AI.
My human co-workers generally have good faith. Even the developer who was clearly on the verge of getting a role elsewhere without his heart in it-- he tried to solve the problems assigned to him, not some random delusion that the words happened to echo. I don't have that level of trust with AI.
If there's a misunderstanding the problem or the context, it's probably still the product of a recognizable logic flow that you can use to discuss what went wrong. I can ask Claude "Why are you converting this amount from Serbian Dinars to Poppyseed Bagels in line 476?" but will its answer be meaningful?
Human code review often involves a bit of a shared background. We've been working with the same codebases for several years, so we're going to use existing conventions. In this situation, the "AI knows all and sees all" becomes an anti-feature-- it may optimize for "this is how most people solve this task from a blank slate" rather than "it's less of a cognitive burden for the overall process if your single change is consistent with 500 other similar structures which have been in place since the Clinton administration."
There may be ways to try to force-feed AI this behaviour, but the more effort you devote to priming and pre-configuring the machine, the less you're actually saving over doing the actual work in the first place.
Right, this is the exact opposite of the best practices that Edward Deming helped develop in Japan, then brought to the west.
Quality needs to come from the process, not the people.
Choosing to use a process known to be flawed, then hoping that people will catch the mistakes, doesn't seem like a great idea if the goal is quality.
The trouble is that LLMs can be used in many ways, but only some of those ways play to their strengths. Management have fantasies of using AI for everything, having either failed to understand what it is good for, or failed to learn the lessons of Japan/Deming.
> Choosing to use a process known to be flawed, then hoping that people will catch the mistakes, doesn't seem like a great idea if the goal is quality.
You're also describing the software development process prior to LLMs. Otherwise code reviews wouldn't exist.
People have built complex working mostly bug free products without code reviews so humans are not that flawed.
With humans and code reviews now two humans looked at it. With LLM and code review of the LLM output now one human looked at it, so its not the same. LLM are still far from as reliable as humans or you could just tell the LLM to do code reviews and then it builds the entire complex product itself.
People have built complex bug free software without __formal__ code review. It's very rare to write complex bug free software without at least __informal__ code review, and it's luck, not skill.
Can't have a code review if you're coding solo[0], unless we are redefining the meaning of "code review" to the point of meaningless by including going over one's own code.
0. The dawn of video games had many titles with 1 person responsible for programming. This remains the case many indy games and small software apps and services. It's a skill that requires expertise and/or dedication.
Sure - software development is complex, but there seems to be a general attempt over time to improve the process and develop languages, frameworks and practices that remove the sources of human error.
Use of AI seems to be a regression in this regard, at least as currently used - "look ma, no hands! I've just vibe coded an autopiliot". The current focus seems to be on productivity - how many more lines of code or vibe-coded projects can you churn out - maybe because AI is still basically a novelty that people are still learning how to use.
If AI is to be used productively towards achieving business goals then the focus is going to need to mature and change to things like quality, safety, etc.
> Management have fantasies of using AI for everything, having either failed to understand what it is good for, or failed to learn the lessons of Japan/Deming.
Third option: they want to automate all jobs before the competition does. Think of it as AWS, but for labor.
Deming’s process was about how to operate a business in a capital-intensive industry when you don’t have a lot of capital (with market-acceptable speed and quality). That you could continue to push it and raise quality as you increased the amount of capital you had was a side-effect, and the various Japanese automakers demonstrated widely different commitments to it.
And I’m sure you know that he started formulating his ideas during the Great Depression and refined them while working on defense manufacturing in the US during WWII.
> Quality needs to come from the process, not the people.
Not sure which Japanese school of management you're following, but I think Toyota-style goes against that. The process gives more autonomy to workers than, say, Ford-style, where each tiny part of the process is pre-defined.
I got the impression that Toyota-style was considered to bring better quality to the product, even though it gives people more autonomy.
In an ideal world all employees would be top notch, on their game every day, never making mistakes, but the real world isn't like that. If you want repeatable quality then it needs to be baked into the process.
It's a bit like Warren Buffet saying he only wants to invest in companies that could be run by an idiot, because one day they will be.
Edward Deming actually worked with both Toyota and Ford, perhaps more foundationally at Toyota, bringing his process-based-quality ideas to both. Toyota's management style is based around continuous process improvement, combined with the employee empowerment that you refer to.
Or more broadly, the existence of complex or any life.
Sure, it's not the way I would pick to do most things, but when your buzzword magical thinking so deep all that you have is a hammer, even if it doesn't look like a nail you will force your wage slaves to hammer it anyway until it works.
As to your other cases.. injection molded plastic parts for things like the spinning t bar spray arm in some dishwashers. Crap molds, pass to low wage or temp to razorblade fix by hand and box up. Personally worked such a temp job before, among others so yes that bad output manual qc and fix up abounds still.
And if we are talking high failure rates... see also chip binning and foundry yields in semiconductors.
Just have to look around to see the dubious seeming is more the norm.
What happens is a kind of feeling of developing a meta skill. It's tempting to believe the scope of what you can solve has expanded when you are self-assessed as "good" with AI.
Its the same with any "general" tech. I've seen it since genetic algorithms were all the rage. Everyone reaches for the most general tool, then assumes everything that tool might be used for is now a problem or domain they are an expert in, with zero context into that domain. AI is this times 100x, plus one layer more meta, as you can optimize over approaches with zero context.
That's an oversimplification. AI can genuinely expand the scope of things you can do. How it does this is a bit particular though, and bears paying attention to.
Normally, if you want to achieve some goal, there is a whole pile of tasks you need to be able to complete to achieve it. If you don't have the ability complete any one of those tasks, you will be unable to complete the goal, even if you're easily able to accomplish all the other tasks involved.
AI raises your capability floor. It isn't very effective at letting you accomplish things that are meaningfully outside your capability/comprehension, but if there are straightforward knowledge/process blockers that don't involve deeper intuition it smooths those right out.
Normally, one would learn the missing steps, with or without AI.
You're probably envisioning a more responsible use if it (floor raising, "meaningfully inside your comprehension"), that is actually not what I'm referring to at all ( "assumes everything that tool might be used for is now a problem or domain they are an expert in"). A meta tech can be used in many ways and yours is close to what I believe the right method is. But I'm asserting that the danger is massive over reliance and over confidence in the "transferability".
> If you don't have the ability complete any one of those tasks, you will be unable to complete the goal
Nothing has changed. Few projects start with you knowing all the answers. In the same way AI can help you learn, you can learn from books, colleagues, and trial and error for tasks you do not know.
I can say from first hand experience that something has absolutely changed.
Before AI, if I had the knowledge/skill to do something on the large scale, but there were a bunch of minute/mundane details I had to figure out before solving the hard problems, I'd just lose steam from the boredom of it and go do something else. Now I delegate that stuff to AI. It isn't that I couldn't have learned how to do it, it's that I wouldn't have because it wouldn't be rewarding enough.
That’s great - you personally have found a tool that helps you overcome unknown problems. Other people have other methods for doing that. Maybe AI makes that more accessible in general.
Yep. All the process in the world won’t teach you to make a system that works.
The pattern I see over and over is a team aimlessly putting a long through tickets in sprints until an engineer who knows how to solve the problem gets it on track personally.
What I took away from the article was that being good at code review makes the person better at guiding the agent to do the job, giving the right context and constraints at the right time… and not that the code reviewer has to fix whatever agent generated… this is also pretty close to my personal experience… LLM models are a bull which can be guided and definitely not a complete idiot…
In a strange kind of analogy, flowing water can cause a lot of damage.. but a dam built to the right specification and turbines can harness that for something very useful… the art is to learn how to build that dam
I have a play project which hits these constraints a lot.
I have been messing around with getting AI to implement novel (to me) data structures from papers. They're not rocket science or anything but there's a lot of detail. Often I do not understand the complex edge cases in the algorithms myself so I can't even "review my way out of it". I'm also working in go which is usually not a very good fit for implementing these things because it doesn't have sum types; lack of sum types oten adds so much interface{} bloat it would render the data structure pointless. Am working around with codegen for now.
What I've had to do is demote "human review" a bit; it's a critical control but it's expensive. Rather, think more holistically about "guard rails" to put where and what the acceptance criteria should be. This means that when I'm reviewing the code I am reasonably confident it's functionally correct, leaving me to focus on whether I like how that is being achieved. This won't work for every domain, but if it's possible to automate controls, it feels like this is the way to go wherever possible.
The "principled" way to do this would be to use provers etc, but being more of an engineer I have resorted to ruthless guard rails. Bench tests that automatically fail if the runtime doesn't meet requirements (e.g. is O(n) instead of O(log n)) or overall memory efficiency is too low - and enforcing 100% code coverage from both unit tests AND fuzzing. Sometimes the cli agent is running for hours chasing indexes or weird bugs; the two main tasks are preventing it from giving up, and stopping it from "punting" (wait, this isn't working, let me first create a 100% correct O(n) version...) or cheating. Also reminding it to check AGAIN for slice sharing bugs which crop up a surprising % of the time.
The other "interesting" part of my workflow right now is that I have to manually shuffle a lot between "deep research" (which goes and reads all the papers and blogs about the data structure) and the cli agent which finds the practical bugs etc but often doesn't have the "firepower" to recognise when it's stuck in a local maximum or going around in circles. Have been thinking about an MCP that lets the cli agent call out to "deep research" when it gets really stuck.
The issue with the hypothetical is if you give a team lead 25 competent people they'd also get bad results. Or at least, the "team lead" isn't really leading their team on technical matters apart from fighting off the odd attempt to migrate to MongoDB and hoping that their people are doing the right thing. The sweet spot for teams is 3-6 people and someone more interested in empire building than technical excellence can handle maybe around 9 people and still do a competent job. It doesn't depend much on the quality of the people.
The way team leads seem to get used is people who are good at code get a little more productive as more people are told to report to them. What is happening now is the senior-level engineers all automatically get the same option: a team of 1-2 mid-level engineers on the cheap thanks to AI which is entirely manageable. And anyone less capable gets a small team, a rubber duck or a mentor depending on where they fall vs LLM use.
Of course, the real question is what will happen as the AIs get into the territory traditionally associated with 130+ IQ ranges and the engineers start to sort out how to give them a bit more object persistence.
Imagine a factory making injection molded plastic toys but instead of pumping out perfect parts 99.999% of the time, the machine gives you 50% and you have to pay people to pull out the bad ones from a full speed assembly line and hope no bad ones get through.
Chips can be and are tested automatically. It's not like there are people running beside a conveyor belt with a microscope. So no, it's not the same thing. One doesn't need people, while the other one does.
Also, a fab is intended to make fully functioning chips, which sometimes it fails to achieve. An LLM is NOT designed to give correct output, just plausible output. Again, not the same goal.
LLMs are an implementation of bogosort. But less efficient.
I predict many disastrous "AI" failures because the designers somehow believed that "some humans capable of constant vigilant attention to detail" was an easy thing they could have.
It also assumes that people who are "good" at the standard code review process (which is tuned for reviewing code written by humans with some level of domain experience and thus finding human-looking mistakes) will be able to translate their skills perfectly to reviewing code written by AI. There have been plenty of examples where this review process was shown to be woefully insufficient for things outside of this scope (for instance, malicious patches like the bad patches scandal with Linux a few years ago or the xz backdoor were only discovered after the fact).
I haven't had to review too much AI code yet, but from what I've seen it tends to be the kind of code review that really requires you to think hard and so seems likely to lead to mistakes even with decent code reviewers. (I wouldn't say that I'm a brilliant code reviewer, but I have been doing open source maintenance full-time for around a decade at this point so I would say I have some experience with code reviews.)
I'm not sure about the current state of the art, but microprocessors production is (was?) very bad. You make a lot of them in a single silicon wafer, and then test them thoughtfully until you find the few that are good. You drop all the defective ones because they are very cheap piece of sand and charge a lot for the ones that works correctly to cover all the costs.
Design for test is still a major part of (high volume) chip design. Anything that can't be tested in seconds on wafer is basically worthless for mass production.
In that case, tho, no-one’s saying “let’s be sloppy with production and make up for it in the QA” (which really used to be a US car industry strategy until the Japanese wiped the floor with them); the process is as good as it reasonably can be, there are just physical limits. Chip manufacturers spend vast amounts on reducing the error rate.
I went from ";" to fully working C++ production grade code with good test coverage. To my estimation, 90% of the work was done in an agent prompt. It was a side project, now it will be my job. The process is like they described.
For the core parts you cannot let go of the reins. You have to keep steering it. You have to take short breaks and reload the code into the agent as it starts acting confused. But once you get the hang of it, things that would take you months of convincing yourself and picking yourself back up to continue becomes a day's work.
Once you have a decent amount of work done, you can have the agent read your code as documentation and use it to develop further.
2. Quality control is key to good processes as well. Code review is literally a best practice in the software industry. Especially in BigTech and high-performing organizations. That is, even for humans, including those that could be considered the cream of the industry, code review is a standard step of the delivery process.
3. People have posted their GitHub profiles and projects (including on this very forum) to show how AI is working out for them. Browse through some of them and see how much "endless broken nonsense" you find. And if that seems unscientific, well go back to point 1.
I picked one of the studies in the search (!) you linked. First of all, it's a bullshit debate tactic to try to overwhelm your opponents with vague studies -- a search is complete bullshit because it puts the onus on the other person to discredit the gargantuan amount of data you've flooded them with. Many of the studies in that search don't have anything to do with programming at all.
So right off the bat, I don't trust you. Anyway, I picked one study from the search to give you the benefit of the doubt. It compared leetcode in the browser to LLM generation. This tells us absolutely nothing about real world development.
What made the METR paper interesting was that they studied real projects, in the real world. We all know LLMs can solve well bounded problems in their data sets.
As for 3 I've seen a lot of broken nonsense. Let me know when someone vibe codes up a new mobile operating system or a competitor to KDE and Gnome lol
Even if the process weren’t technically bad, it would still be shit. Doing code review with a human has meaning in that the human will probably learn something, and it’s an investment in the future. Baby-sitting an LLM, however, is utterly meaningless.
Are you saying that supermarket vegetables/produce are good?
Quite a bit of it, like Tomatoes and Strawberries, is just crap. Form over substance. Nice color and zero flavor. Selected for delivery/shelf-life/appearance rather actually being any good.
> I was also considering the way the US food standards allows a lot of insect parts in the products, but wasn't sure how to phrase it.
I don't know how the US compares to other countries in terms of "insects per pound" standards, but having some level of insects is going to be inevitable.
For example, how could you guarantee that your wheat, pre-milling, has zero insects in it, or that your honey has no bee parts in it (best you can do is strain it, then anything that gets through the straining process will be on your toast).
You can campaign for your government to set any legal minimum for quality that you want, but it's essentially nonsensical to expect people not to optimise for cheapest given whatever those constraints are.
Your list of winners are optimising for what the market cares about, exactly like the supermarkets (who are also mostly winners) are optimising for what the market cares about. For most people, for food specifically, that means "cheap". Unavoidably, because most people have less money than they'd like. Fancy food is rare treat for many.
Apple software currently has a reputation for buggy UI; Oracle has a reputation for being litigious; that just leaves Nvidia who are printing money selling shovels in two sucessive gold rushes, which is fine for a business and means my investment is way up, but also means for high end graphics cards consumer prices are WTF and availability is LOL.
> This idea that you can get good results from a bad process
This idea is called "evolution"...
> as long as you have good quality control
...and it's QA is death on every single level of the systems: cell, organism, species, and ecosystem. You must consider that those devs or companies with not-good-enough QA will end up dead (from a business perspective).
Evolution is extremely inefficient at producing good designs. Given enough time it'll explore more, because it's driven randomly, but most mutations either don't help, or downright hurt an organism's survival.
If the engineer, doing the implementation is top-shelf, you can get very good results from a “flawed” process (in quotes, because it’s not actually “bad.” It’s just a process that depends on the engineer being that particular one).
Silicon Valley is obsessed with process over people, manifesting “magical thinking” that a “perfect” process eliminates the need for good people.
I have found the truth to be in-between. I worked for a company that had overwhelming Process, but that process depended on good people, so it hired top graduates, and invested huge amounts of money and time into training and retention.
Said a little more crass/simply: A people hire A people. B people hire C people.
The first is phenomenal until someone makes a mistake and brings in a manager or supervisor from the C category that talks the talk but doesn't walk the walk.
If you accidentally end up in one that turns out to be the later. It's maddening trying to get anything accomplished if the task involves anyone else.
> Imagine that your boss came to you, the tech lead of a small team, and said “okay, instead of having five competent people, your team will now have 25 complete idiots. We expect that their random flailing will sometimes produce stuff that kinda works, and it will be your job to review it all.”
This is exactly the point of corporate Agile. Management believes that the locus of competence in an organization should reside within management. Depending on competent programmers is thus a risk, and what is sought is a process that can simulate a highly competent programmer's output with a gang of mediocre programmers. Kinda like the myth that you can build one good speaker out of many crappy ones, or the principle of RAID which is to use many cheap, failure-prone drives to provide the reliability guarantees of one expensive, reliable drive (which also kinda doesn't work if the drives came from the same lot and are prone to fail at about the same time). Every team could use some sort of process, but usually if you want to retain good people, this takes the form of "disciplines regarding branching, merging, code review/approval, testing, CI, etc." Something as stifling as Scrum risks scaring your good people away, or driving them nuts.
So yes, people do expect it to work, all the time. And with AI in the mix, it now gains very nice "labor is more fungible with capital" properties. We're going to see some very nice, spectacular failures in the next few years as a result, a veritable Perseid meteor shower of critical systems going boom; and those companies that wish to remain going concerns will call in human programmers to clean up the mess (but probably lowball on pay and/or try to get away with outsourcing to places with dirt-cheap COL). But it'll still be a rough few years for us while management in many orgs gets high off their own farts.
Bayesian reasoning would lead me to think that a high rate of failures means even if QA is 99.9% amazing and dev is AI 80% slop, there still will be more poor features and bugs (99.9% * 80% = 79.92%) than if both are mediocore (%90 * %90 = 81%)
Asking AI to stay true to my requested parameters is hard, THEY ALL DRIFT AWAY, RANDOMLY
When working on nftables syntax highlighters, I have 230 tokens, 2,500 state, and 50,000+ state transitions.
Some firm guidelines given to AI agents are:
1. Fully-deterministic LL(1) full syntax tree.
2. No use of Vim 'syntax keyword' statement
3. Use long group names in snake_case whose naming starts with 'nft_' prefix (avoids collision with other Vim namespaces)
4. For parts of the group names, use only nftables/src/parser_bison.y semantic action and token names as-is.
5. For each traversal down the syntax tree, append that non-terminal node name from parser_bison.y to its group names before using it.
With those 5 "simple" user-requested requirements, all AI agents drift away from at least each of the rules at seemingly random interval.
At the moment, it is dubiois to even trust the bit-length of each packet field.
Never mind their inability to construct a simple Vimscript.
I use AI agents mainly as documentation.
On the bright side, they are getting good at breaking down 'rule', 'chain_block stmt', and 'map_stmt_expr' (that '.' period we see at chaining header expressions together; just use the quoted words and paste in one of your nft rule statements.
Im a dev (who knows nothing about nftables) and I don't understand your instructions. I think maybe you could improve your situation by formulating them as "when creating new groupnames use the semantic actions and token names as defined in parser_bison.y" I.e. with if conditions so that the correct rules apply to the correct situations. Because your rules are written as if to apply to every line of code, it might unnecessarily try to incorporate context even when it's not applicable.
AI-generated code can be useful in the early stages of a project, but it raises concerns in mature ones. Recently, a 280kloc+ Postgres parser was merged into Multigres (https://github.com/multigres/multigres/pull/109) with no public code review. In open source, this is worrying. Many people rely on these projects for learning and reference. Without proper review, AI-generated code weakens their value as teaching tools, and more importantly the trust in pulling as dependencies. Code review isn’t just about bugs, it’s how contributors learn, understand design choices, and build shared knowledge. The issue isn’t speed of building software (although corporations may seem to disagree), but how knowledge is passed on.
I oversaw this work, and I'm open to feedback on how things can be improved. There are some factors that make this particular situation different:
This was an LLM assisted translation of the C parser from Postgres, not something from the ground up.
For work of this magnitude, you cannot review line by line. The only thing we could do was to establish a process to ensure correctness.
We did control the process carefully. It was a daily toil. This is why it took two months.
We've ported most of the tests from Postgres. Enough to be confident that it works correctly.
Also, we are in the early stages for Multigres. We intend to do more bulk copies and bulk translations like this from other projects, especially Vitess. We'll incorporate any possible improvements here.
The author is working on a blog post explaining the entire process and its pitfalls. Please be on the lookout.
I was personally amazed at how much we could achieve using LLM. Of course, this wouldn't have been possible without a certain level of skill. This person exceeds all expectations listed here: https://github.com/multigres/multigres/discussions/78.
"We intend to do more bulk copies and bulk translations like this from other projects"
Supabase’s playbook is to replicate existing products and open source projects, release them under open source, and monetize the adoption. They’ve repeated this approach across multiple offerings. With AI, the replication process becomes even faster, though it risks producing low-quality imitations that alienate the broader community and people will resent the stealing of their work.
An alternative viewpoint which we are pretty open about in our docs:
> our technological choices are quite different; everything we use is open source; and wherever possible, we use and support existing tools rather than developing from scratch.
I understand that people get frustrated when there is any commercial interest associated to open source. But someone needs to fund open source efforts and we’re doing our best here. Some (perhaps non-obvious) examples
* we employ the maintainers of PostgREST, contributing directly to the project - not some private fork
* we employ maintainers of Postgres, contributing patches directly
* we have purchased and open sourced private companies, like OrioleDB, open sourced the code and made the patents freely available to everyone
* we picked up unmaintained tools and maintained them at our own cost, like the Auth server, which we upstreamed until the previous owner/company stopped accepting contributions
* we worked with open source tools/standards like TUS to contribute missing functionality like Postgres support and advisory locks
* we have sponsored adjacent open source initiatives like adding types to Elixir
* we have given equity to framework creators, which I’m certain will be the largest donation that these creators have (and will) ever receive for their open source work
* and yes, we employ the maintainers of Vitess to create a similar offering for the Postgres ecosystem under the same Apache2 license
I am falling into a pattern of treating AI coding like a drunk mid-level dev: "I saw those few paragraphs of notes you wrote up on a napkin, and stayed up late Saturday night while drinking and spat out this implementation. you like?"
So I can say to myself, "No, do not like. But the overall gist at least started in the right direction, so I can revise it from here and still be faster than had I done it myself on Monday morning."
The most useful thing I've found is "I need to do X, show me 3 different popular libraries that do it". I've really limited my AI use to "Lady's Illustrated Primer" especially after some bad experiences with AI code from devs who should know better.
I don't even frame my requests conversationally. They usually read like brief demands, sometimes just comma delimited technologies followed by a goal. Works fine for me, but I also never prompt anything that I don't already understand how to do myself. Keeps the cart behind the horse.
I've started putting in my system prompt "keep answers brief and don't talk in the first/second person". Gets rid of all the annoying sycophancy and stops it from going on for ten paragraphs. I can ask for more details when I need it.
> It’s very time consuming and 80% of the time I end up wondering if it would’ve been quicker to just do it all by myself right from the start.
Yes, this. Every time I read these sort of step by step guides to getting the best results with coding agents it all just sounds like boatloads of work that erase the efficiency margins that AI is supposed to bring in the first place. And anecdotally, I've found that to be true in practice as well.
Not to say that AI isn't useful. But I think knowing when and where AI will be useful is a skill in and of itself.
At least for me, I can have five of these processes running at once. I can also use Deepresearch for generating the designs with a survey of literature. I can use NotebookLM to analyse the designs. And I use Sourcery, CodeRabbit, Codex and Codescene together to do code review.
It took me a long time to get there with custom cli tools and browser userscripts. The out of the box tooling is very limited unless you are willing to pay big £££s for Devin or Blitzy.
I think I’m working at lower levels, but usually my flow is:
- I start to build or refactor the code structure by myself creating the basic interfaces or skip to the next step when they already exist. I’ll use LLMs as autocomplete here.
- I write down the requirements and tell which files are the entry point for the changes.
- I do not tell the agent my final objective, only one step that gets me closer to it, and one at a time.
- I watch carefully and interrupt the agent as soon as I see something going wrong. At this point I either start over if my requirement assumptions were wrong or just correct the course of action of the agent if it was wrong.
Most of the issues I had in the past were from when I write down a broad objective that requires too many steps at the beginning. Agents cannot judge correctly when they finished something.
I have a similar, though not as detailed, process. I do the same as you up to the PRD, then give it the PRD and tell it the high level architecture, and ask it to implement components how I want them.
It's still time-consuming, and it probably would be faster for me to do it myself, but I can't be bothered manually writing lines of code any more. I maybe should switch to writing code with the LLM function by function, though.
Doesn't a head chef in a restaurant context delegate a lot to other people for the cooking? And of course they use many tools also. And pre-prepared parts also, often from external suppliers.
Yeah, sounds like it would have been far quicker to use the AI to give you a general overview of approaches/libraries/language features/etc, and then done the work yourself.
This. Having had the pleasure to review the work and fix the bugs of agent jockeys (generally capable developers that fell in love with Claude Code et al), I'm rather sceptical. The code often looks as if they were on mushrooms. They cannot reason about it whatsoever, like they weren't even involved, when I know they weren't completely hands off.
I really believe there are people out there that produce good code with these things, but all I've seen so far has been tragic.
Luckily, I've witnessed a few snap out of it and care again. Literally looks to me as if they had a substance abuse problem for a couple of months.
If you take a critical look at what comes out of contemporary agentic workflows, I think the conclusion must be that it's not there. So yeah, if you're a good reviewer, you would perhaps come to that conclusion much sooner.
I'm not even anti-LM. Little things—research, "write TS types for this object", search my codebase, go figure out exactly what line in the Django rest framework is causing this weird behavior, —are working great and saving me an hour here and 15m there.
It's really obvious when people lean on it, because they don't act like a beginner (trying things that might not work) or just being sloppy (where there's a logic ot it but there's no attention to detail), but it's like they copy pasted from Stackoverflow search results at random and there are pieces that might belong but the totality is incoherent.
I'm definitely not anti LLM, I use them all the time. Just not for generating code. I give it a go every couple of months, probably wasting more time on it than I should. I don't think I've felt any real advancements since last year around this time, and this agentic hype seems to be a bit ahead of its time, to put it mildly. But I absolutely get a lot of value out of them.
It's also nice for handling some little tasks that otherwise would have been just annoying enough for me to not do it. Small 5 or 6 line functions that I would have had to fiddle with for far longer to get right.
> I really believe there are people out there that produce good code with these things, but all I've seen so far has been tragic
I don't believe this at all, because all I've seen so far is tragic
I would need to see any evidence of good quality work coming from AI assisted devs before I start to entertain the idea myself. So far all I see is low effort low quality code that the dev themself is unable to reason about
Never tried psychedelic mushrooms, so that part is speculation. But no amount of weed or alcohol could get me even close to writing code that unhinged.
> If you’re a nitpicky code reviewer, I think you will struggle to use AI tooling effectively. [...] Likewise, if you’re a rubber-stamp code reviewer, you’re probably going to put too much trust in the AI tooling.
So in other words, if you are good at code review you are also good enough at writing code that you will be better off writing it yourself for projects you will be responsible for maintaining long term. This is true for almost all of them if you work at a sane place or actually care about your personal projects. Writing code for you is not a chore and you can write it as fluently and quickly as anything else.
Your time "using AI" is much better spent filling in the blanks when you're unfamiliar with a certain tool or need to discover a new one. In short, you just need a few google searches a day... just like it ever was.
I will admit that modern LLMs have made life easier here. AI summaries on search engines have indeed improved to the point where I almost always get my answer and I no longer get hung up meat-parsing poorly written docs or get nerd-sniped pondering irrelevant information.
Code review is part of the job, but one of the least enjoyable parts. Developers like _writing_ and that gives the most job satisfaction. AI tools are helpful, but inherently increases the amount of code we have to review with more scrutiny than my colleagues because of how unpredictable - yet convincing - it can be. Why did we create tools that do the fun part and increase the non-fun part? Where are the "code-review" agents at?
Maybe I'm weird but I don't actually enjoy the act of _writing_ code. I enjoy problem solving and creating something. I enjoy decomposing systems and putting them back together in a better state, but actually manually typing out code isn't something I enjoy.
When I use an LLM to code I feel like I can go from idea to something I can work with in much less time than I would have normally.
Our codebase is more type-safe, better documented, and it's much easier to refactor messy code into the intended architecture.
Maybe I just have lower expectations of what these things can do but I don't expect it to problem solve. I expect it to be decent at gathering relevant context for me, at taking existing patterns and re-applying them to a different situation, and at letting me talk shit to it while I figure out what actually needs to be done.
I especially expect it to allow me to be lazy and not have to manually type out all of that code across different files when it can just generate them it in a few seconds and I can review each change as it happens.
If natural language was an efficient way to write software we would have done it already. Fact is that it's faster to write class X { etc }; Than it is to write "create a class named X with behavior etc". If you want to think and solve problems yourself, it doesn't make sense to then increase your workload by putting your thoughts in natural language, which will be more verbose.
I therefore think it makes the most sense to just feed it requirements and issues, and telling it to provide a solution.
Also unless you're starting a new project or big feature with a lot of boiler plate, in my experience it's almost never necessary to make a lot of files with a lot of text in it at once.
Been doing it for ten years still love the profession as much if not more than when I started, but the joy of software development for me was always in seeing my idea come to life, in exploring all the clever ways people had solved so many problems, in trying to become as good at the craft as they were, and in sharing those solutions and ideas with like-minded peers.
I care deeply about the code quality that goes into the projects I work on because I end up having to maintain it, review it, or fix it when it goes south, and honestly it just feels wrong to me to see bad code.
But literally typing out the characters that make up the code? I could care less. I've done that already. I can do it in my sleep, there's no challenge.
At this stage in my career I'm looking for ways to take the experience I have and upskill my teams using it.
I'd be crazy not to try and leverage LLMs as much as possible. That includes spending the time to write good CLAUDE.md files, set up custom agents that work with our codebase and patterns, it also includes taking the time to explain the why behind those choices to the team so they understand them, calling out bad PRs that "work" but are AI slop and teaching them how to get better results out of these things.
Idk man the profession is pretty big and creating software is still just as fun as when I was doing it character by character in notepad. I just don't care to type more than I need to when I can focus on problem solving and building.
While reading your comment it occured to me that people code at different abstraction levels. I do systems programming in golang and rust and I - like you - enjoy seeing my ideas come to life not so much the typing. The final result (how performant, how correct, how elegant and not complex) is in my control instead of an agent's; I enjoy having the creativity in the implementation. I can imagine other flavors of the profession working at higher abstraction layers and using more frameworks, where their result is dependant on how the framework executes. At that point, you might just want to connect all the frameworks/systems and get the feature out the door. And it is definitely a spectrum of languages, tools, frameworks that are more or less involved.
The creativity in implementing (e.g an indexed array that, when it grows to large, gets reformated to a less performance hashmap) is what I imagine being lost and bring people satisfaction. Pulling that off in a clean and not in a complex way... well there is a certain reward in that. I don't have any long term proof but I also hypothesize it helps with maintainability.
But I also see your point, sometimes I need a tool that does a function and I don't care to write it and giving the agent requirements and having it implemented is enough. But typically these tools are used and discarded.
Agreed 100% and I enjoy that part too, I just don't really see how that is being taken away.
The way I see it these tools allow me to use my actual brainpower mostly on those problems. Because all the rote work can now be workably augmented away, I can choose which problems to actually focus on "by hand" as it were. I'd never give those problems to an LLM to solve. I might however ask it to search the web for papers or articles or what have you that have solved similar problems and go from there.
If someone is giving that up then I'd question why they're doing that.. No one is forcing them to.
It's the problem solving itself that is fun, the "layer" that it's in doesn't really make a difference to me.
Yes, hence tests, linters, and actually verifying the changes it is making. You can't trust anything the LLM writes. It will hallucinate or misunderstand something at some point if your task gets long. But that's not the point, I'm not asking it to solve things for me.
I'm using it to get faster at building my own understanding of the problem, what needs to get done, and then just executing the rote steps I've already figured out.
Sometimes I get lucky and the feature is well defined enough just from the context gathering step that the implementation is literally just be hitting the enter key as I read the edits it wants to make.
Sometimes I have to interrupt it and guide it a bit more as it works.
Sometimes I realize I misunderstood something as it's thinking about what it needs to do.
One-shotting or asking the LLM to think for you is the worst way to use them.
You can take the output of an LLM and feed it into another LLM and ask it to fact-check. Not surprisingly, these LLMs have a high false negative rate, meaning that it won't always catch the error. (I think you agree with me so far.) However the probability of these LLM failures are independent of each other, so long as you don't share context. The converse is that the LLM has a less-than-we-would-like probability of detecting a hallucination, but if it does then verification of that fact is reliable in future invocations.
Combine this together: you can ask an LLM to do X, for any X, then take the output and feed it into some number of validation instances to look for hallucinations, bad logic, poor understanding, whatever. What you get back on the first pass will look like a flip of the coin -- one agent claims it is hallucination, the other agent says it is correct; both give reasons. But feed those reasons into follow-up verifier prompts, and repeat. You will find that non-hallucination responses tend to persist, while hallucinations are weeded out. The stable point is the truth.
This works. I have workflows that make use of this, so I can attest to its effectiveness. The new-ish Claude Code sub-agent capabilities and slash commands are excellent for doing this, btw.
Fundamentally, unit tests are using the same system to write your invariants twice, it just so happens that they're different enough that failure in one tends to reveal a bug in another.
You can't reasonably state this won't be the case with tools built for code review until the failure cases are examined.
Furthermore a simple way to help get around this is by writing code with one product while reviewing the code with another.
> unit tests are using the same system to write your invariants twice
For unit tests, the parts of the system that are the same are not under test, while the parts that are different are under test.
The problem with using AI to review AI is that what you're checking is the same as what you're checking it with. Checking the output of one LLM with another brand probably helps, but they may also have a lot of similarities, so it's not clear how much.
> The problem with using AI to review AI is that what you're checking is the same as what you're checking it with.
This isn't true. Every instantiation of the LLM is different. Oversimplifying a little, but hallucination emerges when low-probability next words are selected. True explanations, on the other hand, act as attractors in state-space. Once stumbled upon, they are consistently preserved.
So run a bunch of LLM instances in parallel with the same prompt. The built-in randomness & temperature settings will ensure you get many different answers, some quite crazy. Evaluate them in new LLM instances with fresh context. In just 1-2 iterations you will hone in on state-space attractors, which are chains of reasoning well supported by the training set.
What if you use a different AI model? Sometimes just a different seed generates a different result. I notice there is a benefit to seeing and contrasting the different answers. The improvement is gradual, it’s not a binary.
Weirdly, you can not only do this, it somehow does actually catch some of its own mistakes.
Not all of the mistakes, they generally still have a performance ceiling less than human experts (though even this disclaimer is still simplifying), but this kind of self-critique is basically what makes the early "reasoning" models one up over simple chat models: for the first-n :END: tokens, replace with "wait" and see it attempt other solutions and pick something usually better.
Turned out that for a lot of things (not all things, Transformers have a lot of weaknesses), using a neural network to score an output is, if not "fine", then at least "ok".
Generating 10 options with mediocre mean and some standard deviation, and then evaluating which is best, is much easier than deliberative reasoning to just get one thing right in the first place more often.
>
Code review is part of the job, but one of the least enjoyable parts. Developers like _writing_ and that gives the most job satisfaction.
At least for me, what gives the most satisfaction (even though this kind of satisfaction happens very rarely) if I discover some very elegant structure behind whatever has to be implemented that changes the whole way how you thought about programming (oroften even about life) for decades.
> what gives the most satisfaction (even though this kind of satisfaction happens very rarely) if I discover some very elegant structure behind whatever has to be implemented that changes the whole way how you thought about programming
A number of years ago, I wrote a caching/lookup library that is probably some of the favorite code I've ever created.
After the initial configuration, the use was elegant and there was really no reason not to use it if you needed to query anything that could be cached on the server side. Super easy to wrap just about any code with it as long as the response is serializable.
Under the hood, it would check the preferred caching solution (e.g., Redis/Memcache/etc), followed by less preferred options if the preferred wasn't available, followed by the expensive lookup if it wasn't found anywhere. Defaulted to in-memory if nothing else was available.
If the data was returned from cache, it would then compare the expiration to the specified duration... If it was getting close to various configurable tolerances, it would start a new lookup in the background and update the cache (some of our lookups could take several minutes*, others just a handful of seconds).
The hardest part was making sure that we didn't cause a thundering herd type problem with looking up stuff multiple times... in-memory cache flags indicating lookups in progress so we could hold up other requests if it failed through and then let them know once it's available. While not the absolute worst case scenario, you might end up making the expensive lookups once from each of the servers that use it if the shared cache isn't available.
* most of these have a separate service running on a schedule to pre-cache the data, but things have a backup with this method.
> Developers like _writing_ and that gives the most job satisfaction.
Is it possible that this is just the majority and there’s plenty of folks that dislike actually starting from nothing and the endless iteration to make something that works, as opposed to have some sort of a good/bad baseline to just improve upon?
I’ve seen plenty of people that are okay with picking up a codebase someone else wrote and working with the patterns and architecture in there BUT when it comes to them either needing to create new mechanisms in it or create an entirely new project/repo it’s like they hit a wall - part of it probably being friction, part not being familiar with it, as well as other reasons.
> Why did we create tools that do the fun part and increase the non-fun part? Where are the "code-review" agents at?
Presumably because that’s where the most perceived productivity gain is in. As for code review, there’s CodeRabbit, I think GitLab has their thing (Duo) and more options are popping up. Conceptually, there’s nothing preventing you from feeding a Git diff into RooCode and letting it review stuff, alongside reading whatever surrounding files it needs.
> I’ve seen plenty of people that are okay with picking up a codebase someone else wrote and working with the patterns and architecture in there BUT when it comes to them either needing to create new mechanisms in it or create an entirely new project/repo it’s like they hit a wall - part of it probably being friction, part not being familiar with it, as well as other reasons.
For me, it's exactly the opposite:
I love to build things from "nothing" (if I had the possibility, I would even like to write my own kernel that is written in a novel programming language developed by me :-) ).
On the other hand, when I pick up someone else's codebase, I nearly always (if it was not written by some insanely smart programmer) immediately find it badly written. In nearly al cases I tend to be right in my judgements (my boss agrees), but I am very sensitive to bad code, and often ask myself how the programmer who wrote the original code did not yet commit seppuku, considering how much of a shame the code is.
Thus: you can in my opinion only enjoy picking up a codebase someone else wrote if you are incredibly tolerant of bad code.
> Developers like _writing_ and that gives the most job satisfaction.
Not me. I enjoy figuring out the requirements, the high-level design, and the clever approach that will yield high performance, or reuse of existing libraries, or whatever it is that will make it an elegant solution.
Once I've figured all that out, the actual process of writing code is a total slog. Tracking variables, remembering syntax, trying to think through every edge case, avoiding off-by-one errors. I've gone from being an architect (fun) to slapping bricks together with mortar (boring).
I'm infinitely happier if all that can be done for me, everything is broken out into testable units, the code looks plausibly correct, and the unit tests for each function cover all cases and are demonstrably correct.
You don't really know if the system design you've architected in your mind is any good though, do you, until you've actually tried coding it. Discovering all the little edge cases at that point is hard work ("a total slog") because it's where you find out where the flaws in your thinking were, and how your beautifully imagined abstractions fall down.
Then after going back and forth between thinking about it and trying to build it a few times, after a while you discover the real solution.
Or at least that's how it's worked for me for a few decades, everyone might be different.
That's why you have short functions so you don't have to track that many variable. And use symbol completion (a standard in many editors).
> trying to think through every edge case, avoiding off-by-one errors.
That is designing, not coding. Sometimes I think of an edge case, but I'm already on a task that I'd like to finish, so I just add a TODO comment. Then at least before I submit the PR, I ripgrep the project for this keyword and other.
Sometimes the best design is done by doing. The tradeoffs become clearer when you have to actually code the solution (too much abstraction, too verbose, unwieldy,...) instead of relying on your mind (everything seems simpler)
You always have variables. Not just at the function level, but at the class level, object level, etc. And it's not about symbol completion, it's about remembering all the obscure differences in built-in function names and which does what.
And no, off-by-one errors and edge cases are firmly part of coding, once you're writing code inside of a function. Edge cases are not "todos", they're correctly handling all possible states.
> Sometimes the best design is done by doing.
I mean, sure go ahead and prototype, rewrite, etc. That doesn't change anything. You can have the AI do that for you too, and then you can re-evaluate and re-design. The point is, I want to be doing that evaluation and re-designing. Not typing all the code and keeping track of loop states and variable conditions and index variables and exit conditions. That stuff is boring as hell, and I've written more than enough to last a lifetime already.
> You always have variables. Not just at the function level, but at the class level, object level, etc.
Aka the scope. And the namespace of whatever you want to access. Which is a design problem.
> And it's not about symbol completion, it's about remembering all the obscure differences in built-in function names and which does what
That's what references are for. And some IDEs bring it right alongside the editor. If not, you have online and offline references. You remember them through usage and semantics.
> And no, off-by-one errors and edge cases are firmly part of coding, once you're writing code inside of a function.
It's not. You define the happy path and error cases as part of the specs. But they're generally lacking in precision (full of ambiguities) and only care about the essential complexity. The accidental complexity comes as part of the platform and is also part of the design. Pushing those kind of errors as part of coding is shortsightedness.
> Not typing all the code and keeping track of loop states and variable conditions and index variables and exit conditions. That stuff is boring as hell, and I've written more than enough to last a lifetime already
That is like saying "Not typing all the text and keeping track of words and punctuation and paragraphs and signatures. English is boring as hell and I've written more than enough..."
If you don't like formality, say so. I've never had anyone describe coding as you did. No one things about those stuff that closely. It's like a guitar player complaining about which strings to strike with a finger. Or a race driver complaining about the angle of the steering wheel and having to press the brake.
I don't know what to tell you. Sure there are tools like IDE's to help, but it doesn't help with everything.
The simple fact is that I find there's very little creative satisfaction to be found in writing most functions. Once you've done it 10,000 times, it's not exactly fun anymore, I mean unless you're working on some cutting-edge algorithm which is not what we're doing 99.9% of the time.
The creative part becomes in the higher level of design, where it's no longer rote. This is the whole reason why people move up into architecture roles, designing systems and libraries and API's instead of writing lines of code.
The analogies with guitar players or race car drivers or writers are flawed, because nothing they do is rote. Every note matters, every turn, every phrase. They're about creativity and/or split-second decision making.
But when you're writing code, that's just not the case. For anything that's a 10- or 20- line function, there isn't usually much creativity there, 99.99% of the time. You're just translating an idea into code in a straightforward way.
So when you say, "Developers like _writing_ and that gives the most job satisfaction." That's just not true. Especially not for many experienced devs. Developers like thinking, in my experience. They like designing, the creative part. Not the writing part. The writing is just the means to the end.
Because the goal of "AI" is not to have fun, it's to solve problems and increase productivity. I have fun programming too, but you have to realize the world isn't optimizing make things more fun.
I hear you, but without any enjoyment in the process, quality and productivity go down the drain real fast.
The Ironies of Automation paper is something I mention a lot, the core thesis is that making humans review / rubber stamp automation reduces their work quality. People just aren't wired to do boring stuff well.
As a human, I do agree that it would be better and we should strive for that. However I don't think humans are really driving all this progress/innovation. It is just evolution keeping doing what it's always done, it is ruthless and does not care at all whether we are having fun or not.
I will second this. I believe code review agents and search summaries are the way forward for coding with LLMs.
The ability to ignore AI and focus on solving the problems has little to do with "fun". If anything it leaves a human-auditable trail to review later and hold accountable devs who have gone off the rails and routinely ignored the sometimes genuinely good advice that comes out of AI.
If humans don't have to helicopter over developers, that's a much bigger productivity boost than letting AI take the wheel. This is a nuance missed by almost everyone who doesn't write code or care about its quality.
Code review isn't the same as design review, nor are these the only type of things (coding and design) that someone may be trying to use AI for.
If you are going to use AI, and catch it's mistakes, then you need to have expertise in whatever it is you are using the AI for. Even if we limit the discussion just to coding, then being a good code reviewer isn't enough - you'd need to have skill at whatever you are asking the AI to do. One of the valuable things AI can do is help you code using languages and frameworks you are not familiar with, which then of course means you are not going to be competent to review the output, other than in most generic fashion.
A bit off topic, but it's weird to me to see the term "coding" make a comeback in this AI/LLM era. I guess it is useful as a way to describe what AI is good at - coding vs more general software developer, but how many companies nowadays hire coders as opposed to software developers (I know it used to be a thing with some big companies like IBM)? Rather than compartmentalized roles, it seems the direction nowadays is more expecting developers to be able to do everything from business analysis and helping develop requirements, to architecture/design and then full-stack development, and subsequent production support.
> Using AI agents correctly is a process of reviewing code. [...]
> Why is that? Large language models are good at producing a lot of code, but they don’t yet have the depth of judgement of a competent software engineer. Left unsupervised, they will spend a lot of time committing to bad design decisions.
Obviously you want to make course corrections sooner than later. Same as I would do with less experienced devs, talk through the high level operations, then the design/composition. Reviewing a large volume of unguided code is like waiting for 100k tokens to be written only to correct the premise in the first 100 and start over.
I love doing code review for colleagues since I know that it bolsters our shared knowledge, experience and standards. Code review for an external, stubborn, uncooperative AI? No thanks, that sounds like burnout.
No. The failure conditions of "AI agents" are not even close to classical human mistakes (the only one ones where code review has anything more than an infinitesimal chance to catch). There is absolutely no skill transfer and it is a poor excuse anyway since review was never going to catch anything anyway.
code review can be almost as much effort as writing the code, especially when the code is not up to the expectations of the reviewer. this is fine, because you want two people (the original author, and the reviewer) on the code.
when reviewing AI code, not only will the effort needed by the reviewer increase, you also lose the second person (the author) looking at the code, because AI can't do that. it can produce code but not reason about or reflect on it like humans can.
I think that I review code much differently than the author. When I'm reviewing code, my assumption is that the person writing it has already verified that it works. I am primarily looking for readability and code smells.
In an ideal world I'd probably be looking more at the actual logic of the code. However, everywhere I've worked it's a full time job just despirately trying to fight ballooning complexity from people who prioritize quick turn around over quality code.
I think I'm good at code review, but we've all seen parts of the codebase where it's written by one teammate with specific domain knowledge and your option is to approve something you don't fully understand or to learn the background necessary to understand it.
In my experience, not having to learn the background is the biggest time saver provided by LLM coding (e.g. not having to read through API docs or confirm details of a file format or understand some algorithm). So in a way I feel like there is a fundamental tension.
This isn't some triviality you can throw aside as unimportant, it is the shape that the code has today, and limits and controls what it will have tomorrow.
It's how you make things intuitive, and it is equally how you ensure people follow a correct flow and don't trap themselves into a security bug.
I really disagree with this too, especially given the article's next line:
> ...You’ll be forever tweaking individual lines of code, asking for a .reduce instead of a .map.filter, bikeshedding function names, and so on. At the same time, you’ll miss the opportunity to guide the AI away from architectural dead ends.
I think a good review will often do both, and understand that code happens at the line level and also the structural level. It implies a philosophy of coding that I have seen be incredibly destructive firsthand — committing a bunch of shit that no one on a team understands and no one knows how to reuse.
This is distinctly not the api, but an implementation detail.
Personally, i can ask colleagues to change function names, rework hierarchy, etc. But leave this exact example be, as it does not have any material difference difference - regardless of my personal preference.
Agreed. A program is made of names, these names are of the utmost importance. For understanding, and also for searchability.
I do a lot of code reviews, and one of the main things I ask for, after bug fixes, is renaming things for readers to understand at first read unambiguously and to match the various conventions we use throughout the codebase.
Ex: new dev wrote "updateFoo()" for a method converting a domain thing "foo" from its type in layer "a" to its type in layer "b", so I asked him to use "convertFoo_aToB()" instead.
This blog gets posted often but the content is usually lousy. Lots of specious assertions about the nature of software development that really give off a "I totally have this figured out" vibe. I can't help but feel that anyone who feels so about this young industry that changes so rapidly and is so badly performed at so many places, is yet to summit Mt. Stupid.
I think I'd actually have a use for an AI that could receive my empty public APIs (such as a C++ header file) as an input and produce a first rough implementation. Maybe this exists already, I don't know because I haven't done any serious vibe coding.
As long as you're reinventing the wheel (implementing some common pattern because you don't want to pull in an entire dependency), that kind of AI generation works quite well. Especially if you also have the AI generate tests for its code, so you can force it to iterate on itself while it gets things wrong the first couple of tries. It's slow and resource intensive, but it'll generate something mostly complete most of the time.
I'm not sure if you're saving any time there, though. Perhaps if you give an LLM task before ending the work day so it can churn away for a while unattended, it may generate a decent implementation. There's a good chance you need to throw out the work too; you can't rely on it, but it can be a nice bonus if you're lucky.
I've found that this only works on expensive models with large context windows and limited API calls, though. The amount of energy wasted on shit code that gets reverted must be tremendous.
I hope the AI industry makes true on its promise that it'll solve the whole inefficiency problem because the way things are going now, the industry isn't sustainable.
The leading models have been very good at this for over a year now. Try copying one your existing C++ header files into GPT-5 or Claude 4 or Gemini 2.5 as an experiment and see how they do.
You can do this already, the most useful things to help with this are either writing tests or having it write tests and telling it how to compile and see error messages so you can let it loop.
I am good at code review, sure, but I don't like doing it. It's about as strong an engineering technique as coding at a whiteboard. I know I'm at a tiny fraction of my potential without debugging tools and for that reason code review on github is usually a waste of my time. I'll just write code thanks and I'll move the needle on quality by developing. As a reviewer I'll scan for smells but I assume that you too would be most effective if I left you make and clean up your own messes so long as they aren't egregious
> In my view, the best code review is structural. It brings in context from parts of the codebase that the diff didn’t mention.
That may be true for AI code.
But it would be pretty terrible for human-written code to bring this up after the code is written, wasting hours/days effort for lack of a little up-front communication on design.
AI makes routine code generation cheap -- only seconds/minutes and cents are being wasted -- but you essentially still need that design session.
What does this mean for juniors? A few companies are now introducing expectations that all engineers will use coding agents including juniors and grads. If they haven't yet learnt what good looks like through experience how are they going to review code produced by AI agents?
I have received a few LLM produced PRs from peers from adjacent teams, in good faith but not familiar with the project, and they increasingly infuriate me. They were all garbage, but there’s a great asymmetry: it costs my peers nothing to generate them, it costs me precious time to refute them. And what can I do really? Saying “it’s irreparable garbage because the syntax might be right but it’s conceptually nonsense” but that’s not the most constructive take.
You could use an LLM to give you advice on how to present that take in a more constructive manner.
Partially sarcastic but I do personally use LLMs to guide my communication in very limited cases:
1. It's purely business related, and
2. I'm feeling too emotionally invested (or more likely, royally pissed off) and don't trust myself to write in a professional manner, and
3. I genuinely want the message to sound cold, corporate, and unemotional
Number 3 would fit you here. These people are not being respectful to you in presenting code for review that respects your time. Why should you take the time to write back personally?
It should be noted that this accounts for maybe 5% of my business communications, and I'm careful not to let that number grow.
> Why should you take the time to write back personally?
Because it's 3 sentences, if you want to be way more polite and verbose than necessary.
"I will close PRs if they appear to be largely LLM-generated. I am always happy to review something with care and attention if it shows the same qualities. Thanks!"
The idea is to get your coworkers to stop sending you AI slop, send them AI slop in retaliation?
They're either lying about using AI, or they're incompetent enough to produce AI quality (read: Garbage) code, either way the company should let them go
That would be the nuclear option, but if you have any rapport at all with the person or team in question, you could also just pull them aside, ask if they are under unusual pressure to show progress, and make it clear that you get it, and you want to help, but that you can't if you're drowning in AI slop code review. I imagine it's a junior doing this, in which case it's in their career interest to stop and start acting like a professional. I've had seniors tell me more or less the same thing, in the pre-llm era: "slow down and get it right." Sometimes you just need to hear that.
This feels like a culture problem. I have seen higher-quality PRs as people use AI to review their work before pushing it. This means less silly typos and obvious small bugs.
Nothing makes me hate AI more than getting a slop PR written by one of the agent-wielding coworkers with the comments describing what the next line does for every line. More often than not it looks plausible but turns out to be completely unusable upon closer inspection. Incredibly disrespectful to do this to your coworkers imo, its proper that you call it out.
If you had a colleague who was consistently writing complete shit you would raise it with your manager. This situation isn't all that different - the only complicating factor they're not on your team.
If it's only happened a few times you might first try setting some ground rules for contributions. Really common for innersource repos to have a CONTRIBUTING.md file or similar. Add a checkbox to your PR template that the dev has to check to indicate they've read it, then wait and see.
As someone that basically does code reviews for a living, last thing I want to do is code review agents. I want to reduce how much review I’m doing, not hand hold some ai agents.
Getting AI to produce a bunch of code and then you having to filter through it all is a massive waste of time. The focus should be on getting AI to produce better code in the first place (e.g., using detailed plans), rather than on the volume of code you can produce...
I have only had real advantages with AI for helping me plan changes, and for it helping me to review my code. Getting it to write code for me has been somewhat helpful, but only for simple tedious changes or first drafts. But it is definitely not something I want to leverage by getting AI to produce more and more code that I then have to filter through and review. No thank you. I feel like this is really the wrong focus for implementing AI into your workflows.
TypeScript with NextJS. I've also used AI tools with C and Zig, and AI is much better at writing TS. But even though TS works much better, it's still not that great. This is largely because the quality of the code that AI writes is not good enough, so then I have to spend a decent chunk of time fixing it.
Everyone I know trying to use AI in large codebases has had similar experiences. AI is not good enough at following the rules of your codebase yet (i.e., following structure, code style, library usage, re-using code, refactoring, etc...). This makes it far less useful for writing code changes and additions. It can still be useful for small changes, or for writing first drafts of functions/classes/interfaces, but for more meaningful changes it often fails.
That is why I believe that right now, if you want to maintain a large codebase, and maintain a high bar for quality, AI tools are just not good enough at writing most code for you yet. The solution to this is not to get AI to write even more code for you to review and throw out and iterate upon in a frustrating cycle. Instead, I believe it is to notice where AI is helpful and focus on those use-cases, and avoid it when it is not.
That said, AI labs seem to be focusing a lot of effort on improving AI for coding right now, so I expect a lot of progress will be made on these issues in the next few years.
- If I had to iterate as much with a Jr dev as CC on not highly difficult stuff ("of course, I'll just do X!" then X doesn't work, then "of course, the answer is Y!" then Y doesn't work, etc.) I probably would have fired them by now or just say "never mind, I'll do it myself" .
- On the other hand a Jr dev will (hopefully) learn as they go, get better each time, so a month from now they're not making the same mistakes. An LLM can't learn so until there's a new model they keep making the same mistakes (yes, within a session they can learn -- if the session doesn't get too long -- but not across sessions). Also, the Jr dev can test their solution (which may require more than just running unit tests) and iterate on it so that they only come to me when it works and/or they're stuck. Just yesterday, on a rather simple matter, I wasted so much time telling the LLM "that didn't work, try again".
This idea that you can get good results from a bad process as long as you have good quality control seems… dubious, to say the least. “Sure, it’ll produce endless broken nonsense, but as long as someone is checking, it’s fine.” This, generally, doesn’t really work. You see people _try_ it in industry a bit; have a process which produces a high rate of failures, catch them in QA, rework (the US car industry used to be notorious for this). I don’t know of any case where it has really worked out.
Imagine that your boss came to you, the tech lead of a small team, and said “okay, instead of having five competent people, your team will now have 25 complete idiots. We expect that their random flailing will sometimes produce stuff that kinda works, and it will be your job to review it all.” Now, you would, of course, think that your boss had gone crazy. No-one would expect this to produce good results. But somehow, stick ‘AI’ on this scenario, and a lot of people start to think “hey, maybe that could work.”
Reviewing code from less experienced or unmotivated people is also very taxing, both in a cognitive and emotional sense. It will never approach a really good level of quality because you just give up after 4 rounds of reviews on the same feature.
Except humans learn from your PR comments and in other interactions with more experienced people, and so inexperienced devs become experienced devs eventually. LLMs are not so trainable.
Some people say we're near the end of pre-training scaling, and RLHF etc is going to be more important in the future. I'm interested in trying out systems like https://github.com/OpenPipe/ART to be able to train agents to work on a particular codebase and learn from my development logs and previous interactions with agents.
LLMs can learn if you provide it rules in your repo, and update those rules as you identify the common mistakes the LLM makes
If they're unmotivated enough to not get there after four review rounds for a junior-appropriate feature, they're not going to get better. It's a little impolite to say, but if you spend any significant amount of time coaching juniors you'll encounter exactly what I'm talking about.
retarded take
Can you elaborate or you call it a day after insulting?
Here’s the thing about AI though - you don’t need to worry about its confidence or impact on professional development if you’re overly critical, and it will do a turn within seconds. That gives a tremendous amount of flexibility and leverage to the code reviewer. Works better on some types of problems than others, but it’s worth exploring!
With human co-workers, you can generally assume things you can't with AI.
My human co-workers generally have good faith. Even the developer who was clearly on the verge of getting a role elsewhere without his heart in it-- he tried to solve the problems assigned to him, not some random delusion that the words happened to echo. I don't have that level of trust with AI.
If there's a misunderstanding the problem or the context, it's probably still the product of a recognizable logic flow that you can use to discuss what went wrong. I can ask Claude "Why are you converting this amount from Serbian Dinars to Poppyseed Bagels in line 476?" but will its answer be meaningful?
Human code review often involves a bit of a shared background. We've been working with the same codebases for several years, so we're going to use existing conventions. In this situation, the "AI knows all and sees all" becomes an anti-feature-- it may optimize for "this is how most people solve this task from a blank slate" rather than "it's less of a cognitive burden for the overall process if your single change is consistent with 500 other similar structures which have been in place since the Clinton administration."
There may be ways to try to force-feed AI this behaviour, but the more effort you devote to priming and pre-configuring the machine, the less you're actually saving over doing the actual work in the first place.
Right, this is the exact opposite of the best practices that Edward Deming helped develop in Japan, then brought to the west.
Quality needs to come from the process, not the people.
Choosing to use a process known to be flawed, then hoping that people will catch the mistakes, doesn't seem like a great idea if the goal is quality.
The trouble is that LLMs can be used in many ways, but only some of those ways play to their strengths. Management have fantasies of using AI for everything, having either failed to understand what it is good for, or failed to learn the lessons of Japan/Deming.
> Choosing to use a process known to be flawed, then hoping that people will catch the mistakes, doesn't seem like a great idea if the goal is quality.
You're also describing the software development process prior to LLMs. Otherwise code reviews wouldn't exist.
People have built complex working mostly bug free products without code reviews so humans are not that flawed.
With humans and code reviews now two humans looked at it. With LLM and code review of the LLM output now one human looked at it, so its not the same. LLM are still far from as reliable as humans or you could just tell the LLM to do code reviews and then it builds the entire complex product itself.
People have built complex bug free software without __formal__ code review. It's very rare to write complex bug free software without at least __informal__ code review, and it's luck, not skill.
Can't have a code review if you're coding solo[0], unless we are redefining the meaning of "code review" to the point of meaningless by including going over one's own code.
0. The dawn of video games had many titles with 1 person responsible for programming. This remains the case many indy games and small software apps and services. It's a skill that requires expertise and/or dedication.
Sure - software development is complex, but there seems to be a general attempt over time to improve the process and develop languages, frameworks and practices that remove the sources of human error.
Use of AI seems to be a regression in this regard, at least as currently used - "look ma, no hands! I've just vibe coded an autopiliot". The current focus seems to be on productivity - how many more lines of code or vibe-coded projects can you churn out - maybe because AI is still basically a novelty that people are still learning how to use.
If AI is to be used productively towards achieving business goals then the focus is going to need to mature and change to things like quality, safety, etc.
Code reviews are useful, but I think everyone would admit that they are not _perfect_.
> Management have fantasies of using AI for everything, having either failed to understand what it is good for, or failed to learn the lessons of Japan/Deming.
Third option: they want to automate all jobs before the competition does. Think of it as AWS, but for labor.
Oh man, that's what I've been smelling with all this. It's the Red Bead Experiment, all over again. https://www.youtube.com/watch?v=ckBfbvOXDvU
> Deming helped develop in Japan
Deming’s process was about how to operate a business in a capital-intensive industry when you don’t have a lot of capital (with market-acceptable speed and quality). That you could continue to push it and raise quality as you increased the amount of capital you had was a side-effect, and the various Japanese automakers demonstrated widely different commitments to it.
And I’m sure you know that he started formulating his ideas during the Great Depression and refined them while working on defense manufacturing in the US during WWII.
> Quality needs to come from the process, not the people.
Not sure which Japanese school of management you're following, but I think Toyota-style goes against that. The process gives more autonomy to workers than, say, Ford-style, where each tiny part of the process is pre-defined.
I got the impression that Toyota-style was considered to bring better quality to the product, even though it gives people more autonomy.
In an ideal world all employees would be top notch, on their game every day, never making mistakes, but the real world isn't like that. If you want repeatable quality then it needs to be baked into the process.
It's a bit like Warren Buffet saying he only wants to invest in companies that could be run by an idiot, because one day they will be.
Edward Deming actually worked with both Toyota and Ford, perhaps more foundationally at Toyota, bringing his process-based-quality ideas to both. Toyota's management style is based around continuous process improvement, combined with the employee empowerment that you refer to.
Evolution via random mutation and selection.
Or more broadly, the existence of complex or any life.
Sure, it's not the way I would pick to do most things, but when your buzzword magical thinking so deep all that you have is a hammer, even if it doesn't look like a nail you will force your wage slaves to hammer it anyway until it works.
As to your other cases.. injection molded plastic parts for things like the spinning t bar spray arm in some dishwashers. Crap molds, pass to low wage or temp to razorblade fix by hand and box up. Personally worked such a temp job before, among others so yes that bad output manual qc and fix up abounds still.
And if we are talking high failure rates... see also chip binning and foundry yields in semiconductors.
Just have to look around to see the dubious seeming is more the norm.
What happens is a kind of feeling of developing a meta skill. It's tempting to believe the scope of what you can solve has expanded when you are self-assessed as "good" with AI.
Its the same with any "general" tech. I've seen it since genetic algorithms were all the rage. Everyone reaches for the most general tool, then assumes everything that tool might be used for is now a problem or domain they are an expert in, with zero context into that domain. AI is this times 100x, plus one layer more meta, as you can optimize over approaches with zero context.
That's an oversimplification. AI can genuinely expand the scope of things you can do. How it does this is a bit particular though, and bears paying attention to.
Normally, if you want to achieve some goal, there is a whole pile of tasks you need to be able to complete to achieve it. If you don't have the ability complete any one of those tasks, you will be unable to complete the goal, even if you're easily able to accomplish all the other tasks involved.
AI raises your capability floor. It isn't very effective at letting you accomplish things that are meaningfully outside your capability/comprehension, but if there are straightforward knowledge/process blockers that don't involve deeper intuition it smooths those right out.
Normally, one would learn the missing steps, with or without AI.
You're probably envisioning a more responsible use if it (floor raising, "meaningfully inside your comprehension"), that is actually not what I'm referring to at all ( "assumes everything that tool might be used for is now a problem or domain they are an expert in"). A meta tech can be used in many ways and yours is close to what I believe the right method is. But I'm asserting that the danger is massive over reliance and over confidence in the "transferability".
> If you don't have the ability complete any one of those tasks, you will be unable to complete the goal
Nothing has changed. Few projects start with you knowing all the answers. In the same way AI can help you learn, you can learn from books, colleagues, and trial and error for tasks you do not know.
I can say from first hand experience that something has absolutely changed.
Before AI, if I had the knowledge/skill to do something on the large scale, but there were a bunch of minute/mundane details I had to figure out before solving the hard problems, I'd just lose steam from the boredom of it and go do something else. Now I delegate that stuff to AI. It isn't that I couldn't have learned how to do it, it's that I wouldn't have because it wouldn't be rewarding enough.
That’s great - you personally have found a tool that helps you overcome unknown problems. Other people have other methods for doing that. Maybe AI makes that more accessible in general.
Yep. All the process in the world won’t teach you to make a system that works.
The pattern I see over and over is a team aimlessly putting a long through tickets in sprints until an engineer who knows how to solve the problem gets it on track personally.
What I took away from the article was that being good at code review makes the person better at guiding the agent to do the job, giving the right context and constraints at the right time… and not that the code reviewer has to fix whatever agent generated… this is also pretty close to my personal experience… LLM models are a bull which can be guided and definitely not a complete idiot…
In a strange kind of analogy, flowing water can cause a lot of damage.. but a dam built to the right specification and turbines can harness that for something very useful… the art is to learn how to build that dam
I have a play project which hits these constraints a lot.
I have been messing around with getting AI to implement novel (to me) data structures from papers. They're not rocket science or anything but there's a lot of detail. Often I do not understand the complex edge cases in the algorithms myself so I can't even "review my way out of it". I'm also working in go which is usually not a very good fit for implementing these things because it doesn't have sum types; lack of sum types oten adds so much interface{} bloat it would render the data structure pointless. Am working around with codegen for now.
What I've had to do is demote "human review" a bit; it's a critical control but it's expensive. Rather, think more holistically about "guard rails" to put where and what the acceptance criteria should be. This means that when I'm reviewing the code I am reasonably confident it's functionally correct, leaving me to focus on whether I like how that is being achieved. This won't work for every domain, but if it's possible to automate controls, it feels like this is the way to go wherever possible.
The "principled" way to do this would be to use provers etc, but being more of an engineer I have resorted to ruthless guard rails. Bench tests that automatically fail if the runtime doesn't meet requirements (e.g. is O(n) instead of O(log n)) or overall memory efficiency is too low - and enforcing 100% code coverage from both unit tests AND fuzzing. Sometimes the cli agent is running for hours chasing indexes or weird bugs; the two main tasks are preventing it from giving up, and stopping it from "punting" (wait, this isn't working, let me first create a 100% correct O(n) version...) or cheating. Also reminding it to check AGAIN for slice sharing bugs which crop up a surprising % of the time.
The other "interesting" part of my workflow right now is that I have to manually shuffle a lot between "deep research" (which goes and reads all the papers and blogs about the data structure) and the cli agent which finds the practical bugs etc but often doesn't have the "firepower" to recognise when it's stuck in a local maximum or going around in circles. Have been thinking about an MCP that lets the cli agent call out to "deep research" when it gets really stuck.
The issue with the hypothetical is if you give a team lead 25 competent people they'd also get bad results. Or at least, the "team lead" isn't really leading their team on technical matters apart from fighting off the odd attempt to migrate to MongoDB and hoping that their people are doing the right thing. The sweet spot for teams is 3-6 people and someone more interested in empire building than technical excellence can handle maybe around 9 people and still do a competent job. It doesn't depend much on the quality of the people.
The way team leads seem to get used is people who are good at code get a little more productive as more people are told to report to them. What is happening now is the senior-level engineers all automatically get the same option: a team of 1-2 mid-level engineers on the cheap thanks to AI which is entirely manageable. And anyone less capable gets a small team, a rubber duck or a mentor depending on where they fall vs LLM use.
Of course, the real question is what will happen as the AIs get into the territory traditionally associated with 130+ IQ ranges and the engineers start to sort out how to give them a bit more object persistence.
Imagine a factory making injection molded plastic toys but instead of pumping out perfect parts 99.999% of the time, the machine gives you 50% and you have to pay people to pull out the bad ones from a full speed assembly line and hope no bad ones get through.
Is this not how microchips are made?
Chips can be and are tested automatically. It's not like there are people running beside a conveyor belt with a microscope. So no, it's not the same thing. One doesn't need people, while the other one does.
Also, a fab is intended to make fully functioning chips, which sometimes it fails to achieve. An LLM is NOT designed to give correct output, just plausible output. Again, not the same goal.
LLMs are an implementation of bogosort. But less efficient.
> but as long as someone is checking
I predict many disastrous "AI" failures because the designers somehow believed that "some humans capable of constant vigilant attention to detail" was an easy thing they could have.
It also assumes that people who are "good" at the standard code review process (which is tuned for reviewing code written by humans with some level of domain experience and thus finding human-looking mistakes) will be able to translate their skills perfectly to reviewing code written by AI. There have been plenty of examples where this review process was shown to be woefully insufficient for things outside of this scope (for instance, malicious patches like the bad patches scandal with Linux a few years ago or the xz backdoor were only discovered after the fact).
I haven't had to review too much AI code yet, but from what I've seen it tends to be the kind of code review that really requires you to think hard and so seems likely to lead to mistakes even with decent code reviewers. (I wouldn't say that I'm a brilliant code reviewer, but I have been doing open source maintenance full-time for around a decade at this point so I would say I have some experience with code reviews.)
I'm not sure about the current state of the art, but microprocessors production is (was?) very bad. You make a lot of them in a single silicon wafer, and then test them thoughtfully until you find the few that are good. You drop all the defective ones because they are very cheap piece of sand and charge a lot for the ones that works correctly to cover all the costs.
I'm not sure how this translates to programming, code review is too expensive, but for short code you can try https://en.wikipedia.org/wiki/Superoptimization
Design for test is still a major part of (high volume) chip design. Anything that can't be tested in seconds on wafer is basically worthless for mass production.
In that case, tho, no-one’s saying “let’s be sloppy with production and make up for it in the QA” (which really used to be a US car industry strategy until the Japanese wiped the floor with them); the process is as good as it reasonably can be, there are just physical limits. Chip manufacturers spend vast amounts on reducing the error rate.
I went from ";" to fully working C++ production grade code with good test coverage. To my estimation, 90% of the work was done in an agent prompt. It was a side project, now it will be my job. The process is like they described.
For the core parts you cannot let go of the reins. You have to keep steering it. You have to take short breaks and reload the code into the agent as it starts acting confused. But once you get the hang of it, things that would take you months of convincing yourself and picking yourself back up to continue becomes a day's work.
Once you have a decent amount of work done, you can have the agent read your code as documentation and use it to develop further.
1. The flaw in this premise is that the process is bad. Aside from the countless anecdotal reports about how AI and agents are improving productivity, there are actual studies showing 25 - 55% boosts. Yes, RCTs at larger size than the METR one that keeps getting bandied about: https://news.ycombinator.com/item?id=44860577 and many more on Google Scholar: https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&as_ylo...
2. Quality control is key to good processes as well. Code review is literally a best practice in the software industry. Especially in BigTech and high-performing organizations. That is, even for humans, including those that could be considered the cream of the industry, code review is a standard step of the delivery process.
3. People have posted their GitHub profiles and projects (including on this very forum) to show how AI is working out for them. Browse through some of them and see how much "endless broken nonsense" you find. And if that seems unscientific, well go back to point 1.
I picked one of the studies in the search (!) you linked. First of all, it's a bullshit debate tactic to try to overwhelm your opponents with vague studies -- a search is complete bullshit because it puts the onus on the other person to discredit the gargantuan amount of data you've flooded them with. Many of the studies in that search don't have anything to do with programming at all.
So right off the bat, I don't trust you. Anyway, I picked one study from the search to give you the benefit of the doubt. It compared leetcode in the browser to LLM generation. This tells us absolutely nothing about real world development.
What made the METR paper interesting was that they studied real projects, in the real world. We all know LLMs can solve well bounded problems in their data sets.
As for 3 I've seen a lot of broken nonsense. Let me know when someone vibe codes up a new mobile operating system or a competitor to KDE and Gnome lol
> Sure, it’ll produce endless broken nonsense, but as long as someone is checking, it’s fine
Well you've just described an EKF on a noisy sensor.
I do not think anybody is going to get that reference. https://xkcd.com/2501/
Ha yes, I tend to hang around the robotics crowd a bit too much.
I mean the average person surely knows at least what a Jacobian is, right? Right? /s
> … good results from a bad process …
Even if the process weren’t technically bad, it would still be shit. Doing code review with a human has meaning in that the human will probably learn something, and it’s an investment in the future. Baby-sitting an LLM, however, is utterly meaningless.
> I don’t know of any case where it has really worked out.
Supermarket vegetables.
Are you saying that supermarket vegetables/produce are good?
Quite a bit of it, like Tomatoes and Strawberries, is just crap. Form over substance. Nice color and zero flavor. Selected for delivery/shelf-life/appearance rather actually being any good.
> Form over substance. Nice color and zero flavor. Selected for delivery/shelf-life/appearance rather actually being any good.
From an economics POV, that's the correct test.
I was also considering the way the US food standards allows a lot of insect parts in the products, but wasn't sure how to phrase it.
> I was also considering the way the US food standards allows a lot of insect parts in the products, but wasn't sure how to phrase it.
I don't know how the US compares to other countries in terms of "insects per pound" standards, but having some level of insects is going to be inevitable.
For example, how could you guarantee that your wheat, pre-milling, has zero insects in it, or that your honey has no bee parts in it (best you can do is strain it, then anything that gets through the straining process will be on your toast).
> From an economics POV, that's the correct test
Maybe we could stop filtering everything through this bullshit economics race to the bottom then
You can campaign for your government to set any legal minimum for quality that you want, but it's essentially nonsensical to expect people not to optimise for cheapest given whatever those constraints are.
Apple, Oracle or Nvidia did not get there following your way of thinking
A race to the bottom leaves you like Boeing or Intel.
Late stage capitalism is not a must.
Counterpoint: Juicero.
Your list of winners are optimising for what the market cares about, exactly like the supermarkets (who are also mostly winners) are optimising for what the market cares about. For most people, for food specifically, that means "cheap". Unavoidably, because most people have less money than they'd like. Fancy food is rare treat for many.
Apple software currently has a reputation for buggy UI; Oracle has a reputation for being litigious; that just leaves Nvidia who are printing money selling shovels in two sucessive gold rushes, which is fine for a business and means my investment is way up, but also means for high end graphics cards consumer prices are WTF and availability is LOL.
As you said Oracle wins money hiring more lawyers every day, not because each day they hire cheaper employees.
See, the only way is not a race to the bottom like all late stage capitalist claim.
What is your point?
> This idea that you can get good results from a bad process
This idea is called "evolution"...
> as long as you have good quality control
...and it's QA is death on every single level of the systems: cell, organism, species, and ecosystem. You must consider that those devs or companies with not-good-enough QA will end up dead (from a business perspective).
I look forward to software which takes several million years to produce and tends to die of Software Cancer.
Like, evolution is not _good_ at ‘designing’ things.
Evolution is extremely inefficient at producing good designs. Given enough time it'll explore more, because it's driven randomly, but most mutations either don't help, or downright hurt an organism's survival.
So we're software evolvers now, not engineers?
Sounds like a stupid path forward to me
We’ve always been software evolvers. Ideas that have been around for decades, such as “a codebase is a garden”, are even more relevant now.
Iterating on a design is not the same as throwing out crap until it works.
That depends.
If the engineer, doing the implementation is top-shelf, you can get very good results from a “flawed” process (in quotes, because it’s not actually “bad.” It’s just a process that depends on the engineer being that particular one).
Silicon Valley is obsessed with process over people, manifesting “magical thinking” that a “perfect” process eliminates the need for good people.
I have found the truth to be in-between. I worked for a company that had overwhelming Process, but that process depended on good people, so it hired top graduates, and invested huge amounts of money and time into training and retention.
Said a little more crass/simply: A people hire A people. B people hire C people.
The first is phenomenal until someone makes a mistake and brings in a manager or supervisor from the C category that talks the talk but doesn't walk the walk.
If you accidentally end up in one that turns out to be the later. It's maddening trying to get anything accomplished if the task involves anyone else.
Hire slow, fire fast.
Steve Jobs said this decades ago.
Its the content that matters, not the process.
> Imagine that your boss came to you, the tech lead of a small team, and said “okay, instead of having five competent people, your team will now have 25 complete idiots. We expect that their random flailing will sometimes produce stuff that kinda works, and it will be your job to review it all.”
This is exactly the point of corporate Agile. Management believes that the locus of competence in an organization should reside within management. Depending on competent programmers is thus a risk, and what is sought is a process that can simulate a highly competent programmer's output with a gang of mediocre programmers. Kinda like the myth that you can build one good speaker out of many crappy ones, or the principle of RAID which is to use many cheap, failure-prone drives to provide the reliability guarantees of one expensive, reliable drive (which also kinda doesn't work if the drives came from the same lot and are prone to fail at about the same time). Every team could use some sort of process, but usually if you want to retain good people, this takes the form of "disciplines regarding branching, merging, code review/approval, testing, CI, etc." Something as stifling as Scrum risks scaring your good people away, or driving them nuts.
So yes, people do expect it to work, all the time. And with AI in the mix, it now gains very nice "labor is more fungible with capital" properties. We're going to see some very nice, spectacular failures in the next few years as a result, a veritable Perseid meteor shower of critical systems going boom; and those companies that wish to remain going concerns will call in human programmers to clean up the mess (but probably lowball on pay and/or try to get away with outsourcing to places with dirt-cheap COL). But it'll still be a rough few years for us while management in many orgs gets high off their own farts.
Bayesian reasoning would lead me to think that a high rate of failures means even if QA is 99.9% amazing and dev is AI 80% slop, there still will be more poor features and bugs (99.9% * 80% = 79.92%) than if both are mediocore (%90 * %90 = 81%)
Asking AI to stay true to my requested parameters is hard, THEY ALL DRIFT AWAY, RANDOMLY
When working on nftables syntax highlighters, I have 230 tokens, 2,500 state, and 50,000+ state transitions.
Some firm guidelines given to AI agents are:
1. Fully-deterministic LL(1) full syntax tree.
2. No use of Vim 'syntax keyword' statement
3. Use long group names in snake_case whose naming starts with 'nft_' prefix (avoids collision with other Vim namespaces)
4. For parts of the group names, use only nftables/src/parser_bison.y semantic action and token names as-is.
5. For each traversal down the syntax tree, append that non-terminal node name from parser_bison.y to its group names before using it.
With those 5 "simple" user-requested requirements, all AI agents drift away from at least each of the rules at seemingly random interval.
At the moment, it is dubiois to even trust the bit-length of each packet field.
Never mind their inability to construct a simple Vimscript.
I use AI agents mainly as documentation.
On the bright side, they are getting good at breaking down 'rule', 'chain_block stmt', and 'map_stmt_expr' (that '.' period we see at chaining header expressions together; just use the quoted words and paste in one of your nft rule statements.
Im a dev (who knows nothing about nftables) and I don't understand your instructions. I think maybe you could improve your situation by formulating them as "when creating new groupnames use the semantic actions and token names as defined in parser_bison.y" I.e. with if conditions so that the correct rules apply to the correct situations. Because your rules are written as if to apply to every line of code, it might unnecessarily try to incorporate context even when it's not applicable.
AI-generated code can be useful in the early stages of a project, but it raises concerns in mature ones. Recently, a 280kloc+ Postgres parser was merged into Multigres (https://github.com/multigres/multigres/pull/109) with no public code review. In open source, this is worrying. Many people rely on these projects for learning and reference. Without proper review, AI-generated code weakens their value as teaching tools, and more importantly the trust in pulling as dependencies. Code review isn’t just about bugs, it’s how contributors learn, understand design choices, and build shared knowledge. The issue isn’t speed of building software (although corporations may seem to disagree), but how knowledge is passed on.
Edit: Reference to the time it took to open the PR: https://www.linkedin.com/posts/sougou_the-largest-multigres-...
I oversaw this work, and I'm open to feedback on how things can be improved. There are some factors that make this particular situation different:
This was an LLM assisted translation of the C parser from Postgres, not something from the ground up.
For work of this magnitude, you cannot review line by line. The only thing we could do was to establish a process to ensure correctness.
We did control the process carefully. It was a daily toil. This is why it took two months.
We've ported most of the tests from Postgres. Enough to be confident that it works correctly.
Also, we are in the early stages for Multigres. We intend to do more bulk copies and bulk translations like this from other projects, especially Vitess. We'll incorporate any possible improvements here.
The author is working on a blog post explaining the entire process and its pitfalls. Please be on the lookout.
I was personally amazed at how much we could achieve using LLM. Of course, this wouldn't have been possible without a certain level of skill. This person exceeds all expectations listed here: https://github.com/multigres/multigres/discussions/78.
"We intend to do more bulk copies and bulk translations like this from other projects"
Supabase’s playbook is to replicate existing products and open source projects, release them under open source, and monetize the adoption. They’ve repeated this approach across multiple offerings. With AI, the replication process becomes even faster, though it risks producing low-quality imitations that alienate the broader community and people will resent the stealing of their work.
An alternative viewpoint which we are pretty open about in our docs:
> our technological choices are quite different; everything we use is open source; and wherever possible, we use and support existing tools rather than developing from scratch.
I understand that people get frustrated when there is any commercial interest associated to open source. But someone needs to fund open source efforts and we’re doing our best here. Some (perhaps non-obvious) examples
* we employ the maintainers of PostgREST, contributing directly to the project - not some private fork
* we employ maintainers of Postgres, contributing patches directly
* we have purchased and open sourced private companies, like OrioleDB, open sourced the code and made the patents freely available to everyone
* we picked up unmaintained tools and maintained them at our own cost, like the Auth server, which we upstreamed until the previous owner/company stopped accepting contributions
* we worked with open source tools/standards like TUS to contribute missing functionality like Postgres support and advisory locks
* we have sponsored adjacent open source initiatives like adding types to Elixir
* we have given equity to framework creators, which I’m certain will be the largest donation that these creators have (and will) ever receive for their open source work
* and yes, we employ the maintainers of Vitess to create a similar offering for the Postgres ecosystem under the same Apache2 license
And I'm not sure about their ability to release said code under a different license either.
Postgres has a pretty permissive license, but that doesn't mean you can just ignore it.
My process is basically
1. Give it requirements
2. Tell it to ask me clarifying questions
3. When no more questions, ask it to explain the requirements back to me in a formal PRD
4. I criticize it
5. Tell it to come up with 2 alternative high level designs
6. I pick one and criticize it
7. Tell it to come up with 2 alternative detailed TODO lists
8. I pick one and criticize it
9. Tell it to come up with 2 alternative implementations of one of the TODOs
10. I pick one and criticize it
11. Back to 9
I usually “snapshot” outputs along the way and return to them to reduce useless context.
This is what produces the most decent results for me, which aren’t spectacular but at the very least can be a baseline for my own implementation.
It’s very time consuming and 80% of the time I end up wondering if it would’ve been quicker to just do it all by myself right from the start.
Definitely sounds slower than doing it yourself.
I am falling into a pattern of treating AI coding like a drunk mid-level dev: "I saw those few paragraphs of notes you wrote up on a napkin, and stayed up late Saturday night while drinking and spat out this implementation. you like?"
So I can say to myself, "No, do not like. But the overall gist at least started in the right direction, so I can revise it from here and still be faster than had I done it myself on Monday morning."
The most useful thing I've found is "I need to do X, show me 3 different popular libraries that do it". I've really limited my AI use to "Lady's Illustrated Primer" especially after some bad experiences with AI code from devs who should know better.
I don't even frame my requests conversationally. They usually read like brief demands, sometimes just comma delimited technologies followed by a goal. Works fine for me, but I also never prompt anything that I don't already understand how to do myself. Keeps the cart behind the horse.
I've started putting in my system prompt "keep answers brief and don't talk in the first/second person". Gets rid of all the annoying sycophancy and stops it from going on for ten paragraphs. I can ask for more details when I need it.
> It’s very time consuming and 80% of the time I end up wondering if it would’ve been quicker to just do it all by myself right from the start.
Yes, this. Every time I read these sort of step by step guides to getting the best results with coding agents it all just sounds like boatloads of work that erase the efficiency margins that AI is supposed to bring in the first place. And anecdotally, I've found that to be true in practice as well.
Not to say that AI isn't useful. But I think knowing when and where AI will be useful is a skill in and of itself.
At least for me, I can have five of these processes running at once. I can also use Deepresearch for generating the designs with a survey of literature. I can use NotebookLM to analyse the designs. And I use Sourcery, CodeRabbit, Codex and Codescene together to do code review.
It took me a long time to get there with custom cli tools and browser userscripts. The out of the box tooling is very limited unless you are willing to pay big £££s for Devin or Blitzy.
paid big bucks for devin... still was limited and not very good
I think I’m working at lower levels, but usually my flow is:
- I start to build or refactor the code structure by myself creating the basic interfaces or skip to the next step when they already exist. I’ll use LLMs as autocomplete here.
- I write down the requirements and tell which files are the entry point for the changes.
- I do not tell the agent my final objective, only one step that gets me closer to it, and one at a time.
- I watch carefully and interrupt the agent as soon as I see something going wrong. At this point I either start over if my requirement assumptions were wrong or just correct the course of action of the agent if it was wrong.
Most of the issues I had in the past were from when I write down a broad objective that requires too many steps at the beginning. Agents cannot judge correctly when they finished something.
I have a similar, though not as detailed, process. I do the same as you up to the PRD, then give it the PRD and tell it the high level architecture, and ask it to implement components how I want them.
It's still time-consuming, and it probably would be faster for me to do it myself, but I can't be bothered manually writing lines of code any more. I maybe should switch to writing code with the LLM function by function, though.
That's like a chef saying they can't be bothered to cook...
Doesn't a head chef in a restaurant context delegate a lot to other people for the cooking? And of course they use many tools also. And pre-prepared parts also, often from external suppliers.
If the final dish is excellent, does it matter if the chef made it themselves, or if they instructed the sous-chef how to make it?
> but I can't be bothered manually writing lines of code any more. I maybe should switch to writing code with the LLM function by function, though.
Maybe you should consider a change of career :/
Why?
Yeah, sounds like it would have been far quicker to use the AI to give you a general overview of approaches/libraries/language features/etc, and then done the work yourself.
If you are good at code review, you will also be good at not using AI agents.
This. Having had the pleasure to review the work and fix the bugs of agent jockeys (generally capable developers that fell in love with Claude Code et al), I'm rather sceptical. The code often looks as if they were on mushrooms. They cannot reason about it whatsoever, like they weren't even involved, when I know they weren't completely hands off.
I really believe there are people out there that produce good code with these things, but all I've seen so far has been tragic.
Luckily, I've witnessed a few snap out of it and care again. Literally looks to me as if they had a substance abuse problem for a couple of months.
If you take a critical look at what comes out of contemporary agentic workflows, I think the conclusion must be that it's not there. So yeah, if you're a good reviewer, you would perhaps come to that conclusion much sooner.
Yeah.
I'm not even anti-LM. Little things—research, "write TS types for this object", search my codebase, go figure out exactly what line in the Django rest framework is causing this weird behavior, —are working great and saving me an hour here and 15m there.
It's really obvious when people lean on it, because they don't act like a beginner (trying things that might not work) or just being sloppy (where there's a logic ot it but there's no attention to detail), but it's like they copy pasted from Stackoverflow search results at random and there are pieces that might belong but the totality is incoherent.
Yeah, it gets so wild, downright psychedelic.
I'm definitely not anti LLM, I use them all the time. Just not for generating code. I give it a go every couple of months, probably wasting more time on it than I should. I don't think I've felt any real advancements since last year around this time, and this agentic hype seems to be a bit ahead of its time, to put it mildly. But I absolutely get a lot of value out of them.
It's also nice for handling some little tasks that otherwise would have been just annoying enough for me to not do it. Small 5 or 6 line functions that I would have had to fiddle with for far longer to get right.
Totally. I have to watch it like a hawk though. I've had it get the logic backwards on trivial "copy paste this from from the docs" functions.
Oh yeah. Just this week I had it hallucinate random non-existent API calls and completely bungle fairly straightforward arguments.
> I really believe there are people out there that produce good code with these things, but all I've seen so far has been tragic
I don't believe this at all, because all I've seen so far is tragic
I would need to see any evidence of good quality work coming from AI assisted devs before I start to entertain the idea myself. So far all I see is low effort low quality code that the dev themself is unable to reason about
>The code often looks as if they were on mushrooms. They cannot reason about it whatsoever
Interesting comparison, why not weed or alcohol?
Never tried psychedelic mushrooms, so that part is speculation. But no amount of weed or alcohol could get me even close to writing code that unhinged.
If I'm good at code review, I want to get better at it.
> If you’re a nitpicky code reviewer, I think you will struggle to use AI tooling effectively. [...] Likewise, if you’re a rubber-stamp code reviewer, you’re probably going to put too much trust in the AI tooling.
So in other words, if you are good at code review you are also good enough at writing code that you will be better off writing it yourself for projects you will be responsible for maintaining long term. This is true for almost all of them if you work at a sane place or actually care about your personal projects. Writing code for you is not a chore and you can write it as fluently and quickly as anything else.
Your time "using AI" is much better spent filling in the blanks when you're unfamiliar with a certain tool or need to discover a new one. In short, you just need a few google searches a day... just like it ever was.
I will admit that modern LLMs have made life easier here. AI summaries on search engines have indeed improved to the point where I almost always get my answer and I no longer get hung up meat-parsing poorly written docs or get nerd-sniped pondering irrelevant information.
Code review is part of the job, but one of the least enjoyable parts. Developers like _writing_ and that gives the most job satisfaction. AI tools are helpful, but inherently increases the amount of code we have to review with more scrutiny than my colleagues because of how unpredictable - yet convincing - it can be. Why did we create tools that do the fun part and increase the non-fun part? Where are the "code-review" agents at?
Maybe I'm weird but I don't actually enjoy the act of _writing_ code. I enjoy problem solving and creating something. I enjoy decomposing systems and putting them back together in a better state, but actually manually typing out code isn't something I enjoy.
When I use an LLM to code I feel like I can go from idea to something I can work with in much less time than I would have normally.
Our codebase is more type-safe, better documented, and it's much easier to refactor messy code into the intended architecture.
Maybe I just have lower expectations of what these things can do but I don't expect it to problem solve. I expect it to be decent at gathering relevant context for me, at taking existing patterns and re-applying them to a different situation, and at letting me talk shit to it while I figure out what actually needs to be done.
I especially expect it to allow me to be lazy and not have to manually type out all of that code across different files when it can just generate them it in a few seconds and I can review each change as it happens.
If natural language was an efficient way to write software we would have done it already. Fact is that it's faster to write class X { etc }; Than it is to write "create a class named X with behavior etc". If you want to think and solve problems yourself, it doesn't make sense to then increase your workload by putting your thoughts in natural language, which will be more verbose.
I therefore think it makes the most sense to just feed it requirements and issues, and telling it to provide a solution.
Also unless you're starting a new project or big feature with a lot of boiler plate, in my experience it's almost never necessary to make a lot of files with a lot of text in it at once.
the time spent literally typing code into an editor is never the bottleneck in any competently-run project
if the act of writing code is something you consider a burden rather than a joy then my friend you are in the wrong profession
Been doing it for ten years still love the profession as much if not more than when I started, but the joy of software development for me was always in seeing my idea come to life, in exploring all the clever ways people had solved so many problems, in trying to become as good at the craft as they were, and in sharing those solutions and ideas with like-minded peers.
I care deeply about the code quality that goes into the projects I work on because I end up having to maintain it, review it, or fix it when it goes south, and honestly it just feels wrong to me to see bad code.
But literally typing out the characters that make up the code? I could care less. I've done that already. I can do it in my sleep, there's no challenge.
At this stage in my career I'm looking for ways to take the experience I have and upskill my teams using it.
I'd be crazy not to try and leverage LLMs as much as possible. That includes spending the time to write good CLAUDE.md files, set up custom agents that work with our codebase and patterns, it also includes taking the time to explain the why behind those choices to the team so they understand them, calling out bad PRs that "work" but are AI slop and teaching them how to get better results out of these things.
Idk man the profession is pretty big and creating software is still just as fun as when I was doing it character by character in notepad. I just don't care to type more than I need to when I can focus on problem solving and building.
While reading your comment it occured to me that people code at different abstraction levels. I do systems programming in golang and rust and I - like you - enjoy seeing my ideas come to life not so much the typing. The final result (how performant, how correct, how elegant and not complex) is in my control instead of an agent's; I enjoy having the creativity in the implementation. I can imagine other flavors of the profession working at higher abstraction layers and using more frameworks, where their result is dependant on how the framework executes. At that point, you might just want to connect all the frameworks/systems and get the feature out the door. And it is definitely a spectrum of languages, tools, frameworks that are more or less involved.
The creativity in implementing (e.g an indexed array that, when it grows to large, gets reformated to a less performance hashmap) is what I imagine being lost and bring people satisfaction. Pulling that off in a clean and not in a complex way... well there is a certain reward in that. I don't have any long term proof but I also hypothesize it helps with maintainability.
But I also see your point, sometimes I need a tool that does a function and I don't care to write it and giving the agent requirements and having it implemented is enough. But typically these tools are used and discarded.
Agreed 100% and I enjoy that part too, I just don't really see how that is being taken away.
The way I see it these tools allow me to use my actual brainpower mostly on those problems. Because all the rote work can now be workably augmented away, I can choose which problems to actually focus on "by hand" as it were. I'd never give those problems to an LLM to solve. I might however ask it to search the web for papers or articles or what have you that have solved similar problems and go from there.
If someone is giving that up then I'd question why they're doing that.. No one is forcing them to.
It's the problem solving itself that is fun, the "layer" that it's in doesn't really make a difference to me.
But it's not exactly rewarding to add one more CRUD endpoint. It's a shit-ton of typing in multiple layers.
An LLM can do it in two minutes while I fetch coffee, then I can proceed to add the complex bits (if there are any)
Code is the ultimate fact checker, where what you write is what gets done. Specs are well written wishes.
Yes, hence tests, linters, and actually verifying the changes it is making. You can't trust anything the LLM writes. It will hallucinate or misunderstand something at some point if your task gets long. But that's not the point, I'm not asking it to solve things for me.
I'm using it to get faster at building my own understanding of the problem, what needs to get done, and then just executing the rote steps I've already figured out.
Sometimes I get lucky and the feature is well defined enough just from the context gathering step that the implementation is literally just be hitting the enter key as I read the edits it wants to make.
Sometimes I have to interrupt it and guide it a bit more as it works.
Sometimes I realize I misunderstood something as it's thinking about what it needs to do.
One-shotting or asking the LLM to think for you is the worst way to use them.
> Where are the "code-review" agents at?
OpenAI's Codex Cloud just added a new feature for code review, and their new GPT-5-Codex model has been specifically trained for code review: https://openai.com/index/introducing-upgrades-to-codex/
Gemini and Claude both have code review features that work via GitHub Actions: https://developers.google.com/gemini-code-assist/docs/review... and https://docs.claude.com/en/docs/claude-code/github-actions
GitHub have their own version of this pattern too: https://github.blog/changelog/2025-04-04-copilot-code-review...
There are also a whole lot of dedicated code review startups like https://coderabbit.ai/ and https://www.greptile.com/ and https://www.qodo.ai/products/qodo-merge/
you can't use a system with the exact same hallucination problem to check the work of another one just like it. Snake oil
Yes you can, and this shouldn't be surprising.
You can take the output of an LLM and feed it into another LLM and ask it to fact-check. Not surprisingly, these LLMs have a high false negative rate, meaning that it won't always catch the error. (I think you agree with me so far.) However the probability of these LLM failures are independent of each other, so long as you don't share context. The converse is that the LLM has a less-than-we-would-like probability of detecting a hallucination, but if it does then verification of that fact is reliable in future invocations.
Combine this together: you can ask an LLM to do X, for any X, then take the output and feed it into some number of validation instances to look for hallucinations, bad logic, poor understanding, whatever. What you get back on the first pass will look like a flip of the coin -- one agent claims it is hallucination, the other agent says it is correct; both give reasons. But feed those reasons into follow-up verifier prompts, and repeat. You will find that non-hallucination responses tend to persist, while hallucinations are weeded out. The stable point is the truth.
This works. I have workflows that make use of this, so I can attest to its effectiveness. The new-ish Claude Code sub-agent capabilities and slash commands are excellent for doing this, btw.
I don't think it's that simple.
Fundamentally, unit tests are using the same system to write your invariants twice, it just so happens that they're different enough that failure in one tends to reveal a bug in another.
You can't reasonably state this won't be the case with tools built for code review until the failure cases are examined.
Furthermore a simple way to help get around this is by writing code with one product while reviewing the code with another.
> unit tests are using the same system to write your invariants twice
For unit tests, the parts of the system that are the same are not under test, while the parts that are different are under test.
The problem with using AI to review AI is that what you're checking is the same as what you're checking it with. Checking the output of one LLM with another brand probably helps, but they may also have a lot of similarities, so it's not clear how much.
> The problem with using AI to review AI is that what you're checking is the same as what you're checking it with.
This isn't true. Every instantiation of the LLM is different. Oversimplifying a little, but hallucination emerges when low-probability next words are selected. True explanations, on the other hand, act as attractors in state-space. Once stumbled upon, they are consistently preserved.
So run a bunch of LLM instances in parallel with the same prompt. The built-in randomness & temperature settings will ensure you get many different answers, some quite crazy. Evaluate them in new LLM instances with fresh context. In just 1-2 iterations you will hone in on state-space attractors, which are chains of reasoning well supported by the training set.
What if you use a different AI model? Sometimes just a different seed generates a different result. I notice there is a benefit to seeing and contrasting the different answers. The improvement is gradual, it’s not a binary.
You don't need to use a different model, generally. In my experience a fresh context window is all you need, the vast majority of the time.
The system is the human writing the code.
It's snake oil that works surprisingly well.
Weirdly, you can not only do this, it somehow does actually catch some of its own mistakes.
Not all of the mistakes, they generally still have a performance ceiling less than human experts (though even this disclaimer is still simplifying), but this kind of self-critique is basically what makes the early "reasoning" models one up over simple chat models: for the first-n :END: tokens, replace with "wait" and see it attempt other solutions and pick something usually better.
the "pick something usually better" sounds a lot like "and then draw the rest of the f*** owl"
Turned out that for a lot of things (not all things, Transformers have a lot of weaknesses), using a neural network to score an output is, if not "fine", then at least "ok".
Generating 10 options with mediocre mean and some standard deviation, and then evaluating which is best, is much easier than deliberative reasoning to just get one thing right in the first place more often.
> Code review is part of the job, but one of the least enjoyable parts. Developers like _writing_ and that gives the most job satisfaction.
At least for me, what gives the most satisfaction (even though this kind of satisfaction happens very rarely) if I discover some very elegant structure behind whatever has to be implemented that changes the whole way how you thought about programming (oroften even about life) for decades.
> what gives the most satisfaction (even though this kind of satisfaction happens very rarely) if I discover some very elegant structure behind whatever has to be implemented that changes the whole way how you thought about programming
A number of years ago, I wrote a caching/lookup library that is probably some of the favorite code I've ever created.
After the initial configuration, the use was elegant and there was really no reason not to use it if you needed to query anything that could be cached on the server side. Super easy to wrap just about any code with it as long as the response is serializable.
CachingCore.Instance.Get(key, cacheDuration, () => { /* expensive lookup code here */ });
Under the hood, it would check the preferred caching solution (e.g., Redis/Memcache/etc), followed by less preferred options if the preferred wasn't available, followed by the expensive lookup if it wasn't found anywhere. Defaulted to in-memory if nothing else was available.
If the data was returned from cache, it would then compare the expiration to the specified duration... If it was getting close to various configurable tolerances, it would start a new lookup in the background and update the cache (some of our lookups could take several minutes*, others just a handful of seconds).
The hardest part was making sure that we didn't cause a thundering herd type problem with looking up stuff multiple times... in-memory cache flags indicating lookups in progress so we could hold up other requests if it failed through and then let them know once it's available. While not the absolute worst case scenario, you might end up making the expensive lookups once from each of the servers that use it if the shared cache isn't available.
* most of these have a separate service running on a schedule to pre-cache the data, but things have a backup with this method.
Junior developers love writing code.
Senior developers love removing code.
Code review is probably my favorite part of the job, when there isn’t a deadline bearing down on me for my own tasks.
So I don’t really agree with your framing. Code reviews are very fun.
> Developers like _writing_ and that gives the most job satisfaction.
Is it possible that this is just the majority and there’s plenty of folks that dislike actually starting from nothing and the endless iteration to make something that works, as opposed to have some sort of a good/bad baseline to just improve upon?
I’ve seen plenty of people that are okay with picking up a codebase someone else wrote and working with the patterns and architecture in there BUT when it comes to them either needing to create new mechanisms in it or create an entirely new project/repo it’s like they hit a wall - part of it probably being friction, part not being familiar with it, as well as other reasons.
> Why did we create tools that do the fun part and increase the non-fun part? Where are the "code-review" agents at?
Presumably because that’s where the most perceived productivity gain is in. As for code review, there’s CodeRabbit, I think GitLab has their thing (Duo) and more options are popping up. Conceptually, there’s nothing preventing you from feeding a Git diff into RooCode and letting it review stuff, alongside reading whatever surrounding files it needs.
> I’ve seen plenty of people that are okay with picking up a codebase someone else wrote and working with the patterns and architecture in there BUT when it comes to them either needing to create new mechanisms in it or create an entirely new project/repo it’s like they hit a wall - part of it probably being friction, part not being familiar with it, as well as other reasons.
For me, it's exactly the opposite:
I love to build things from "nothing" (if I had the possibility, I would even like to write my own kernel that is written in a novel programming language developed by me :-) ).
On the other hand, when I pick up someone else's codebase, I nearly always (if it was not written by some insanely smart programmer) immediately find it badly written. In nearly al cases I tend to be right in my judgements (my boss agrees), but I am very sensitive to bad code, and often ask myself how the programmer who wrote the original code did not yet commit seppuku, considering how much of a shame the code is.
Thus: you can in my opinion only enjoy picking up a codebase someone else wrote if you are incredibly tolerant of bad code.
> Developers like _writing_ and that gives the most job satisfaction.
Not me. I enjoy figuring out the requirements, the high-level design, and the clever approach that will yield high performance, or reuse of existing libraries, or whatever it is that will make it an elegant solution.
Once I've figured all that out, the actual process of writing code is a total slog. Tracking variables, remembering syntax, trying to think through every edge case, avoiding off-by-one errors. I've gone from being an architect (fun) to slapping bricks together with mortar (boring).
I'm infinitely happier if all that can be done for me, everything is broken out into testable units, the code looks plausibly correct, and the unit tests for each function cover all cases and are demonstrably correct.
You don't really know if the system design you've architected in your mind is any good though, do you, until you've actually tried coding it. Discovering all the little edge cases at that point is hard work ("a total slog") because it's where you find out where the flaws in your thinking were, and how your beautifully imagined abstractions fall down.
Then after going back and forth between thinking about it and trying to build it a few times, after a while you discover the real solution.
Or at least that's how it's worked for me for a few decades, everyone might be different.
He did not say he does not iterate! And it is much easier and faster to do when an LLM is involved.
> Tracking variables, remembering syntax,
That's why you have short functions so you don't have to track that many variable. And use symbol completion (a standard in many editors).
> trying to think through every edge case, avoiding off-by-one errors.
That is designing, not coding. Sometimes I think of an edge case, but I'm already on a task that I'd like to finish, so I just add a TODO comment. Then at least before I submit the PR, I ripgrep the project for this keyword and other.
Sometimes the best design is done by doing. The tradeoffs become clearer when you have to actually code the solution (too much abstraction, too verbose, unwieldy,...) instead of relying on your mind (everything seems simpler)
You always have variables. Not just at the function level, but at the class level, object level, etc. And it's not about symbol completion, it's about remembering all the obscure differences in built-in function names and which does what.
And no, off-by-one errors and edge cases are firmly part of coding, once you're writing code inside of a function. Edge cases are not "todos", they're correctly handling all possible states.
> Sometimes the best design is done by doing.
I mean, sure go ahead and prototype, rewrite, etc. That doesn't change anything. You can have the AI do that for you too, and then you can re-evaluate and re-design. The point is, I want to be doing that evaluation and re-designing. Not typing all the code and keeping track of loop states and variable conditions and index variables and exit conditions. That stuff is boring as hell, and I've written more than enough to last a lifetime already.
> You always have variables. Not just at the function level, but at the class level, object level, etc.
Aka the scope. And the namespace of whatever you want to access. Which is a design problem.
> And it's not about symbol completion, it's about remembering all the obscure differences in built-in function names and which does what
That's what references are for. And some IDEs bring it right alongside the editor. If not, you have online and offline references. You remember them through usage and semantics.
> And no, off-by-one errors and edge cases are firmly part of coding, once you're writing code inside of a function.
It's not. You define the happy path and error cases as part of the specs. But they're generally lacking in precision (full of ambiguities) and only care about the essential complexity. The accidental complexity comes as part of the platform and is also part of the design. Pushing those kind of errors as part of coding is shortsightedness.
> Not typing all the code and keeping track of loop states and variable conditions and index variables and exit conditions. That stuff is boring as hell, and I've written more than enough to last a lifetime already
That is like saying "Not typing all the text and keeping track of words and punctuation and paragraphs and signatures. English is boring as hell and I've written more than enough..."
If you don't like formality, say so. I've never had anyone describe coding as you did. No one things about those stuff that closely. It's like a guitar player complaining about which strings to strike with a finger. Or a race driver complaining about the angle of the steering wheel and having to press the brake.
I don't know what to tell you. Sure there are tools like IDE's to help, but it doesn't help with everything.
The simple fact is that I find there's very little creative satisfaction to be found in writing most functions. Once you've done it 10,000 times, it's not exactly fun anymore, I mean unless you're working on some cutting-edge algorithm which is not what we're doing 99.9% of the time.
The creative part becomes in the higher level of design, where it's no longer rote. This is the whole reason why people move up into architecture roles, designing systems and libraries and API's instead of writing lines of code.
The analogies with guitar players or race car drivers or writers are flawed, because nothing they do is rote. Every note matters, every turn, every phrase. They're about creativity and/or split-second decision making.
But when you're writing code, that's just not the case. For anything that's a 10- or 20- line function, there isn't usually much creativity there, 99.99% of the time. You're just translating an idea into code in a straightforward way.
So when you say, "Developers like _writing_ and that gives the most job satisfaction." That's just not true. Especially not for many experienced devs. Developers like thinking, in my experience. They like designing, the creative part. Not the writing part. The writing is just the means to the end.
Because the goal of "AI" is not to have fun, it's to solve problems and increase productivity. I have fun programming too, but you have to realize the world isn't optimizing make things more fun.
I hear you, but without any enjoyment in the process, quality and productivity go down the drain real fast.
The Ironies of Automation paper is something I mention a lot, the core thesis is that making humans review / rubber stamp automation reduces their work quality. People just aren't wired to do boring stuff well.
Enjoyment and rewards are the drivers for motivation.
Yeah, though in my experience, reward alone is not enough.
> you have to realize the world isn't optimizing make things more fun.
Serious question: why not?
IMO it should be.
If "progress" is making us all more miserable, then what's the point? Shouldn't progress make us happier?
It feels like the endgame of AI is that the masses slave away for the profit of a few tech overlords.
As a human, I do agree that it would be better and we should strive for that. However I don't think humans are really driving all this progress/innovation. It is just evolution keeping doing what it's always done, it is ruthless and does not care at all whether we are having fun or not.
[dead]
If you have a paid Copilot membership and a Github project you can request a code review from Copilot. And it doesn't do a terrible job, actually.
I will second this. I believe code review agents and search summaries are the way forward for coding with LLMs.
The ability to ignore AI and focus on solving the problems has little to do with "fun". If anything it leaves a human-auditable trail to review later and hold accountable devs who have gone off the rails and routinely ignored the sometimes genuinely good advice that comes out of AI.
If humans don't have to helicopter over developers, that's a much bigger productivity boost than letting AI take the wheel. This is a nuance missed by almost everyone who doesn't write code or care about its quality.
The title of this article seems way too glib.
Code review isn't the same as design review, nor are these the only type of things (coding and design) that someone may be trying to use AI for.
If you are going to use AI, and catch it's mistakes, then you need to have expertise in whatever it is you are using the AI for. Even if we limit the discussion just to coding, then being a good code reviewer isn't enough - you'd need to have skill at whatever you are asking the AI to do. One of the valuable things AI can do is help you code using languages and frameworks you are not familiar with, which then of course means you are not going to be competent to review the output, other than in most generic fashion.
A bit off topic, but it's weird to me to see the term "coding" make a comeback in this AI/LLM era. I guess it is useful as a way to describe what AI is good at - coding vs more general software developer, but how many companies nowadays hire coders as opposed to software developers (I know it used to be a thing with some big companies like IBM)? Rather than compartmentalized roles, it seems the direction nowadays is more expecting developers to be able to do everything from business analysis and helping develop requirements, to architecture/design and then full-stack development, and subsequent production support.
My official title is "Software Engineer", in the last five years I have..
1. Stood up and managed my own Kubernetes clusters for my team
2. Docker, just so so much Docker
3. Developed CI/CD pipelines
4. Done more integration and integration testing then I care to think about
5. Written god knows how many requirements and produced and endless stream of diagrams and graphs for systems engineering teams
6. Don't a bunch of random IT crap because our infrastructure team can't be bothered
7. Wrote some code once in a while
Seems so.
> Using AI agents correctly is a process of reviewing code. [...]
> Why is that? Large language models are good at producing a lot of code, but they don’t yet have the depth of judgement of a competent software engineer. Left unsupervised, they will spend a lot of time committing to bad design decisions.
Obviously you want to make course corrections sooner than later. Same as I would do with less experienced devs, talk through the high level operations, then the design/composition. Reviewing a large volume of unguided code is like waiting for 100k tokens to be written only to correct the premise in the first 100 and start over.
I love doing code review for colleagues since I know that it bolsters our shared knowledge, experience and standards. Code review for an external, stubborn, uncooperative AI? No thanks, that sounds like burnout.
No. The failure conditions of "AI agents" are not even close to classical human mistakes (the only one ones where code review has anything more than an infinitesimal chance to catch). There is absolutely no skill transfer and it is a poor excuse anyway since review was never going to catch anything anyway.
If I am good at the most boring part of my job, I get to do that and only that from here on out?
No thank you.
Also, the article is wrong, it's always better for a bug not to be in there in the first place, than to be there and possibly be missed.
code review can be almost as much effort as writing the code, especially when the code is not up to the expectations of the reviewer. this is fine, because you want two people (the original author, and the reviewer) on the code.
when reviewing AI code, not only will the effort needed by the reviewer increase, you also lose the second person (the author) looking at the code, because AI can't do that. it can produce code but not reason about or reflect on it like humans can.
I think that I review code much differently than the author. When I'm reviewing code, my assumption is that the person writing it has already verified that it works. I am primarily looking for readability and code smells.
In an ideal world I'd probably be looking more at the actual logic of the code. However, everywhere I've worked it's a full time job just despirately trying to fight ballooning complexity from people who prioritize quick turn around over quality code.
I think I'm good at code review, but we've all seen parts of the codebase where it's written by one teammate with specific domain knowledge and your option is to approve something you don't fully understand or to learn the background necessary to understand it.
In my experience, not having to learn the background is the biggest time saver provided by LLM coding (e.g. not having to read through API docs or confirm details of a file format or understand some algorithm). So in a way I feel like there is a fundamental tension.
> bikeshedding function names
... Function names compose much of the API.
The API is the structure of the codebase.
This isn't some triviality you can throw aside as unimportant, it is the shape that the code has today, and limits and controls what it will have tomorrow.
It's how you make things intuitive, and it is equally how you ensure people follow a correct flow and don't trap themselves into a security bug.
I really disagree with this too, especially given the article's next line:
> ...You’ll be forever tweaking individual lines of code, asking for a .reduce instead of a .map.filter, bikeshedding function names, and so on. At the same time, you’ll miss the opportunity to guide the AI away from architectural dead ends.
I think a good review will often do both, and understand that code happens at the line level and also the structural level. It implies a philosophy of coding that I have seen be incredibly destructive firsthand — committing a bunch of shit that no one on a team understands and no one knows how to reuse.
> for a .reduce instead of a .map.filter...
This is distinctly not the api, but an implementation detail.
Personally, i can ask colleagues to change function names, rework hierarchy, etc. But leave this exact example be, as it does not have any material difference difference - regardless of my personal preference.
Agreed. A program is made of names, these names are of the utmost importance. For understanding, and also for searchability.
I do a lot of code reviews, and one of the main things I ask for, after bug fixes, is renaming things for readers to understand at first read unambiguously and to match the various conventions we use throughout the codebase.
Ex: new dev wrote "updateFoo()" for a method converting a domain thing "foo" from its type in layer "a" to its type in layer "b", so I asked him to use "convertFoo_aToB()" instead.
This blog gets posted often but the content is usually lousy. Lots of specious assertions about the nature of software development that really give off a "I totally have this figured out" vibe. I can't help but feel that anyone who feels so about this young industry that changes so rapidly and is so badly performed at so many places, is yet to summit Mt. Stupid.
I think I'd actually have a use for an AI that could receive my empty public APIs (such as a C++ header file) as an input and produce a first rough implementation. Maybe this exists already, I don't know because I haven't done any serious vibe coding.
As long as you're reinventing the wheel (implementing some common pattern because you don't want to pull in an entire dependency), that kind of AI generation works quite well. Especially if you also have the AI generate tests for its code, so you can force it to iterate on itself while it gets things wrong the first couple of tries. It's slow and resource intensive, but it'll generate something mostly complete most of the time.
I'm not sure if you're saving any time there, though. Perhaps if you give an LLM task before ending the work day so it can churn away for a while unattended, it may generate a decent implementation. There's a good chance you need to throw out the work too; you can't rely on it, but it can be a nice bonus if you're lucky.
I've found that this only works on expensive models with large context windows and limited API calls, though. The amount of energy wasted on shit code that gets reverted must be tremendous.
I hope the AI industry makes true on its promise that it'll solve the whole inefficiency problem because the way things are going now, the industry isn't sustainable.
Yeah it can, though rough is definitely the word.
And sometimes the LLM just won't go in the direction you want, but that's OK - you just have to go write those bits of code.
It can be suprising where it works and where it doesn't.
Just go with those first suggestions though and the code will end up rough.
The leading models have been very good at this for over a year now. Try copying one your existing C++ header files into GPT-5 or Claude 4 or Gemini 2.5 as an experiment and see how they do.
They certainly invent new functions whenever I try.
You can do this already, the most useful things to help with this are either writing tests or having it write tests and telling it how to compile and see error messages so you can let it loop.
I am good at code review, sure, but I don't like doing it. It's about as strong an engineering technique as coding at a whiteboard. I know I'm at a tiny fraction of my potential without debugging tools and for that reason code review on github is usually a waste of my time. I'll just write code thanks and I'll move the needle on quality by developing. As a reviewer I'll scan for smells but I assume that you too would be most effective if I left you make and clean up your own messes so long as they aren't egregious
You review the code and found it broken. Then what?
- Rewrite it yourself?
- Tell AI to generate it again? — will lead to worse code than the first.
- Write the long prompt (like 6 page) even longer and hope it works this time?
In my experience you tell the AI how to fix it and get better code based on your instructions.
> In my view, the best code review is structural. It brings in context from parts of the codebase that the diff didn’t mention.
That may be true for AI code.
But it would be pretty terrible for human-written code to bring this up after the code is written, wasting hours/days effort for lack of a little up-front communication on design.
AI makes routine code generation cheap -- only seconds/minutes and cents are being wasted -- but you essentially still need that design session.
What does this mean for juniors? A few companies are now introducing expectations that all engineers will use coding agents including juniors and grads. If they haven't yet learnt what good looks like through experience how are they going to review code produced by AI agents?
I have received a few LLM produced PRs from peers from adjacent teams, in good faith but not familiar with the project, and they increasingly infuriate me. They were all garbage, but there’s a great asymmetry: it costs my peers nothing to generate them, it costs me precious time to refute them. And what can I do really? Saying “it’s irreparable garbage because the syntax might be right but it’s conceptually nonsense” but that’s not the most constructive take.
You could use an LLM to give you advice on how to present that take in a more constructive manner.
Partially sarcastic but I do personally use LLMs to guide my communication in very limited cases:
1. It's purely business related, and
2. I'm feeling too emotionally invested (or more likely, royally pissed off) and don't trust myself to write in a professional manner, and
3. I genuinely want the message to sound cold, corporate, and unemotional
Number 3 would fit you here. These people are not being respectful to you in presenting code for review that respects your time. Why should you take the time to write back personally?
It should be noted that this accounts for maybe 5% of my business communications, and I'm careful not to let that number grow.
> Why should you take the time to write back personally?
Because it's 3 sentences, if you want to be way more polite and verbose than necessary.
"I will close PRs if they appear to be largely LLM-generated. I am always happy to review something with care and attention if it shows the same qualities. Thanks!"
The idea is to get your coworkers to stop sending you AI slop, send them AI slop in retaliation?
> if they appear to be largely LLM-generated
And then what if the person denies it?
Run it up the chain
They're either lying about using AI, or they're incompetent enough to produce AI quality (read: Garbage) code, either way the company should let them go
That would be the nuclear option, but if you have any rapport at all with the person or team in question, you could also just pull them aside, ask if they are under unusual pressure to show progress, and make it clear that you get it, and you want to help, but that you can't if you're drowning in AI slop code review. I imagine it's a junior doing this, in which case it's in their career interest to stop and start acting like a professional. I've had seniors tell me more or less the same thing, in the pre-llm era: "slow down and get it right." Sometimes you just need to hear that.
This feels like a culture problem. I have seen higher-quality PRs as people use AI to review their work before pushing it. This means less silly typos and obvious small bugs.
Nothing makes me hate AI more than getting a slop PR written by one of the agent-wielding coworkers with the comments describing what the next line does for every line. More often than not it looks plausible but turns out to be completely unusable upon closer inspection. Incredibly disrespectful to do this to your coworkers imo, its proper that you call it out.
If you had a colleague who was consistently writing complete shit you would raise it with your manager. This situation isn't all that different - the only complicating factor they're not on your team.
If it's only happened a few times you might first try setting some ground rules for contributions. Really common for innersource repos to have a CONTRIBUTING.md file or similar. Add a checkbox to your PR template that the dev has to check to indicate they've read it, then wait and see.
Or: As long as you have a good editor with endless time, a thousand monkeys with typewriters will reproduce Shakespeare.
Username checks out.
As someone that basically does code reviews for a living, last thing I want to do is code review agents. I want to reduce how much review I’m doing, not hand hold some ai agents.
Getting AI to produce a bunch of code and then you having to filter through it all is a massive waste of time. The focus should be on getting AI to produce better code in the first place (e.g., using detailed plans), rather than on the volume of code you can produce...
I have only had real advantages with AI for helping me plan changes, and for it helping me to review my code. Getting it to write code for me has been somewhat helpful, but only for simple tedious changes or first drafts. But it is definitely not something I want to leverage by getting AI to produce more and more code that I then have to filter through and review. No thank you. I feel like this is really the wrong focus for implementing AI into your workflows.
can I ask what language you are using AI for, there is also a difference in performance for AI in different languages
TypeScript with NextJS. I've also used AI tools with C and Zig, and AI is much better at writing TS. But even though TS works much better, it's still not that great. This is largely because the quality of the code that AI writes is not good enough, so then I have to spend a decent chunk of time fixing it.
Everyone I know trying to use AI in large codebases has had similar experiences. AI is not good enough at following the rules of your codebase yet (i.e., following structure, code style, library usage, re-using code, refactoring, etc...). This makes it far less useful for writing code changes and additions. It can still be useful for small changes, or for writing first drafts of functions/classes/interfaces, but for more meaningful changes it often fails.
That is why I believe that right now, if you want to maintain a large codebase, and maintain a high bar for quality, AI tools are just not good enough at writing most code for you yet. The solution to this is not to get AI to write even more code for you to review and throw out and iterate upon in a frustrating cycle. Instead, I believe it is to notice where AI is helpful and focus on those use-cases, and avoid it when it is not.
That said, AI labs seem to be focusing a lot of effort on improving AI for coding right now, so I expect a lot of progress will be made on these issues in the next few years.
I’ve been thinking about this a lot lately.
What’s the best way to review AI code?
I wish I had a local, GitHub PR review-like experience where I can leave comments for the agent.
Sorry, this is not the profession of programming, and people in the near future will be looking back at this era and laughing their asses off.
But not me, because I will never touch an agentic tool. And believe me, I have a big smile on my face. Life is good! =D
Two observations:
- If I had to iterate as much with a Jr dev as CC on not highly difficult stuff ("of course, I'll just do X!" then X doesn't work, then "of course, the answer is Y!" then Y doesn't work, etc.) I probably would have fired them by now or just say "never mind, I'll do it myself" .
- On the other hand a Jr dev will (hopefully) learn as they go, get better each time, so a month from now they're not making the same mistakes. An LLM can't learn so until there's a new model they keep making the same mistakes (yes, within a session they can learn -- if the session doesn't get too long -- but not across sessions). Also, the Jr dev can test their solution (which may require more than just running unit tests) and iterate on it so that they only come to me when it works and/or they're stuck. Just yesterday, on a rather simple matter, I wasted so much time telling the LLM "that didn't work, try again".
Unfortunately code review is like the least fun part of software engineering.
[dead]