The quality of AI-assisted software depends on unit of work management

blog.nilenso.com

165 points by mogambo1 2 days ago

And the value of AI as pushed to us by these companies is in doing larger units of work.

But... reviewing code is harder than writing code. Expressing how I want something to be done in natural language is incredibly hard.

So over time I'm spending a lot of energy in those things, and only getting it 80% right.

Not to mention I'm constantly in this highly suspicious mode, trying to pierce through the veil of my own prompt and the code generated, because it's the edge cases that make work hard.

The end result is exhaustion. There is no recharge. Plans are front-loaded, and then you switch to auditing mode.

Whereas with code you front-load a good amount of design, but you can make changes as you go, and since you know your own code the effort to make those are much lower.

strogonoff 2 days ago

I somewhat dread code reviews. In order to properly evaluate the solution, you must first know what is the right solution to begin with. You must analyse the problem and arrive at it yourself. This is the brunt of the work, and yet you are not allowed the pleasant part of it: knowing that your work is shipped and feeling pride in it.
Working with LLM-generated code is mostly the same. The more sophisticated the autocomplete, the more mental overhead spent on understanding its output. There is an advantage: you are spared having to argue with a possibly defensive peer about what you believe is best. There is also a disadvantage: you do not feel like you are helping someone grow, and instead you are an unpaid (you are not paid for that in particular) contributor to a product by Microsoft (or similar) intended generally in longer term to push you and/or your peers out of the job. Additionally, there is no single mind that you can build rapport with and learn to understand the approaches and vibes of after a while.
nicce 2 days ago

> Expressing how I want something to be done in natural language is incredibly hard
Surprise, surprise… that is why programming languages were created.
- dragonwriter 2 days ago
  
  Programming languages don’t solve that problem, since someone still has to explain what needs to be done in natural language unless the end customer is also the programmer.
  Programming languages were created because of the different problem of “its very hard to get computers to understand natural language even if you know how to express what you want in it”.
  
  derefr 2 days ago
  
  > someone still has to explain what needs to be done in natural language unless the end customer is also the programmer
  You're conflating requirement analysis with design. The customer only needs to describe the problem — a set of constraints on what comprises a valid solution. The software engineer is then free to design and develop a particular valid solution (and show it to the customer, which will result in more feedback, which will feed back into design, and so on.)
  Formalizing this split is the premise behind Domain Driven Design (DDD): you can sit with the customer and pin down a problem description (= set of design requirements) together with them, expressed in exactly the natural language the customer-as-problem-domain-expert uses, without any reference to any particular potential design's solution-space domain. You can then turn around and reuse that set of natural-language statements as the skeleton of a test suite, that "enforces" the customer's expectations upon any potential design you create.
  It's a lot like an artist sitting with a customer who's commissioning them, with the artist sketching something the customer is describing; and then the artist going away to actually illustrate/paint/craft/design/etc the thing, constrained by that sketch.
  
  nicce 2 days ago
  
  I don't see the difference? Natural language simply was lacking the level of precision. We see natural language words and symbols everywhere in programming languages. Natural language was fine-tuned with improved accuracy. And optimised to reduce the amount of needed words. The difference in the precision between natural languages and programming languages was simply just too big, so you needed "an interpreter" to translate the level of precision from customer to computer.
  
  dragonwriter 2 days ago
  
  Programming languages solve that machine code is hard for humans to work with for large problems and natural language (even when the meaning is perfectly clear) is very difficult (to the point of complete intractability with the level of knowledge and available hardware at the time programming languages were originally invented) to parse mechanically, a necessary first step in translation into machine code a computer could run.
  Any problem with the difficulty of clearly expressing things in natural language (a real thing for which there have long been solutions between humans that are different than programming languages) was a problem that was technically unreachable at the user->machine interface because of that more fundamental problem for most of the history of computing (its arguable that LLMs are at the level where now it is potentially an issue with computers, but it took decades of having programming languages to be able to get the technical capacity to even experience the problem, it is not the problem programming languages address.)
  
  soraminazuki 2 days ago
  
  I don't but that. In the same spirit, are you telling me that you solve math problems without using any mathematical notation because it doesn't offer any improvement over natural language?
  
  dragonwriter a day ago
  
  > In the same spirit, are you telling me that you solve math problems without using any mathematical notation because it doesn't offer any improvement over natural language?
  That’s not in the same spirit, and I didn’t say either that programming languages don’t solve any problem, or that there aren’t tools that do solve the problem it was suggested that programming language solve where it exists, which isn't (except maybe since the advent of LLMs) between human and computer.
  
  pxc 2 days ago
  
  I had a logic textbook in college where all the proofs were just prose with very little notation. It was okay. But it felt like it took a lot more effort to process than denser, more specialized notation!
  
  SkyBelow 2 days ago
  
  >Programming languages don’t solve that problem
  The idea is that a programmer will work with natural language dynamically to create enough of an understanding to create a formalized specification within the programming language (eh, we technically need to include the entire ecosystem needed to run the code, but rarely is that an issue). Often it won't be perfect, but then a demo can occur where natural language can then comment on what was missed, which can then better be captured in the formal language. This continues until we get close enough that further effort to refine isn't justified.
  This isn't the only reason programming languages were created, much as rarely anything has a single reason it was created. They were created as a way to get computers to do what we wanted without needing to modify the hardware, and then higher level languages were created so we can write more specification faster while losing a bit of the lowest level exactness.
  Also why we have many different programming languages, as they try to fit into different levels of trade off between the different reasons for a programming language to exist. Well, one of the reasons we have many different programming languages....
  
  dragonwriter a day ago
  
  > The idea is that a programmer will work with natural language dynamically to create enough of an understanding to create a formalized specification within the programming language
  I don't disagree that that is some people’s idea of the idealized process (though there are others, too.)
  I disagree, though, that it is the purpose for which programming languages, as a category of tools, were invented. In fact, there were a whole bunch of tools for solving the problem of clarity of communicating expectations of sysstem behavior between humans in unconstrainted natural language invented, and that it was until fairly recently (well into the period of dominance of Agile, or at least “Agile”, methodss) the usual expectation in non-trivial programming projects that some combination of those would be used to solve that problem, and then the programmers, understanding the intent through those tools, would use programming languages to solve the problem that languages that are ideal for communicating between humans are not ideal for source languages for practical compilers or interpreters in computers.
- bdangubic 2 days ago
  
  COBOL would like a word :)
palmotea 2 days ago

That makes very clear these tools are not meant to serve you, you are meant to serve these tools.
- fragmede 3 hours ago
  
  What is this, that episode of The Twilight Zone called "to serve man"?
- ryoshu 2 days ago
  
  Reverse centaur.
falcor84 a day ago

> But... reviewing code is harder than writing code.
For me it's very clearly the opposite. I wonder if it's a professional background, or personality or neurotype issue or something, but when I'm faced with a problem I often get somewhat paralyzed, spending a long time thinking about a good approach, but when I delegate to someone or ask an AI to tackle it, even if I get back something half-shitty, it removes my paralysis and then reviewing and improving what they did is significantly easier (or at least more motivating) for me than doing it from scratch. And even if they ended up giving me something that's entirely in the wrong direction, and I need to throw out all of it, I still usually feel that it removes that paralysis and gives me a better understanding of the problem space.
I wonder if this difference between people accounts for a significant different between those who benefit from AI and those who don't.
rjh29 a day ago

Don't spend energy on clever prompting and reviewing. Spend your energy on knowing when to use the LLM at all.
If you know what to write but it's tedious, LLMs are great, they'll just fill all that in for you. Anything more complex or open that needs checking could be quicker to just think through and write yourself. You can still use LLMs at the edges, e.g. what API methods should I use for this?
- RealityVoid a day ago
  
  I like using LLM's for exploration of a new codebase, seeing what other options already exist for solving a solution I might not be aware of and planning a bit how to change stuff, since it sometimes sees thing I did not think of.
  Another thing it's good at is writing tests - a lot of times I won't bother, but with AI I can do it cheaply. And it's very good at keeping documentation and a codebase consistent, believe it or not. If I change a part that is mentioned somewhere else and it has it in the context window, it will update both parts, whereas I might omit it.

danparsonson 2 days ago

I seem to be in a minority but I find user stories or features to be really awkward and unnatural units of work for building software. Sure these things help to define the expected result but they shouldn't directly drive the development process. Imagine building a house that way - you don't build the living room, then the kitchen, then the bathroom etc.; you build floors, walls, the roof... The 'features' or use cases for the building arise out of the combination of different elements that were put into it, and usually right near the end of the build. The same is true for basically anything else that we build or create - if you're making a sculpture, do you finish working on one leg first before you move onto some other part?

Features are vertical slices through the software cake, but the cake is actually made out of horizontal layers. Creating a bunch of servings of cake and then trying to stick them together just results in a fragile mess that's difficult to work with and easy to break.

galbar 2 days ago

My take on this is that, from a SW development POV, user stories are not the right unit of work. Instead, I treat user stories as "Epics". Stake holders can track that Epic for progress, as the unit of work from their POV.
Internally, the team splits Epics into "Spikes" (figure out what to do) and "Tasks" (executing on the things we need to do).
- Spikes are scoped to up to 3 days and their outcome is usually a doc and either a follow-up Spike or Tasks to execute.
- Tasks must be as small and unambiguous as possible (within reason).
- danparsonson a day ago
  
  Well OK, but that's just the same thing with extra steps.
  The point I'm making is that there are large cross-cutting concerns that shouldn't be sliced up by feature, but rather that the features should arise out of the composition of the cross-cutting concerns.
  A single user story commonly requires the holy trinity of UI, 'business logic' and data storage, and my contention is that it's more efficient and robust to build those three layers out holistically rather than try to assemble them from the fragments required for all the user stories.
  
  galbar a day ago
  
  Our job as SWEs is to convert the vertical slice of functionality into something that fits well and robustly in the various technical layers that need to be touched.
  The process that I outlined above explicitly creates the space for SWEs to consider the wider implications of the required changes in the architecture and make robust.
  Part of that is understanding what the roadmap is and what is the product vision in the mid term, so that the tech layer can be built, step by step, towards what fits that vision.
rdw 2 days ago

In the area I live, houses are often built one complete room at a time, over many years. They start out as a single-room shack, then the owner builds extensions as they have children or money. Often, they build a porch, and then decades later wall up the porch and turn it into a room of some kind.
I kind of like this analogy because it does help us reason about the situation. The one-room shack is basically an MVP; a hacky result that just does one thing and probably poorly, but it is useful enough to justify its own existence. The giant mansion built from detailed architectural plans seems like a waterfall process for an enterprise application, doesn't it?
There are many advantages to building a house one room at a time. You get something to house you quickly and cheaply. When you build each extension, you have a very good idea of how it will be most useful because you know your needs well. You are more capable of taking advantage of sales (my neighbor collects construction overstock for free/cheap and starts building something once he has enough quantity to do so). It's more "agile". The resulting houses are beautiful in their own bespoke ways. They last a long time, too.
The downsides are that the services and structure are a hodgepodge of eras and necessity. If you're competent, you can avoid problems in your own work, but you may have to build on shoddy "legacy" work. You spend more of your time in a state of construction, and it may be infeasible to undertake a whole-house project like running ethernet to every room.
It's all tradeoffs. I think it does in many cases make sense to build a house in this way, and it likewise makes sense to build software this way. It depends on the situation.
- danparsonson a day ago
  
  That's interesting, thanks; as you point out, an important aspect of software- as with building architecture is that it tends to evolve over time, and that's where the waterfall approach falls down. However, in software at least, it's not actually necessary to exchange one extreme for another - waterfall or agile; one can take benefits from both approaches, blending foresight and forward planning with modular construction.
  > There are many advantages to building a house one room at a time.... It's more "agile"... The downsides are that the services and structure are a hodgepodge of eras and necessity... it may be infeasible to undertake a whole-house project like running ethernet to every room.
  The thing is that that end result is actually the opposite of agile, being as it is more difficult to change, and this speaks more broadly to a perennial problem in software development - requirements change regularly, even deep into project development. Planning a design up front does not mean just fixing a specific set of requirements in stone, but also anticipating the things that may change, even without knowing the specifics of what those changes will be, and designing in a flexible way that can accomodate a broad spectrum of possible futures. A car manufacturer might conceivably branch into making other types of vehicles, plant equipment, and similar things like that, whereas they are unlikely to ever get into catering (and if they did, that would likely be a seperate business and a new piece of software). Responding only to the requirements in front of you right now tends to make the design more rigid rather than less, and almost inevitably leads to big balls of mud and big-bang rewrite projects that fail as often as they succeed. Keep in mind also that most software spends most of its life in maintenance mode, so optimising for the delivery stage is short-sighted at best.
  Designing software in the way I'm describing is not easy, but it's definitely possible, and in my opinion offers a lot more value than it might first appear.
  
  igouy 2 hours ago
  
  > A car manufacturer might conceivably branch into making other types of vehicles, plant equipment …
  Do you consider mattresses "similar things like that" ?
magicalist 2 days ago

I don't think the analogies are that helpful. You are absolutely building the kitchen, the living room, the bathroom as you put in the foundation, framing, the plumbing, the electrical work, etc, or you will get either an unusable house or a very expensive gutting and remodel. User stories are figured out for the house long before any construction begins. House building is well on the waterfall side of development anyways, but at this point, what insight is this analogy yielding?
Call user stories a grouping of work, sure, but I guess I don't see why the distinction matters. Most possible "units of work" will have many cases worth breaking down further regardless of choice of unit.
- danparsonson a day ago
  
  > You are absolutely building the kitchen, the living room, the bathroom as you put in the foundation, framing, the plumbing, the electrical work, etc
  OK, the kitchen and the bathroom are special cases due to the plumbing and so on, so my analogy breaks down a bit there, but the rest of the rooms? They don't crystalise their function until the occupants move in. Maybe I as a builder might assume a certain room will be the living room, and designate another as the master bedroom, but until the owner puts in funiture, they're more or less just empty boxes with power and windows. Most of the 'features' or user stories of the house arise at the end out of the combination of built elements and final decoration. Software is actually a lot like this - take a trivial example user story of creating an invoice. What do you need for that? UI, data storage, comms maybe, some domain logic. Each of those things can (and in my opinion should) be expressed independently, but if you're developing that user story as a single deliverable, then you need to create bits of all of them. And that's what I'm saying - we're building things that naturally decompose into 'horizontal' layers (units of infrastructure), but doing it in 'vertical' slices (user stories), which, to torture my analogy even further, results in uneven flooring, mismatched walls, and untidy structures that get more and more difficult to change over time as requirements change and more builders try to add other slices of building that were not anticipated.
  If you want to sleep in the lounge from now on instead of the bedroom, and entertain your guests in the back bedroom, you can just move the furniture. That's a lot more agile in my opinion than the software we commonly build.
nemomarx 2 days ago

They make more sense if you think about adding to an existing house. "I want to open up this wall to serve X function" kinda work
they're very well adapted to legacy enterprise work
- gizmo686 2 days ago
  
  Even there, "open up this wall" is not a unit of work. You need to:
  * Evaluate what, if any, structural implications removing the wall has * Tear down the existing wall * Redo any plumbing, ductwork, wiring, etc that was hiding in the wall * Remediate structural concerns from removing the wall. * Redo the flooring * Repair and repaint any damage done to remaining drywall
  If this is part of a larger renovation, you will likely schedule work so the above tasks happen at the same time as other similar tasks.
  E.g. A meaningful unit of work might be "electrical roughing", which would include both moving wires that were previously in the wall, and running a new circuit to the garage for a car charger. No user story covers those to tasks, but the nature of renovating a house means that it makes sense to do them together.
esperent 2 days ago

> Imagine building a house that way - you don't build the living room, then the kitchen, then the bathroom etc.; you build floors, walls, the roof...
This isn't a good analogy. When building a house, you are physically realising a blueprint that describes everything in great detail. You know exactly where every wire and pipe should go ahead of time. When there are changes, they must be minor.
This isn't how writing code works. Maybe some management level people would like to believe it can work that way, but it doesn't in practice.
- danparsonson a day ago
  
  OK, it's just an analogy, to make a broad point about the approach to design and construction. But you absolutely can design software in that way - I do it, and it works very well. It's not easy, but as I said in another comment, it delivers robust solutions that are nonetheless adaptable to change and easy to maintain. By contrast the 'agile' projects I've worked on were either a tangled mess, or else failed spectacularly.
  > This isn't how writing code works
  Maybe that's not how you write code, but there are many different ways to paint that particular fence, no? I've been coding for a long time and for me, this is the approach that I've landed on that's the best I've found so far. To me, the idea of feature-led development is, to put it mildly, nonsensical.
igouy 2 days ago

Software is not like "building a house" and is not like a sculpture and is not like a cake because software is (mostly) notional not physical.
- danparsonson a day ago
  
  Yeah the point is about design philosophy. The physicality or not is irrelevant.
  
  igouy a day ago
  
  Demonstrate that is so. Provide examples that are not physical.
  
  danparsonson 15 hours ago
  
  No. You just don't understand what I'm talking about.
  
  igouy 2 hours ago
  
  Perhaps if you talked about software instead of house-building, sculpture, cakes …
- hackable_sand 2 days ago
  
  I don't see the difference. Could you explain how the physical attributes change the analogies?
  
  igouy 2 days ago
  
  The physical constraints govern the development processes described in the analogies.
  The process for software is not constrained in that way.
  
  fragmede 2 days ago
  
  I can use simple find and replace to change a variable name. If I've mixed salt and sugar up, there's no undo button.
igouy 21 hours ago

> do you finish working on one leg first before you move onto some other part?
Depends on the physical process. Are you carving, casting, bolting or welding or using 3d modelling and printing … ?
> trying to stick them together just results in a fragile mess
If it's a physical cake.
If it's software we seamlessly add functionality to each layer as needed.
ath_ray a day ago

"Unit of work" here is the unit for software delivery, and it can be decoupled from how any individual developer plans and executes whatever software they are delivering.
Product requirements are a hypothesis for creating business value, and the only way to test that hypothesis is to actually demonstrate a slice of that value in a way that's legible to all stakeholders involved.
This post is a nice articulation of this: https://blog.nilenso.com/blog/2025/09/17/the-common-sense-un...
- discreteevent a day ago
  
  That post seems to say that the unit of work must be something customer facing. It even qoutes Kent Beck talking about "Weekly delivery of customer-appreciated value".
  There is so much great software in the world that wasn't delivered like that and couldn't be delivered like that: Unix, Microsoft Word, Postgres, AutoCAD, The JVM, Google search, Windows, AWS, Robotics, Calculators ....
  The software industry seems to have been captured by contractors who used to deliver CRUD apps and now want to make the whole world in their image and likeness.
- danparsonson a day ago
  
  That's the point though, thinking of delivery in terms of slices of business value naturally leads one to break application development along those lines. It's very convenient for the stakeholders to see progress mapped out like that, but it tends to lead to fragile and poorly-architected systems that are difficult to change in the future (and therefore not lower-case A agile).
  
  sriharis a day ago
  
  Slicing a cake across layers is about prioritising value and mitigating the risk of building the wrong thing. Most product and feature requirements are hypothesis for creating value, unless that hypothesis has already been validated.
  > It's very convenient for the stakeholders to see progress mapped out like that It's important for the business to validate product value. This is not just progress anxiety.
  Crafting software to perfection is ultimately a waste if it doesn't provide value to the business or customer. If we are sure we're building the right thing, we can risk more, and spend more of our time building the thing better. Build scrappy first, build confidence in value, and then craft to perfection.
  The slices of cake aren't built in isolation. Every time a slice is being worked on, it is integrated back. The cake analogy falls apart here, because cakes (and houses) aren't nearly as malleable as software. We have opportunities to refactor it every step along the way, and change its shape. Yes, sometimes we refactor independent of business value, and I think that's essential too. I don't think the idea that's presented is to have absolutely every slice be vertical, and business / customer facing.
citizenpaul 2 days ago

I've been downvoted before for saying my take on this but...
Its because SE is a low class low power field. Its not respected by the people in charge at the overwhelming majority of companies. It has resisted standardizing like lawyers, doctors or even real estate agents. So there is little leverage a person in the field can push back with. Its mostly just seen as an annoyance to gaining/consolidating power for the power brokers on their way up the ladder.
That really is what computers/software are. Huge engines for orchestrating power that kings of old couldn't dream of.
- Jensson 2 days ago
  
  > It has resisted standardizing like lawyers, doctors or even real estate agents.
  You can't standardize a field that changes so fast, it takes decades to standardize a field and there has never been a point in time of software where two decades didn't completely changes the job.
  
  citizenpaul 2 days ago
  
  There is almost nothing new in CS since the 1970's. Even LLM's were invented at least theoretically back then.
- prmph a day ago
  
  This is the absolute truth.
  And the worse news is: it will never change. There are several things fundamental to SWE, at least the corporate, open source, and/or indie flavors, that ensure it will not be standardized.
- jnwatson 2 days ago
  
  > "SE is a low class low power field"
  This is the difference with FAANGs. Software engineering is king. The inmates are running the asylum.
  Google is at least 4x as efficient as other large companies I've worked for. Nearly every internal process that can possibly be automated is.
  
  fragmede a day ago
  
  I was there for three years and you're totally right that everything's been automated, but also there are a large number of product level decisions that just don't make sense. They make financial sense, sure, but then that means the engineer has drank the MBA cool aid (or not enough of it), things get killed off, and they are no longer to be trusted around things that need proper love and care put into them. Promo packets though, sure.
  https://therussofirm.com/man-dies-after-following-google-map...
  It's hard to read that as a human, though, and not want to build a system that lets people update bad map data? Which there used to be, but then yeah.
  So yeah, the inmmates (engineers) used to run the asylum (Google), but then a group of fucking psychopaths (DoubleClick) got added to the asylum, got given meth (ad money) and shits fucking unhinged.

datadrivenangel 2 days ago

Keep your scope as small as necessary, but no smaller. This has been fundamentally true for project management work breakdown structures for decades.

BinaryIgor 2 days ago

Interesting that it turns out to be true for code generation as well!
igouy 2 days ago

But not actionable?

tedggh 2 days ago

I found out that summarizing a completed task and feeding it to a new context works better than staying on the same context for multiple tasks. So let’s say I have a sprint with tasks 1, 2 and 3. I start by creating a project with general information including the spec, git issues, code base, folder trees, etc then work on Task 1. When done I ask for a summary using a template, which gives me a txt file describing what the original goal was, what we changed and what the next steps are. Then I repeat the process for Task 2 and I feed the summary from Task 1. At least in ChatGPT keeping the same context for multiple tasks has lots of issues like speed, increased hallucinations, and ChatGPT referencing content from old files.

smw 2 days ago

Hell, claude even makes that part of the standard workflow, with /compact; cleverly using the llm itself to summarize the previous context
- furyofantares 2 days ago
  
  /compact is really poor imo. Quality just falls off a ledge after it.
  I much prefer to choose tasks that can be done with 25%+ context left and then just start the next task with fresh context.
  If I'm getting low on context I have it summarize the plan and progress in a text file rather than use /compact and then start a fresh context and reference that file, which I can then edit and try again if I'm not getting good results.
- bryanlarsen 2 days ago
  
  Once you see that message it's time to finish the task without AI because Claude will start crapping over your codebase if you let it continue.
  
  ewoodrich 2 days ago
  
  I just got burned by that, it compacted and then immediately dropped what it was in the middle of working on to redo something it had already finished half an hour earlier. Which, predictably, sent it into "systematically destroy the entire working codebase" mode because the code it was now reading didn't match expectations of the original instructions. So it started extrapolating like "huh, it looks like the function already exists, therefore, user must have meant [bizarre completely out of left field guess] instead" in an escalating loop of confusion and code mangling.

rglover 2 days ago

What I've found works best:

1. Assume that any model will start to lose focus beyond 50K-100K tokens (even with a huge context window).

2. Be gluttonous with chats. At the first sign of confusion or mistakes, tell it to generate a new prompt and move to a new chat.

3. Write detailed prompts with clear expectations (from how the code should be written to the specific implementation that's required). Combine these with context like docs to get a fairly consistent hit rate.

4. Use tools like Cline that let you switch between an "Act" and "Plan" mode. This saves a ton of tokens but also avoids the LLM getting stuck on a loop when it's debugging.

I recently wrote this short blog post related to this: https://ryanglover.net/blog/treat-the-ai-like-it-s-yourself

The above approach helped me to implement a full-blown database wrapper around LMDB for Node.js in ~2 weeks of slow back-and-forth (link to code in post for those who are curious).

jonstewart 2 days ago

I first tried getting specific with Claude Code. I made the Claude.md, I detailed how to do TDD, what steps it should take, the commands it should run. It was imperfect. Then I had it plan (think hard) and write the plan to a file. I’d clear context, have it read the plan, ask me questions, and then have it decompose the plan into a detailed plan of discrete tasks. Have it work its way through that. It would inevitably go sideways halfway through, even clearing context between each task. It wouldn’t run tests, it would commit breakage, it would flip flop between two different broken approaches, it was just awful. Now I’ve just been vibing, writing as little as possible and seeing what happens. That sucks, too.

It’s amazing at reviewing code. It will identify what you fear, the horrors that lie within the codebase, and it’ll bring them out into the sunlight and give you a 7 step plan for fixing them. And the coding model is good, it can write a function. But it can’t follow a plan worth shit. And if I have to be extremely detailed at the function by function level, then I should be in the editor coding. Claude code is an amazing niche tool for code reviews and dialogue and debugging and coping with new technologies and tools, but it is not a productivity enhancement for daily coding.

liszper 2 days ago

With all due respect, you sound like someone who is just getting familiar with these tools. 100 more hours spent with AI coding and you will be much more productive. Coding with AI is a slightly different skill from coding, similar how managing software engineers is different from writing software.
- abtinf 2 days ago
  
  liszper:
  > most SWE folks still have no idea how big the difference is between the coding agents they tried a year ago and declared as useless and chatgpt 5 paired with Codex or Cursor today
  Also liszper: oh, you tried the current approach and don’t agree with me? Well you just don’t know what you are doing.
  
  bubblyworld 2 days ago
  
  Lol, what is up with everyone assuming there's no learning curve to these things? If you applied this argument to literally any other tool you would be laughed at, for good reason.
  
  bluefirebrand 2 days ago
  
  Probably because "there's no learning curve they are just magic tools" is how they are marketed and how our managers are expecting them to work
  
  bubblyworld 2 days ago
  
  Sure, but people are allowed to have their own opinions too.
  
  liszper 2 days ago
  
  Yes, exactly. Learning new things is hard. Personally it took me about 200 hours to get started, and since then ~2500 hours to get familiar with the advanced techniques, and now I'm very happy with the results, managing extremely large codebases with LLM in production.
  For context before that I had ~15 years of experience coding the traditional way.
  
  chownie 2 days ago
  
  Has anyone else noticed the extreme dichotomy of developers using AI agents? Either AI agents essentially don't work, or they are apparently running legions of agents to produce some nebulous gigantic estate.
  I think the crucial difference is that I do actually see evidence (ie the codebase) posted sometimes for the former, the latter could well be entirely mythos -- a 24 day old account evangelizing for the legion of agents story does kind of fit the theme.
  
  azinman2 2 days ago
  
  You should write a blog post about your learnings. If you could even give some high level highlights here that’d be really helpful.
  
  sarchertech 2 days ago
  
  How many users is production and how large is extremely large.
  
  liszper 2 days ago
  
  200k DAU, 7 million registered, ~50 microservices, large monorepo
  
  sarchertech 2 days ago
  
  You have 50 microservices for 200k daily users?
  Let me guess this has something to do with AI?
  
  liszper 2 days ago
  
  No, It has something to do with experience. The system is highly integrated to other platforms and have to stay afloat during burst loads.
  
  pjc50 2 days ago
  
  .. what is this thing and can we see it?
  
  liszper 2 days ago
  
  you can OSINT me pretty easily, not going to post it here for the sake of anonymity against crawlers who train models on our conversations. today's HN comments are tomorrow's coding LLMs
  
  pjc50 2 days ago
  
  Funnily enough the same kind of approach you get from Lisp advocates and the more annoying faction of Linux advocacy (which isn't as prevalent these days, it seems)
  
  klibertp 2 days ago
  
  > the same kind of approach you get from Lisp
  In what way? Lisp (Common Lisp) is the most stable and unchanging language out there. If you learned it anytime after the late 80s, you still know it, and will know it until the end of time. Meanwhile, here, we hear that "a year ago" is so much time that everything changed (for the better, of course).
  Or is it about needing some serious time investment to get comfortable with Lisp? Even then, once you do spend enough time that s-exprs stop being a problem, that's it; there's nothing else to be getting comfortable with, and certainly, you won't need to relearn any of that a year into the future.
  I don't think AI coding and Lisp are comparable, even considering just the tone of messages on the topic (as far as I can see, "smug lisp weenies" are a thing of the ancient past).
  
  liszper 2 days ago
  
  I'm also a lisper, yes.
- ryandrake 2 days ago
  
  I'm starting to kind of dig C.C. but you're right, it definitely feels like a very confident, very ambitious high schooler level developer with infinite energy. You really have to give it very small tasks and be constantly steering its direction. At the end of the day, I'm not sure I'm really saving that much time coaching Claude to do the job right vs. just writing the code myself, but it's definitely a neat trick.
  The difference from an actual junior developer, of course, is that the human junior developer learns from his mistakes and gets better, but Claude seems to be stuck at the level of expertise of its model, and you have to wait for the model to improve before Claude improves.
  
  jonstewart 2 days ago
  
  The thing I am calling BS on is that there's much productivity gain in giving it very small tasks and constantly steering its direction. For 80% of code, I'm faster than it if that's what I have to do. For debugging? For telling it to integrate a new tool? Port my legacy build system to something better? It's great at that, removes some important barriers to entry.
  
  rmunn 2 days ago
  
  Bingo. All my experience is on Linux, and I've never written anything for Windows. So recently when I needed to port a small C program to Windows, I told ChatGPT "Hey, port this to Windows for me". I wouldn't trust the result, I'd rewrite it myself, but it let me find out which Win32 API functions I'd be calling, and why I'd be calling them, faster than searching MSDN would have done.
- jonstewart 2 days ago
  
  I think it has more to do with the kind of software I write and their requirements then it has to do with spending more time with this current tool. For some things it's great, but it's been a net productivity loss for me on my core coding responsibilities.
- zmmmmm 2 days ago
  
  ah, they are holding it wrong.
  I am always so skeptical of this style of response. Because if it takes hundreds of hours to learn to use something, how can it really be the silver bullet everyone was claiming earlier? Surely they were all in the midst of the 100 hours. And what else could we do if we spent 100 hours learning something? It's a lot of time, a huge investment, all on faith that things will get better.
  
  jmatthews 4 hours ago
  
  How many hours do you have mastering git or your IDE or your library of choice for UX?
- TheRoque 2 days ago
  
  Then, it's the job of someone else to use these tools, not developers
  
  liszper 2 days ago
  
  I agree with your point. I think this is the reason why most developers still don't get it, because AI coding ultimately requires a "higher level" methodology.
  
  dgfitz 2 days ago
  
  "Hacker culture never took root in the 'AI' gold rush because the LLM 'coders' saw themselves not as hackers and explorers, but as temporarily understaffed middle-managers." [0]
  This, this is you. This is the entire charade. It seems poetic somehow.
  [0]https://news.ycombinator.com/item?id=45123094
  
  liszper 2 days ago
  
  I see myself as a hacker.
  
  dgfitz 2 days ago
  
  By your own exposition, you aren’t a hacker.
  
  liszper 2 days ago
  
  hackers can also cook and not become a chef
cadamsdotcom 2 days ago

Don’t give up on TDD.
I’ve invested hundreds of hours in process and tooling, and can now ship major features with tests in record time with Claude Code.
You have to coach it in TDD - no matter how much you explain in CLAUDE.md. That’s part because “a test that fails because the code isn’t written yet” is conceptually very similar to “a test that passes without the code we’re about to write” and is also similar to “a test that asserts the code we’re about to write is not there”. You have to watch closely to make sure it produces the first thing.
Why does it keep getting confused? You can’t blame it really. When two things are conceptually similar, models need lots of examples to distinguish between them. If the set of samples is sparse the model is likely to jump the small distance from a concept to similar ones.
So, you have to accept this as how Claude 4 works, keep it on a short leash, keep reminding it that it must watch the test fail, ask it if the test failed for the right reason (not some setup issue), and THEN give it permission to write the code.
The result is two mirror copies of your feature or fix: code and tests.
Reviewing code and tests together is pleasant because they mirror one another. The tests forever ensure your feature works as described, no manual testing needed, no regressions. And the model knows all the tricks to make your tests really beautiful.
TDD is the check and balance missing from most people’s agentic software dev process.
- jonstewart 18 hours ago
  
  Oh, I will never give up on TDD. And the assistants are great at helping to write tests, and especially analyzing the tests you have and suggesting others for edge cases.
  But I have repeatedly seen claude get hung up on TDD itself and I've tried lots of different prompts/directions. It runs into a problem and inevitably runs ever more complicated shell commands and creating weird temp input files than sticking to "cargo test" and addressing the failing test.
  Since I need to review the agent's code, I'd much prefer it to use a workflow like a human, with a progression of small commits following TDD--much easier to review the code then. If it's just splatting up big diffs, then it makes review harder, and that offsets any productivity gains.

zmmmmm 2 days ago

I prefer small units of work. It really surprises me how fast people have leapt from the 2x speedup (mind blowing level of productivity increase) to full agentic coding without really questioning if it's a good idea.

When I have let Claude loose and vibe coded up hundreds of lines at a time that I have no familiarity with, I viscerally feel how I no longer understand or can maintain the app I've built. If I can't get Claude to do the next change I need, I'm screwed.

I'm very satisfied at the moment to be wielding LLMs as a tool at the individual function / microfeature level and getting a very satisfying productivity improvement.

elpakal 2 days ago

I agree with the premise of the article, and have felt that we're probably seeing AI code gen tools being limited by the constraints being put on them by traditional source code management tools like git and GitHub. Those tools were designed for incremental changes (patches), and have worked well for humans to organize changes so they could be more easily reviewed, maintained and reasoned about. Units of work in the form of features, patches etc rely on "pull requests" which are a function of the above.

liszper 2 days ago

most SWE folks still have no idea how big the difference is between the coding agents they tried a year ago and declared as useless and chatgpt 5 paired with Codex or Cursor today

thanks for the article, it's a good one

blibble 2 days ago

> most SWE folks still have no idea how big the difference is between the coding agents they tried a year ago and declared as useless and chatgpt 5 paired with Codex or Cursor today
yes, just as was said each and every previous time OpenAI/anthropic shit out a new model
"now it doesn't suck!"
- Filligree 2 days ago
  
  Each and every new model expands the scope of what you can do. You notice that, get elated when things that didn’t work start working, then three weeks later the honeymoon period is over and you notice the remaining limits.
  The hedonic treadmill ensures it feels the same way each time.
  But that doesn’t mean the models aren’t improving, nor that the scope isn’t expanding. If you compare today’s tools to those a year ago, the difference is stark.
- thrawa8387336 2 days ago
  
  She is choosing GPT5 as the good example? Maybe Claude, maybe..
angusturner 2 days ago

I think most SWEs do have a good idea where I work.
They know that its a significant, but not revolutionary improvement.
If you supervise and manage your agents closely on well scoped (small) tasks they are pretty handy.
If you need a prototype and don't care about code quality or maintenance, they are great.
Anyone claiming 2x, 5x, 10x etc is absolutely kidding themselves for any non-trivial software.
- jmcodes 2 days ago
  
  I've found a pretty good speed up just letting Claude Code run with a custom prompt to gather the context (relevant files, types, etc..) for the task then having it put together a document with that context.
  It takes all of five minutes to have it run and at the end I can review it, if it's small ask it to execute, and if it actually requires me to work it myself well now I have a reference with line numbers, some comments on how the system appears to work, what the intent is, areas of interest, etc..
  I also rely heavily on the sequential thinking MCP server to give it more structure.
  Edit:
  I will say because I think it's important I've been a senior dev for a while now, a lot of my job _is_ reviewing other people's pull requests. I don't find it hard or tedious at all.
  Honestly it's a lot easier to review a few small "PRs" as the agent works than some of the giant PRs I'd get from team members before.
  
  zmmmmm 2 days ago
  
  > I've been a senior dev for a while now, a lot of my job _is_ reviewing other people's pull requests
  I kind of hate that I'm saying this, but I'm sort of similar and one thing I really like is having zero guilt about trashing the LLM's code. So often people are submitting something and the code is OK but just pervasively not quite how I like it. Some staff will engage in micro arguments about things rather than just doing them how I want and it's just tiring. Then LLMs are really good at explaining why they did stuff (or simulating that) as well. LLMs will enthusiastically redo something and then help adjust their own AGENTS.md file to align better in the future.
- bluefirebrand 2 days ago
  
  > If you supervise and manage your agents closely on well scoped (small) tasks they are pretty handy
  Compared to just doing it yourself though?
  Imagine having to micromanage a junior developer like this to get good results
  Ridiculous tbh
- dingnuts 2 days ago
  
  if the benefit is less than 2x then we're talking about AI assisted coding as being a very, very expensive IntelliSense. 1.x improvement just isn't much. My mind goes back to that study showing engineers claimed a 20% improvement and measured 20% reduction in productivity -- this is all encouraging me to just keep using traditional tools.
  
  rmunn 2 days ago
  
  The only AI-assisted software work I've seen actually have a benefit is the way my coworker use Supermaven, where it's basically Intellisense but suggesting filling in the function parameters for you as well. He'll type `MergeEx` and it will not just suggest `MergeExample(` as Intellisense would have done, but also suggest `MergeExample(oldExample, newExample, mergeOptions)` based on the variable names in scope at the moment and which ones line up with the types. Then he presses Tab and moves on, saving 10-15 seconds of typing. Repeat that multiple times through the day and it might be a 10% improvement, with no time lost on fiddling with prompts to get the AI to correct its mistakes. (Here, if the suggestion is wrong, he just ignores it and keeps typing, and the second he types a character that wasn't the next one in the suggestion it goes away and a new suggestion might be calculated, but the cognitive load in ignoring the incorrect suggestion is minimal).
- gnarcoregrizz 2 days ago
  
  I've found it to be insanely productive when doing framework-based web development (currently working with Django), I would say it's an easy 5-10x improvement in productivity there, but I still need to keep a somewhat close eye on it. It's not nearly as productive in my home grown stuff, it can be kind of annoying actually.
- liszper 2 days ago
  
  I'd argue this just proves my point.
TheRoque 2 days ago

It's true that I haven't been a hardcore agent-army vibe coder, I just try the popular ones once in a while in a naive way (isn't it the point of these tools, to have little friction ?), claude code for example. And it's cool ! But imperfect, and as this article attests, there's a lot of mental overhead to even have a shot at getting a decent output. And even if it's decent, it still needs to be reviewed and could include logical flaws.
I'd rather use it the other way, I'm the one in charge, and the AI reviews any logical flaw or things that I would have missed. I don't even have to think about context window since it'll only look at my new code logic.
So yeah, 3 years after the first ChatGPT and Copilot, I don't feel huge changes regarding "automated" AI programming, and I don't have any AI tool in my IDE, I pefer to have a chat using their website, to brainstorm, or occasionally find a solution to something I'm stuck on.
zeroonetwothree 2 days ago

I use agents for coding small stuff at work almost every day. I would say there has been some improvement compared to a year ago but it’s not any sort of step change. They still are only able to complete simple “intern-level” tasks around 50% of the time. Which is helpful but not revolutionary.
rco8786 2 days ago

I still use Claude Code and Cursor and tbh still run into a lot of the same issues. Hallucinating code, hallucinating requirements, even when scoped to a very simple "make this small change".
It's good enough that it helps, particularly in areas or languages that I'm unfamiliar with. But I'm constantly fighting with it.
kibwen 2 days ago

Last week I wanted to generate some test data for some unit tests for a certain function in a C codebase. It's an audio codec library, so I could have modified the function to dump its inputs to disk and then run the library on any audio file and then hardcoded the input into the unit tests. Instead, I decided I wanted to save a few bytes and wanted to look at generating dummy data dynamically. I wanted to try out Claude for generating the code that would generate the data, so to keep the context manageable I extracted the function and all its dependencies into a self-contained C program (less than 200 lines altogether) and asked it to write a function that would generate dummy data, in C.
Impressively, it recognized the structure of the code and correctly identified it as a component of an audio codec library, and provided a reasonably complete description of many minute details specific to this codec and the work that the function was doing.
Rather less impressively, it decided to ignore my request and write a function that used C++ features throughout, such as type inference and lambdas, or should I say "lambdas" because it was actually just a function-defined-within-a-function that tried to access and mutate variables outside of its own function scope, like we were writing Javascript or something. Even apart from that, the code was rife with the sorts of warnings that even a default invocation of gcc would flag.
I can see why people would be wowed by this on its face. I wouldn't expect any average developer to have such a depth of knowledge and breadth of pattern-matching ability to be able to identify the specific task that this specific function in this specific audio codec was performing.
At the same time, this is clearly not a tool that's suitable for letting loose on a codebase without EXTREME supervision. This was a fresh session (no prior context to confuse it) using a tightly crafted prompt (a small, self-contained C program doing one thing) with a clear goal, and it still required constant handholding.
At the end of the day, I got the code working by editing it manually, but in an honest retrospective I would have to admit that the overall process actually didn't save me any time at all.
Ironically, despite how they're sold, these tools are infinitely better at going from code to English than going the other way around.
- angusturner 2 days ago
  
  I feel this. I've had a few tasks now where in honest retrospect I find myself asking "did that really speed me up". Its a bit demoralising cause not only do you waste time, you have a worse mental model of the resulting code and feel less sense of ownership over the result.
  Brainstorming, ideation and small, well defined tasks where I can quickly vet the solution : these feel like the sweet spot for current frontier model capabilities.
  (Unless you are pumping out some sloppy React SPA that you don't care about anything except get it working as fast as possible - fine, get Claude code to one shot it)
- Filligree 2 days ago
  
  There’s been a lot of noise about Claude performance degradation, and the current best option is probably Codex, but this still surprises me. It sounds like it succeeded on the hard part, then stumbled on the easy bit.
  Just two questions, if you don’t mind satisfying my curiosity.
  - Did you tell it to write C? Or better yet, what was the prompt? You can use Claude --resume to easily find that.
  - Which model? (Sooner or Opus)? Though I’d have expected either one to work.
  
  chrisweekly 2 days ago
  
  Sooner -> Sonnet
- walleeee 2 days ago
  
  > Ironically, despite how they're sold, these tools are infinitely better at going from code to English than going the other way around.
  Yes. Decently useful (and reasonably safe) to red team yourself with. But extremely easy to red queen yourself otherwise.
realusername 2 days ago

I tried again recently and I see absolutely no difference. If there's been some improvement, it's very subtle.
There's a big difference with their benchmarks and real world coding.

xmpir 2 days ago

Same as for human software engineers... We'll see Conway' law all again with agentic coding!

bryanrasmussen 2 days ago

maybe it just works that way for Agents because they see in the data it works that way for humans.

trash_cat a day ago

>> "Turns out the major bottleneck is not intelligence, but rather providing the correct context."

But this has more or less always been the case for LLMs. The challenge becomes context capure. Which in my opinion is the real challenge with LLM adoption. Without the right contex, some tasks just cannot be reliably completed.

chrisrickard 2 days ago

We are tackling this at https://userdoc.fyi - we help you build your specs (epics, stories, acceptance criteria, tech notes, test cases, etc) - then you can generate what we call Dev Plans, one or more requirement layers for implementation.

e.g maybe a dev plan is all your authentication feature requirements, or in the house of analogy – all the requirements for the rooms, but with instructions to actually just first build the floor, and the walls.

Dev plans then slice the reqs into meaningful units of work, as mentioned in the article – a feature/story, is often too large of a checkpoint, or often needs to be implemented in collaboration with other features/stories, so it understands the correct architectural context,.

You can then implement Dev plans over MCP, or copy to .md for tools like Lovable or V0.

jaaron 2 days ago

Agreed that picking the right size of work is critical.

I didn't know about Kiro specs. I've been playing around with my own org-mode based approach with mixed success in keeping dev agent work tracked:

https://github.com/farra/dev-agent-work

marstall 2 days ago

doing things in small chunks is good. so is it doing things in large chunks sometimes. In AI, like in life, there are no hard and fast rules and we're all figuring it out as we go. Like with "vibe coding" - sometimes it's ok to not even look at the code AI is generated, sometimes you need to understand every line.

It feels like part of my journey to being an "AI developer" is being present for those tradeoffs, metabolizing each one into my craft.

AI is a fickle, but powerful horse. I'm finding it a privilege to learn how to be a rider.

stpedgwdgfhgdd 2 days ago

Test driven development, there is nothing more to say for the coming year