Hallucinations in code are the least dangerous form of LLM mistakes

371 points by ulrischa 7 months ago

Terr_ 7 months ago

[Recycled from an older dupe submission]

As much as I've agreed with the author's other posts/takes, I find myself resisting this one:

> I'll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”

> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people.

No, that does not follow.

1. Reviewing depends on what you know about the expertise (and trust) of the person writing it. Spending most of your day reviewing code written by familiar human co-workers is very different from the same time reviewing anonymous contributions.

2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.

3. Motivation is important, for some developers that means learning, understanding and creating. Not wanting to do code reviews all day doesn't mean you're bad at them. Also, reviewing an LLM's code has no social aspect.

However you do it, somebody else should still be reviewing the change afterwards.

Eridrus 7 months ago

Yeah, I strongly disagree with this too.
I've spent a lot of time reviewing code and doing code audits for security (far more than the average engineer) and reading code still takes longer than writing it, particularly when it is dense and you cannot actually trust the comments and variable names to be true.
AI is completely untrustable in that sense. The English and code have no particular reason to align so you really need to read the code itself.
These models may also use unfamiliar idioms where you don't know the edge cases where you either have to fight the model to do it a different way, or go investigate the idiom and think through the edge cases if you really want to understand it.
I think most people just don't read the code these models produce at all and just click accept and then just see if tests pass or just look at the output manually.
I am still trying to give it a go, and sometimes it really does make things easier on simpler tasks and I am blown away, and it has been getting better, but I feel like I need to set myself a hard timeout with these tools where if they haven't done basically what I wanted quickly, I should just start from scratch since the task is beyond them and I'll spend more time on the back and forth.
They are useful for giving me the motivation to do things that I'm avoiding because they're too boring though because after fighting with them for 20 minutes I'm ready to go write the code.
lsy 7 months ago

I'm also put off by the author's condescension towards people who aren't convinced after using the technology. It's not the user's job to find a product useful, it's a product's job to be useful for the user. If a programmer puts a high value on being able to trust a program's output to be minimally conformant to libraries and syntax that are literally available to the program, and places a high value on not having to babysit every line of code that you review and write, that's the programmer's prerogative in their profession, not some kind of moral failing.
elcritch 7 months ago

> 2. Reviews are not just about the code's potential mechanics, but inferring and comparing the intent and approach of the writer. For LLMs, that ranges between non-existent and schizoid, and writing it yourself skips that cost.
With humans you can be reasonably sure they've followed through with a mostly consistent level of care and thouhht. LLMs will just outright lie to make their jobs easier in one section while in another area generate high quality code.
I've had to do a 'git reset --hard' after trying out the Claude code and spending $20 bucks. It always seems great at first, but it just becomes non-sense on larger changes. Maybe chain of thought models do better though.
- aaronbaugher 7 months ago
  
  It's like cutting and pasting from Stack Overflow, if SO didn't have a voting system to give you some hope that the top answer at least works and wasn't hallucinated by someone who didn't understand the question.
  I asked Gemini for the lyrics of a song that I knew was on all the main lyrics sites. It gave me the lyrics to a different song with the same title. On the second try, it hallucinated a batch of lyrics. Third time, I gave it a link to the correct lyrics, and it "lied" and said it had consulted that page to get it right but gave me another wrong set.
  It did manage to find me a decent recipe for chicken salad, but I certainly didn't make it without checking to make sure the ingredients and ratios looked reasonable. I wouldn't use code from one of these things without closely inspecting every line, which makes it a pointless exercise.
  
  simonw 7 months ago
  
  I'm pretty sure Gemini (and likely other models too) have been deliberately engineered to avoid outputting exact lyrics, because the LLM labs know that the music industry is extremely litigious.
  I'm surprised it didn't outright reject your request to be honest.
  
  aaronbaugher 7 months ago
  
  I wondered if it'd been banned from looking at those sites. If that's commonly known (I've only started dabbling in this stuff, so I wasn't aware of that), it's interesting that it didn't just tell me it couldn't do that, instead of lying and giving false info.
  
  krupan 7 months ago
  
  "it's interesting that it didn't just tell me it couldn't do that, instead of lying and giving false info."
  Interesting is a very kind word to use there
- boesboes 7 months ago
  
  I did the exact same today! It started out reasonable, but as you iterate on the commits/PR it become complete crap. And expensive too for very little value.
- Terr_ 7 months ago
  
  > With humans you can be reasonably sure they've followed through with a mostly consistent level of care and thouhht.
  And even if they fail, other humans are more likely to fail in ways we are familiar-with and can internally model and anticipate ourselves.
saghm 7 months ago

The crux of this seems to be that "reviewing code written by other people" isn't the same as "reviewing code written by LLMs". The "human" element of human-written code allows you to utilize social knowledge as well as technical, and that can even be built up over time when reviewing the same persons' code. Maybe there's some equivalent of this that people can develop when dealing with LLM code, but I don't think many people have it now (if it even does exist), and I don't even know what it would look like.
mcpar-land 7 months ago

the part of their claim that does the heavy lifting is "code written by other people" - LLM-produced code does not fall into that category. LLM code is not written by anyone. There was no model in a brain I can empathize with and think about why they might have done this decision or that decision, or a person I can potentially contact and do code review with.
theshrike79 7 months ago

You can see the patterns a.k.a. "code smells"[0] in code 20x faster than you can write code yourself.
I can browse through any Java/C#/Go code and without actually reading every keyword see how it flows and if there's something "off" about how it's structured. And if I smell something I can dig down further and see what's cooking.
If your chosen language is difficult/slow to read, then it's on you.
And stuff should have unit tests with decent coverage anyway, those should be even easier for a human to check, even if the LLM wrote them too.
[0] https://en.wikipedia.org/wiki/Code_smell
- skywhopper 7 months ago
  
  Wow, what a wildly simplistic view you have of programming. “Code smells” (god, I hate that term) are not the only thing that can be wrong. Unit tests only cover what they cover. Reviewing the code is only one piece of the overall cost here.
- theshrike79 7 months ago
  
  -4 points and one reply, what is this, Reddit? The downvote button isn't for "I disagree".
  
  throwuxiytayq 7 months ago
  
  You’re catching some downvotes, but I agree with your perspective. I’m feeling very productive with LLMs and C# specifically. There’s definitely some LLM outputs that I don’t even bother checking, but very often the code is visibly correct and ready for use. Ensuring that the LLM output conforms to your preferred style (e.g. functional-like with static functions) helps a lot. I usually do a quick formatting/refactoring pass with the double purpose of also understanding and checking the code. In case there’s doubts about correctness (usually in just one or two spots), they can be cleared up very quickly. I’m sure this workflow isn’t a great fit for every language, program type and skill level (there’s experts out there that make me embarrassed!), but reading some people I feel like a lot of my peers are missing out.
  
  SR2Z 7 months ago
  
  I think the reason for this gap are the differences in scope and novelty between codebases. When you need an LLM to write a piece of code that's been written a million times before (e.g. "find the normal vector to this plane", "find the highest scoring user") it generally produces decent code.
  But on the flip side, this type of code is intrinsically less valuable than novel stuff ("convert this signed distance field to a mesh") which an LLM will choke on.
  
  throwuxiytayq 7 months ago
  
  Not sure that vector normalization and “MaxBy” count as a piece of code. It’s a building block, less than a line of code. Usually more compact to just type it out than to describe it in natural language.
  
  SR2Z 7 months ago
  
  It really depends - a recent example I had was trying to implement the DPMO paper (a signed distance field to mesh algorithm), and one of the steps is "compute the plane of best fit with these points and project this other point onto it." Not a particularly long piece of code, but long enough that my local DeepSeek model was able to meaningfully save time for me.
  http://www.sccg.sk/~chalmo/GM/SM02ob.pdf

notepad0x90 7 months ago

My fear is that LLM generated code will look great to me, I won't understand it fully but it will work. But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws. Especially if you consider coding as piecing together things instead of implementing a well designed plan. Lots of pieces making up the whole picture but a lot of those pieces are now put there by an algorithm making educated guesses.

Perhaps I'm just not that great of a coder, but I do have lots of code where if someone took a look it, it might look crazy but it really is the best solution I could find. I'm concerned LLMs won't do that, they won't take risks a human would or understand the implications of a block of code beyond its application in that specific context.

Other times, I feel like I'm pretty good at figuring out things and struggling in a time-efficient manner before arriving at a solution. LLM generated code is neat but I still have to spend similar amounts of time, except now I'm doing more QA and clean up work instead of debugging and figuring out new solutions, which isn't fun at all.

noisy_boy 7 months ago

I do these things for this:
- keep the outline in my head: I don't give up the architect's seat. I decide which module does what and how it fits in the whole system, it's contract with other modules etc.
- review the code: this can be construed as negating the point of LLMs as this is time consuming but I think it is important to go through line by line and understand every line. You will absorb some of the LLM generated code in the process which will form an imperfect map in your head. That's essential for beginning troubleshooting next time things go wrong.
- last mile connectivity: several times the LLM takes you there but can't complete the last mile connectivity; instead of wasting time chasing it, do the final wiring yourself. This is a great shortcut to achieve the previous point.
- FiberBundle 7 months ago
  
  In my experience you just don't keep as good a map of the codebase in your head when you have LLMs write a large part of your codebase as when you write everything yourself. Having a really good map of the codebase in your head is what brings you large productivity boosts when maintaining the code. So while LLMs do give me a 20-30% productivity boost for the initial implementation, they bring huge disadvantages after that, and that's why I still mostly write code myself and use LLMs only as a stackoverflow alternative.
  
  simonw 7 months ago
  
  I have enough projects that I'm responsible for now (over 200 packages on PyPI, over 800 GitHub repositories) that I gave up on keeping a map of my codebases in my head a long time ago - occasionally I'll stumble across projects I released that I don't even remember existing!
  My solution for this is documentation, automated tests and sticking to the same conventions and libraries (like using Click for command line argument parsing) across as many projects as possible. It's essential that I can revisit a project and ramp up my mental model of how it works as quickly as possible.
  I talked a bit more about this approach here: https://simonwillison.net/2022/Nov/26/productivity/
  
  FiberBundle 7 months ago
  
  You're an extreme outlier. Most programmers work with 1-3 codebases probably. Obviously you can't keep 800 codebases in your head, and you have to settle for your approach out of necessity. I find it hard to believe you get anywhere close to the productivity benefits of having a good mental map of a codebase with just good documentation and extensive test coverage. I don't have any data on this, but from experience I'd say that people who really know a codebase can be 10-50x as fast at fixing bugs than those with only a mediocre familiarity.
  
  MrMcCall 7 months ago
  
  The evolution of a codebase is an essential missing piece of our development processes. Barring detailed design docs that no one has time to write and then update, understanding that evolution is the key to understanding the design intent (the "why") of the codebase. Without that why, there will be no consistency, and less chance of success.
  "Short cuts make long delays." --Tolkien
- happymellon 7 months ago
  
  > This is a great shortcut to achieve the previous point.
  How does doing the hard part provide a shortcut for reviewing all the LLVM code?
  If anything it's a long cut, because now you have to understand the code and write it yourself. This isn't great, it's terrible.
  
  noisy_boy 7 months ago
  
  Sure whatever works for you; my approach works for me
  
  happymellon 7 months ago
  
  But you don't explain how doing the hard part shortcuts needing to understand the LLVM code.
- zahlman 7 months ago
  
  The way you've written this comes across like the AI is influencing your writing style....
  
  noisy_boy 7 months ago
  
  thatistrue I us ed to write lik this b4 ai it has change my life
  
  matthberg 7 months ago
  
  As someone pretty firmly in the anti-AI camp, I'm genuinely glad that you've personally found AI a useful tool to polish text and help you communicate.
  I think that just because someone might be more or less eloquent than someone else, the value of their thoughts and contributions shouldn't be weighed any differently. In a way, AI formatting and grammar assistance could be a step towards a more equitable future, one where ideas are judged on inherent merits rather than superficial junk like spel;ng or idk typos n shi.t
  However, I think what the parent commenter (and I) might be saying is that it seems you're relying on AI for more than just help expressing yourself—it seems you're relying on it to do the thinking too. I'd urge you to consider if that's what you really want from a tool you use. That said, I'm just some random preachy-judgy stranger on the internet, you don't owe me shit, lol
  (Side notes I couldn't help but include: I think talking about AI and language is way more complicated (and fascinating) than just that aspect, including things I'm absolutely unqualified to comment on—discrimination against AAVE use, classism, and racism can't and shouldn't be addressed by a magic-wand spell-checker that "fixes" everyone's speech to be "correct" (as if a sole cultural hegemony or way of speech is somehow better than any other))
  
  noisy_boy 7 months ago
  
  > As someone pretty firmly in the anti-AI camp, I'm genuinely glad that you've personally found AI a useful tool to polish text and help you communicate.
  > I think that just because someone might be more or less eloquent than someone else, the value of their thoughts and contributions shouldn't be weighed any differently. In a way, AI formatting and grammar assistance could be a step towards a more equitable future, one where ideas are judged on inherent merits rather than superficial junk like spel;ng or idk typos n shi.t
  I guess I must come clean that my reply was sarcasm which obviously fell flat and caused you to come to the defense of those who can't spell - I swear I don't have anything against them.
  > However, I think what the parent commenter (and I) might be saying is that it seems you're relying on AI for more than just help expressing yourself—it seems you're relying on it to do the thinking too. I'd urge you to consider if that's what you really want from a tool you use. That said, I'm just some random preachy-judgy stranger on the internet, you don't owe me shit, lol
  You and presumably the parent commenter have missed the main point of the retort - you are assuming I am relying on AI for my content or its style. It is neither - I like writing point-wise in a systematic manner, always have, always will - AI or no-AI be damned. It is the all-knowing veil-piercing eagle-eyed deduction of random preachy-judgy strangers on the internet about something being AI-generated/aided just because it follows structure, that is annoying.
  
  danielmarkbruce 7 months ago
  
  It's funny that some folks seem to assume AI writing style just arrive out of thin air....
  
  noisy_boy 7 months ago
  
  Maybe LLM generated text was their first-contact with structured and systematic writing.
  
  plxxyzs 7 months ago
  
  Three bullet points, each with three sentences (ok last one has a semicolon instead) is a dead giveaway
  
  Jensson 7 months ago
  
  Lots of people wrote like that before AI, AI writes like people its made to copy how people write. It wouldn't write like that if people didn't.
  
  johnisgood 7 months ago
  
  Yes, I prefer using lists myself, too, does not mean my writing is being influenced by AI. I have always liked bullet points long before AI was even a thing, it is for better organization and visual clarity.
  
  KronisLV 7 months ago
  
  I feel like “looks like it’s written by AI” might become a critique of writing that’s very template-like, neutral, corporate. I don’t usually dislike it though, as long as the information is there.
  
  noisy_boy 7 months ago
  
  Three bullet points AND three sentences?!! Get outta here...
intended 7 months ago

I think this is a great line: > My fear is that LLM generated code will look great to me, I won't understand it fully but it will work
This is a degree of humility that made the scenario we are in much clearer.
Our information environment got polluted by the lack of such humility. Rhetoric that sounded ‘right’ is used everywhere. If it looks like an Oxford Don, sounds like an Oxford Don, then it must be an academic. Thus it is believable, even if they are saying the Titanic isn’t sinking.
Verification is the heart of everything humanity does, our governance structures, our judicial systems, economic systems, academia, news, media - everything.
It’s a massive computation effort to figure out what the best ways to allocate resources given current information, allowing humans to create surplus and survive.
This is why we dislike monopolies, or manipulations of these markets - they create bad goods, and screw up our ability to verify what is real.
unclebucknasty 7 months ago

All of this. Could have saved me a comment [0] if I'd seen this earlier.
When people talk about 30% or 50% coding productivity gains with LLMs, I really want to know exactly what they're measuring.
[0] https://news.ycombinator.com/item?id=43236792
fuzztester 7 months ago

>My fear is that LLM generated code will look great to me, I won't understand it fully but it will work.
puzzled. if you don't understand it fully, how can you say that it will look great to you, and that it will work?
- raincole 7 months ago
  
  It happens all the time. Way before LLM. There were countless times I implemented an algorithm from a paper or a book while not fully understanding it (in other words, I can't prove the correctness or time complexity without referencing the original paper).
  
  fuzztester 7 months ago
  
  imo, your last phrase, excerpted below:
  >(in other words, I can't prove the correctness ... without referencing the original paper).
  agrees with what I said in my previous comment:
  >if you don't understand it fully, how can you say .... that it will work?
  (irrelevant parts from our original comments above, replaced with ... , without loss of meaning to my argument.)
  both those quoted fragments, yours and mine, mean basically the same thing, i.e. that both you and the GP don't know whether it will work.
  it's not that one cannot use some piece of code without knowing whether it works; everybody does that all the time, from algorithm books for example, as you said.
- Nevermark 7 months ago
  
  > if you don't understand it fully, how can you say that it will look great to you, and that it will work?
  Presumably, that simply reflects that a primary developer always has an advantage of having a more reliable understanding of a large code base - and the insights into the problem that come about during development challenges - than a reviewer of such code.
  A lot of important bug subtle insights, many sub-verbal, into a problem come from going through the large and small challenges of creating something that solves it. Reviewers just don't get those insights as reliably.
  Reviewers can't see all the subtle or non-obvious alternate paths or choices. They are less likely to independently identify subtle traps.
- rsynnott 7 months ago
  
  I mean, depends what you mean by ‘work’. For instance, something which produces the correct output, and leaks memory, is that working? Something which produces the correct output, but takes a thousand times longer than it should; is that working? Something which produces output which looks superficially correct and passes basic tests, is that working?
  ‘Works for me’ isn’t actually _that_ useful a signal without serious qualification.
  
  fuzztester 7 months ago
  
  >‘Works for me’ isn’t actually _that_ useful a signal without serious qualification.
  yes, and it sounds a bit like "works on my machine", a common cop-out which I am sure many of us have heard of.
  google: works on my machine meme
  
  fuzztester 7 months ago
  
  exactly.
  what you said just strengthens my argument.
- upcoming-sesame 7 months ago
  
  You write a decent amount of tests
  
  fuzztester 7 months ago
  
  Famous quote, read many years ago:
  testing can prove the presence of errors, but not their absence.
  https://www.google.com/search?q=quote+testing+can+prove+the+...
  - said by Steve McConnell (author of Code Complete), Edsger Dijkstra, etc. ...
ajmurmann 7 months ago

To fight this I mostly do ping-pong pairing with llms. After e discuss the general goal and approach I usually write the first test. The llm the makes it pass and writes the next test which I'll make pass and so on. It forces me to stay 100% in the loop and understand everything. Maybe it's not as fast as having the llm write as much as possible but I think it's a worthwhile tradeoff.
hakaneskici 7 months ago

When it comes to relying on code that you didn't write yourself, like an npm package, do you care if it's AI code or human code? Do you think your trust toward AI code may change over time?
- sfink 7 months ago
  
  Of course I care. Human-written code was written for a purpose, with a set of constraints in mind, and other related code will have been written for the same or a complementary purpose and set of constraints. There is intention in the code. It is predictable in a certain way, and divergences from the expected are either because I don't fully understand something about the context or requirements, or because there's a damn good reason. It is worthwhile to dig further until I do understand, since it will very probably have repercussions elsewhere and elsewhen.
  For AI code, that's a waste of time. The generated code will be based on an arbitrary patchwork of purposes and constraints, glued together well enough to function. I'm not saying it lacks purpose or constraints, it's just that those are inherited from random sources. The parts flow together with robotic but not human concern for consistency. It may incorporate brilliant solutions, but trying to infer intent or style or design philosophy is about as useful as doing handwriting analysis on a ransom note made from pasted-together newspaper clippings.
  Both sorts of code have value. AI code may be well-commented. It may use features effectively that a human might have missed. Just don't try to anthropomorphize an AI coder or a lawnmower, you'll end up inventing an intent that doesn't exist.
  
  gunian 7 months ago
  
  what if you
  - generate - lint - format - fuzz - test - update
  infintely?
  
  sfink 7 months ago
  
  Then you'll get code that passes the tests you generate, where "tests" includes whatever you feed the fuzzer to detect problems. (Just crashes? Timeouts? Comparison with a gold standard?)
  Sorry, I'm failing to see your point.
  Are you implying that the above is good enough, for a useful definition of good enough? I'm not disagreeing, and in fact that was my starting assumption in the message you're replying to.
  Crap code can pass tests. Slow code can pass tests. Weird code can pass tests. Sometimes it's fine for code to be crap, slow, and/or weird. If that's your situation, then go ahead and use the code.
  To expand on why someone might not want such code, think of your overall codebase as having a time budget, a complexity budget, a debuggability budget, an incoherence budget, and a maintenance budget. Yes, those overlap a bunch. A pile of AI-written code has a higher chance of exceeding some of those budgets than a human-written codebase would. Yes, there will be counterexamples. But humans will at least attempt to optimize for such things. AIs mostly won't. The AI-and-AI-using-human system will optimize for making it through your lint-fuzz-test cycle successfully and little else.
  Different constraints, different outputs. Only you can decide whether the difference matters to you.
  
  pixelfarmer 7 months ago
  
  > Then you'll get code that passes the tests you generate
  Just recently I think here on HN there was a discussion about how neural networks optimize towards the goal they are given, which in this case means exactly what you wrote, including that the code will do stuff in wrong ways just to pass the given tests.
  Where do the tests come from? Initially from a specification of what "that thing" is supposed to do and also not supposed to do. Everyone who had to deal with specifications in a serious way knows how insanely difficult it is to get these right, because there are often things unsaid, there are corner cases not covered and so on. So the problem of correctness is just shifted, and the assumption that this may require less time than actually coding ... I wouldn't bet on it.
  Conceptually the idea should work, though.
  
  gunian 7 months ago
  
  what if you thought of your codebase as something similar to human DNA and the LLM as nature and the entire process as some sort of evolutionary process? the fitness function would be no panics exceptions and latency instead of some random KPI or OKR pr who likes working with who or who made who laugh
  it's what our lord and savior jesus christ uses for us humans if it is good for him its good enough for me. and surely google is not laying off 25k people because it believes humans are better than their LLMs :)
  
  intended 7 months ago
  
  Who has that much time and money when your boss is breathing down your neck?
  
  whymeogod 7 months ago
  
  [flagged]
- PessimalDecimal 7 months ago
  
  Publicly available code with lots of prior usage seems less likely to be buggy than LLM-generated code produced on-demand and for use only by me.
madeofpalk 7 months ago

Do you not review code from your peers? Do you not search online and try to grok code from StackOverflow or documentation examples?
All of these can vary wildly in quality. Maybe its because I mostly use coding LLMs as either a research tool, or to write reasonably small and easy to follow chunks of code, but I find it no different than all of the other types of reading and understanding other people's code I already have to do.
eru 7 months ago

> But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws.
Alas, I don't share your optimism about code I wrote myself. In fact, it's often harder to find flaws in my own code, then when reading someone else's code.
Especially if 'this is too complicated for me to review, please simplify' is allowed as a valid outcome of my review.
otabdeveloper4 7 months ago

> ...but it will work
You don't know that though. There's no "it must work" criteria in the LLM training.
tokioyoyo 7 months ago

The big argument against it is, at some point, there’s a chance, that you won’t really need to understand what the code does. LLMs writes code, LLMs write tests, you find bugs, LLM fixes code, LLM adds test cases for the found bug. Rinse and repeat.
- SamPatt 7 months ago
  
  For fairly simple projects built from scratch, we're already there.
  Claude Code has been doing all of this for me on my latest project. It's remarkable.
  It seems inevitable it'll get there for larger and more complex code bases, but who knows how far away that is.
- saagarjha 7 months ago
  
  What do you do when the LLM doesn't fix the code?
  
  amarcheschi 7 months ago
  
  You tell it there's an error, and to fix the code (/s)
JimDabell 7 months ago

> My fear is that LLM generated code will look great to me, I won't understand it fully but it will work.
If you don’t understand it, ask the LLM to explain it. If you fail to get an explanation that clarifies things, write the code yourself. Don’t blindly accept code you don’t understand.
This is part of what the author was getting at when they said that it’s surfacing existing problems not introducing new ones. Have you been approving PRs from human developers without understanding them? You shouldn’t be doing that. If an LLM subsequently comes along and you accept its code without understanding it too, that’s not a new problem the LLM introduced.
- np- 7 months ago
  
  Code reviews with a human are a two way street. When I find code that is ambiguous I can ask the developer to clarify and either explain their justification or ask them to fix it before the code is approved. I don’t have to write it myself, and if the developer is simply talking in circles then I’d be able to escalate or reject—and this is a far less likely failure case to happen with a real trusted human than an LLM. “Write the code yourself” at that point is not viable for any non-trivial team project, as people have their own contexts to maintain and commitments/projects to deliver. It’s not the typing of the code that is the hard part which is the only real benefit of LLMs that they can type super fast, it’s fully understanding the problem space. Working with another trusted human is far far different from working with an LLM.
- sarchertech 7 months ago
  
  No one takes the time to fully understand all the PRs they approve. And even when you do take the time to “fully understand” the code, it’s very easy for your brain to trick you into believing you understand it.
  At least when a human wrote it, someone understood the reasoning.
  
  sgarland 7 months ago
  
  > No one takes the time to fully understand all the PRs they approve.
  I was appalled when I was being effusively thanked for catching some bugs in PRs. “No one really reads these,” is what I was told. Then why the hell do we have a required review?!
  
  sarchertech 7 months ago
  
  Cargo culting.
kadushka 7 months ago

I wouldn't be great at finding bugs in it or logical flaws
This is what tests are for.
- nradov 7 months ago
  
  You can't test quality into a product.
- notepad0x90 7 months ago
  
  The tests are probably LLM generated as well lol
sunami-ai 7 months ago

Worst part is that the patterns of implementation won't be consistent across the pieces. So debug a whole codebase that was authored with LLM generated code is like having to debug a codebase where ever function was written by a different developer and no one followed any standards. I guess you can specify the coding standards in the prompt and ask it to use FP-style programming only, but I'm not sure how well it can follow.
- QuiDortDine 7 months ago
  
  Not well, at least for ChatGPT. It can't follow my custom instructions which can be summed up as "follow PEP-8 and don't leave trailing whitespace".
  
  jampekka 7 months ago
  
  In don't think they meant formatting details.
  
  6r17 7 months ago
  
  Formatting is like a dot on the i; there is 200 other small details that are just completely off putting to me : - naming conventions (ias are lazy and tent to use generic names with no meaning) such as "Glass" instead of "GlassProduct" ; - error management convention
  But the most troublesome to me is that it is just "pissing" out code and has no after-tough about the problem it is solving or the person it is talking to.
  The number of times I have to repeat myself just to get a stubborn answer with no discussion is alarming. It does not benefit my well-being and is annoying to work with except for a bunch of exploratory cases.
  I believe LLM are actually the biggest data heist organized. We believe that those models will get better at solving their jobs but the reality is that we are just giving away code, knowledge, ideas at scale, correcting the model for free, and paying to be allowed to do so. And when we watch the 37% minimum hallucination rate, we can more easily understand that the actual tough comes from the human using it.
  I'm not comfortable having to argue with a machine and have to explain to it what I'm doing, how, and why - just to get it to spam me with things I have to correct afterwards anyway.
  The worst is, all that data is the best insight on everything. How many people ask for X ? How much time did they spend trying to do X ? What were they trying to achieve ? Who are their customers ? etc...
  
  johnisgood 7 months ago
  
  It is supposed to follow that instruction though. When it generates code, I can tell is to use tabs, 2 spaces, etc. and the generated code will use that. It works well with Claude, at least.

layer8 7 months ago

> Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing. No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!

I would have stated this a bit differently: No amount of running or testing can prove the code correct. You actually have to reason through it. Running/testing is merely a sanity/spot check of your reasoning.

johnrob 7 months ago

I’m not sure it’s possible to have the full reasoning in your head without authoring the code yourself - or, spending a comparable amount of effort to mentally rewrite it.
- layer8 7 months ago
  
  I tend to agree, which is why I’m skeptical about large-scale LLM code generation, until AIs exhibit reliable diligence and more general attention and awareness, and probably also long-term memory about a code base and its application domain.
- theshrike79 7 months ago
  
  Spoken by someone who hasn't had to maintain Somene Else's Code on a budget.
  You can't just rewrite everything to match your style. You take what's in there and adapt to the style, your personal preference doesn't matter.
  
  np- 7 months ago
  
  Someone Else’s Code was understood by at least one human at some point in time before it was committed. That means that another equally skilled human is likely to be able to get the gist of it, if not understand it perfectly.
  
  horsawlarway 7 months ago
  
  It's a giant misdirection to assume the complaint is "style".
  Writing is a very solid choice as an approach to understanding a novel problem. There's a quip in academia - "The best way to know if you understand something is to try teaching it to someone else". This happens to hold true for teaching it to the compiler with code you've written.
  You can't skip details or gloss over things, and you have to hold "all the parts" of the problem together in your head. It builds a very strong intuitive understanding.
  Once you have an intuitive understanding of the problem, it's very easy to drop into several different implementations of the solution (regardless of the style) and reason about them.
  On the other hand, if you don't understand the problem, it's nearly impossible to have a good feel for why any given solution does what it does, or where it might be getting things wrong.
  ---
  The problem with using an AI to generate the code for you is that unless you're already familiar with the problem you risk being completely out of your depth "code reviewing" the output.
  The difficulty in the review isn't just literally reading the lines of code - it's in understanding the problem well enough to make a judgement call about them.
  
  layer8 7 months ago
  
  They said “mentally rewrite”, not actually rewrite.
- skydhash 7 months ago
  
  Which is why everyone is so keen on standards (Convention, formatting, architecture,...), because it is less a burden when you're just comparing expected to actual, than learning unknowns.
- tuyiown 7 months ago
  
  > spending a comparable amount of effort to mentally rewrite it.
  I'm pretty sure mentally rewrite it requires _more_ effort than writing it in the first place. (maybe less time though)
Snuggly73 7 months ago

Agree - case in point - dealing with race conditions. You have to reason thru the code.
- wfn 7 months ago
  
  > case in point - dealing with race conditions.
  100%. Case in point for case in point - I was just scratching my head over some Claude-produced lines for me, thinking if I should ask what this kind entity had in mind when using specific compiler builtins (vs. <stdatomic.h>), like, "is there logic to your madness..." :D
  size_t unique_ips = __atomic_load_n(&((ip_database_t*)arg)->unique_ip_count, __ATOMIC_SEQ_CST);
  I think it just likes compiler builtins because I mentioned GCC at some point...
nnnnico 7 months ago

not sure that human reasoning actually beats testing when checking for correctness
- ljm 7 months ago
  
  The production of such tests presumably requires an element of human reasoning.
  The requirements have to come from somewhere, after all.
  
  MrMcCall 7 months ago
  
  I would argue that designing and implementing a working project requires human reasoning, too, but that line of thinking seems to be falling out of fashion in favor of "best next token" guessing engines.
  I know what Spock would say about this approach, and I'm with him.
- Gupie 7 months ago
  
  "Beware of bugs in the above code; I have only proved it correct, not tried it."
  Donald E. Knuth
- fragmede 7 months ago
  
  Human reason is fine, the problem is that human attention spans aren't great at checking for correctness. I want every corner case regression tested automatically because there's always going to be some weird configuration that a human's going to forget to regression test.
  
  sarchertech 7 months ago
  
  With any non trivial system you can’t actually test every corner case. You depend on human reason to identify the ones most likely to cause problems.
- layer8 7 months ago
  
  Both are necessary, they complement each other.
dmos62 7 months ago

Well, what if you run a complete test suite?
- layer8 7 months ago
  
  There is no complete test suite, unless your code is purely functional and has a small-ish finite input domain.
  
  suzzer99 7 months ago
  
  And even then, your code could pass all tests but be a spaghetti mess that will be impossible to maintain and add features to.
  
  MattSayar 7 months ago
  
  Seems to be a bit of a catch 22. No LLM can write perfect code, and no test suite can catch all bugs. Obviously, no human can write perfect code either.
  If LLM-generated code has been "reasoned-through," tested, and it does the job, I think that's a net-benefit compared to human-only generated code.
  
  unclebucknasty 7 months ago
  
  >I think that's a net-benefit compared to human-only generated code.
  Net-benefit in what terms though? More productive WRT raw code output? Lower error rate?
  Because, something about the idea of generating tons of code via LLMs, which humans have to then verify, seems less productive to me and more error-prone.
  I mean, when verifying code that you didn't write, you generally have to fully reason through it, just as you would to write it (if you really want to verify it). But, reasoning through someone else's code requires an extra step to latch on to the author's line of reasoning.
  OTOH, if you just breeze through it because it looks correct, you're likely to miss errors.
  The latter reminds me of the whole "Full self-driving, but keep your hands on the steering wheel, just in case" setup. It's going to lull you into overconfidence and passivity.
  
  rapind 7 months ago
  
  > "Full self-driving, but keep your hands on the steering wheel, just in case" setup
  This is actually a trick though. No one working on self driving actually expects people to actually babysit it for long at all. Babysitting actually feels worse than driving. I just saw a video on self-driving trucks and how the human driver had his hands hovering on the wheel. The goal of the video is to make you think about how amazing self-driving rigs will be, but all I could think about was what an absolutely horrible job it will be to babysit these things.
  Working full-time on AI code reviews sounds even worse. Maybe if it's more of a conversation and you're collaboratively iterating on small chunks of code then it wouldn't be so bad. In reality though, we'll just end up trusting the AI because it'll save us a ton of money and we'll find a way to externalize the screw ups.
  
  jmb99 7 months ago
  
  > reasoning through someone else's code requires an extra step to latch on to the author's line of reasoning.
  And, in my experience, it’s a lot easier to latch on to a real person’s real line of reasoning rather than a chatbot’s “line of reasoning”
  
  unclebucknasty 7 months ago
  
  Exactly. And, if correction is required, then you either re-write it or you're stuck maintaining whatever odd way the LLM approached the problem, whether it's as optimal (or readable) as a human's or not.
  
  Ekaros 7 months ago
  
  Also after reasonable period if you are stuck you can actually ask them what were they thinking and why was it written that way and what are the constrains they thought of.
  And you can discuss these, with both of you hopefully having experience in the domain.
- shakna 7 months ago
  
  If the complete test suite were enough, then SQLite, who famously has one of the largest and most comprehensive, would not encounter bugs. However, they still do.
  If you employ AI, you're adding a remarkable amount of speed, to a processing domain that is undecidable because most inputs are not finite. Eventually, you will end up reconsidering the Gambler's Fallacy, because of the chances of things going wrong.
- e12e 7 months ago
  
  You mean, for example test that your sieve finds all primes, and only primes that fit in 4096 bits?
- bandrami 7 months ago
  
  Paging Dr. Turing. Dr. Turing, please report to the HN comment section.
  
  dmos62 7 months ago
  
  Gave me a chuckle!

atomic128 7 months ago

Last week, The Primeagen and Casey Muratori carefully review the output of a state-of-the-art LLM code generator.

They provide a task well-represented in the LLM's training data, so development should be easy. The task is presented as a cumulative series of modifications to a codebase:

https://www.youtube.com/watch?v=NW6PhVdq9R8

This is the actual reality of LLM code generators in practice: iterative development converging on useless code, with the LLM increasingly unable to make progress.

mercer 7 months ago

In my own experience, I have all sorts of ways that I try to 'drag' the llm out of some line of 'thinking' by editing the conversation as a whole, or just restarting the whole prompt, and I've been kind of just doing this over time since GPT3.
While I still think all this code generation is super cool, I've found that the 'density' of the code makes it even more noticeable - and often annoying - to see the model latch on, say, some part of the conversation that should essentially be pruned from the whole thinking process, or pursue some part of earlier code that makes no sense to me, and then 'coaxing' it again.

bigstrat2003 7 months ago

> Hallucinated methods are such a tiny roadblock that when people complain about them I assume they’ve spent minimal time learning how to effectively use these systems—they dropped them at the first hurdle.

This seems like a very flawed assumption to me. My take is that people look at hallucinations and say "wow, if it can't even get the easiest things consistently right, no way am I going to trust it with harder things".

JusticeJuice 7 months ago

You'd be surprised. I know a few people who couldn't really code before LLMs, but now with LLMs they can just brute-force through problems. They seem pretty undetered about 'trusting' the solution, if they ran it and it worked for them, it gets shipped.
- tcoff91 7 months ago
  
  Well I hope this isn’t backend code because the amount of vulnerabilities that are going to come from these practices will be staggering
  
  namaria 7 months ago
  
  The backlash will be enormous. In the near future, there will be less competent coders and a tsunami of bad code to fix. If 2020 was annoying to hiring managers they have no idea how bad it will become.
  
  Eisenstein 7 months ago
  
  Of course this will be the case, but probably not for the reasons you are concerned about. It is because a lot of people have been enabled by these tools to realize they are able to do things they thought were beyond them.
  The opaque wall that separates the solution from the problem in technology often comes from the very steep initial learning curve. The reason most people who are developers now learned to code is because they had free time when they were young, had access to the technology, and were motivated to do it.
  But as an adult, very few people are able to get past the first obstacles which keep them from eventually becoming proficient, but now they have a cheat code. So you will see a lot more capable programmers in the future who will be able to help you fix this backlog of bad code -- we just have to wait for them to gain the experience and knowledge needed before that happens and deal with the mistakes along the way.
  This is no different from any other enabling technology. The people who feel like they had to struggle through it and pay their dues when it 'wasn't easy' are going to be resentful and try and gatekeep; it is only human nature.
  
  MrMcCall 7 months ago
  
  > This is no different from any other enabling technology.
  Coding is unique. One can't replace considered, forward-thinking data flow design reasoning with fancy guesswork and triage.
  Should anyone build a complex brick wall by just iterating over the possible solutions? Hell no. That's what expertise is for, and that is only attained via hard graft, and predicting the next word is not going to be a viable substitute.
  It's all a circle jerk of people hoping for a magic box.
  
  Eisenstein 7 months ago
  
  When did you learn to code? What access did you have to technology when you started? How much free time did you have? What kind of education did you have?
  Are you really unique because you are one of only a few special people who can code because of some innate ability? Or is it that you have above average intelligence, have a rather uncommon but certainly not rare ability to think a certain way, and had an opportunity and interest which honed those talents to do something most people can't?
  How would you feel if you never had access to a computer with a dev environment until you were an adult, and then someone told you not to bother learning how to code because you aren't special like they are?
  The 'magic box' is a way to get past the wall that requires people to spend 3 hours trying to figure out what python environments are before they can even write a program that does anything useful.
  
  skydhash 7 months ago
  
  I read a lot of books. While I had some fundamental in high school, I really started in college and the tricks was to read books for the theorical parts, read articles for advices and specific walkthroughs, read code for examples of implementation, and then solve problems to intenalize all of that reading.
  But it all compounds. Going from reading to doing takes little time and I’m able to use much denser information repositories.
  If you have to spend three hours reading about python environments, that’s just a signal that your foundation is lacking (you don’t know how your tools work). Using LLM is flying blind and hoping you will land instead of crashing.
  
  MrMcCall 7 months ago
  
  Well said.
  One quibble, however, is that python environments are a mess (as is any 3rd party software use in any environment, in my limited experience), and I refuse to use any such thing, when at all possible. If I couldn't directly integrate that code into my codebase, I won't waste my time, because every dependency is another point of failure, either the author's or (more likely) that I might muck up my use of it. Then, there are issues such as versioning, security, and even the entire 3rd party management software itself. It does not look like it will actually save me any time, and might end up being a huge drag on my progression.
  That said, using an LLM for ANYTHING is super risky IMO. Like you said, a person should read about what the think they want to utilize, and then incrementally build up the needed skills and experience by using it.
  There are many paths in life that have ZERO shortcuts, but there are also many folks who refuse to acknowledge that difficult work is sometimes absolutely unavoidable.
  
  MrMcCall 7 months ago
  
  You must have replied to the wrong post, because I never said anything about myself being "unique" or otherwise differently talented than others, though that is possible. I don't measure myself against others; if they have a better insight into something I will gladly learn from them, without ego.
  I'm talking about the fact that programming is a unique human endeavor, and a damned difficult one at that.
  > How would you feel if you never had access to a computer with a dev environment until you were an adult, and then someone told you not to bother learning how to code because you aren't special like they are?
  I would never say some stupid shit like that, to anyone, ever. If they want to do it, I would encourage them and give them basic advice to help them on their way. And I IN NO WAY believe that I am more talented at programming than ANYONE else on Earth. The experience I have earned from raw, hard graft across various programming environments and projects is my only advantage in a conversion about software development. But I firmly believe that a basic linux install and python, C, and bash will be enough to allow anyone to reach a level of basic professional proficiency.
  You are WAY out of pocket here, my friend, or perhaps you just don't understand English very well.
  > When did you learn to code? What access did you have to technology when you started? How much free time did you have? What kind of education did you have?
  Getting to learn BASIC on an Apple (2e?) in 6th grade was fantastic for me; it was love at first goto. But having a C64 in 9th Grade was pivotal to the development of my fundamental skills and mindset, and I was very lucky to be in a nice house with the time to write programs for fun, and an 11th grade AP CS course with a very good teacher and TRS80s. But we were very much lower middle class, which factored into my choice of college and how well I did there. But, absolutely, I am a very, very lucky human being, yet tenacity via passion is the key to my success, and is not beyond ANYONE else.
  > The 'magic box' is a way to get past the wall that requires people to spend 3 hours trying to figure out what python environments are before they can even write a program that does anything useful.
  If you say so, but no one should be learning to program in a specific python env or doing anything "useful" except for personal exploration or rudimentary classwork.
  Educating ourselves about how to logically program -- types, vars, fcts, files -- is our first "useful" programming any of us will be able to do for some years, which is no different than how an auto mechanic will ramp up to professional levels of proficiency, from changing oil to beyond.
  With the internet in 2025, however, I'm sure people can learn more quickly, but if and only if they have the drive to do so.
  
  naasking 7 months ago
  
  > The backlash will be enormous. In the near future, there will be less competent coders and a tsunami of bad code to fix
  These code AIs are just going to get better and better. Fixing this "tsunami of bad code" will consist of just passing it through the better AIs that will easily just fix most of the problems. I can't help but feel like this will be mostly a non-problem in the end.
  
  dns_snek 7 months ago
  
  > Fixing this "tsunami of bad code" will consist of just passing it through the better AIs that will easily just fix most of the problems.
  At this point in time there's no obvious path to that reality, it's just unfounded optimism and I don't think it's particularly healthy. What happens 5, 10, or 20 years down the line when this magical solution doesn't arrive?
  
  naasking 7 months ago
  
  I don't know where you're getting your data that there's no obvious path, or that it's unfounded optimism. When the chatbots first came out they were unusable for code, now they're borderline good for many tasks and excellent at others, and it's only been a couple of years. Every tool has its limitations at any given time, and I think your pessimism is entirely speculative.
  
  dns_snek 7 months ago
  
  What we have now are LLMs that some consider to be good at tasks that are incremental, limited in scope, and require constant human oversight with many iterations.
  What you want is an LLM that is exceptionally good at completely rewriting a poorly written codebase spanning tens or hundreds of thousands of lines of code, which works reliably with minimal oversight and without introducing hundreds of critical and hard to diagnose bugs.
  Not realizing that these tasks are many orders of magnitudes apart in complexity is where the "unfounded optimism" comment comes from.
  
  krupan 7 months ago
  
  Nobody has to prove a negative, my friend
  
  naasking 7 months ago
  
  Anybody making a claim should be able to justify it or admit it's conjecture.
  
  namaria 7 months ago
  
  Goes both ways. Your extending the line in some particular way from the past couple of years isn't much more than an article of faith.
  
  naasking 7 months ago
  
  It's more than the past couple of years, steady improvements in machine learning stretch back decades at this point. There is no indication this is stopping or slowing down, quite the contrary. We also already know that better is possible because the human brain is still better in many ways, and it exists.
  You can claim that continued progression is speculative, and some aspects are, but it's hardly "an article of faith", unlike "we've suddenly hit a surprising wall we can't surmount".
  
  Izkata 7 months ago
  
  > steady improvements in machine learning stretch back decades at this point
  Except that's not how it's actually gone. It's more like, improvements happen in erratic jumps as new methods are discovered, then improvements slow or stall out when the limits of those methods are reached.
  
  naasking 7 months ago
  
  No, that's just how it looked from the outside if you weren't tracking closely. Even emergent abilities are a mirage when you look at the actual data:
  https://hai.stanford.edu/news/ais-ostensible-emergent-abilit...
  
  Izkata 7 months ago
  
  I'm not talking "past 3 years", I'm talking "past 50 years": https://en.m.wikipedia.org/wiki/AI_winter
  And really, there was a version of what I'm talking about in the shorter timespan with LLMs - OpenAI's GPT models existed for several years before someone got the idea to put it behind a chat interface and the popularity / apparent capability exploded a few years ago.
  
  naasking 7 months ago
  
  > OpenAI's GPT models existed for several years before someone got the idea to put it behind a chat interface and the popularity / apparent capability exploded a few years ago.
  That's exactly what I said in the post you responded to: there weren't erratic jumps, there was steady progress over decades.
  
  Izkata 7 months ago
  
  You keep switching back and forth between short and long time periods, as if the rapid steady growth of the past couple years is how it's gone for decades. This is not the case - we're currently in a short* period of rapid growth after a decade or so of stagnation. That's what "erratic" means, it has not been steady - over the past several decades there have been several times where we've seen rapid growth for a short period, then it hits a wall and we see very little or no growth until the next breakthrough.
  * Granted we don't know for sure it'll be short this time, but hints are that we're starting to hit that wall with improvements slowing down.

t_mann 7 months ago

Hallucinations themselves are not even the greatest risk posed by LLMs. A much greater risk (in simple terms of probability times severity) I'd say is that chat bots can talk humans into harming themselves or others. Both of which have already happened, btw [0,1]. Still not sure if I'd call that the greatest overall risk, but my ideas for what could be even more dangerous I don't even want to share here.

[0] https://www.qut.edu.au/news/realfocus/deaths-linked-to-chatb...

[1] https://www.theguardian.com/uk-news/2023/jul/06/ai-chatbot-e...

tombert 7 months ago

I don't know if the model changed in the last six months, or maybe the wow factor has worn off a bit, but it also feels like ChatGPT has become a lot more "people-pleasy" than it was before.
I'll ask it opinionated questions, and it will just do stuff to reaffirm what I said, even when I give contrary opinions in the same chat.
I personally find it annoying (I don't really get along with human people pleasers either), but I could see someone using it as a tool to justify doing bad stuff, including self-harm; it doesn't really ever push back on what I say.
- taneq 7 months ago
  
  I haven't played with it too much, and maybe it's changed recently or the paid version is different, but last week I found it irritatingly obtuse.
  > Me: Hi! Could you please help me find the problem with some code?
  > ChatGPT: Of course! Show me the code and I'll take a look!
  > Me: [bunch o' code]
  > ChatGPT: OK, it looks like you're trying to [do thing]. What did you want help with?
  > Me: I'm trying to find a problem with this code.
  > ChatGPT: Sure, just show me the code and I'll try to help!
  > Me: I just pasted it.
  > ChatGPT: I can't see it.
  
  MrMcCall 7 months ago
  
  Maybe they taught it that a) it doesn't work, and b) not to tell anyone.
  Lying will goad a person into trying again; the brutally honest truth will stop them like a brick wall.
- unclebucknasty 7 months ago
  
  Yeah, I think it's coded to be super-conciliatory as some sort of apology for its hallucinations, but I find it annoying as well. Part of it is just like all automated prompts that try to be too human. When you know it's not human, it's almost patronizing and just annoying.
  But, it's actually worse, because it's generally apologizing for something completely wrong that it told you just moments before with extreme confidence.
- renewiltord 7 months ago
  
  It's obvious, isn't it? The average Hacker News user, who has converged to the average Internet user, wants exactly that experience. LLMs are pretty good tools but perhaps they shouldn't be made available to others. People like me can use them but others seem to be killed when making contact. I think it's fine to restrict access to the elite. We don't let just anyone fly a fighter jet. Perhaps the average HN user should be protected from LLM interactions.
  
  tombert 7 months ago
  
  Is that really what you got from what I wrote? I wasn't suggesting that we restrict access to anyone, and I wasn't trying to imply that I'm somehow immune to the problems that were highlighted.
  I mentioned that I don't like people-pleasers and I find it a bit obnoxious when ChatGPT does it. I'm sure that there might be other bits of subtle encouragement it gives me that I don't notice, but I can't elaborate on those parts because, you know, I didn't notice them.
  I genuinely do not know how you got "we should restrict access" from my comment or the parent, you just extrapolated to make a pretty stupid joke.
  
  renewiltord 7 months ago
  
  Haha, I'm not claiming you're wanting that. I want that. So I'm saying it. What makes you think I was attempting to restate what you wrote?
  
  tombert 7 months ago
  
  It looked like you were being sarcastic, implying I was trying to suggest that I thought I was better than the average person in regards to handling AI. Particularly this line:
  > People like me can use them but others seem to be killed when making contact.
  If I misread that, fair enough.
  
  renewiltord 7 months ago
  
  Yeah, no, 100% sincere personal view. That guy who killed himself after using it is obviously not ready for this. Imagine killing yourself after typing in `print("Kill yourself")` at the Python REPL. The guy's an imbecile. We don't let just anyone drive a truck. I'm fine with nearly everyone being on the outside and unable to use these tools so long as I'm allowed to with as little trouble as possible.
  I recognize that the view that others should not be permitted things that I should be allowed to use is generally a sarcastically expressed view, but I genuinely think it has merit. Everyone who believes these things are dangerous and everyone to whom this is obviously dangerous, like the aforementioned mentally deficient individual, shouldn't be permitted use.
hexaga 7 months ago

More generally - AI that is good at convincing people is very powerful, and powerful things are dangerous.
I'm increasingly coming around to the notion that AI tooling should have safety features concerned with not directly exposing humans to asymptotically increasing levels of 'convincingness' in generated output. Something like a weaker model used as a buffer.
Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.
Like most safety regulations, it'll take blood for the inking. Exposing mass numbers of people to these models strikes me as wildly negligent if we expect continued improvement along this axis.
- kjs3 7 months ago
  
  Yeah...this. I'm not so concerned that AI is going to put me out of a job or become Skynet. I'm concerned that people are offloading decision making and critical thinking to the AI, accepting it's response at face value and responding to concerns with "the AI said so...must be right". Companies are already maliciously exploiting this (e.g. the AI has denied your medical claim, and we can't tell you how it decided that because our models are trade secrets), but it will soon become de rigueur and people will think you're weird for questioning the wisdom of the AI.
  
  Nevermark 7 months ago
  
  The combination of blind faith in AI, and good faith highly informed understanding and agreement, achieved with help of AI, covers the full spectrum of the problem.
- southernplaces7 7 months ago
  
  >Projecting out to 5-10 years: what happens when LLMs are still producing hallucinatory semi-sense, but merely comprehending it makes the machine temporarily own you? A bit like getting hair caught in an angle grinder, that.
  Seriously? Do you suppose that it will pull this trick off through some sort of hypnotizing magic perhaps? I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.
  The kinds of people who would be convinced by such "dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings anyhow.
  Aside from demonstrating the persistent AI woo that permeats many comments on this site, the logic above reminds me of the harping nonsense around the supposed dangers of video games or certain violent movies "making kinds do bad things", in years past. The prohibitionist nanny tendencies behind such fears are more dangerous than any silly chatbot AI..
  
  aaronbaugher 7 months ago
  
  I've seen people talk about using ChatGPT as a free therapist, so yes, I do think there's a good chance that they could be talked into self-destructive behavior by a chat bot that latched onto something they said and is "trying" to tell them what they want to hear. Maybe not killing themselves, but blowing up good relationships or quitting good jobs, absolutely.
  These are people who have jobs and apartments and are able to post online about their problems in complete sentences. If they're not "of sound mind," we have a lot more mentally unstable people running around than we like to think we do.
  
  southernplaces7 7 months ago
  
  >we have a lot more mentally unstable people running around than we like to think we do.
  So what do you believe should be the case? That AI in any flexible communicative form be limited to a select number of people who can prove they're of sound enough mind to use it unfiltered?
  You see how similar this is to historical nonsense about restricting the loaning or sale of books on certain subjects only to people of a certain supposed caliber or authority? Or banning the production and distribution of movies that were claimed to be capable of corrupting minds into committing harmful and immoral acts. How stupid do these historical restrictions look today in any modern society? That's how stupid this harping about the dangers of AI chatbots will look down the road.
  The limitation of AI because it may or may not cause some people to do irrational things not only smacks of a persistent AI woo on this site, which drastically overstates the power of these stochastic parrot systems, but also seems to forget that we live in a world in which all kinds of information triggers could maybe make someone make stupid choices. These include books, movies, and all kinds of other content produced far more effectively and with greater emotional impact by completely human authors.
  By claiming a need for regulating the supposed information and discourse dangers of AI chat systems, you're not only serving the cynically fear-mongering arguments of major AI companies who would love such a regulatory moat around their overvalued pet projects, you're also tacitly claiming that literature, speech and other forms of written, spoken or digitally produced expression should be restricted unless they stick to the banally harmless, by some very vague definitions of what exactly harmful content even is.
  In sum, fuck that and the entire chain of implicit long-used censorship, moralizing nannyism, potential for speech restriction and legal over-reach that it so bloody obviously entails.
  
  hexaga 7 months ago
  
  If you believe current models exist at the limit of possible persuasiveness, there obviously isn't any cause for concern.
  For various reasons, I don't believe that, which is why my argument is predicated on them improving over time. Obviously current models aren't overly hazardous in the sense I posit - it's a concern for future models that are stronger, or explicitly trained to be more engaging and/or convincing.
  The load bearing element is the answer to: "are models becoming more convincing over time?" not "are they very convincing now?"
  > [..] I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot [..]
  Then you're not engaging with the premise at all, and are attacking a point I haven't made. The tautological assurance that non-convincing AI is not convincing is not relevant to a concern predicated on the eventual existence of highly convincing AI: that sufficiently convincing AI is hazardous due to induced loss of control, and that as capabilities increase the loss of control becomes more difficult to resist.
  
  OkayPhysicist 7 months ago
  
  You're describing a phase change in persuasiveness which we have no evidence for. If humans were capable of being immediately compelled to do something based on reading some text, advertisers would have taken advantage of that a looooong time ago.
  Persuasion is mostly about establishing that doing or believing what you're telling them is in their best interest. If all my friends start telling me a piece of information, belief in that information has a real interest to me, as it would help strengthen social bonds. If I have a consciously weakly held belief in something, then a compelling argument would consist of providing enough evidence for a viewpoint that I could confidently hold that view and not worry I'll appear misinformed when speaking on it.
  Convincing me to do something involves establishing that either I'll face negative consequences for not doing it, or positive rewards for doing it. AI has an extremely difficult time establishing that kind of credibility.
  To argue that an AI could become persuasive to the point of mind control is to assert that one can compell a belief in another without the ability to take real-world action.
  The absolute worst case scenario for a rogue AI is it leveraging people's belief in it to compel actions in others by way of a combination of blackmail, rewards, and threats of compelling others to commit violence on its behalf by a combination of the same.
  We already live in a world with such artificial intelligences: we call them governments and corporations.
  
  hexaga 7 months ago
  
  > You're describing a phase change in persuasiveness which we have no evidence for.
  That's reasonable, and I really do hope this keeps on being the case. However, I would nit that I see this as a continuum rather than a phase change. That is, I think hazard smoothly increases with persuasiveness. I can point to some far off region and say: "oh, that seems quite concerning" but it doesn't start being so there.
  Persuasiveness below the threshold of 'instant mind control' is still a hazard. Hanging out with salesmen on the job is like to loosen your wallet, even if it isn't guaranteed.
  > If humans were capable of being immediately compelled to do something based on reading some text, advertisers would have taken advantage of that a looooong time ago.
  I'd base my counter on the notion that the problem of persuasion is harder when you have less information about whom you're trying to convince.
  To expand on the intuition behind that: advertisement-persuasion is hard in a way that conversational-persuasion is not. Shilling in conversational contexts (word of mouth) is more effective than generic advertisement.
  A message that will convince one specific person is easier to generate than a message that will convince any random 10 people.
  This proceeds to the idea that information about a person-under-persuasion is akin to power over them. Knowing not only what you believe but why you believe it and what else you believe adjacent to it and what you want is a force multiplier in this regard.
  And so we get to AI models, which gather specific information about the mind of each person they interact with. The message is tailored to you and you alone, it is not a wide spectrum net cast to catch the largest possible number. Advertisements are qualitatively different; they do not 'pick your brain' nearly so much as the model does.
  > Convincing me to do something involves establishing that either I'll face negative consequences for not doing it, or positive rewards for doing it. AI has an extremely difficult time establishing that kind of credibility.
  > To argue that an AI could become persuasive to the point of mind control is to assert that one can compell a belief in another without the ability to take real-world action.
  I don't agree with this because I don't agree with the premise that you must use a 'principled' approach to convince someone as you've described. People use heuristics to decide what to believe.
  By dint of the bitter lesson, I think superhuman persuasion will involve stupid tricks of no particular principled basis that take advantage of 'invisible' vulnerabilities in human cognition.
  That is, I don't think those 'reasons to believe the belief' matter. A child will believe the voice of their parents; it doesn't necessarily register that it's in their best interest or it will be bad for them if they don't. Bootstrapping children involves exploiting vulnerabilities in their psyche via implicit trust. Will the AI speak in the voice of my father, as I might hear it in prelingual childhood? Are all such mechanisms gone by adulthood? Is there anything like a generalized follow-the-leader-with-leader-detection pattern?
  How hard is it for gradient descent to fit a solution to the boundaries of such heuristics?
  This is however, getting into the weeds of exact mechanisms which I'm not too concerned with. I believe (but can't prove) that exploits of that nature exist (or that similarly effective means exist), and that they can be found via brute force search. I think the dominant methodology of continuously training chat models on conversational data those same models participate in is among the likeliest of ways to get to that point.
  Ultimately, so long as there's no directed pressure to force people into contact with very convincing model output (see your rogue AI scenario), it doesn't seem that hard to make it safe: limit direct contact and/or require that tooling limits contact by default. Avoid multi-turn refinement and conversational history (amplification of persuasive power via mechanism described above). Treat it like a spinning blade and be it on your own head if you want to break yourself.
  However, as I mentioned in my original comment, it will take blood for the inking. The incentives don't align to guard against this class of hazard from the get-go or even admit it is possible (merely to produce appearances of caring about 'safety' (read: our model won't do scary politically incorrect things!)), so we're going to see what happens when you mindlessly expose millions of people to it.
  
  southernplaces7 7 months ago
  
  You completely misunderstand my argument with your nitpicking on a specific sarcastic description I made about the current communicative state of most AI chat systems.
  In reality, even if they improve to be completely indistinguishable from the sharpest and most persuasive of human minds our society has ever known, i'd still make exactly the same arguments as above. I'd make these for the same reason that I'd argue for how no regulatory body or self-appointed filter of moral arbiters should be able to restrict the specific arguments and formas of expression currently available for persuasive human beings, or people of any kind.
  Just as we shouldn't prohibit literature, film, internet blog posts, opinion pieces in media and any other sources by which people communicate their opinions and information to others under the argument that such opinions might be "harmful" , I wouldn't regulate AI sources of information and chatbots.
  One can make an easy case for regulating and punishing the acts people try to perform based on information they obtain from AI, in terms of the measurable harm these acts would cause to others, but banning a source of information based on a hypothetical, ambiguous danger of its potential for corrupting minds is little different from the idiocy of restricting free expression because it might morally corrupt supposedly fragile minds.
  
  hexaga 7 months ago
  
  If your argument must rest on a caricature of weak persuasiveness attempting to persuade someone of something extremely disadvantageous to show how impossible hazardous persuasion is, there is something wrong. Nevertheless:
  First, you argued the implausibility of strong persuasion. Your rhetoric was effectively "look how silly this whole notion of a machine persuading someone of something is, because how dumb would you need to be for this silly thing to convince you to do this very bad thing?"
  That is then used to fuel an argument that I am merely propagating AI woo, consumed by magical thinking, and clearly am just afraid of something equivalent to violent video games and/or movies. The level of inferential contortion is difficult to wrap my head around.
  Now, you seem to be arguing along an entirely different track: that AI models should have the inalienable right to self expression, for the same reason humans should have that right (I find it deeply ironic that this is the direction you'd choose after accusations of AI woo, but I digress). Or, equivalently, that humans should have the inalienable right to access and use such models.
  This is no longer an argument about the plausibility of AI being persuasive, or that persuasion can be hazardous, but that we should permit it in spite of any of that because freedom of expression is generally a good thing.
  (This is strange to me, because I never argued that the models should be banned or prohibited, merely that tooling should try to avoid direct human-to-model-output contact, as such contact (when model output is sufficiently persuasive) is hazardous. Much like how angle grinders or power tools are generally not banned, but have safety features preventing egregious bodily harms.)
  > In reality, even if they improve to be completely indistinguishable from the sharpest and most persuasive of human minds our society has ever known, I'd still make exactly the same arguments as above.
  While my true concern is systems of higher persuasiveness than humans have ever been exposed to, let's see:
  > I have a hard time imagining [the most persuasive of human minds our society has ever known] convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.
  This is immediately falsified by the myriad examples of exactly this occurring, via a much lower bar than 'most persuasive person ever'. Hmm. Strange wonder that it requires a sarcastic caricature to not immediately seem like a nonsense argument.
  Considering my entire position is simply that exposure to persuasion can be hazardous, I don't see what you're trying to prove now. It's certainly not in opposition to something I've said.
  As it does seem you have shifted perspectives to the moral rather than the mechanistic, and that you've conceded that persuasion carries with it nontrivial hazard (even if we should entertain that hazard for the sake of our freedoms), are we now determining how much risk is acceptable to maintain freedoms? I'm not interested in having that discussion, as I don't purport to restrict said freedoms in any case.
  Going back to the power tool analogy, you are of course free to disable safety precautions on your own personal angle grinder. At work, some sort of regulatory agency (OSHA, etc) will toil to stop your employer from doing so. I, personally, want a future of AI tooling akin to this. If AI are persuasive enough to be hazardous, I don't want to be forced by my employer to directly consume ultra-high-valence generated output. I want such high-valence content to be treated as the light of an arc-welder, something you're required to wear protection to witness or risk intervention by independent agencies that everybody grumbles about but enjoys the fruit of (namely, a distinct lack of exotic skin cancers and blindness in welders).
  My point was originally and remains the bare observation that any of this will cost in blood, and whatever regulations are made will be inked in it.
  I do understand the deeper motivations of your arguments, the desire to avoid (and/or fear of) gleeful overreach by the hands of AI labs who want nothing more than to wholly control all use of such models. That lies orthogonal to my basis of reasoning. It does not adequately contend with the realities of what to do when persuasiveness approaches sufficient levels. Is the truth now something to be avoided because it would serve the agenda of somebody in particular? Should we distort our understanding to not encroach on ideas that will be misappropriated by those with something to gain?
  Ignoring any exposition on whether it is plausible or whether it caps out at human or subhuman or superhuman levels or any of the chaff about freedom of expression or misappropriation by motivated actors: if we do manage to build such a thing as I describe (and the hazard inherent is plainly obvious if the construction is not weakened, but resident still even if weakened), what do we do? How many millions will be exposed to these systems? How can it be made into something that retains utility yet is not a horror beyond reckoning?
  There is a great deal more to say on the subject, I unfortunately don't have the time to explore it in any real depth here.
southernplaces7 7 months ago

In both of your linked examples, the people in question very likely had at least some sort of mental instability working in their minds.
I have a hard time imagining any sort of overly verbose, clause and condition-ridden chatbot convincing anyone of sound mind to seriously harm themselves or do some egregiously stupid/violent thing.
The kinds of people who would be convinced by such "harm dangers" are likely to be mentally unstable or suggestible enough about it to in any case be convinced by any number of human beings, or by books, or movies or any other sort of excuse for a mind that had problems well before seeing X or Y.
By the logic of regulating AI for these supposed dangers, you could argue that literature, movie content, comic books, YouTube videos and that much loved boogeyman in previous years of violent video games should all be banned or regulated for the content they express.
Such notions have a strongly nannyish, prohibitionist streak that's much more dangerous than some algorithm and the bullshit it spews to a few suggestible individuals.
The media of course loves such narratives, because their breathless hysteria and contrived fear-mongering plays right into more eyeballs. Seeing people again take seriously such nonsense after idiocies like the media frenzy around video games in the early 2000s and prior to that, similar media fits about violent movies and even literature, is sort of sad.
We don't need our tools for expression, and sources of information "regulated for harm" because a small minority of others can't get an easy grip on their psychological state.
- skywhopper 7 months ago
  
  Pretty much everyone has “some sort of mental instability working in their minds”.
  
  southernplaces7 7 months ago
  
  Don't be obtuse. There are degrees of mental instability and no, some random person having a touch of it in very specific ways isn't the same as someone deciding to try killing the Queen of England because a chatbot said so. Most people wouldn't be quite that deluded in that context.
  I'd love to see evidence of mental instability in "everyone" and its presence in many people is in any case no justification for what are in effect controls on freedom of speech and expression, just couched in a new boogeyman.
zahlman 7 months ago

Is this somehow worse than humans talking each other into it?
- skywhopper 7 months ago
  
  Yes.
  
  zahlman 7 months ago
  
  How?
  
  krupan 7 months ago
  
  Does this really have to be spelled out?? Because a single human can only intimately converse with and convince a small number of people, while an LLM can do that with thousands (what is the upper limit even?) of people at a time.
  Also, because AI is being relentlessly marketed as being better than humans, thereby encouraging people to trust it even more than they might a fellow human.

AndyKelley 7 months ago

> Chose boring technology. I genuinely find myself picking libraries that have been around for a while partly because that way it’s much more likely that LLMs will be able to use them.

This is an appeal against innovation.

> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”

> Those people are loudly declaring that they have under-invested in the crucial skills of reading, understanding and reviewing code written by other people. I suggest getting some more practice in. Reviewing code written for you by LLMs is a great way to do that.

As someone who has spent [an incredible amount of time reviewing other people's code](https://github.com/ziglang/zig/pulls?q=is%3Apr+is%3Aclosed), my perspective is that reviewing code is fundamentally slower than writing it oneself. The purpose of reviewing code is mentorship, investing in the community, and building trust, so that those reviewees can become autonomous and eventually help out with reviewing.

You get none of that from reviewing code generated by an LLM.

xboxnolifes 7 months ago

> This is an appeal against innovation.
No it is not. It is arguing for using more stable and better documented tooling.
- em-bee 7 months ago
  
  so it's an appeal to not innovate on tooling and languages?
  
  xboxnolifes 7 months ago
  
  It's not appealing to anything.

verbify 7 months ago

An anecdote: I was working for a medical centre, and had some code that was supposed to find the 'main' clinic of a patient.

The specification was to only look at clinical appointments, and find the most recent appointment. However if the patient didn't have a clinical appointment, it was supposed to find the most recent appointment of any sort.

I wrote the code by sorting the data (first by clinical-non-clinical and then by date). I asked chatgpt to document it. It misunderstood the code and got the sorting backwards.

I was pretty surprised, and after testing with foo-bar examples eventually realised that I had called the clinical-non-clinical column "Clinical", which confused the LLM.

This is the kind of mistake that is a lot worse than "code doesn't run" - being seemingly right but wrong is much worse than being obviously wrong.

zahlman 7 months ago

To be clear, by "clinical-non-clinical", you mean a boolean flag for whether the appointment is clinical?
- verbify 7 months ago
  
  Yes, although we weren't using a boolean.
  (There was a reason for this - the field was used elsewhere within a PowerBI model, and the clinicians couldn't get their heads around True/False, PowerBI doesn't have an easy way to map True/False values to strings, so we used 'Clinical/Non-Clinical' as string values).
  I am reluctant to share the code example, because I'm preciously guarding an example of an LLM making an error in the hope that I'll be able to benchmark models using this, however here's the powerquery code (which you can put into excel) - ask an LLM to explain the code/predict what the output will look like, and compare it with what you get in excel.
  let
  MyTable = #table( {"Foo"}, { {"ABC"}, {"BCD"}, {"CDE"} } ), AddedCustom = Table.AddColumn( MyTable, "B", each if Text.StartsWith([Foo], "LIAS") or Text.StartsWith([Foo], "B") then "B" else "NotB" ), SortedRows = Table.Sort( AddedCustom, {{"B", Order.Descending}} )
  in
  SortedRows
  I believe the issue arises because the column that sorts B/NotB is also called 'B' (i.e. the Clinical/Non-Clinical column was simply called 'Clinical', which is not an amazing naming convention).

tombert 7 months ago

I use ChatGPT to generate code a lot, and it's certainly useful, but it has given me issues that are not obvious.

For example, I had it generate some C code to be used with ZeroMQ a few months ago. The code looked absolutely fine, and it mostly worked fine, but it made a mistake with its memory allocation stuff that caused it to segfault sometimes, and corrupt memory other times.

Fortunately, this was such a small project and I already know how to write code, so it wasn't too hard for me to find and fix, though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.

zahlman 7 months ago

>though I am slightly concerned that some people are copypasting large swaths of code from ChatGPT that looks mostly fine but hides subtle bugs.
They used to do the same with Stack Overflow. But now it's more dangerous, because the code can be "subtly wrong in ways the user can't fathom" to order.
- tombert 7 months ago
  
  Yeah, there's effectively no limit to how much code that you can have.
  We're all guilty of copypasting from Stack Overflow, but as you said, that's not made to order. In order to use the code copied from there, you will likely have to edit it, at least a bit to fit your application, meaning that it does require a bit of understanding of what you're doing.
  Since ChatGPT can be completely tuned to what you want without writing code, it's far more tempting to just copy and paste from it without auditing it.
- krupan 7 months ago
  
  The beauty of stack overflow is that the code you are copying and pasting has been reviewed and voted on by a decent number of other programmers
KoolKat23 7 months ago

And subtle bugs existed pre-2022, how often my apps are updated for "minor bug fixes" would mean this is par for the course.
- tombert 7 months ago
  
  Sure, it's possible that the code it gave me was based on some incorrectly written code it scraped from Gitlab or something.
  I'm not a luddite, I'm perfectly fine with people using AI for writing code. The only thing that really concerns me is that it has the potential to generate a ton of shitty code that doesn't look shitty, creating a lot of surface area for debugging.
  Prior to AI, the quantity of crappy code that could be generated was basically limited by the speed in which a human could write it, but now there's really no limit.
  Again, just to reiterate, this isn't "old man yells at cloud". I think AI is pretty cool, I use it all the time, I don't even have a problem with people generating large quantities of code, it's just something we have to be a bit more weary of.
  
  KoolKat23 7 months ago
  
  Agree, just means less time developing and more time on quality control.

not2b 7 months ago

If the hallucinated code doesn't compile (or in an interpreted language, immediately throws exceptions), then yes, that isn't risky because that code won't be used. I'm more concerned about code that appears to work for some test cases but solves the wrong problem or inadequately solves the problem, and whether we have anyone on the team who can maintain that code long-term or document it well enough so others can.

wavemode 7 months ago

I once submitted some code for review, in which the AI had inserted a recursive call to the same function being defined. The recursive call was completely unnecessary and completely nonsensical, but also not wrong per se - it just caused the function to repeat what it was doing. The code typechecked, the tests passed, and the line of code was easy to miss while doing a cursory read through the logic. I missed it, the code reviewer missed it, and eventually it merged to production.
Unfortunately there was one particular edge case which caused that recursive call to become an infinite loop, and I was extremely embarrassed seeing that "stack overflow" server error alert come through Slack afterward.
t14n 7 months ago

fwiw this problem already exists with my more junior co-workers. and also my own code that I write when exhausted!
if you have trusted processes for review and aren't always rushing out changes without triple checking your work (plus a review from another set of eyes), then I think you catch a lot of the subtler bugs that are emitted from an LLM.
- not2b 7 months ago
  
  Yes, code review can catch these things. But code review for more complex issues works better when the submitter can walk the reviewers through the design and explain the details (sometimes the reviewers will catch a flaw in the submitter's reasoning before they spot the issue in the code: it can become clearer that the developer didn't adequately understand the spec or the problem to be solved). If an LLM produced it, a rigorous process will take longer, which reduces the value of using the LLM in the first place.

henning 7 months ago

If I have to spend lots of time learning how to use something, fix its errors, review its output, etc., it may just be faster and easier to just write it myself from scratch.

The burden of proof is not on me to justify why I choose not to use something. It's on the vendor to explain why I should turn the software development process into perpetually reviewing a junior engineer's hit-or-miss code.

It is nice that the author uses the word "assume" -- there is mixed data on actual productivity outcomes of LLMs. That is all you are doing -- making assumptions without conclusive data.

This is not nearly as strong an argument as the author thinks it is.

> As a Python and JavaScript programmer my favorite models right now are Claude 3.7 Sonnet with thinking turned on, OpenAI’s o3-mini-high and GPT-4o with Code Interpreter (for Python).

This is similar to Neovim users who talk about "productivity" while ignoring all the time spent tweaking dofiles that could be spent doing your actual job. Every second I spend toying with models is me doing something that does not directly accomplish my goals.

You have no idea how much code I read, so how can you make such claims? Anyone who reads plenty of code knows that it often feels like reading other people's code is often harder than just writing it yourself.

The level of hostility towards just sitting down and thinking through something without having an LLM insert text into your editor is unwarranted and unreasonable. A better policy is: if you like using coding assistants, great. If you don't and you still get plenty of work done, great.

skydhash 7 months ago

Also the thing that people miss is compounded experience. Just starting with any language, you have to read a lot of documentation, books, and articles. After a year or so, you have enough skeleton projects, code samples, and knowledge, that you could build a mini framework if the projects were repetitive. Even then, you could just copy paste features that you've already implemented, like that test harness or the Rabbitmq integration an be very productive that way.

sevensor 7 months ago

> you have to put a lot of work in to learn how to get good results out of these systems

That certainly punctures the hype. What are LLMs good for, if the best you can hope for is to spend years learning to prompt it for unreliable results?

jjevanoorschot 7 months ago

Many tools that increase your productivity as a developer take a while to master. For example, it takes a while to become proficient with a debugger, but I'd still wager that it's worth it to learn to use a debugger over just relying on print debugging.
- MrMcCall 7 months ago
  
  40+ years of successful coding with only print debugging FTW!
  A tool that helps you by iteratively guessing the next token is not a "developer tool" any more than a slot machine is a wealth buidling tool.
  Even when I was using Visual Studio Ultimate (that has a fantastic step-through debugging environment), the debugger was only useful for the very initial tests, in order to correct dumb mistakes.
  Finding dumb mistakes is a different order of magnitude of the dev process than building a complex edifice of working code.
  
  UncleEntity 7 months ago
  
  I would say printf debugging is the functional equivalent of "guessing the next token". I only reach for it when my deductive reasoning (and gdb) skills fail and I'm just shining a flashlight in the dark hoping to see the bugs scurrying around.
  Ironically, I used it to help the robots find a pretty deep bug in some code they authored in which the whole "this code isn't working, fix it" prompt didn't gain any traction. Giving them the code with the debug statements and the output set them on the right path. Easy peasy...true, they were responsible for the bug in the first place so I guess the humans who write bug free code have the advantage.
  
  MrMcCall 7 months ago
  
  > I would say printf debugging is the functional equivalent of "guessing the next token".
  The output of the code print statments, as the code is iteratively built up from skeleton to ever greater levels of functionality, is analyzed to ensure that things are working properly, in a stepwise fashion. There is no guessing in this whatsoever. It is a logical design progression from minimal functionality to complete implementation.
  Standard commercial computers never guess, so that puts constraints on my adding to their intrinsic logical data flows, i.e. I should never be guessing either.
  > I guess the humans who write bug free code have the advantage.
  We fanatical perfectionists are the only ones who write successful software, though perfection in function is the only perfection that can be attained. Other metrics about, for example, code structure, or implementation environment, or UI design, and the like, are merely ancillary to the functioning of the data flows.
  And I need not guess to know this fundamental truth, which is common for all engineering endeavors, though software is the only engineering pursuit (not discipline, yet) where there is only a binary result: either it works perfectly as designed or it doesn't. We don't get to be "off by 0.1mm", unless our design specs say we have some grey area, and I've never seen that in all my years of developing/modifying various n-tiered RDBMS topologies, desktop apps, and even a materials science equipment test data capture system.
  I saw the term "fuzzy logic" crop up a few decades ago, but have never had the occasion to use anything like that, though even that is a specific kind of algorithm that will either be implemented precisely or not.
- krupan 7 months ago
  
  You missed the part about unreliable results. Never in software engineering have we had to put a lot of effort into a tool that gives unpredictable, unreliable results like LLMs.
rsynnott 7 months ago

You’re holding it wrong, magic robot edition.
Like, at a certain point, doing it yourself is probably less hassle.
naasking 7 months ago

Because LLMs are not a stationary target, they're only getting better. They're already much better than they were only 2 years ago.

fumeux_fume 7 months ago

Least dangerous only within the limited context you defined of compilation errors. If I hired a programmer and I found whole libraries they invented to save themselves the effort of finding a real solution, I would be much more upset than if I found subtle logical errors in their code. If you take the cynical view that hallucinations are just speed bumps that can be iterated away then I would argue you are under-valuing the actual work I want the LLM to do for me. One time I was trying to get help with the AWS CLI or boto3 and no matter how many times I pasted the traceback to Claude or ChatGPT, it would apologize and then hallucinate the non-existent method or command. At least with logical errors I can fix those! But all in all, I still agree with a lot in this post.

nojs 7 months ago

If you’re writing code in Python against well documented APIs, sure. But it’s an issue for less popular languages and frameworks, when you can’t immediately tell if the missing method is your fault due to a missing dependency, version issue, etc.

zahlman 7 months ago

IMX, quite a few Python users - including ones who think they know what they're doing - run into that same confusion, because they haven't properly understood fundamentals e.g. about how virtual environments work, or how to read documentation effectively. Or sometimes just because they've been careless and don't have good habits for ensuring (or troubleshooting) the environment.

jccalhoun 7 months ago

I am not a programmer and i don't use Linux. I've been working on a python script for a raspberry pi for a few months. Chatgpt has been really helpful in showing me how to do things or debug errors.

Now I am at the point that I am cleaning up the code and making it pretty. My script is less than 300 lines and Chatgpt regularly just leaves out whole chunks of the script when it suggests improvements. The first couple times this led to tons of head scratching over why some small change to make one thing more resilient would make something totally unrelated break.

Now I've learned to take Chatgpt's changes and diff it with the working version before I try to run it.

Tostino 7 months ago

Version control inside an IDE helps with noticing these types of changes, even if you aren't a programmer
genewitch 7 months ago

Chatgpt can output a straight diff, too, that you can use with patch.
That's how aider commands the models to reply, for example.
- IanCal 7 months ago
  
  That's not quite right. The models are pretty bad at generating a proper diff, so there are two common formats used. The main one is a search and replace, and the search is then done in quite a fuzzy manner.
  
  IanCal 7 months ago
  
  To be clear the diff they generate is something you or I could apply manually and wouldn't notice an issue. It's things like very minor whitespace issues, or more commonly the count saying how large the sections are - nothing that affects the meat of the diff, they're fine with the hard part but then there's small counting errors.
  
  genewitch 7 months ago
  
  thanks, i didn't know how to respond to this as i never diff or use patch, but i know what they look like (@22,8 -/+ sort or whatever), and aider was outputting the green and red lines inverse video the same way github looks. It's a reasonable facsimile of "diff output", but i shouldn't have asserted it was diff output.
sidpatil 7 months ago

You can try asking ChatGPT to rewrite the original script to include the improvements.
pks016 7 months ago

Same. I used Claude to write script for my lab experiment. I had to review and edit some stuff but it worked mostly.
apwell23 7 months ago

yea its great at toy projects
- consumer451 7 months ago
  
  In my experience, a tool like Windsurf or Cursor (w/ Sonnet) is great at building a real project, as long the guardrails are clearly defined.
  For example, starting a SaaS project from something like Refine.dev + Ant Design, instead of just a blank slate.
  Of course, none of what I build is even close to novel code, which helps.

burningion 7 months ago

I think another category of error that Simon skips over that breaks this argument entirely: the hallucination where the model forgets a feature.

Rather than the positive (code compiles), the negative (forgets about a core feature), can be extremely difficult to tell. Worse still, the feature can slightly drift, based upon code that's expected to be outside of the dialogue / context window.

I've had multiple times where the model completely forgot about features in my original piece of code, after it makes a modification. I didn't notice these missing / subtle changes until much later.

simonw 7 months ago

That doesn't fit the definition of "hallucination" I was using here, which is a model inventing something that doesn't exist. Definitely a problem to watch out for though - I've had to remind the models to use existing functions before. I see that as an inevitable part of the back and forth with the model while iterating on code.

fzeroracer 7 months ago

> I’ll finish this rant with a related observation: I keep seeing people say “if I have to review every line of code an LLM writes, it would have been faster to write it myself!”

Not only is this a massive bundle of assumptions but it's also just wrong on multiple angles. Maybe if you're only doing basic CRUDware you can spend five seconds and give a thumbs up but in any complex system you should be spending time deeply reading code. Which is naturally going to take longer than using what knowledge you already have to throw out a solution.

greybox 7 months ago

I've not yet managed to successfully write any meaningful contribution to a codebase with an llm, faster than I could have written it myself.

Ok sure it writes test code boiler plate for me.

Honestly the kind of work im doing requires that I understand the code im reading, more than have the ability to quickly churn out more of it.

I think probably an llm is going to greatly speed up Web development, or anything else where the impetus is on adding to a codebase quickly, as for maintaining older code, performing precise upgrades, and fixing bugs, so far ive seen zero benefits. And trust me, I would like my job to be easier! Its not like I've not tried to use these

AStrangeMorrow 7 months ago

For me it has helped with: - boilerplate/quick prototyping - unit tests - quick start with a new language/library that I don’t really know
But yes, once the codebase starts to grow ever so slightly the only use I found is a glorified autocomplete.
Overall it does save me time in writing code. But not in debugging.

cratermoon 7 months ago

Increasingly I see apologists for LLMs sounding like people justifying fortune tellers and astrologists. The confidence games are in force, where the trick involves surreptitiously eliciting all the information the con artist needs from the mark, then playing it back to them as if it involves some deep and subtle insights.

chad1n 7 months ago

The idea is correct, a lot of people (including myself sometimes) just let an "agent" run and do some stuff and then check later if it finished. This is obviously more dangerous than just the LLM hallucinating functions, since at least you can catch the latter, but the first one depends on the tests of the project or your reviewer skills.

The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.

zahlman 7 months ago

>The real problem with hallucination is that we started using LLMs as search engines, so when it invents a function, you have to go and actually search the API on a real search engine.
That still seems useful when you don't already know enough to come up with good search terms.

jchw 7 months ago

> The moment you run LLM generated code, any hallucinated methods will be instantly obvious: you’ll get an error. You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.

Interestingly though, this only works if there is an error. There are cases where you will not get an error; consider a loosely typed programming language like JS or Python, or simply any programming language when some of the API interface is unstructured, like using stringly-typed information (e.g. Go struct tags.) In some cases, this will just silently do nothing. In other cases, it might blow up at runtime, but that does still require you to hit the code path to trigger it, and maybe you don't have 100% test coverage.

So I'd argue hallucinations are not always safe, either. The scariest thing about LLMs in my mind is just the fact that they have completely different failure modes from humans, making it much harder to reason about exactly how "competent" they are: even humans are extremely difficult to compare with regards to competency, but when you throw in the alien behavior of LLMs, there's just no sense of it.

And btw, it is not true that feeding an error into an LLM will always result in it correcting the error. I've been using LLMs experimentally and even trying to guide it towards solving problems I know how to solve, sometimes it simply can't, and will just make a bigger and bigger mess. Due to the way LLMs confidently pretend to know the exact answer ahead of time, presumably due to the way they're trained, they will confidently do things that would make more sense to try and then undo when they don't work, like trying to mess with the linker order or add dependencies to a target to fix undefined reference errors (which are actually caused by e.g. ABI issues.) I still think LLMs are a useful programming tool, but we could use a bit more reality. If LLMs were as good as people sometimes imply, I'd expect an explosion in quality software to show up. (There are exceptions of course. I believe the first versions of Stirling PDF were GPT-generated so long ago.) I mean, machine-generated illustrations have flooded the Internet despite their shortcomings, but programming with AI assistance remains tricky and not yet the force multiplier it is often made out to be. I do not believe AI-assisted coding has hit its Stable Diffusion moment, if you will.

Now whether it will or not, is another story. Seems like the odds aren't that bad, but I do question if the architectures we have today are really the ones that'll take us there. Either way, if it happens, I'll see you all at the unemployment line.

alexashka 7 months ago

> My less cynical side assumes that nobody ever warned them that you have to put a lot of work in to learn how to get good results out of these systems

Why am I reminded of people who say you first have to become a biblical scholar before you can criticize the bible?

loxs 7 months ago

The worst for me so far has been the following:

1. I know that a problem requires a small amount of code, but I also know it's difficult to write (as I am not an expert in this particular subfield) and it will take me a long time, like maybe a day. Maybe it's not worth doing at all, as the effort is not worth the result.

2. So why not ask the LLM, right?

3. It gives me some code that doesn't do exactly what is needed, and I still don't understand the specifics, but now I have a false hope that it will work out relatively easily.

4. I spend a day until I finally manage to make it work the way it's supposed to work. Now I am also an expert in the subfield and I understand all the specifics.

5. After all I was correct in my initial assessment of the problem, the LLM didn't really help at all. I could have taken the initial version from Stack Overflow and it would have been the same experience and would have taken the same amount of time. I still wasted a whole day on a feature of questionable value.

gojomo 7 months ago

Such "hallucinations" can also be plausible & useful APIs that oughtta exist – de facto feature requests.

dullcrisp 7 months ago

That's right, sometimes it's the children who are wrong.
smohare 7 months ago

[dead]

objectified 7 months ago

But that's for methods. For libraries, the scenario is different, and possibly a lot more dangerous. For example, the LLM generates code that imports a library that does not exist. An attacker notices this too while running tests against the LLM. The attacker decides to create these libraries on the public package registry and injects malware. A developer may think: "oh, this newly generated code relies on an external library, I will just install it," and gets owned, possibly without even knowing for a long time (as is the case with many supply chain attacks).

And no, I'm not looking for a way to dismiss the technology, I use LLMs all the time myself. But what I do think is that we might need something like a layer in between the code generation and the user that will catch things like this (or something like Copilot might integrate safety measures against this sort of thing).

namaria 7 months ago

Prompt injection means that unless people using LLMs to generate code are willing to hunt down and inspect all dependencies, it will become extremely easy to spread malware.

999900000999 7 months ago

I've probably spent about 25$ on Claude code so far.

I'm tempted to pay someone in Poland or whatever another 500$ to just finish the project. Claude code is like a temp that has a code quota to reach. After they reach it, they're done. You've reached the context limit.

A lot of stuff is just weird. For example I'm basically building a website with Supabase. Claude does not understand the concept of shared style sheets, instead it will just re-implement the same style sheets over and over again on like every single page and subcomponent.

Multiple incorrect implementations of relatively basic concepts. Over engineering all over the place.

A part of this might be on Supabase though. I really want to create a FOSS project, so firebase, while probably being a better fit, is out.

Not wanting to burn out, I took a break after a 4 hour Claude session. It's like reviewing code for a living.

However, I'm optimistic soon a competitor will emerge with better pricing. I would absolutely love to run three coding agents at once, maybe it even a fourth that can run integration tests against the first three.

simonw 7 months ago

"Not wanting to burn out, I took a break after a 4 hour Claude session. It's like reviewing code for a living."
That's a great encapsulation of what sometimes feel like after a (highly productive) session with these tools. I get a lot done with them but wow it's exhausting!
- 999900000999 7 months ago
  
  What's really disappointing is that Claude code doesn't really appear to "learn".
  Maybe a local SQL lite DB for notes would be ideal. Add a one sentence summary of what you did for each change, and then read it back before writing more code.
  
  simonw 7 months ago
  
  I've talked to people who get Claude Code to constantly update a TODOs.md with notes on what it's going to do next and what it's just done.

dzaima 7 months ago

Even if one is very good at code review, I'd assume the vast majority of people would still end up with pretty different kinds of bugs they are better at finding while writing vs reviewing. Writing code and having it reviewed by a human gets both classes, whereas reviewing LLM code gets just one half of that. (maybe this can be compensated-ish by LLM code review, maybe not)

And I'd be wary of equating reviewing human vs LLM code; sure, the explicit goal of LLMs is to produce human-like text, but they also have prompting to request being "correct" over being "average human" so they shouldn't actually "intentionally" reproduce human-like bugs from training data, resulting in the main source of bugs being model limitations, thus likely producing a bug type distribution potentially very different to that of humans.

krupan 7 months ago

Reading this article and then through the comments here, the overall argument I'm hearing here is that we should let the AI write the code and we should focus on reviewing it and testing it. We should work towards becoming good at specify a problem, and then validating the solution

Should we even be asking AI to write code? Shouldn't we just be building and training AI to solve these problems without writing any code at all? Replace every app with some focused, trained, and validated AI. Want to find the cheapest flights? Who cares what algorithm the AI uses to find them, just let it do that. Want to track your calorie intake, process payroll every two weeks, do your taxes, drive your car, keep airplanes from crashing into each other, encrypt your communications, predict the weather? Don't ask AI to clumsily write code to do these things. Just tell it to do them!

Isn't that the real promise of AI?

simonw 7 months ago

I think that is a promise that is doomed to failure.
Something we have learned as a civilization over the past ~70 years is that deterministic algorithms are an incredibly powerful thing. Designing processes that have a guaranteed, reliable result for a known input is a phenomenal way to scale up solutions to all kinds of problems.
If we want AI to help us with that, the best way to do that is to have it write code.
- throwuxiytayq 7 months ago
  
  AI is automating cognitive work of a human brain. There is barely anything deterministic, guaranteed, reliable or scalable about human brains. (To be honest, this should be apparent if you hired or worked with people.) If anything, being able to process these workloads without the meatware-specific deficiencies has terrifying scalability. The current wave of “““reasoning””” models demonstrate this: the LLM instantly emits a soup of tokens that could take you hours to analyze, greatly boosting the accuracy of the final answer. Expect a lot more of that, quantitatively and qualitatively.

xlii 7 months ago

> With code you get a powerful form of fact checking for free. Run the code, see if it works.

Um. No.

This is oversimplification that falls apart in any at minimum level system.

Over my career I’ve encountered plenty of reliability caused consequences. Code that would run but side effects of not processing something, processing it too slow or processing it twice would have serious consequences - financial and personal ones.

And those weren’t „nuclear power plant management” kind of critical. I often reminisce about educational game that was used at school and cost of losing a single save progress meant couple thousand dollars of reimbursement.

https://xlii.space/blog/network-scenarios/

This a cheatsheet I made for my colleagues. This is the thing we need to keep in mind when designing system I’m working on. Rarely any LLM thinks about it. It’s not a popular engineering by any sort, but it it’s here.

As for today I’ve yet to name single instance where any of ChatGPT produced code actually would save me time. I’ve seen macro generation code recommendation for Go (Go doesnt have macros), object mutations for Elixir (Elixir doesn’t have objects but immutable structs), list splicing in Fennel (Fennel doesn’t have splicing), language feature pragma ported from another or pure byte representation of memory in Rust and the code used UTF-8 string parsing to do it. My trust toward any non-ephemeral generated code is sub zero.

It’s exhausting and annoying. It feels like interacting with Calvin’s (of Calvin and Hobbes) dad but with all the humor taken away.

nottorp 7 months ago

> I asked Claude 3.7 Sonnet "extended thinking mode" to review an earlier draft of this post [snip] It was quite helpful, especially in providing tips to make that first draft a little less confrontational!

So he's also using LLMs to steer his writing style towards the lowest common denominator :)

dhbradshaw 7 months ago

The more leverage a piece of code has, the more good or damage it can do.

The more constraints we can place on its behavior, the harder it is to mess up.

If it's riskier code, constrain it more with better typing, testing, design, and analysis.

Constraints are to errors (including hallucinations) as water is to fire.

noodletheworld 7 months ago

If you want to use LLMs for code, use them.

If you don't, don't.

However, this 'lets move past hallucinations' discourse is just disingenuous.

The OP is conflating hallucinations, which are a fact, and undisputed failure mode of LLMs that no one has any solution for.

...and people not spending enough time and effort learning to use the tools.

I don't like it. It feels bad. It feels like a rage bait piece, cast out of frustration that the OP doesn't have an answer for hallucinations, because there isn't one.

People aren't stupid.

If they use a tool and it sucks, they'll stop using it and say "this sucks".

If people are saying "this sucks" about AI, it's because the LLM tool they're using sucks, not because they're idiots, or there's a grand 'anti-AI' conspiracy.

People are lazy; if the tool is good (eg. cursor), people will use it.

If they use it, and the first thing it does is hallucinate some BS (eg. intellij full line completion), then you'll get people uninstalling it and leaving reviews like "blah blah hallucination blah blah. This sucks".

Which is literally what is happening. Right. Now.

To be fair 'blah blah hallucinations suck' is a common 'anti-AI' trope that gets rolled out.

...but that's because it is a real problem

Pretending 'hallucinations are fine, people are the problem' is... it's just disingenuous and embarrassing from someone of this caliber.

tippytippytango 7 months ago

Yep. LLMs can get all the unit tests to pass. But not the acceptance tests. The discouraging thing is you might have all green checks on the unit tests, but you can’t get the acceptance tests to pass without starting over.

tanepiper 7 months ago

One thing I've found is that while I work with a LLM and it can do things way faster than me, the other side of it is I'm quickly loosing understand of the deeper code.

If someone asks me a question about something I've worked on, I might be able to give an answer about some deep functionality.

At the moment I'm working with a LLM on a 3D game and while it works, I would need to rebuild it to understand all the elements of it.

For me this is my biggest fear - not that LLMs can code, but that they do so at such a volume that in a generation or two no one will understand how the code works.

throwaway314155 7 months ago

> The real risk from using LLMs for code is that they’ll make mistakes that aren’t instantly caught by the language compiler or interpreter. And these happen all the time!

Are these not considered hallucinations still?

dzaima 7 months ago

Humans can hallucinate up some API they want to call in the same way that LLMs can, but you don't call all human mistakes hallucinations; classifying everything LLMs do wrong as hallucinations would seem rather pointless to me.
- thylacine222 7 months ago
  
  Analogizing this to human hallucination is silly. In the instance you're talking about, the human isn't hallucinating, they're lying.
  
  dzaima 7 months ago
  
  I definitely wouldn't say I'm lying (...to.. myself? what? or perhaps others for a quick untested response in a chatroom or something) whenever I write some code and it turns out that I misremembered the name of an API. "Hallucination" for that might be over-dramatic but at least it it's a somewhat sensible description.
- ForTheKidz 7 months ago
  
  Maybe we should stop referring to undesired output (confabulation? Bullshit? Making stuff up? Creativity?) as some kind of input delusion. Hallucination is already a meaningful word and this is just gibberish in that context.
  As best I can tell, the only reason this term stuck is because early image generation looked super trippy.
simonw 7 months ago

I think of hallucinations as instance where an LLM invents something that is entirely untrue - like a class or method that doesn't exist, or a fact about the world that's unnoticed true.
I guess you could call bugs in LLM code "hallucinations", but they feel like a slightly different thing to me.
- throwaway314155 7 months ago
  
  That's a great distinction actually. Thanks
fweimer 7 months ago

I don't think it's necessarily a hallucination if models accurately reproduce the code quality of their training data.

simonw 7 months ago

I really like this theory from Kellan Elliott McCrae: https://fiasco.social/@kellan/114092761910766291

> I think a simpler explanation is that hallucinating a non-existent library is a such an inhuman error it throws people. A human making such an error would be almost unforgivably careless.

This might explain why so many people see hallucinations in generated code as an inexcusable red flag.

internet_points 7 months ago

Even with boring tech that's been in the training set for ages (rails), you can get some pretty funny hallucinations: https://bengarcia.dev/making-o1-o3-and-sonnet-3-7-hallucinat... (fortunately this one was the very non-dangerous kind, making it very obvious; though I wonder how many non-obvious hallucinations entered the training set by the same process)

marcofloriano 7 months ago

"Proving to yourself that the code works is your job. This is one of the many reasons I don’t think LLMs are going to put software professionals out of work."

Good point

intrasight 7 months ago

> You can fix that yourself or you can feed the error back into the LLM and watch it correct itself.

Well, those types of errors won't be happening next year will they?

> No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!

What rot. The test is the problem definition. If properly expressed, the code passing the test means the code is good.

why-el 7 months ago

I am not so sure. Code by one LLM can be reviewed by another. Puppeteer like solutions will exist pretty soon. "Given this change, can you confirm this spec".

Even better, this can carry on for a few iterations. And both LLMs can be:

1. Budgeted ("don't exceed X amount")

2. Improved (another LLM can improve their prompts)

and so on. I think we are fixating on how _we_ do things, not how this new world will do their _own_ thing. That to me is the real danger.

sfink 7 months ago

That's ok. Writing such a spec is writing the code, declaratively.
The only difference between that and writing SQL (as opposed to writing imperative code to query the database) is that the translation mechanism is much more sophisticated, much less energy efficient, much slower, and most significantly much more error-prone than a SQL interpreter.
But declarative coding is good! It has its issues, and LLMs in particular compound the problems, but it's a powerful technique when it works.
tylerchurch 7 months ago

> Code by one LLM can be reviewed by another
Reviewed against what? Who is writing the specs?
- why-el 7 months ago
  
  the user who wants it? and a premature retort: if the feedback is "the user / PM / stakeholder could be wrong", then... that's where we are. A "refiner" LLM can be fronted (Replit is playing with this for instance).
  To be clear: this is not something I do currently, but my point is that one needs to detach from how _we_ engineers do this for a more accurate evaluation of whether these things truly do not work.

01100011 7 months ago

Timely article. I really, really want AI to be better at writing code, and hundreds of reports suggest it works great if you're a web dev or a python dev. Great! But I'm a C/C++ systems guy(working at a company making money off AI!) and the times I've tried to get AI to write the simplest of test applications against a popular API it mostly failed. The code was incorrect, both using the API incorrectly and writing invalid C++. Attempts to reason with the LLMs(grokv3, deepseek-r1) led further and further away from valid code. Eventually both systems stopped responding.

I've also tried Cursor with similar mixed results.

But I'll say that we are getting tremendous pressure at work to use AI to write code. I've discussed it with fellow engineers and we're of the opinion that the managerial desire is so great that we are better off keeping our heads down and reporting success vs saying the emperor wears no clothes.

It really feels like the billionaire class has fully drunk the kool-aid and needs AI to live up to the hype.

JTyQZSnP3cQGa8B 7 months ago

They have also found a way to force every developer and company to get a $20/month subscription forever.

svaha1728 7 months ago

If X, AWS, Meta, and Google would just dump their code into a ML training set we could really get on with disrupting things.

zeroCalories 7 months ago

I've definitely had these types of issues while writing code with LLMs. When relying on an LLM to write something I don't fully understand I will basically default to a form of TDD, making sure that the code behaves according to some spec. If I can't write a spec, then that's an issue.

sublinear 7 months ago

> Compare this to hallucinations in regular prose, where you need a critical eye, strong intuitions and well developed fact checking skills to avoid sharing information that’s incorrect and directly harmful to your reputation

Ah so you mean... actually doing work. Yeah writing code has the same difficulty, you know. It's not enough to merely get something to compile and run without errors.

> With code you get a powerful form of fact checking for free. Run the code, see if it works.

No, this would be coding by coincidence. Even the most atrociously bad prose writers don't exactly go around just saying random words from a dictionary or vaguely (mis)quoting Shakespeare hoping to be understood.

Velorivox 7 months ago

Not just that, “it works” is a very, very low bar to have for your code. To illustrate, the other day I tested an LLM by having it create a REST API. I asked for an end point where I could update a particular field of the record (think liking a post).
Then I decided to add on more functionality and asked for the ability to update all the other fields…
As you can guess, it gave me one endpoint per field for that entity. Sure, “it works”…
- trollbridge 7 months ago
  
  There are human developers who do the same thing…
  
  skydhash 7 months ago
  
  There are humans that do extreme sports just for the thrill. I still don't want my car to have a feature that can got it to throw itself off a cliff.
drekipus 7 months ago

> Even the most atrociously bad prose writers don't exactly go around just saying random words from a dictionary or vaguely (mis)quoting Shakespeare hoping to be understood.
I actually do this (and I'm not proud of it)

myaccountonhn 7 months ago

Another danger is spotted in the later paragraphs:

> I genuinely find myself picking libraries that have been around for a while partly because that way it’s much more likely that LLMs will be able to use them.

People will pick solutions that have a lot of training data, rather than the best solution.

duck2 7 months ago

That's exactly why I stopped using Svelte. Claude is much more sensible when generating React. Looks like a bleak future where only the most popular library survives.

Ozzie_osman 7 months ago

I'm excited to see LLMs get much better at testing. They are already good at writing unit tests (as always, you have to review them carefully). But imagine an LLM that can see your code changes _and_ can generate and execute automated and manual tests based on the change.

AdieuToLogic 7 months ago

Software is the manifestation of a solution to a problem.

Any entity, human or otherwise, lacking understanding of the problem being solved will, by definition, produce systems which contain some combination of defects, logic errors, and inapplicable functionality for the problem at hand.

antfarm 7 months ago

LLM generated code is legacy code.

tigerlily 7 months ago

When you go from the adze to the chainsaw, be mindful that you still need to sharpen the chainsaw, top up the chain bar oil, and wear chaps.

Edit: oh and steel capped boots.

Edit 2: and a face shield and ear defenders. I'm all tuckered out like Grover in his own alphabet.

Telemakhos 7 months ago

I'm not remotely convinced that LLMs are a chainsaw, unless they've been very thoroughly trained on the problem domain. LLMs are good for vibe coding, and some of them (Grok 3 is actually good at this) can speak passable Latin, but try getting them to compose Sotadean verse in Latin or put a penthemimeral caesura in an iambic trimeter in ancient Greek. They can define a penthemimeral caesura and an iambic trimeter, but they don't understand the concepts and can't apply one to the other. All they can do is spit out the next probable token. Worse, LLMs have lied to me on the definition of Sotadean verse, not even regurgitating what Wikipedia should have taught them.
Image-generating AIs are really good at producing passable human forms, but they'll fail at generating anything realistic for dice, even though dice are just cubes with marks on them. Ask them to illustrate the Platonic solids, which you can find well-illustrated with a Google image search, and you'll get a bunch of lumps, some of which might resemble shapes. They don't understand the concepts: they just work off probability. But, they look fairly good at those probabilities in domains like human forms, because they've been specially trained on them.
LLMs seem amazing in a relatively small number of problem domains over which they've been extensively trained, and they seem amazing because they have been well trained in them. When you ask for something outside those domains, their failure to work from inductions about reality (like "dice are a species of cubes, but differentiated from other cubes by having dots on them") or to be able to apply concepts become patent, and the chainsaw looks a lot like an adze that you spend more time correcting than getting correct results from.
- aaronbaugher 7 months ago
  
  When I was tutoring algebra, I sometimes ran into students who could solve the problems in the book, but if I wrote a problem that looked a little different or that combined two of the concepts they'd supposedly learned, they would be lost. I gradually realized that they didn't understand the concepts at all, but had learned to follow patterns. ("When it's one fraction divided by another fraction, flip the second fraction over and multiply. Why? No idea, but I get an A.")
  This feels like that: a "student" who can produce the right answers as long as you stick to a certain set of questions that he's already been trained on through repetition, but anything outside that set is hopeless, even if someone who understood that set could easily reason from it to the new question.
namaria 7 months ago

Chainsaws are deterministic. Using LLMs is more akin to trying to do topiary by juggling axes.

mediumsmart 7 months ago

As a non programmer I only get little programs or scripts that do something from the LLM. If they do the thing it means the code is tested, flawless and done. I would never let them have to deal with other humans Input of course.

Ozzie_osman 7 months ago

Great article, but doesn't talk about the potentially _most_ dangerous form of mistakes: an adversarial LLM trying to inject vulnerabilities. I expect this to become a vector soon as people figure out ways to accomplish this

davesque 7 months ago

I thought he was going to say the really danger is hallucination of facts, but no.

amelius 7 months ago

I don't agree. What if the LLM takes a two-step approach, where it first determines a global architecture, and then it fills in the code? (Where it hallucinates in the first step).

DeathArrow 7 months ago

I agree with the author. But can't the risk be minimized somehow by asking LLM A to generate code and LLM B to write integration tests?

al2o3cr 7 months ago

    My cynical side suspects they may have been looking for
    a reason to dismiss the technology and jumped at the first
    one they found.

MY cynical side suggests the author is an LLM fanboi who prefers not to think that hallucinating easy stuff strongly implies hallucinating harder stuff, and therefore jumps at the first reason to dismiss the criticism.

williamcotton 7 months ago

What do you mean by "harder stuff"? What about an experimental DSL written in C with a recursive descent parser and a web server runtime that includes Lua, jq, a Postgres connection pool, mustache templates, request-based memory arena, database migrations and much more? 11,000+ lines of code with ~90% written by Claude in Cursor Composer.
https://github.com/williamcotton/webdsl
Frankly us "fanbois" are just a little sick and tired of being told that we must be terrible developers working on simple toys if we find any value from these tools!
- elanora96 7 months ago
  
  I'm a strong believer that LLMs are tools and when wielded by talented and experienced developers they are somewhere in the danger category of Stack Overflow and transitive dependencies. This is not a critique of your project, or really the quality of LLMs, but when I see 90% of a 11,000+ loc project written in Claude, it just feels sort of depressing in a way I haven't processed yet.
  I love foss, I love browsing projects of all quality levels and vintages and seeing how things were built. I love learning new patterns and sometimes even bickering over their strengths and weaknesses. An LLM generated code base hardly makes me even want to engage with it...
  Perhaps these feelings are somewhat analogous to hardcopies vs ebooks? My opinions have changed over time and I read and collect both. Have you had similar thoughts and gotten over them? Do you see tools like Claude in a way where this isn't an issue?
  
  goosejuice 7 months ago
  
  You're romanticizing software. To place more value in the code than the outcome. There's nothing wrong with that, but most people that use software don't think about it that way.
  
  williamcotton 7 months ago
  
  I mean, when I'm working on something that I don't expect to be more than a throw-away experiment I'm not too worried about the code itself.
  The grammar itself still seems a bit clunky and the next time I head down this path I imagine I'll go with a more hand-crafted approach.
  I learned a lot about integrating Lua and jq into a project along the way (and how to make it performant), something I had no prior experience with.
- dzaima 7 months ago
  
  Some free code review of the first file I clicked into - https://github.com/williamcotton/webdsl/blob/92762fb724a9035... among other places should probably be doing the conditional "lexer->line++"; thing. Quite a weird decision to force all code paths to manually do that whenever a newline char is encountered. Could've at least made a "advance_maybe_newline(lexer);" or so. But I guess LLMs give you copy-paste garbage.
  Even the article of this thread says:
  > Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing.
  
  williamcotton 7 months ago
  
  Thanks for taking a look! The lexer and parser is probably close to 100% Claude and I definitely didn't review it completely. I spent most of the time trying out different grammars (normally something you want to do before you start writing code) and runtime features! "Build the web server runtime and framework into the language" was an idea kicking around in my head for a few years but until Cursor I didn't have the energy to play around with the idea.
  
  ianbutler 7 months ago
  
  Okay so this is a personal opinion right? Like where is the objectivity in your review?
  What are the hardline performance characteristics being violated? Or functional incorrectness. Is this just "it's against my sensibilities" because at the end of the day frankly no one agrees on how to develop anything.
  The thing I see a lot of developers struggle with is just because it doesn't fit your mental model doesn't make it objectively bad.
  So unless it's objectively wrong or worse in a measurable characteristic I don't know that it matters.
  For the record I'm not asserting it is right, I'm just saying I've seen a lot of critiques of LLM code boil down to "it's not how I'd write it" and I wager that holds for every developer you'll ever interact with.
  
  dzaima 7 months ago
  
  OP didn't put much effort into writing the code so I'm certainly not putting in much effort into a proper review of it, for no benefit to me no less. I just wanted to see what quality AI gets you, and made a comment about it.
  I'm pretty sure the code not having the "if (…) lexer->line++" in places is just a plain simple repeated bug that'll result in wrong line numbers for certain inputs.
  And human-wise I'd say the simple way to not have made that bug would've been to make/change abstractions upon the second or so time writing "if (…) lexer->line++" such that it takes effort to do it incorrectly, whereas the linked code allows getting it wrong by default with no indication that there's a thing to be gotten wrong. Point being that bad abstractions are not just a maintenance nightmare, but also makes doing code review (which is extra important with LLM code) harder.
  
  KoolKat23 7 months ago
  
  I agree, it seems a lot of the complaints boil down to academic reasons.
  Fine it's not the best and perhaps may run into some longer term issues but most importantly it works at this point in time.
  A snobby/academic equivalent would be someone using an obscure language such as COBOL.
  The world continues to turn.
- rhubarbtree 7 months ago
  
  I’m always really sceptical of any “proof by example” that is essentially anecdotal.
  If this is going to be your argument, you need a solid scientific approach. A study where N developers are given access to a tool vs N that are not, controls are in place etc.
  Because the overwhelming majority of coders I speak to are saying exactly the same thing, which is LLMs are a small productivity boost. And the majority of cursor users, which is admittedly a much smaller number, are saying it just gets stuck playing whack a mole. And common sense says these are the expected outcomes, so we are going to need really rigorous work to convince people that LLMs can build 90% of most deeply technical projects. Exceptional results require exceptional evidence.
  And when we do see anecdotal incidents that seem so divergent from the norm, well that then makes you wonder how that can be, is this really objective or are we in some kind of ideological debate?
- semi-extrinsic 7 months ago
  
  Honest question: this looks like a library others can use to build websites. It contains features related to authentication and security. If it's 90% LLM generated, how do you sleep at night? I'd be dead scared someone would use this, hit a bug that leaks PII (or worse) and then sue me into oblivion.
  
  williamcotton 7 months ago
  
  "WebDSL is an experimental domain-specific language and server implementation for building web applications."
  And it's MIT:
  THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
- namaria 7 months ago
  
  Protip: when you block a user in github it let's you add a note as to why that will show in their profile. It will also alert you when you see a repository to which that user has contributed.
- Snuggly73 7 months ago
  
  ..."request-based memory arena"...
  there are some very questionable things going on with the memory handing in this code. just saying.
  
  williamcotton 7 months ago
  
  Request-based memory arenas are pretty standard for web servers!
  
  Snuggly73 7 months ago
  
  Maybe be, after all - I dont write web servers (btw, the PQ and JQ libraries doesnt seem to use the arena allocator, which makes the whole proposition a bit dubious, but lets say that its me being picky).
  What I meant was, that IMO the code is not very robust when dealing with memory allocations:
  1. The "string builder" for example silently ignores allocation failures and just happily returns - https://github.com/williamcotton/webdsl/blob/92762fb724a9035...
  2. In what seems most of the places, the code simply doesnt check for allocation failures, which leads to overruns (just couple of examples):
  https://github.com/williamcotton/webdsl/blob/92762fb724a9035...
  https://github.com/williamcotton/webdsl/blob/92762fb724a9035...
  
  williamcotton 7 months ago
  
  Thanks for digging in. Yup, those two libs don’t support custom allocators. I raised an issue in the jq repo to ask if they thought about adding it.
  Great points about happy path allocations. If I ever touch the project again I’ll check each location.
  Note to self: free code reviews of projects if you mention LLMs!
  
  namaria 7 months ago
  
  "People took a cursory look at a codebase I published and found glaring mistakes they discussed publicly as examples of how bad it is" is not the flex you think it is.
  
  williamcotton 7 months ago
  
  "Cursory", get it? I did indeed make it with Cursor! ;)
  I hope you find yourself having a better day today than yesterday.
  
  namaria 7 months ago
  
  I hope you stop peddling AI slop
simonw 7 months ago

I find it a bit surprising that I'm being called an "LLM fanboy" for writing an article with the title "Hallucinations in code are the least dangerous form of LLM mistakes" where the bulk of the article is about how you can't trust LLMs not to make far more serious and hard-to-spot logic errors.

devmor 7 months ago

I don’t really understand what the point or tone of this article is.

It says that Hallucinations are not a big deal, that there’s great dangers that are hard to spot in LLM-generated code… and then presents tips on fixing hallucinations with the general theme of positivity towards using LLMs to generate code, with no more time dedicated to the other dangers.

It sure gives the impression that the article itself was written by an LLM and barely edited by a human.

TheRealPomax 7 months ago

> No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!

Absolutely not. If your testing requires a human to do testing, your testing has already failed. Your tests do need to include both positive and negative tests, though. If your tests don't include "things should crash and burn given ..." your tests are incomplete.

> If you’re using an LLM to write code without even running it yourself, what are you doing?

Running code through tests is literally running the code. Have code coverage turned on, so that you get yelled at for LLM code that you don't have tests for, and CI/CD that refuses to accept code that has no tests. By all means push to master on your own projects, but for production code, you better have checks in place that don't allow not-fully-tested code (coverage, unit, integration, and ideally, docs) to land.

The real problem comes from LLMs happily not just giving you code but also test cases. The same prudence applies as with test cases someone added to a PR/MR: just because there are tests doesn't mean they're good tests, or enough tests, review them in the assumption that they're testing the wrong thing entirely.

ggm 7 months ago

I'm just here to whine, almost endlessly, that the word "hallucination" is a term of art chosen deliberately because it helps promote a sense AGI exists, by using language which implies reasoning and consciousness. I personally dislike this. I think we were mistaken allowing AI proponents to repurpose language in that way.

It's not hallucinating Jim, it's statistical coding errors. It's floating point rounding mistakes. It's the wrong cell in the excel table.

rhubarbtree 7 months ago

“Errors”?
- namaria 7 months ago
  
  Errors are a category of well understood and explicit failures.
  Slop is the best description. LLMs are sloppy tools and some people are not discerning enough to know that blindly running this slop is endangering themselves and others.
  
  rhubarbtree 7 months ago
  
  I'm not sure errors are really understood that well.
  I ask for 2+5, you give me 10. Is that an error?
  But then it turns out the user for this program wanted + to be a multiply operator, so the result is "correct".
  But then it turns out that another user in the same company wanted it to mean "divide".
  It seems to me to be _very_ rare when we can say for sure software contains errors or is error-free, because even at the extreme level of the spec there are just no absolutes.
  The generality of "correctness" achieved by a human programmer is caused by generality of intent - they are trying to make the software work as well as possible for its users in all cases.
  An LLM has no such intent. It just wants to model language well.
  
  namaria 7 months ago
  
  LLM output isn't 'error' or 'hallucination' because it can only resemble human language. There is no intent. There is nothing being communicated.
  If LLMs output text, that is always the correct output, because it is programmed to extend a given piece of text by outputting tokens that translate to human readable text.
  LLMs are only coincidentally correct sometimes when it is given a bit of text to extend and by some clever stopping and waiting for bits of text from a person it can render something that looks like a conversation and it reads like a cogent conversation. That is what they are programmed to do and they do it well.
  The text being coherent but failing to conform to reality some way or another is just part of how they work. They are not failing, they are working as intended. They don't hallucinate or produce errors, they are merely sometimes coincidentally correct.
  That's what I meant by my comment. Saying that the LLMs 'hallucinate' or 'are wrong about something' is incorrect. They are not producing errors. They are successfully doing what they were programmed to do. LLMs produce sloppy text that is sometimes coincidentally informative.

cenriqueortiz 7 months ago

Code testing is “human in the loop” for LLM generated code.

marcofloriano 7 months ago

"If you’re using an LLM to write code without even running it yourself, what are you doing?"

Hallucinating

0dayz 7 months ago

Personally I believe the worst with llm is it's abysmal ability to architect code, it's why I use llms more like a Google than a so called coding buddy, because there was so many times I had to rewrite the entire file because the llm had added in so much extra unmanageable functions,even deciding to solve problems I hadn't asked it to do.

tiberriver256 7 months ago

Wait until he hears about yolo mode and 'vibe' coding.

Then the biggest mistake it could make is running `gh repo delete`

cryptoegorophy 7 months ago

Just ask another LLM to proof read?

namaria 7 months ago

Do you realize that giving LLMs 'instructions' is merely trying to blindly twist knobs by random amounts?

hackburg 7 months ago

[dead]

sunami-ai 7 months ago

I asked o3-mini-high (investor paying for Pro, I personally would not) to critique the Developer UX of D3's "join" concept (how when you select an empty set then when you update you enter/exit lol) and it literally said "I'm sorry. I can't help you with that." The only thing missing was calling me Dave.

homelessgolden 7 months ago

[dead]

sfink 7 months ago

I don't care about the programmers who can't write FizzBuzz. Why should I? If I employed them, they were costing me money. If I worked with them, they were costing me time and hair follicles. I need them about as much as I need a buggy whip.
The linked article makes the claim that the majority of comp sci majors cannot write FizzBuzz. That's a bold assertion; how did the author sample such people? I suspect the sample pool was people applying for a position. There is a major selection bias there. First, people who fail many interviews will do more interviews than those who do not fail, so you'll start with a built-in bias towards the less competent (or more nervous).
Second, there is a large pile of money being given to people who make it over a somewhat arbitrary bar. As a random person, why would I not try to jump over the bar, even if I'm not particularly good at jumping? There are a lot of such bars with a lot of such large piles of money behind them. If getting a chance at jumping over those bars requires me to get a particular piece of paper with a particular title printed at the top of it, I'll be motivated to get that piece of paper too.
- zahlman 7 months ago
  
  > Second, there is a large pile of money being given to people who make it over a somewhat arbitrary bar. As a random person, why would I not try to jump over the bar, even if I'm not particularly good at jumping? There are a lot of such bars with a lot of such large piles of money behind them.
  Why don't we see job positions for doctors and lawyers similarly flooded, then?
  
  sfink 7 months ago
  
  Because there is a high barrier to entry. In the US, at least, there was also an explicit policy and set of mechanisms to limit supply of doctors: https://www.advisory.com/daily-briefing/2022/02/16/physician...
  For lawyers, there is an oversupply of the most lucrative segments, and an undersupply everywhere else: https://www.ajs.org/is-there-a-shortage-of-lawyers/
  But in both cases, there just isn't some low bar that you can finagle your way over and get to the promised riches. Lawyers have a literal Bar, and it isn't low. Doctors have a ton of required training. Both have serious certification requirements that computer science professionals do not. Both professions support my point.
  Furthermore, incompetent lawyers face real-world tests. If they lose their cases or otherwise screw things up, they are not going to be raking in the money. And people are trying their best to flood the doctor market, by inventing certifications that avoid the requirements to be a physician and setting themselves up as alternative medicine specialists or naturalists or generic "healers" or whatever. (I'm not saying they're all crap, but I am saying that unqualified people are flooding those positions.)
  
  zahlman 7 months ago
  
  >I'm not saying they're all crap, but I am saying that unqualified people are flooding those positions.
  That's the part I'm wondering about. Because it seems like I don't hear reports from people who would hire doctors and lawyers, of having to deal with that.
zahlman 7 months ago

>the majority of programmers can’t write FizzBuzz
How did they get through the Leetcode-style interviews before LLMs and remote interviewing?

AIFounder 7 months ago

[dead]