For me, Claude Code was the most impressive innovation this year. Cursor was a good proof of concept but Claude Code is the tool that actually got me to use LLMs for coding.
The kind of code that Claude produces looks almost exactly like the code I would write myself. It's like it's reading my mind. This is a game changer because I can maintain the code that Claude produces.
With Claude Code, there are no surprises. I can pretty much guess what its code will look like 90% to 95% of the time but it writes it a lot faster than I could. This is an amazing innovation.
Gemini is quite impressive as well. Nano banana in particular is very useful for graphic design.
I haven't tried Gemini with coding yet but TBH, Claude Code does such a great job; if I could code any faster, I would get decision fatigue. I don't like rushing into architecture or UX decisions. I like to sit on certain decisions for a day or two before starting implementation. Once you start in a particular direction, it's hard to undo and you may try to double down on the mistake due to sunk cost fallacy. I try hard to avoid that.
I don't even see much reason to use Cursor. I am used to IntelliJ IDEA, so I just downloaded the Claude Code plugin and basically now I use the IDE only for navigating in the code, finding references and reviewing the code. I can't even remember the last time I wrote more than 2 lines of code. Claude Code has catapulted my performance at least 5x if not more. And now that the cost of writing test is so minimal I am also able to achieve much better (and meaningful!) test coverage too. The AI agents is where the most productivity is.
I just create a plan with Claude, iterate over, ask questions, then let it implement the plan, review, ask to do some adjustments. No manual writing of code at all. Zero.
Maybe I'm holding it wrong, but the finer aspects of a codebase it still messes up. If I ask it to implement some weird thing off the beaten path it gets lost. But I completely agree at the test part. I actually test much more now since it's so easy!
Do you guys all work 100% on open source? Or are you uploading bits of your copyrighted code for future training to Anthropic? I hate patents so copyright is the only IP protection I have.
We use AWS Bedrock, so everything stays within our AWS account. It's not like we aren't already uploading our code to GitHub for version control, AWS for deployment, Jetbrains for development, all of ours logs to Datadog, Sentry, Snowflake, and more.
For one: modifying existing images in interesting ways ... adding characters, removing elements, altering or enhancing certain features, creating layers, and so on. Things that would take a while on Photoshop, done almost instantly. Really unlocks the imagination.
I gave it an image of my crappy art and asked what steps I could take to make it look better. It gave me specific advice like varying the line widths and how to use this on specific parts of the character. It also pointed out that the shading in my piece was inconsistent and did not reflect the 3d form I was representing and again gave me specific fixes I could implement. I asked for it to give me an updated version of the piece with all of its advice implemented and it did so. I was pretty shocked at all of this.
For me: I've only tried using it seriously a few times but my experience is that you have to juggle carefully when to start a fresh session. It can get really anchored on earlier versions of images. It was interesting balancing iteration and from-scratch prompt refinement.
It's decent for things that would take a long time in Photoshop. Like most AI, sometimes it works great and sometimes it goes off the rails completely. Most recently, I used it to process some drone photos that were taken during late fall for the purpose of marketing a commercial property. All of the trees/grass/plants were brown, so I told it to make it look like the photos were taken during the summer but not to change anything else. It did a very good job, not just changing the color, but actually adding leaves to the plants and trees in a way that looked very realistic. It did in seconds what would have taken one of my team members hours, leaving them to work on other more pressing projects.
I first got into agentic properly with GLM coding plan (it's like $2/month), but I found myself very consistently asking Claude to make the code more elegant and readable. At which point I realized I was being silly and just switched to Claude code.
(GLM etc. get surprisingly close with good prompting but... $0.60/day to not worry about that is a no brainer.)
I’m with you, I’ve used CC but I strongly prefer Cursor.
Fundamentally, I don’t like having my agent and my IDE be split. Yes, I know there are CC plugins for IDEs, but you don’t get the same level of tight integration.
You're not missing much. You can generally use Cursor like Claude Code for normal day to day use. I prefer Cursor because I like reviewing changes in an IDE, and I like being able to switch to the current SOTA model.
Though for more automated work, one thing you miss with Cursor is sub agents. And then to a lesser extent skills (these are pretty easy to emulate in other tools). I'm sure it's only a matter of time though.
The big limitation is that you have to approve/disapprove at every step. With Cursor you can iterate on changes and it updates the diffs until you approve the whole batch.
You are missing an entire agentic experience. And I wouldn't call it vibe coding for an engineer; you're more or less empowered to truly orchestrate the development of your system.
Cursor has agent, but that's like whoever else tried to copy the Model T while Ford was developing it.
I have only compared Claude Code with Crush and a tool of my own design. In my experience, Claude code is optimized for giant codebases and long tasks. It loves launching dozens of agents in parallel. So it's a bit heavy for smaller, surgical stuff, though it works decent for that too.
If you mostly have small codebases that fit in context, or make many small changes interactively, it's not really great for that (though it can handle it too). It'll just be spending most of its time poking around the codebase, when the whole thing should have just been loaded... (Too bad there's no small repo mode. I made startup hook that just dumps cat dir into context, but yeah, should be a toggle.)
If you switch to Codex you will get a lot of tokens for $200, enough to more consistently use high reasoning as well. Cursor is simply far more expensive so you end up using less or using dumber models.
Claude Code is overrated as it uses many of its features and modalities to compensate for model shortcomings that are not as necessary for steering state of the art models like GPT 5.2
I think this is a total misunderstanding of Anthropic’s place in the AI race. Opus 4.5 is absolutely a state of the art model. I won’t knock anyone for preferring Codex, but I think you’re ignoring official and unofficial benchmarks.
The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
Totally, however OP's point was that Claude had to compensate for deficiencies versus a state of the art model like ChatGPT 5.2. I don't think that's correct. Whether or not Opus 4.5 is actually #1 on these benchmarks, it is clearly very competitive with the other top-tier models. I didn't take "state of the art" to here narrowly mean #1 on a given benchmark, but rather to mean near or at the frontier of current capabilities.
One thing to remember when comparing ML models of any kind is that single value metrics obscure a lot of nuance and you really have to go through the model results one by one to see how it performs. This is true for vision, NLP, and other modalities.
I wonder how model competence and/or user preference on web development (that leaderboard) carries over to more complex and larger projects, or more generally anything other than web development ?
In addition to whatever they are exposed to as part of pre-training, it'd be interesting to know what kind of coding tasks these models are being RL-trained for? Are things like web development and maybe Python/ML coding overemphasized, or are they also being trained on things like Linux/Windows/embedded development etc in different languages?
Yes, I personally feel that the "official" benchmarks are increasingly diverging from the everyday reality of using these models. My theory is that we are reaching a point where all the models are intelligent enough for day-to-day queries, so points like style/personality and proper use of web queries and other capabilities are better differentiators than intelligence alone.
What am I missing? As suspicious as benchmarks are, your link shows GPT 5.2 to be superior.
It is also out of date as it does not include 5.2 Codex.
Per my point about steerability compensated for by modalities and other harness features: Opus 4.5 scores 58% while GPT 5.2 scores 75% for the instruction following benchmark in your link! Thanks for the hard evidence - GPT 5.2 is 30% ahead of Opus 4.5 there. No wonder Claude Code needs those harness features for the user to manually reign in control over its instruction following capability.
I disagree, the claude models seem the best at tool calling, opus 4.5 seems the smartest, and claude code (+ claude model) seems to make good use of subagents and planning in a way that codex doesn't
Opus 4.5 is so bad at instruction following (30% worse per benchmark shared above) that it requires a manual toggle for plan mode.
GPT 5.2 simply obeys instruction to assemble a plan and avoids the need to compensate for poor steerability that would require the user to manually manage modalities.
Opus has improved though so the plan mode is less necessary than it was before, but it is still far behind state of art steerability.
I noticed that despite really liking Karpathy and the blog, I was am kind of wincing/involuntarily reacting to the LLM-like "It's not X, its Y"-phrases:
> it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer
> it's not just about the image generation itself, it's about the joint capability coming from text generation
There would be no reaction from me on this 3 years ago, but now this sentence structure is ruined for me
Very broadly, AI sentence-structure and word choice is recursing back into society, changing how humans use language. The Economist recently had a piece on word usage of British Parliament members. They are adopting words and phrases commonly seen in AI.
We're embarking on a ginormous planetary experiment here.
I hated these sentences way before LLMs, at least in the context of an explanation.
> it's not just a website you go like Google, it's a little spirit/ghost that "lives" on your computer
This type of sentence, I call rhetorical fat. Get rid of this fat and you obtain a boring sentence that repeats what has been said in the previous one.
Not all rhetorical fats are equal, and I must admit I find myself eyerolling on the "little spirit" part more than about the fatness.
I understand the author wants to decorate things and emphasize key elements, and the hate I feel is only caused by the incompatible projection of my ideals to a text that doesn't belong to me.
> it's not just about the image generation itself, it's about the joint capability coming from text generation.
That's unjustified conceptual stress.
That could be a legitimate answer to a question ("No, no, it's not just about that, it's more about this"), but it's a text. Maybe the text wants you to be focused, maybe the text wants to hype you; this is the shape of the hype without the hype.
"I find image generation is cooler when paired with text generation."
It is not a decoration. Karpathy juxtaposes ChatGPT (which feels like a "better google" to most people) to Claude Code, which, apparently, feels different to him. It's a comparison between the two.
You might find this statement non-informative, but without two parts there's no comparison. That's really the semantics of the statement which Karpathy is trying to express.
ChatGPT-ish "it's not just" is annoying because the first part is usually a strawman, something reader considers trite. But it's not the case here.
Indeed, I was probably grumpy at the time I wrote the comment. I do find some truth in it still.
You're right ! The strawman theory is based.
But I think there's more to it, I find dislikable the structure of these sentences (which I find a bit sensationnalist for nothing, I don't know, maybe I am still grumpy).
Well, language is a subject to 'fashion' one-upmanship game: people want to demonstrate their sophistication, often by copying some "cool" patterns, but then over-used patterns become "uncool" cliche.
So it might be just a natural reaction to over-use of a particular pattern. This kind of stuff have been driving language evolution for millennia. Besides that, pompous style is often used in 'copy' (slogans and ads) which is something most people don't like.
Karpathy should go back to what he does best: educating people about AI on a deep level. Running experiments and sharing how they work, that sort of stuff. It seems lately he is closer to an influencer who reviews AI-based products. Hopefully it is not too late to go back.
I feel these review stuff is more like a side / pass time to him. Look at nanochat for example. My impression is that these are the thongs he spends most of his energy still.
After all,l he's been a "influencer" for a long time, starting from the "software 2.0" essay.
Same here, had to configure ChatGPT to stop making these statements. Also had to configure bunch of other stuff to make it bland when answering questions.
The way to make AI not sound like ChatGPT is to use Claude.
I realized that's what bothered me. It's not "oh my god, they used ChatGPT." But "oh my god, they couldn't even be bothered to use Claude."
It'll still sound like AI, but 90% of the cringe is gone.
If you're going to use AI for writing, it's just basic decency to use the one that isn't going to make your audience fly into a fit of rage every ten seconds.
That being said, I feel very self conscious using emdashes in current decade ;)
I appreciate Andrej’s optimistic spirit, and I am grateful that he dedicates so much of his time to educating the wider public about AI/LLMs. That said, it would be great to hear his perspective on how 2025 changed the concentration of power in the industry, what’s happening with open-source, local inference, hardware constraints, etc. For example, he characterizes Claude Code as “running on your computer”, but no, it’s just the TUI that runs locally, with inference in the cloud. The reader is left to wonder how that might evolve in 2026 and beyond.
The CC point is more about the data and environmental and general configuration context, not compute and where it happens to run today. The cloud setups are clunky because of context and UIUX user in the loop considerations, not because of compute considerations.
Agree with the GP, though -- you ought to make that clearer. It really reads like you're saying that CC runs locally, which is confusing since you obviously know better.
I think we need to shift our mindset on what an agent is. The LLM is a brain in a vat connected far away. The agent sits on your device, as a mech suit for that brain, and can pretty much do damn near anything on that machine. It's there, with you. The same way any desktop software is.
One of the most interesting coding agents to run locally is actually OpenAI Codex, since it has the ability to run against their gpt-oss models hosted by Ollama.
Think of it as the early years of UNIX & PC. Running inferences and tools locally and offline opens doors to new industries. We might not even need client/server paradigm locally. LLM is just a probabilistic library we can call.
What he meant was, agents will probably not be these web abstractions that run in deployed services (langchain, crew); agents meaning the Harnesses (software wrapper) specifically that call the LLM API.
It runs on your computer because of its tooling. It can call Bash. It can literally do anything on the operating system and file system. That's what makes it different. You should think of it like a mech suit. The model is just the brain in a vat connected far away.
The section on Claude Code is very ambiguously and confusingly written, I think he meant that the agent runs on your computer (not inference) and that this is in contrast to agents running "on a website" or in the cloud:
> I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of localhost. [...] CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.
However, if so, this is definitely a distinction that needs to be made far more clearly.
> In the same way, LLMs should speak to us in our favored format - in images, infographics, slides, whiteboards, animations/videos, web apps, etc.
You think every Electron app out there re-inventing application UX from scratch is bad, wait until LLMs are generating their own custom UX for every single action for every user for every device. What does command-W do in this app? It's literally impossible to predict, try it and see!
On the other side of the spectrum, I see some of the latest agents, like Codex, take care to get accessibility right -- something not even many humans bother to do.
It's an extension of how I've noticed that AIs will generally write very buttoned-down, cross-the-ts-and-dot-the-is code. Everything gets commented, every method has a try-catch with a log statement, every return type is checked, etc. I think it's a consequence of them not feeling fatigue. These things (accessibility included) are all things humans generally know they 'should' do, but there never seems to be enough time in the day; we'll get to it later when we're less tired. But the ghost in the machine doesn't care. It operates at the same level all the time
The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR is the mental model I didn't know I needed to explain the current state of jagged intelligence. It perfectly articulates why trust in benchmarks is collapsing; we aren't creating generally adaptive survivors, but rather over-optimizing specific pockets of the embedding space against verifiable rewards.
I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
> The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR
I don't see these descriptions as very insightful.
The difference between general/animal intelligence and jagged/LLM intelligence is simply that humans/animals really ARE intelligent (the word was created to describe this human capability), while LLMs are just echoing narrow portions of the intelligent output of humans (those portions that are amenable to RLVR capture).
For an artificial intelligence to be intelligent in it's own right, and therefore be generally intelligent, it would need to need - like an animal - to be embodied (even if only virtually), autonomous, predicting the outcomes of it's own actions (not auto-regressively trained), learning incrementally and continually, built with innate traits like curiosity and boredom to put and keep itself in learning situations, etc.
Of course not all animals are generally intelligent - many (insects, fish, reptiles, many birds) just have narrow "hard coded" instinctual behaviors, but others like humans are generalists who evolution have therefore honed for adaptive lifetime learning and general intelligence.
> I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
We should keep in mind that currently our LLM use is subsidized. When the money dries up and we have to pay the real prices I’ll be interested to see if we can still consider whipping up one time apps as basically free
Notable omission: 2025 is also when the ghosts started haunting the training data. Half of X replies are now LLMs responding to LLMs. The call is coming from inside the dataset.
What is current state of the art workflow when working with legacy code across multiple languages?
This would be a 100 kLOC legacy project written in C++, Python, and jQuery era Javascript circa 2010. Original devs have long left. I would rather avoid C++ as much as possible.
I've been Github Copilot (in VS Code) user since June of 2021 and still use it heavily, but the "more powerful intellisence" approach is limiting me on legacy projects.
Presumably I need to provide more context on larger projects.
I can get pretty far with just ChatGPT plus and feeding bits and pieces of project. However that seems like using the wrong tool.
Codex seems better for building things but not sure about grokking existing things.
Would Cursor be more suitable for just dumping the whole project (all languages) basically 4 different sub projects and then selectively activating what to include in queries?
I dont understand, the agent mode of copilot will search for and be pretty good and filling its own context afaik. I never really feed any of our 100k+ lines legacy codebase explicitly to the LLM.
> LLMs are emerging as a new kind of intelligence, simultaneously a lot smarter than I expected and a lot dumber than I expected
Isn't this concerning? How can we know which one we get? In the realm of code it's easier to tell when mistakes are being made.
> regular people benefit a lot more from LLMs compared to professionals, corporations and governments
We thought this would happen with things like AppleScript, VB, visual programming. But instead, AI is currently used as a smarter search engine. The issue is that's also the area where it hallucinates the most. What do you think is the solution?
I would love Andrej's take on the fast models we got this year. Gemini 3 flash and Grok 4 fast have no business being as good + cheap + fast as they are. For Andrej's prediction about LLMs communicating with us via a visual interface we're going to need fast models, but I feel like AI twitter/HN has mostly ignored these.
Just guessing here, but these small models may well be essentially distillations of larger ones, with this being where their power comes from. e.g. Use a large model to generate synthetic reasoning traces, then train a small model on those.
A few days ago I was trying to unsubscribe to a service (notably an AI 3D modeling tool that I was curious about).
I spent 5 minutes trying to find a way to unsubscribe and couldn't. Finally, I found it buried in the plan page as one of those low-contrast ellipses on the plan card.
Instead of unsubscribing me or taking me to a form, it opened a convos with an AI chatbot with a preconfigured "unsubscribe" prompt. I have never felt more angry with a UI that I had to waste more time talking to a robot before it would render the unsubscribe button in the chat.
Why would we bring the most hated feature of automated phone calls to apps? As a frontend engineer I am horrified by these trends.
There might be some confusion about the transition to what some call post-literate era: era where text is not the primary medium. That’s not necessarily bad because you get the advantages of other mediums - oral and visual but it is something to keep in mind.
I'm bit skeptical that a post-literate era is happening. I gather it appears in some sci-fi but I don't see much sign in reality. I mean here we are on a text only site. If anything we seem to be heading for a 100% literate society. Literacy graphs here: https://ourworldindata.org/grapher/cross-country-literacy-ra...
I don’t think the post-illiterate era means that text will disappear. I think it’s just not going to be dominant anymore but I also have my reservations since I do prefer the text medium.
I think one of the things that is missing from this post is engaging a bit in trying to answer: what are the highest priority AI-related problems that the industry should seek to tackle?
Karpathy hints at one major capability unlock being UI generation, so instead of interacting with text the AI can present different interfaces depending on the kind of problem. That seems like a severely underexplored problem domain so far. Who are the key figures innovating in this space so far?
In the most recent Demis interview, he suggests that one of the key problems that must be solved is online / continuous learning.
Aside from that, another major issues is probably reducing hallucinations and increasing reliability. Ideally you should be able to deploy an LLM to work on a problem domain, and if it encounters an unexpected scenario it reaches out to you in order to figure out what to do. But for standard problems it should function reliably 100% of the time.
> I like this version of the meme for pointing out that human intelligence is also jagged in its own different way.
The idea of jaggedicity seems useful to advancing epistemology. If we could identify the domains that have useful data that we fail to extract, we could fill those holes and eventually become a general intelligence ourselves. The task may be as hard as making a list of your blind spots. But now we have an alien intelligence with an outside perspective. While making AI less jagged it might return the favor.
If we keep inventing different kinds of intelligence the sum of the splats may eventually become well rounded.
I don't think it will become well rounded because that is not cost sensitive. Intelligence is sensitive to cost, it is the core constraint shaping it. Any action has a cost - energy, materials, time, opportunity or social. Intelligence is solving the cost equation, if we can't solve it we die. Cost is also why we specialize, in a group we can offload some intelligence to others. LLMs also have their own costs, and are shaped by it into some kind of jagged intelligence, they are no spherical cows either.
What's interesting about Nano Banana (and even more so video models like Veo 3) is that they act as a weird kind of world model when you consider that they accept images as input and return images as output.
Give it an image of a maze, it can output that same image with the maze completed (maybe).
> We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more.
I think he is referring to capability, not architecture, and say that NB is at the point that it is suggestive of the near-future capability of using GenAI models to create their own UI as needed.
NB (Gemini 2.5 Flash Image) isn't the first major-vendor LLM-based image gen model, after all; GPT Image 1 was first.
The bit about o3 being the turning point is very interesting. I heard someone say that o3 (or perhaps the cheaper o4-mini) should have been called gpt-5, and that people would have been mind blown. Instead it kind of went under the radar as far as the mainstream goes.
Whereas we just got the incremental progress with gpt-5 instead and it was very underwhelming. (Plus like 5 other issues at launch, but that's a separate story ;)
I'm not sure if o4-mini would have made a good default gpt though. (Most use is conversational and its language is very awkward.) So they could have just called it gpt-5 pro or something, and put it on the $20 tier. I don't know.
I agree with this fwiw, for many months I talked to people who never used o3 and didn’t know what it was because it sounded weird. Maybe it wasn’t obvious at the time but that was a good major point release to make then.
Beyond graduating students, I see model labs as “accelerators/incubators” bundling, launching, and productizing observed ideas that gain traction. The sheer strength of their platforms, the number of eyes watching them, near-zero marginal costs, and seemingly unlimited budgets mean that only slow decision-making can prevent them from becoming the next Amazons of everything.
Something I’ve been thinking about is how as end stage users (eg building our own “thing” on top of an LLM) we can broadly verify it’s doing what we need without benchmarks. Does a set of custom evals built out over time solve this? Is there more we can do?
LLMs still need to bring clear added value to enterprise and corporate work; otherwise, they remain a geek’s toy.
Big media agencies that claim to use AI rely on strong creative teams who fine-tune prompts and spend weeks doing so. Even then, they don’t fully trust AI to slice long videos into shorter clips for social media.
Heavy administrative functions like HR or Finance still don’t get approval to expose any of their data to LLMs.
What I’m trying to say is that we are still in the early stages of LLM development, and as promising as this looks, it’s still far from delivering the real value that is often claimed.
I think their non-deterministic nature is what’s making it difficult to adopt. It’s hard to train somebody in the old way of “if you see this, do this” because when you call the LLM twice you most likely get different results.
It took a long time to computerize businesses and it might take some time to adopt/adapt to LLMs.
Vibe coding is sufficient for job hoppers who never finish anything and leave when the last 20% have to be figured out. Much easier to promote oneself as an expert and leave the hard parts to other people.
I’ve found incredible productivity gains writing (vibe coding) tools for myself that will never need to be “productionised” or even used by another person. Heck even I will probably never use the latest log retrieval tool, which exists purely for Claude code to invoke it. There is a ton of useful software yet to be written for which there _is_ no “last 20%”.
These tools are so useful and make you so much more "productive" that you don't think anyone else would want to pay anything for them huh? Did your boss at least give you a big raise for your "productivity" increase, or maybe lay off some of your underperforming coworkers bc you are just so much better now?
Do you mean vibe coding as-in producing unreviewed code with LLMs and prompting at it until it appears to work, or vibe coding as a catch-all for any time someone uses AI-assistance to help them write code?
Friendly reminder: There is no ghost in the machine. It is a system executing code, not a being having thoughts. Let’s admire the tool without projecting a personality onto it.
For me, that’s kind of the point. It’s similar to how the characters in a novel don’t really exist, and yet you can’t really discuss what happens in a novel without pretending that they do. It doesn’t really make sense to treat the author’s motivations and each character’s motivations as the same.
Similarly, we’re all talking to ghosts now, which aren’t real, and yet there is something there that we can talk about. There are obvious behavioral differences depending on what persona the LLM is generating text for.
I also like the hint of danger in “talking to ghosts.” It’s difficult to see how a rational adult could be in any danger from just talking, but I believe the news reports that some people who get too deep into it get “possessed.”
Consciousness is weird and nobody understands it. There is no good reason to assume that these systems have it. But there is also no good reason to rule it out.
For me, Claude Code was the most impressive innovation this year. Cursor was a good proof of concept but Claude Code is the tool that actually got me to use LLMs for coding.
The kind of code that Claude produces looks almost exactly like the code I would write myself. It's like it's reading my mind. This is a game changer because I can maintain the code that Claude produces.
With Claude Code, there are no surprises. I can pretty much guess what its code will look like 90% to 95% of the time but it writes it a lot faster than I could. This is an amazing innovation.
Gemini is quite impressive as well. Nano banana in particular is very useful for graphic design.
I haven't tried Gemini with coding yet but TBH, Claude Code does such a great job; if I could code any faster, I would get decision fatigue. I don't like rushing into architecture or UX decisions. I like to sit on certain decisions for a day or two before starting implementation. Once you start in a particular direction, it's hard to undo and you may try to double down on the mistake due to sunk cost fallacy. I try hard to avoid that.
I don't even see much reason to use Cursor. I am used to IntelliJ IDEA, so I just downloaded the Claude Code plugin and basically now I use the IDE only for navigating in the code, finding references and reviewing the code. I can't even remember the last time I wrote more than 2 lines of code. Claude Code has catapulted my performance at least 5x if not more. And now that the cost of writing test is so minimal I am also able to achieve much better (and meaningful!) test coverage too. The AI agents is where the most productivity is. I just create a plan with Claude, iterate over, ask questions, then let it implement the plan, review, ask to do some adjustments. No manual writing of code at all. Zero.
Maybe I'm holding it wrong, but the finer aspects of a codebase it still messes up. If I ask it to implement some weird thing off the beaten path it gets lost. But I completely agree at the test part. I actually test much more now since it's so easy!
IntelliJ has its own Claude integration too, but it does not use your Claude subscription: https://blog.jetbrains.com/ai/2025/09/introducing-claude-age...
Do you guys all work 100% on open source? Or are you uploading bits of your copyrighted code for future training to Anthropic? I hate patents so copyright is the only IP protection I have.
We use AWS Bedrock, so everything stays within our AWS account. It's not like we aren't already uploading our code to GitHub for version control, AWS for deployment, Jetbrains for development, all of ours logs to Datadog, Sentry, Snowflake, and more.
Yeah, my source code is on my computers, in self-hosted version control and self-hosted CI runners
Nano Banana Pro is legitimately an insane tool if you know how to use it. I still can’t believe they released it in the wild
What is there to using it more than asking it to generate an image of something?
For one: modifying existing images in interesting ways ... adding characters, removing elements, altering or enhancing certain features, creating layers, and so on. Things that would take a while on Photoshop, done almost instantly. Really unlocks the imagination.
I gave it an image of my crappy art and asked what steps I could take to make it look better. It gave me specific advice like varying the line widths and how to use this on specific parts of the character. It also pointed out that the shading in my piece was inconsistent and did not reflect the 3d form I was representing and again gave me specific fixes I could implement. I asked for it to give me an updated version of the piece with all of its advice implemented and it did so. I was pretty shocked at all of this.
For me: I've only tried using it seriously a few times but my experience is that you have to juggle carefully when to start a fresh session. It can get really anchored on earlier versions of images. It was interesting balancing iteration and from-scratch prompt refinement.
It's decent for things that would take a long time in Photoshop. Like most AI, sometimes it works great and sometimes it goes off the rails completely. Most recently, I used it to process some drone photos that were taken during late fall for the purpose of marketing a commercial property. All of the trees/grass/plants were brown, so I told it to make it look like the photos were taken during the summer but not to change anything else. It did a very good job, not just changing the color, but actually adding leaves to the plants and trees in a way that looked very realistic. It did in seconds what would have taken one of my team members hours, leaving them to work on other more pressing projects.
I first got into agentic properly with GLM coding plan (it's like $2/month), but I found myself very consistently asking Claude to make the code more elegant and readable. At which point I realized I was being silly and just switched to Claude code.
(GLM etc. get surprisingly close with good prompting but... $0.60/day to not worry about that is a no brainer.)
I've used all of these tools and for me Cursor works just as well but has tabs, easy ways to abort or edit prompts, great visual diff, etc...
Someone sell me on how Claude Code, I just don't get it.
I’m with you, I’ve used CC but I strongly prefer Cursor.
Fundamentally, I don’t like having my agent and my IDE be split. Yes, I know there are CC plugins for IDEs, but you don’t get the same level of tight integration.
I don’t have much time to evaluate tools every months and I have settled on Cursor. I’m curious on what I’m missing when using the same models?
You're not missing much. You can generally use Cursor like Claude Code for normal day to day use. I prefer Cursor because I like reviewing changes in an IDE, and I like being able to switch to the current SOTA model.
Though for more automated work, one thing you miss with Cursor is sub agents. And then to a lesser extent skills (these are pretty easy to emulate in other tools). I'm sure it's only a matter of time though.
Claude Code's VS Code integration is very easy to set up and pretty helpful if you want to see/review changes in an IDE.
The big limitation is that you have to approve/disapprove at every step. With Cursor you can iterate on changes and it updates the diffs until you approve the whole batch.
There is an auto accept diffs mode
You are missing an entire agentic experience. And I wouldn't call it vibe coding for an engineer; you're more or less empowered to truly orchestrate the development of your system.
Cursor has agent, but that's like whoever else tried to copy the Model T while Ford was developing it.
This hasn’t been my experience at all. I’m finding Cursor with Opus 4.5 and plan mode to be just as capable as CC. And I prefer the UI/UX.
I have only compared Claude Code with Crush and a tool of my own design. In my experience, Claude code is optimized for giant codebases and long tasks. It loves launching dozens of agents in parallel. So it's a bit heavy for smaller, surgical stuff, though it works decent for that too.
If you mostly have small codebases that fit in context, or make many small changes interactively, it's not really great for that (though it can handle it too). It'll just be spending most of its time poking around the codebase, when the whole thing should have just been loaded... (Too bad there's no small repo mode. I made startup hook that just dumps cat dir into context, but yeah, should be a toggle.)
If you switch to Codex you will get a lot of tokens for $200, enough to more consistently use high reasoning as well. Cursor is simply far more expensive so you end up using less or using dumber models.
Claude Code is overrated as it uses many of its features and modalities to compensate for model shortcomings that are not as necessary for steering state of the art models like GPT 5.2
I think this is a total misunderstanding of Anthropic’s place in the AI race. Opus 4.5 is absolutely a state of the art model. I won’t knock anyone for preferring Codex, but I think you’re ignoring official and unofficial benchmarks.
See: https://artificialanalysis.ai
> Opus 4.5 is absolutely a state of the art model.
> See: https://artificialanalysis.ai
The field moves fast. Per artificialanalysis, Opus 4.5 is currently behind GPT-5.2 (x-high) and Gemini 3 Pro. Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
Totally, however OP's point was that Claude had to compensate for deficiencies versus a state of the art model like ChatGPT 5.2. I don't think that's correct. Whether or not Opus 4.5 is actually #1 on these benchmarks, it is clearly very competitive with the other top-tier models. I didn't take "state of the art" to here narrowly mean #1 on a given benchmark, but rather to mean near or at the frontier of current capabilities.
One thing to remember when comparing ML models of any kind is that single value metrics obscure a lot of nuance and you really have to go through the model results one by one to see how it performs. This is true for vision, NLP, and other modalities.
https://lmarena.ai/leaderboard/webdev
LM Arena shows Claude Opus 4.5 on top
I wonder how model competence and/or user preference on web development (that leaderboard) carries over to more complex and larger projects, or more generally anything other than web development ?
In addition to whatever they are exposed to as part of pre-training, it'd be interesting to know what kind of coding tasks these models are being RL-trained for? Are things like web development and maybe Python/ML coding overemphasized, or are they also being trained on things like Linux/Windows/embedded development etc in different languages?
https://x.com/giansegato/status/2002203155262812529/photo/1
https://x.com/METR_Evals/status/2002203627377574113
> Even Google's cheaper Gemini 3 Flash model seems to be slightly ahead of Opus 4.5.
What an insane take for anybody uses these models daily.
Yes, I personally feel that the "official" benchmarks are increasingly diverging from the everyday reality of using these models. My theory is that we are reaching a point where all the models are intelligent enough for day-to-day queries, so points like style/personality and proper use of web queries and other capabilities are better differentiators than intelligence alone.
is x-high fast enough to use as a coding agent?
Yes, if you parallelize your work, which you must learn to do if you want the best quality
What am I missing? As suspicious as benchmarks are, your link shows GPT 5.2 to be superior.
It is also out of date as it does not include 5.2 Codex.
Per my point about steerability compensated for by modalities and other harness features: Opus 4.5 scores 58% while GPT 5.2 scores 75% for the instruction following benchmark in your link! Thanks for the hard evidence - GPT 5.2 is 30% ahead of Opus 4.5 there. No wonder Claude Code needs those harness features for the user to manually reign in control over its instruction following capability.
I disagree, the claude models seem the best at tool calling, opus 4.5 seems the smartest, and claude code (+ claude model) seems to make good use of subagents and planning in a way that codex doesn't
Opus 4.5 is so bad at instruction following (30% worse per benchmark shared above) that it requires a manual toggle for plan mode.
GPT 5.2 simply obeys instruction to assemble a plan and avoids the need to compensate for poor steerability that would require the user to manually manage modalities.
Opus has improved though so the plan mode is less necessary than it was before, but it is still far behind state of art steerability.
I noticed that despite really liking Karpathy and the blog, I was am kind of wincing/involuntarily reacting to the LLM-like "It's not X, its Y"-phrases:
> it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer
> it's not just about the image generation itself, it's about the joint capability coming from text generation
There would be no reaction from me on this 3 years ago, but now this sentence structure is ruined for me
I used to use a lot of em dashes normally in my writing - they were my go-to replacements for commas and semicolons
But I had to change how I write because people started calling my writing “AI generated”
2026 will be the year of the ;
Please no that's my go to
so you switched to using hyphens instead?
En dashes!
Very broadly, AI sentence-structure and word choice is recursing back into society, changing how humans use language. The Economist recently had a piece on word usage of British Parliament members. They are adopting words and phrases commonly seen in AI.
We're embarking on a ginormous planetary experiment here.
You’re absolutely right!
Jk jk, now that you pointed it out I can’t unsee it.
I hated these sentences way before LLMs, at least in the context of an explanation.
> it's not just a website you go like Google, it's a little spirit/ghost that "lives" on your computer
This type of sentence, I call rhetorical fat. Get rid of this fat and you obtain a boring sentence that repeats what has been said in the previous one.
Not all rhetorical fats are equal, and I must admit I find myself eyerolling on the "little spirit" part more than about the fatness.
I understand the author wants to decorate things and emphasize key elements, and the hate I feel is only caused by the incompatible projection of my ideals to a text that doesn't belong to me.
> it's not just about the image generation itself, it's about the joint capability coming from text generation.
That's unjustified conceptual stress.
That could be a legitimate answer to a question ("No, no, it's not just about that, it's more about this"), but it's a text. Maybe the text wants you to be focused, maybe the text wants to hype you; this is the shape of the hype without the hype.
"I find image generation is cooler when paired with text generation."
It is not a decoration. Karpathy juxtaposes ChatGPT (which feels like a "better google" to most people) to Claude Code, which, apparently, feels different to him. It's a comparison between the two.
You might find this statement non-informative, but without two parts there's no comparison. That's really the semantics of the statement which Karpathy is trying to express.
ChatGPT-ish "it's not just" is annoying because the first part is usually a strawman, something reader considers trite. But it's not the case here.
Indeed, I was probably grumpy at the time I wrote the comment. I do find some truth in it still.
You're right ! The strawman theory is based.
But I think there's more to it, I find dislikable the structure of these sentences (which I find a bit sensationnalist for nothing, I don't know, maybe I am still grumpy).
Well, language is a subject to 'fashion' one-upmanship game: people want to demonstrate their sophistication, often by copying some "cool" patterns, but then over-used patterns become "uncool" cliche.
So it might be just a natural reaction to over-use of a particular pattern. This kind of stuff have been driving language evolution for millennia. Besides that, pompous style is often used in 'copy' (slogans and ads) which is something most people don't like.
Karpathy should go back to what he does best: educating people about AI on a deep level. Running experiments and sharing how they work, that sort of stuff. It seems lately he is closer to an influencer who reviews AI-based products. Hopefully it is not too late to go back.
I feel these review stuff is more like a side / pass time to him. Look at nanochat for example. My impression is that these are the thongs he spends most of his energy still.
After all,l he's been a "influencer" for a long time, starting from the "software 2.0" essay.
Yeah, came to read Karpathy's thoughts, but might as well ask an LLM myself..
I cannot unsee this anymore and it ruins the whole internet experience for me
Same here, had to configure ChatGPT to stop making these statements. Also had to configure bunch of other stuff to make it bland when answering questions.
The way to make AI not sound like ChatGPT is to use Claude.
I realized that's what bothered me. It's not "oh my god, they used ChatGPT." But "oh my god, they couldn't even be bothered to use Claude."
It'll still sound like AI, but 90% of the cringe is gone.
If you're going to use AI for writing, it's just basic decency to use the one that isn't going to make your audience fly into a fit of rage every ten seconds.
That being said, I feel very self conscious using emdashes in current decade ;)
If a reader gets angry simply because the author used ChatGPT instead of Claude, then the reader is an idiot.
I don't use LLM for writing just factual research stuff. And this would happen even in those questions.
I dont think Ive ever noticed someone use an emdash until chatgpt appeared
https://xkcd.com/3126/
I mostly use them in Telegram because it auto converts -- into emdash. They are a pain to type everywhere else though!
I love em dashes—they basically indicate a more deliberate pause than a … without the tight vibes of a semicolon.
Same, I cringe when I read this structure.
It's not text - it's clickbait distillied to grammar.
I appreciate Andrej’s optimistic spirit, and I am grateful that he dedicates so much of his time to educating the wider public about AI/LLMs. That said, it would be great to hear his perspective on how 2025 changed the concentration of power in the industry, what’s happening with open-source, local inference, hardware constraints, etc. For example, he characterizes Claude Code as “running on your computer”, but no, it’s just the TUI that runs locally, with inference in the cloud. The reader is left to wonder how that might evolve in 2026 and beyond.
The CC point is more about the data and environmental and general configuration context, not compute and where it happens to run today. The cloud setups are clunky because of context and UIUX user in the loop considerations, not because of compute considerations.
Agree with the GP, though -- you ought to make that clearer. It really reads like you're saying that CC runs locally, which is confusing since you obviously know better.
I think we need to shift our mindset on what an agent is. The LLM is a brain in a vat connected far away. The agent sits on your device, as a mech suit for that brain, and can pretty much do damn near anything on that machine. It's there, with you. The same way any desktop software is.
Yeah, I made some edits to clarify.
From what I can gather, llama.cpp supports Anthropic's message format now[1], so you can use it with Claude Code[2].
[1]: https://github.com/ggml-org/llama.cpp/pull/17570
[2]: https://news.ycombinator.com/item?id=44654145
One of the most interesting coding agents to run locally is actually OpenAI Codex, since it has the ability to run against their gpt-oss models hosted by Ollama.
Or 120b if you can fit the larger model.What do you find interesting about it, and how does it compare to commercial offerings?
It's rare to find a local model that's capable of running tools in a loop well enough to power a coding agent.
I don't think gpt-oss:20b is strong enough to be honest, but 120b can do an OK job.
Nowhere NEAR as good as the big hosted models though.
Think of it as the early years of UNIX & PC. Running inferences and tools locally and offline opens doors to new industries. We might not even need client/server paradigm locally. LLM is just a probabilistic library we can call.
Thanks.
What he meant was, agents will probably not be these web abstractions that run in deployed services (langchain, crew); agents meaning the Harnesses (software wrapper) specifically that call the LLM API.
It runs on your computer because of its tooling. It can call Bash. It can literally do anything on the operating system and file system. That's what makes it different. You should think of it like a mech suit. The model is just the brain in a vat connected far away.
The section on Claude Code is very ambiguously and confusingly written, I think he meant that the agent runs on your computer (not inference) and that this is in contrast to agents running "on a website" or in the cloud:
> I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of localhost. [...] CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.
However, if so, this is definitely a distinction that needs to be made far more clearly.
Well Microsoft had thier "localhost" AI before CC but that was a ghost without a clear purpose or skill.
> In the same way, LLMs should speak to us in our favored format - in images, infographics, slides, whiteboards, animations/videos, web apps, etc.
You think every Electron app out there re-inventing application UX from scratch is bad, wait until LLMs are generating their own custom UX for every single action for every user for every device. What does command-W do in this app? It's literally impossible to predict, try it and see!
On the other side of the spectrum, I see some of the latest agents, like Codex, take care to get accessibility right -- something not even many humans bother to do.
It's an extension of how I've noticed that AIs will generally write very buttoned-down, cross-the-ts-and-dot-the-is code. Everything gets commented, every method has a try-catch with a log statement, every return type is checked, etc. I think it's a consequence of them not feeling fatigue. These things (accessibility included) are all things humans generally know they 'should' do, but there never seems to be enough time in the day; we'll get to it later when we're less tired. But the ghost in the machine doesn't care. It operates at the same level all the time
>our favored format - in images, infographics, slides, whiteboards, animations/videos, web apps, etc
If you look at how humans actually communicate I'd guess #1 is text/speech, #2 pictures
But that's exactly what an LLM solved.
It's the best ui ever.
It understands a lot of languages and abstract concepts.
It will not be necessary at all to let LLM generate random uis.
I'm not a native English speaker. I sometimes just throw in a German word and it just works.
The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR is the mental model I didn't know I needed to explain the current state of jagged intelligence. It perfectly articulates why trust in benchmarks is collapsing; we aren't creating generally adaptive survivors, but rather over-optimizing specific pockets of the embedding space against verifiable rewards.
I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
> The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR
I don't see these descriptions as very insightful.
The difference between general/animal intelligence and jagged/LLM intelligence is simply that humans/animals really ARE intelligent (the word was created to describe this human capability), while LLMs are just echoing narrow portions of the intelligent output of humans (those portions that are amenable to RLVR capture).
For an artificial intelligence to be intelligent in it's own right, and therefore be generally intelligent, it would need to need - like an animal - to be embodied (even if only virtually), autonomous, predicting the outcomes of it's own actions (not auto-regressively trained), learning incrementally and continually, built with innate traits like curiosity and boredom to put and keep itself in learning situations, etc.
Of course not all animals are generally intelligent - many (insects, fish, reptiles, many birds) just have narrow "hard coded" instinctual behaviors, but others like humans are generalists who evolution have therefore honed for adaptive lifetime learning and general intelligence.
> I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
We should keep in mind that currently our LLM use is subsidized. When the money dries up and we have to pay the real prices I’ll be interested to see if we can still consider whipping up one time apps as basically free
I've been doing it for months, it's lovely
https://tech.lgbt/@graeme/115749759729642908
It's a stack based on finishing the job Jupyter started. Fences as functions, callable and composable.
Same shape as an MCP. No training required, just walk them through the patterns.
Literally, it's spatially organized. Turns out a woman named Mrs Curwen and I share some thoughts on pedagogy.
There does in fact exist a functor that maps 18th century piano instruction to context engineering. We play with it
Notable omission: 2025 is also when the ghosts started haunting the training data. Half of X replies are now LLMs responding to LLMs. The call is coming from inside the dataset.
Any tips to spot this? I want to avoid arguing with a X bot.
Really easy: don't argue on the internet. The approach has many benefits.
Also, don't use X.
also, please just do not use X
Ok, fine, but do you have a better way to build a bot following and expose oneself to trending MAGA memes?
“truth” social :)
What is current state of the art workflow when working with legacy code across multiple languages?
This would be a 100 kLOC legacy project written in C++, Python, and jQuery era Javascript circa 2010. Original devs have long left. I would rather avoid C++ as much as possible.
I've been Github Copilot (in VS Code) user since June of 2021 and still use it heavily, but the "more powerful intellisence" approach is limiting me on legacy projects.
Presumably I need to provide more context on larger projects.
I can get pretty far with just ChatGPT plus and feeding bits and pieces of project. However that seems like using the wrong tool.
Codex seems better for building things but not sure about grokking existing things.
Would Cursor be more suitable for just dumping the whole project (all languages) basically 4 different sub projects and then selectively activating what to include in queries?
I dont understand, the agent mode of copilot will search for and be pretty good and filling its own context afaik. I never really feed any of our 100k+ lines legacy codebase explicitly to the LLM.
Excellent more grounded review. A few questions:
> LLMs are emerging as a new kind of intelligence, simultaneously a lot smarter than I expected and a lot dumber than I expected
Isn't this concerning? How can we know which one we get? In the realm of code it's easier to tell when mistakes are being made.
> regular people benefit a lot more from LLMs compared to professionals, corporations and governments
We thought this would happen with things like AppleScript, VB, visual programming. But instead, AI is currently used as a smarter search engine. The issue is that's also the area where it hallucinates the most. What do you think is the solution?
I would love Andrej's take on the fast models we got this year. Gemini 3 flash and Grok 4 fast have no business being as good + cheap + fast as they are. For Andrej's prediction about LLMs communicating with us via a visual interface we're going to need fast models, but I feel like AI twitter/HN has mostly ignored these.
Just guessing here, but these small models may well be essentially distillations of larger ones, with this being where their power comes from. e.g. Use a large model to generate synthetic reasoning traces, then train a small model on those.
check out Sasha Luccioni
Do you have a link to anything they wrote about this?
It’s funny how every podcaster/public ai figure is so certain text as a Ui will go away and it’s not going anywhere.
A few days ago I was trying to unsubscribe to a service (notably an AI 3D modeling tool that I was curious about).
I spent 5 minutes trying to find a way to unsubscribe and couldn't. Finally, I found it buried in the plan page as one of those low-contrast ellipses on the plan card.
Instead of unsubscribing me or taking me to a form, it opened a convos with an AI chatbot with a preconfigured "unsubscribe" prompt. I have never felt more angry with a UI that I had to waste more time talking to a robot before it would render the unsubscribe button in the chat.
Why would we bring the most hated feature of automated phone calls to apps? As a frontend engineer I am horrified by these trends.
It's probably increased during my lifetime. People used to talk, now they sit and text into smartphones.
There might be some confusion about the transition to what some call post-literate era: era where text is not the primary medium. That’s not necessarily bad because you get the advantages of other mediums - oral and visual but it is something to keep in mind.
I'm bit skeptical that a post-literate era is happening. I gather it appears in some sci-fi but I don't see much sign in reality. I mean here we are on a text only site. If anything we seem to be heading for a 100% literate society. Literacy graphs here: https://ourworldindata.org/grapher/cross-country-literacy-ra...
I don’t think the post-illiterate era means that text will disappear. I think it’s just not going to be dominant anymore but I also have my reservations since I do prefer the text medium.
I think one of the things that is missing from this post is engaging a bit in trying to answer: what are the highest priority AI-related problems that the industry should seek to tackle?
Karpathy hints at one major capability unlock being UI generation, so instead of interacting with text the AI can present different interfaces depending on the kind of problem. That seems like a severely underexplored problem domain so far. Who are the key figures innovating in this space so far?
In the most recent Demis interview, he suggests that one of the key problems that must be solved is online / continuous learning.
Aside from that, another major issues is probably reducing hallucinations and increasing reliability. Ideally you should be able to deploy an LLM to work on a problem domain, and if it encounters an unexpected scenario it reaches out to you in order to figure out what to do. But for standard problems it should function reliably 100% of the time.
Google is doing that with A2UI. LLM will be able to decide how to present info to the user.
> I like this version of the meme for pointing out that human intelligence is also jagged in its own different way.
The idea of jaggedicity seems useful to advancing epistemology. If we could identify the domains that have useful data that we fail to extract, we could fill those holes and eventually become a general intelligence ourselves. The task may be as hard as making a list of your blind spots. But now we have an alien intelligence with an outside perspective. While making AI less jagged it might return the favor.
If we keep inventing different kinds of intelligence the sum of the splats may eventually become well rounded.
I don't think it will become well rounded because that is not cost sensitive. Intelligence is sensitive to cost, it is the core constraint shaping it. Any action has a cost - energy, materials, time, opportunity or social. Intelligence is solving the cost equation, if we can't solve it we die. Cost is also why we specialize, in a group we can offload some intelligence to others. LLMs also have their own costs, and are shaped by it into some kind of jagged intelligence, they are no spherical cows either.
> In this world view, nano banana is a first early hint of what that might look like.
What is he referring to here? Is nano banana not just an image gen model? Is it because it's an LLM-based one, and not diffusion?
What's interesting about Nano Banana (and even more so video models like Veo 3) is that they act as a weird kind of world model when you consider that they accept images as input and return images as output.
Give it an image of a maze, it can output that same image with the maze completed (maybe).
There's a fantastic article about that for image-to-video models here: https://video-zero-shot.github.io/
> We demonstrate that Veo 3 can zero-shot solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and much more.
I think he is referring to capability, not architecture, and say that NB is at the point that it is suggestive of the near-future capability of using GenAI models to create their own UI as needed.
NB (Gemini 2.5 Flash Image) isn't the first major-vendor LLM-based image gen model, after all; GPT Image 1 was first.
The bit about o3 being the turning point is very interesting. I heard someone say that o3 (or perhaps the cheaper o4-mini) should have been called gpt-5, and that people would have been mind blown. Instead it kind of went under the radar as far as the mainstream goes.
Whereas we just got the incremental progress with gpt-5 instead and it was very underwhelming. (Plus like 5 other issues at launch, but that's a separate story ;)
I'm not sure if o4-mini would have made a good default gpt though. (Most use is conversational and its language is very awkward.) So they could have just called it gpt-5 pro or something, and put it on the $20 tier. I don't know.
I agree with this fwiw, for many months I talked to people who never used o3 and didn’t know what it was because it sounded weird. Maybe it wasn’t obvious at the time but that was a good major point release to make then.
Here's the source for the jagged spiky intelligence diagram:
https://x.com/colin_fraser/status/1994235521812328695
https://karpathy.bearblog.dev/the-space-of-minds/
Beyond graduating students, I see model labs as “accelerators/incubators” bundling, launching, and productizing observed ideas that gain traction. The sheer strength of their platforms, the number of eyes watching them, near-zero marginal costs, and seemingly unlimited budgets mean that only slow decision-making can prevent them from becoming the next Amazons of everything.
Something I’ve been thinking about is how as end stage users (eg building our own “thing” on top of an LLM) we can broadly verify it’s doing what we need without benchmarks. Does a set of custom evals built out over time solve this? Is there more we can do?
xposted to https://x.com/karpathy/status/2002118205729562949
And also accessible sans login via https://xcancel.com/karpathy/status/2002118205729562949 .
LLMs still need to bring clear added value to enterprise and corporate work; otherwise, they remain a geek’s toy.
Big media agencies that claim to use AI rely on strong creative teams who fine-tune prompts and spend weeks doing so. Even then, they don’t fully trust AI to slice long videos into shorter clips for social media.
Heavy administrative functions like HR or Finance still don’t get approval to expose any of their data to LLMs.
What I’m trying to say is that we are still in the early stages of LLM development, and as promising as this looks, it’s still far from delivering the real value that is often claimed.
I think their non-deterministic nature is what’s making it difficult to adopt. It’s hard to train somebody in the old way of “if you see this, do this” because when you call the LLM twice you most likely get different results.
It took a long time to computerize businesses and it might take some time to adopt/adapt to LLMs.
Vibe coding is sufficient for job hoppers who never finish anything and leave when the last 20% have to be figured out. Much easier to promote oneself as an expert and leave the hard parts to other people.
I’ve found incredible productivity gains writing (vibe coding) tools for myself that will never need to be “productionised” or even used by another person. Heck even I will probably never use the latest log retrieval tool, which exists purely for Claude code to invoke it. There is a ton of useful software yet to be written for which there _is_ no “last 20%”.
These tools are so useful and make you so much more "productive" that you don't think anyone else would want to pay anything for them huh? Did your boss at least give you a big raise for your "productivity" increase, or maybe lay off some of your underperforming coworkers bc you are just so much better now?
All software is not meant to be open-source, in production and working on 100 platforms.
Sometimes the point of the software is to make an app with 2 buttons for your mom to help her do her grocery shopping easier
Do you mean vibe coding as-in producing unreviewed code with LLMs and prompting at it until it appears to work, or vibe coding as a catch-all for any time someone uses AI-assistance to help them write code?
Karpathy uses the term for all of this in the exuberant paragraph 5. of his blog post.
Friendly reminder: There is no ghost in the machine. It is a system executing code, not a being having thoughts. Let’s admire the tool without projecting a personality onto it.
For me, that’s kind of the point. It’s similar to how the characters in a novel don’t really exist, and yet you can’t really discuss what happens in a novel without pretending that they do. It doesn’t really make sense to treat the author’s motivations and each character’s motivations as the same.
Similarly, we’re all talking to ghosts now, which aren’t real, and yet there is something there that we can talk about. There are obvious behavioral differences depending on what persona the LLM is generating text for.
I also like the hint of danger in “talking to ghosts.” It’s difficult to see how a rational adult could be in any danger from just talking, but I believe the news reports that some people who get too deep into it get “possessed.”
Consciousness is weird and nobody understands it. There is no good reason to assume that these systems have it. But there is also no good reason to rule it out.
That’s the old way of thinking about it. there is a new way.
You sound as if you have grounds for certainty about this. What are they?
find on page:slop=0
tl;dr seems like llms are maturing on the product side and for day-day usage