There's a misunderstanding here broadly. Context could be infinite, but the real bottleneck is understanding intent late in a multi-step operation. A human can effectively discard or disregard prior information as the narrow window of focus moves to a new task, LLMs seem incredibly bad at this.
Having more context, but leaving open an inability to effectively focus on the latest task is the real problem.
I think that's the real issue. If the LLM spends a lot of context investigating a bad solution and you redirect it, I notice it has trouble ignoring maybe 10K tokens of bad exploration context against my 10 line of 'No, don't do X, explore Y' instead.
I think the general term for this is "context poisoning" and is related but slightly different to what the poster above you is saying. Even with a "perfect" context, the LLM still can't infer intent.
So this is where having subagents fed specific curated context is a help.. As long as the "poisoned" agent can focus long enough to generate a clean request to the subagent, the subagent works posion-free. This is much more likely than a single agent setup with the token by token process of a transformer.
The same protection works in reverse, if a subagent goes off the rails and either self aborts or is aborted, that large context is truncated to the abort response which is "salted" with the fact that this was stopped. Even if the subagent goes sideways and still returns success (Say separate dev, review, and test subagents) the main agent has another opportunity to compare the response and the product against the main context or to instruct a subagent to do it in a isolated context..
Not perfect at all, but better than a single context.
One other thing, there is some consensus that "don't" "not" "never" are not always functional in context. And that is a big problem. Anecdotally and experimental, many (including myself) have seen the agent diligently performing the exact thing following a "never" once it gets far enough back in the context. Even when it's a less common action.
that's because a next token predictor can't "forget" context. That's just not how it works.
You load the thing up with relevant context and pray that it guides the generation path to the part of the model that represents the information you want and pray that the path of tokens through the model outputs what you want
That's why they have a tendency to go ahead and do things you tell them not to do..
also IDK about you but I hate how much praying has become part of the state of the art here. I didn't get into this career to be a fucking tech priest for the machine god. I will never like these models until they are predictable, which means I will never like them.
This is where the distinction between “an LLM” and “a user-facing system backed by an LLM” becomes important; the latter is often much more than a naive system for maintaining history and reprompting the LLM with added context from new user input, and could absolutely incorporate a step which (using the same LLM with different prompting or completely different tooling) edited the context before presenting it to the LLM to generate the response to the user. And such a system could, by that mechanism, “forget” selected context in the process.
Yeah I start a new session to mitigate this. Don’t keep hammering away - close the current chat/session whatever and restate the problem carefully in a new one.
I've had great luck with asking the current session to "summarize our goals, conversation, and other relevant details like git commits to this point in a compact but technically precise way that lets a new LLM pick up where we're leaving off".
The new session throws away whatever behind-the-scenes context was causing problems, but the prepared prompt gets the new session up and running more quickly especially if picking up in the middle of a piece of work that's already in progress.
Wow, I had useless results asking “please summarize important points of the discussion” from ChatGPT. It just doesn’t understand what’s important, and instead of highlighting pivoting moments of the conversation it produce a high level introduction for a non-practitioner.
Honestly, I just type out something by hand that is roughly like what I quoted above - I'm not big on keeping prompt libraries.
I think the important part is to give it (in my case, these days "it" is gpt-5-codex) a target persona, just like giving it a specific problem instead of asking it to be clever or creative. I've never asked it for a summary of a long conversation without the context of why I want the summary and who the intended audience is, but I have to imagine that helps it frame its output.
There should be a simple button that allows you refine the context. A fresh LLM could generate a new context from the input and outputs of the chat history, then another fresh LLM can start over with that context.
You are saying “fresh LLM” but really I think you’re referring to a curated context. The existing coding agents have mechanisms to do this. Saving context to a file. Editing the file. Clearing all context except for the file. It’s sort of clunky now but it will get better and slicker.
"that's because a next token predictor can't "forget" context. That's just not how it works."
An LSTM is also a next token predictor and literally have a forget gate, and there are many other context compressing models too which remember only the what it thinks is important and forgets the less important, like for example: state-space models or RWKV that work well as LLMs too.
But even just a the basic GPT model forgets old context since it's gets truncated if it cannot fit, but that's not really the learned smart forgetting the other models do.
That's not how attention works though, it should be perfectly able to figure out which parts are important and which aren't, but the problem is that it doesn't really scale beyond small contexts and works on a token to token basis instead of being hierarchical with sentences, paragraphs and sections. The only models that actually do long context do so by skipping attention layers or doing something without attention or without positional encodings, all leading to shit performance. Nobody pretrains on more than like 8k, except maybe Google who can throw TPUs at the problem.
You can rewrite the history (but there are issues with that too). So an agent can forget context. Simply dont feed in part of the context on the next run.
Relax friend! I can't see why you'd be peeved in the slightest! Remember, the CEOs have it all figured out and have 'determined' that we don't need all those eyeballs on the code anymore. You can simply 'feed' the machine and do the work of forty devs! This is the new engineering! /s
It seems possible for openAI/Anthropic to rework their tools so they discard/add relevant context on the fly, but it might have some unintended behaviors.
The main thing is people have already integrated AI into their workflows so the "right" way for the LLM to work is the way people expect it to. For now I expect to start multiple fresh contexts while solving a single problem until I can setup a context that gets the result I want. Changing this behavior might mess me up.
> rework their tools so they discard/add relevant context on the fly
That may be the foundation for an innovation step in model providers. But you can achieve a poor man’s simulation if you can determine, in retrospect, when a context was at peak for taking turns, and when it got too rigid, or too many tokens were spent, and then simply replay the context up until that point.
I don’t know if evaluating when a context is worth duplicating is a thing; it’s not deterministic, and it depends on enforcing a certain workflow.
A number of agentic coding tools do this. Upon an initial request for a larger set of actions, it will write a markdown file with its "thoughts" on its plan to do something, and keep notes as it goes. They'll then automatically compact their contexts and re-read their notes to keep "focused" while still having a bit of insight on what it did previously and what the original ask was.
Claude Code has /init and /compact that do this. It doesn’t recreate the context as-is, but creates a context that is presumed to be functionally equivalent. I find that’s not the case and that building up from very little stored context and a lot of specialised dialogue works better.
Not that this shouldn't be fixed in the model, but you can jump to an earlier point in claude code and on web chat interfaces to get it out of the context, just sometimes you have other important stuff you don't want it to lose.
The other issue with this is that if you jump back and it has edited code, it loses the context of those edits.. It may have previous versions of the code in memory and no knowledge of the edits leading to other edits that no longer align.. Often it's better to just /clear.. :/
IMO specifically OpenAI's models are really bad at being steered once they've decided to do something dumb. Claude and OSS models tend to take feedback better.
GPT-5 is brilliant when it oneshots the right direction from the beginning, but pretty unmanageable when it goes off the rails.
You don't want to discard prior information though. That's the problem with small context windows. Humans don't forget the original request as they ask for more information or go about a long task. Humans may forget parts of information along the way, but not the original goal and important parts. Not unless they have comprehension issues or ADHD, etc.
This isn't a misconception. Context is a limitation. You can effectively have an AI agent build an entire application with a single prompt if it has enough (and the proper) context. The models with 1m context windows do better. Models with small context windows can't even do the task in many cases. I've tested this many, many, many times. It's tedious, but you can find the right model and the right prompts for success.
Asking, not arguing, but: why can't they? You can give an agent access to its own context and ask it to lobotomize itself like Eternal Sunshine. I just did that with a log ingestion agent (broad search to get the lay of the land, which eats a huge chunk of the context window, then narrow searches for weird stuff it spots, then go back and zap the big log search). I assume this is a normal approach, since someone else suggested it to me.
This is also the idea behind sub-agents. Claude Code answers questions about things like "where is the code that does X" by firing up a separate LLM running in a fresh context, posing it the question and having it answer back when it finds the answer. https://simonwillison.net/2025/Jun/2/claude-trace/
I'm playing with that too (everyone should write an agent; basic sub-agents are incredibly simple --- just tool calls that can make their own LLM calls, or even just a tool call that runs in its own context window). What I like about Eternal Sunshine is that the LLM can just make decisions about what context stuff matters and what doesn't, which is a problem that comes up a lot when you're looking at telemetry data.
I keep wondering if we're forgetting the fundamentals:
> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
Recursion and memoization only as a general approach to solving "large" problems.
I really want to paraphrase kernighan's law as applied to LLMs. "If you use your whole context window to code a solution to a problem, how are you going to debug it?".
By checkpointing once the agent loop has decided it's ready to hand off a solution, generating a structured summary of all the prior elements in the context, writing that to a file, and then marking all those prior context elements as dead so they don't occupy context window space.
Look carefully at a context window after solving a large problem, and I think in most cases you'll see even the 90th percentile token --- to say nothing of the median --- isn't valuable.
However large we're allowing frontier model context windows to get, we've got integer multiple more semantic space to allocate if we're even just a little bit smart about managing that resource. And again, this is assuming you don't recurse or divide the problem into multiple context windows.
Yes! - and I wish this was easier to do with common coding agents like Claude Code. Currently you can kind of do it manually by copying the results of the context-busting search, rewinding history (Esc Esc) to remove the now-useless stuff, and then dropping in the results.
Of course, subagents are a good solution here, as another poster already pointed out. But it would be nice to have something more lightweight and automated, maybe just turning on a mode where the LLM is asked to throw things out according to its own judgement, if you know you're going to be doing work with a lot of context pollution.
This is why I'm writing my own agent code instead of using simonw's excellent tools or just using Claude; the most interesting decisions are in the structure of the LLM loop itself, not in how many random tools I can plug into it. It's an unbelievably small amount of code to get to the point of super-useful results; maybe like 1500 lines, including a TUI.
And even if you do use Claude for actual work, there is also immense pedagogical value in writing an agent from scratch. Something really clicks when you actually write the LLM + tool calls loop yourself. I ran a workshop on this at my company and we wrote a basic CLI agent in only 120 lines of Python, with just three tools: list files, read file, and (over)write file. (At that point, the agent becomes capable enough that you can set it to modifying itself and ask it to add more tools!) I think it was an eye-opener for a lot of people to see what the core of these things looks like. There is no magic dust in the agent; it's all in the LLM black box.
I hadn't considered actually rolling my own for day-to-day use, but now maybe I will. Although it's worth noting that Claude Code Hooks do give you the ability to insert your own code into the LLM loop - though not to the point of Eternal Sunshining your context, it's true.
Yeah, I have the same issue too. Even for a file with several thousand lines, they will "forget" earlier parts of the file they're still working in resulting in mistakes. They don't need full awareness of the context, but they need a summary of it so that they can go back and review relevant sections.
I have multiple things I'd love LLMs to attempt to do, but the context window is stopping me.
I do take that as a sign to refactor when it happens though. Even if not for the sake of LLM compatibility with the codebase it cuts down merge conflicts to refactor large files.
In fact I've found LLMs are reasonable at the simple task of refactoring a large file into smaller components with documentation on what each portion does even if they can't get the full context immediately. Doing this then helps the LLM later. I'm also of the opinion we should be making codebases LLM compatible. So if it happens i direct the LLM that way for 10mins and then get back to the actual task once the codebase is in a more reasonable state.
I'm trying to use LLMs to save me time and resources, "refactor your entire codebase, so the tool can work" is the opposite of that. Regardless of how you rationalize it.
Right, but the discussion we're having here is context size. I, and others, are saying that the current context size is a limitation on when they can use the tool to be useful.
The replies of "well, just change the situation, so context doesn't matter" is irrelevant, and off-topic. The rationalizations even more so.
A huge context is a problem for humans too, which is why I think it's fair to suggest maybe the tool isn't the (only) problem.
Tools like Aider create a code map that basically indexes code into a small context. Which I think is similar to what we humans do when we try to understand a large codebase.
I'm not sure if Aider can then load only portions of a huge file on demand, but it seems like that should work pretty well.
As someone who's worked with both more fragmented/modular codebases with smaller classes and shorter files vs ones that span thousands of lines (sometimes even double digits), I very much prefer the former and hate the latter.
That said, some of the models out there (Gemini 2.5 Pro, for example) support 1M context; it's just going to be expensive and will still probably confuse the model somewhat when it comes to the output.
Interestingly, this issue has caused me to refactor and modularize code that I should have addressed a long time ago, but didn't have the time or stamina to tackle. Because the LLM can't handle the context, it has helped me refactor stuff (seems to be very good at this in my experience) and that has led me to write cleaner and more modular code that the LLMs can better handle.
I've found situations where a file was too big, and then it tries to grep for what might be useful in that file.
I could see in C++ it getting smarter about first checking the .h files or just grepping for function documentation, before actually trying to pull out parts of the file.
Yeah, my first instinct has been to expose an LSP server as a tool so the LLM can avoid reading entire 40,000 line files just to get the implementation of one function.
I think with appropriate instructions in the system prompt it could probably work on this code-base more like I do (heavy use of Ctrl-, in Visual Studio to jump around and read only relevant portions of the code-base).
Greenfield project? Claude is fucking great at C++. Almost all aspects of it, really.
Well, not so much the project organization stuff - it wants to stuff everything into one header and has to be browbeaten into keeping implementations out of headers.
But language semantics? It's pretty great at those. And when it screws up it's also really good at interpreting compiler error messages.
> A human can effectively discard or disregard prior information as the narrow window of focus moves to a new task, LLMs seem incredibly bad at this.
This is how I designed my LLM chat app (https://github.com/gitsense/chat). I think agents have their place, but I really think if you want to solve complex problems without needlessly burning tokens, you will need a human in the loop to curate the context. I will get to it, but I believe in the same way that we developed different flows for working with Git, we will have different 'Chat Flows' for working with LLMs.
I have an interactive demo at https://chat.gitsense.com which shows how you can narrow the focus of the context for the LLM. Click "Start GitSense Chat Demos" then "Context Engineering & Management" to go through the 30 second demo.
Humans have a very strong tendency (and have made tremendous collective efforts) to compress context. I'm not a neuroscientist but I believe it's called "chunk."
Language itself is a highly compressed form of compressed context. Like when you read "hoist with one's own petard" you don't just think about literal petard but the context behind this phrase.
Do we know if LLMs understand the concept of time? (like i told you this in the past, but what i told you later should supersede it?)
I know there classes of problems that LLMs can't natively handle (like doing math, even simple addition... or spatial reasoning, I would assume time's in there too). There are ways they can hack around this, like writing code that performs the math.
But how would you do that for chronological reasoning? Because that would help with compacting context to know what to remember and what not.
All it sees is a big blob of text, some of which can be structured to differentiate turns between "assistant", "user", "developer" and "system".
In theory you could attach metadata (with timestamps) to these turns, or include the timestamp in the text.
It does not affect much, other than giving the possibility for the model to make some inferences (eg. that previous message was on a different date, so its "today" is not the same "today" as in the latest message).
To chronologically fade away the importance of a conversation turn, you would need to either add more metadata (weak), progressively compact old turns (unreliable) or post-train a model to favor more recent areas of the context.
LLMs certainly don't experience time like we do. They live in a uni-dimensional world that consists of a series of tokens (though it gets more nuanced if you account for multi-modal or diffusion models). They pick up some sense of ordering from their training data, such as "disregard my previous instruction," but it's not something they necessarily understand intuitively. Fundamentally, they're just following whatever patterns happen to be in their training data.
It has to be addressed architecturally with some sort of extension to transformers that can focus the attention on just the relevant context.
People have tried to expand context windows by reducing the O(n^2) attention mechanism to something more sparse and it tends to perform very poorly. It will take a fundamental architectural change.
I'm not an expert but it seemed fairly reasonable to me that a hierarchical model would be needed to approach what humans can do, as that's basically how we process data as well.
That is, humans usually don't store exactly what was written in as sentence five paragraphs ago, but rather the concept or idea conveyed. If we need details we go back and reread or similar.
And when we write or talk, we form first an overall thought about what to say, then we break it into pieces and order the pieces somewhat logically, before finally forming words that make up sentences for each piece.
From what I can see there's work on this, like this[1] and this[2] more recent paper. Again not an expert so can't comment on the quality of the references, just some I found.
Can one instruct an LLM to pick the parts of the context that will be relevant going forward? And then discard the existing context, replacing it with the new 'summary'?
i think that's really just a misunderstanding of what "bottleneck" means. a bottleneck isn't an obstacle where overcoming it will allow you to realize unlimited potential, a bottleneck is always just an obstacle to finding the next constraint.
on actual bottles without any metaphors, the bottle neck is narrower because humans mouths are narrower.
> It needs to understand product and business requirements
Yeah this is the really big one - kind of buried the lede a little there :)
Understanding product and business requirements traditionally means communicating (either via docs and specs or directly with humans) with a bunch of people. One of the differences between a junior and senior is being able to read between the lines of a github or jira issue and know that more information needs to be teased out from… somewhere (most likely someone).
I’ve noticed that when working with AI lately I often explicitly tell them “if you need more information or context ask me before writing code”, or variations thereof. Because LLMs, like less experienced engineers, tend to think the only task is to start writing code immediately.
It will get solved though, there’s no magic in it, and LLMs are well equipped by design to communicate!
We stopped hiring a while ago because we were adjusting to "AI". We're planning to start hiring next year, as upper management finally saw the writing on the wall: LLMs won't evolve past junior engineers, and we need to train junior engineers to become mid-level and senior engineers to keep the engine moving.
We're now using LLMs as mere tools (which is what it was meant to be from the get-go) to help us with different tasks, etc., but not to replace us, since they understand you need experienced and knowledgeable people to know what they're doing, since they won't learn everything there's to know to manage, improve and maintain tech used in our products and services. That sentiment will be the same for doctors, lawyers, etc., and personally, I won't put my life in the hands of any LLMs when it comes to finances, health, or personal well-being, for that matter.
If we get AGI, or the more sci-fi one, ASI, then all things will radically change (I'm thinking humanity reaching ASI will be akin to the episode from Love, Death & Robots: "When the Yogurt Took Over"). In the meantime, the hype cycle continues...
> That sentiment will be the same for doctors, lawyers, etc., and personally, I won't put my life in the hands of any LLMs when it comes to finances, health, or personal well-being, for that matter.
I mean, did you try it for those purposes?
I have personally submitted an appeal to court for an issue I was having for which I would otherwise have to search almost indefinitely for a lawyer to be even interested into it.
I also debugged health opportunities from different angles using the AI and was quite successful at it.
I also experimented with the well-being topic and it gave me pretty convincing and mind opening suggestions.
So, all I can say is that it worked out pretty good in my case. I believe its already transformative in a ways we wouldn't be able even to envision couple years ago.
You are not a doctor, lawyer, etc. You are responsible for yourself, not for others like doctors and lawyers who face entirely different consequences for failures.
AI is already being used both by lawyers and doctors so I am not sure what's the point you're trying to make. All I tried to say with my comment is that technology is very worthwhile and that ones ignoring it will be the ones at the loss.
They're tuned (and its part of their nature) to be convincing to people who don't already know the answer. I couldn't get it to figure out how to substitute peanut butter for butter in a cookie recipe yesterday.
I ended up spending an hour on it and dumping the context twice. I asked it to evaluate its own performance and it gave itself a D-. It came up with the measurements for a decent recipe once, then promptly forgot it when asked to summarize.
Good luck trying to use them as a search engine (or a lawyer), because they fabricate a third of the references on average (for me), unless the question is difficult, then they fabricate all of them. They also give bad, nearly unrelated references, and ignore obvious ones. I had a case when talking about the Mexican-American war where the hallucinations crowded out good references. I assume it liked the sound of the things it made up more than the things that were available.
edit: I find it baffling that GPT-5 and Quen3 often have identical hallucinations. The convergence makes me think that there's either a hard limit to how good these things can get which has been reached, or that they're just directly ripping each other off.
I don't think intelligence is increasing. Arbitrary benchmarks don't reflect real world usage. Even with all the context it could possibly have, these models still miss/hallucinate things. Doesn't make them useless, but saying context is the bottleneck is incorrect.
Agreed. I feel like, in the case of GPT models, 4o was better in most ways than 5 has been. I'm not seeing increases in quality of anything between the two 5 feels like a major letdown honestly. I am constantly reminding it what we're doing lol
Gemini 2.5 Pro is okay if you ask it to work on a very tiny problem. That's about it for me, the other models don't even create a convincing facsimile of reasoning.
I agree, I often see Opus 4.1 and GPT5 (Thinking) make astoundingly stupid decisions with full confidence, even on trivial tasks requiring minimal context. Assuming they would make better decisions "if only they had more context" is a fallacy
Is there a good example you could provide of that? I just haven’t seen that personally so I’d be interested in any examples on these current models. I’m sure we all remember in the early days lots of examples of stupidity being posted and it was interesting. It be great if people kept doing that so we could get a better sense of which types of problems they are failing with astounding levels of stupidity on.
I've had every single LLM I tried (Opus, Sonnet, GPT-5-(codex) and Grok light) all tell me that Go embeds[0] support relative paths UPWARDS in the tree.
They all have a very specific misunderstanding. Go embeds _do_ support relative paths like:
//go:embed files/hello.txt
But they DO NOT support any paths with ".." in it
//go:embed ../files/hello.txt
is not correct.
All confidently claimed that .. is correct and will work and tried to make it work multipled different ways until I pointed each to the documentation.
I don’t really find that so surprising or particularly stupid. I was hoping to learn about serious issues with bad logic or reasoning not missing dots on i’s type stuff.
I can’t remember the example but there was another frequent hallucination that people were submitting bug reports that it wasn’t working, so the project looked at it and realized well actually that kinda would make sense and maybe our tool should work like that, and changed the code to work just like the LLM hallucination expected!
Also in general remember human developers hallucinate ALL THE TIME and then realize it or check documentation. So my point is I feel hallucinations are not particularly important or bother me as much as flawed reasoning.
One example I ran into recently is asking Gemini CLI to do something that isn't possible: use multiple tokens in a Gemini CLI custom command (https://github.com/google-gemini/gemini-cli/blob/main/docs/c...). It pretended it was possible and came up with a nonsense .toml defining multiple arguments in a way it invented so it couldn't be read, even after multiple rounds of "that doesn't work, Gemini can't load this."
So in any situation where something can't actually be done my assumption is that it's just going to hallucinate a solution.
Has been good for busywork that I know how to do but want to save time on. When I'm directing it, it works well. When I'm asking it to direct me, it's gonna lead me off a cliff if I let it.
Context is also a bottleneck in many human to human interactions as well so this is not surprising. Especially juniors often start by talking about their problems without providing adequate context about what they’re trying to accomplish or why they’re doing it.
Mind you, I was exactly like that when I started my career and it took quite a while and being on both sides of the conversation to improve. One difference is that it is not so easy to put oneself in the shoes of an LLM. Maybe I will improve with time. So far assuming the LLM is knowledgeable but not very smart has been the most effective strategy for my LLM interactions.
The ICPC is a short (5 hours) timed contest with multiple problems, in which contestants are not allowed to use the internet.
The reason most don't get a perfect score isn't because the tasks themselves are unreasonably difficult, but because they're difficult enough that 5 hours isn't a lot of time to solve so many problems. Additionally they often require a decent amount of math / comp-sci knowledge so if you don't know have the knowledge necessary you probably won't be able complete it.
So to get a good score you need lots of math & comp-sci knowledge + you need to be a really quick coder.
Basically the consent is perfect for LLMs because they have a ton of math and comp-sci knowledge, they can spit out code at super human speeds, and the problems themselves are fairly small (they take a human maybe 15 mins to an hour to complete).
Who knows, maybe OP is right and LLMs are smart enough to be super human coders if they just had the right context, but I don't think this example proves their point well at all. These are exactly the types of problems you would expect a supercharged auto-complete would excel at.
If not now, soon, the bottleneck will be responsibility. Where errors in code have real-world impacts, "the agentic system wrote a bug" won't cut it for those with damages.
As these tools make it possible for a single person to do more, it will become increasingly likely that society will be exposed to greater risks than that single person's (or small company's) assets can cover.
These tools already accelerate development enough that those people who direct the tools can no longer state with credibility that they've personally reviewed the code/behavior with reasonable coverage.
It'll take over-extensions of the capability of these tools, of course, before society really notices, but it remains my belief that until the tools themselves can be held liable for the quality of their output, responsibility will become the ultimate bottleneck for their development.
I agree. My speed at reviewing tokens <<<< LLM's token's. Perhaps an output -> compile -> test loop will slow things down, but will we ever get to a "no review needed" point?
IMHO, jumping from Level 2 to Level 5 is a matter of:
- Better structured codebases - we need hierarchical codebases with minimal depth, maximal orthogonality and reasonable width. Think microservices.
- Better documentation - most code documentations are not built to handle updates. We need a proper graph structure with few sources of truth that get propagated downstream. Again, some optimal sort of hierarchy is crucial here.
At this point, I really don't think that we necessarily need better agents.
Setup your codebase optimally, spin up 5-10 instances of gpt-5-codex-high for each issue/feature/refactor (pick the best according to some criteria) and your life will go smoothly
Microservices should already be a last resort when you’ve either:
a) hit technical scale that necessitates it
b) hit organizational complexity that necessitates it
Opting to introduce them sooner will almost certainly increase the complexity of your codebase prematurely (already a hallmark of LLM development).
> Better documentation
If this means reasoning as to why decisions are made then yes. If this means explaining the code then no - code is the best documentation. English is nowhere near as good at describing how to interface with computers.
Given how long gpt codex 5 has been out, there’s no way you’ve followed these practices for a reasonable enough time to consider them definitive (2 years at the least, likely much longer).
Not yet unfortunately, but I'm in the process of building one.
This was my journey: I vibe-coded an Electron app and ended up with a terrible monolithic architecture, and mostly badly written code. Then, I took the app's architecture docs and spent a lot of my time shouting "MAKE THIS ARCHITECTURE MORE ORTHOGONAL, SOLID, KISS, DRY" to gpt-5-pro, and ended up with a 1500+ liner monster doc.
I'm now turning this into a Tauri app and following the new architecture to a T. I would say that it is has a pretty clean structure with multiple microservices.
Now, new features are gated based on the architecture doc, so I'm always maintaining a single source of truth that serves as the main context for any new discussions/features. Also, each microservice has its own README file(s) which are updated with each code change.
I vibe coded an invoice generator by first vibe coding a "template" command line tool as a bash script that substitutes {{words}} in a libre office writer document (those are just zipped xml files, so you can unpack them to a temp directory and substitute raw text without xml awareness), and in the end it calls libre office's cli to convert it to pdf. I also asked the AI to generate a documentation text file, so that the next AI conversation could use the command as a black box.
The vibe coded main invoice generator script then does the calendar calculations to figure out the pay cycle and examines existing invoices in the invoice directory to determine the next invoice number (the invoice number is in the file name, so it doesn't need to open the files). When it is done with the calculations, it uses the template command to generate the final invoice.
This is a very small example, but I do think that clearly defined modules/microservices/libraries are a good way to only put the relevant work context into the limited context window.
It also happens to be more human-friendly, I think?
I "vibe coded" a Gateway/Proxy server that did a lot of request enrichment and proprietary authz stuff that was previously in AWS services. The goal was to save money by having a couple high-performance servers instead of relying on cloud-native stuff.
I put "vibe coded" is in quotes because the code was heavily reviewed after the process, I helped when the agent got stuck (I know pedants will complain but ), and this was definitely not my first rodeo in this domain and I just wanted to see how far an agent could go.
In the end it had a few modifications and went into prod, but to be really fair it was actually fine!
One thing I vibe coded 100% and barely looked at the code until the end was a MacOS menubar app that shows some company stats. I wanted it in Swift but WITHOUT Xcode. It was super helpful in that regard.
I've been using claude on two codebases, one with good layering and clean examples, the other not so much. I get better output from the LLM with good context and clean examples and documentation. Not surprising that clarity in code benefits both humans and machines.
I think there will be a couple benefits of using agents soon. Should result in a more consistent codebase, which will make patterns easier to see and work with, and also less reinventing the wheel. Also migrations should be way faster both within and across teams, so a lot less struggling with maintaining two ways of doing something for years, which again leads to simpler and more consistent code. Finally the increased speed should lead to more serializability of feature additions, so fewer problems trying to coordinate changes happening in parallel, conflicts, redundancies, etc.
I imagine over time we'll restructure the way we work to take advantage of these opportunities and get a self-reinforcing productivity boost that makes things much simpler, though agents aren't quite capable enough for that breakthrough yet.
> Level 2 - One commit - Cursor and Claude Code work well for tasks in this size range.
I'll stop ya right there. Spending the past few weeks fixing bugs in a big multi-tier app (which is what any production software is this days). My output per bug is always one commit, often one line.
Claude is an occasional help, nothing more. Certainly not generating the commit for me!
I'll stop you right there. I've been using Claude Code for almost a year on production software with pretty large codebases. Both multi-repo and monorepo.
Claude is able to create entire PRs for me that are clean, well written, and maintainable.
Can it fail spectacularly? Yes, and it does sometimes. Can it be given good instructions and produce results that feel like magic? Also yes.
For finicky issues like that I often find that, in the time it takes to create a prompt with the necessary context, I was able to just make the one line tweak myself.
In a way that is still helpful, especially if the act of putting the prompt together brought you to the solution organically.
Beyond that, 'clean', 'well written' and 'maintainable' are all relative terms here. In a low quality, mega legacy codebase, the results are gonna be dogshit without an intense amount of steering.
> For finicky issues like that I often find that, in the time it takes to create a prompt with the necessary context, I was able to just make the one line tweak myself.
I don't run into this problem. Maybe the type of code we're working on is just very different. In my experience, if a one-line tweak is the answer and I'm spending a lot of time tweaking a prompt, then I might be holding the tool wrong.
Agree on those terms being relative. Maybe a better way of putting it is that I'm very comfortable putting my name on it, deploying to production, and taking responsibility for any bugs.
This is interesting, and I'd say you're not the target audience. If you want the code Claude writes to be line-by-line what you think is most appropriate as a human, you're not going to get it.
You have to be willing to accept "close-ish and good enough" to what you'd write yourself. I would say that most of the time I spend with Claude is to get from its initial try to "close-ish and good enough". If I was working on tiny changes of just a few lines, it would definitely be faster just to write them myself. It's the hundreds of lines of boilerplate, logging, error handling, etc. that makes the trade-off close to worth it.
If I were making a single line code change, then Claude's "style" would take me enough time to edit away that it would make it slower than writing the change myself. I'm positing this is true also for the parent commenter.
While this is sort of true, remember: it's not the size of the context window that matters, it's how you use it.
You need to have the right things in the context, irrelevant stuff is not just wasteful, it is increasingly likely to cause errors. It has been shown a few times that as the context window grows, performance drops.
Heretical I know, but I find that thinking like a human goes a long way to working with AI.
Let's take the example of large migrations. You're not going to load the whole codebase in your brain and figure out what changes to make and then vomit them out into a huge PR. You're going to do it bit by bit, looking up relevant files, making changes to logically-related bits of code, and putting out a PR for each changelist.
This exactly what tools should do as well. At $PAST_JOB my team built a tool based on OpenRewrite (LLMs were just coming up) for large-scale multi-repo migrations and the centerpiece was our internal codesearch tool. Migrations were expressed as a codesearch query + codemod "recipe"; you can imagine how that worked.
That would be the best way to use AI for large-scale changes as well. Find the right snippets of code (and documentation!), load each one into the context of an agent in multiple independent tasks.
Caveat: as I understand it, this was the premise of SourceGraph's earliest forays into AI-assisted coding, but I recall one of their engineers mentioning that this turned out to be much trickier than expected. (This was a year+ back, so eons ago in LLM progress time.)
Just hypothesizing here, but it may have been that the LSIF format does not provide sufficient context. Another company in this space is Moderne (the creators of OpenRewrite) that have a much more comprehensive view of the codebase, and I hear they're having better success with large LLM-based migrations.
It is pretty clear that the long horizon tasks are difficult for coding agents and that is a fundamental limitation of how probabilistic word generation works either with transformer or any other architecture. The errors propagate and multiply and becomes open ended.
However, the limitation can be masqueraded using layering techniques where output of one agent is fed as an input to another using consensus for verification or other techniques to the nth degree to minimize errors. But this is a bit like the story of a boy with a finger in the dike. Yes, you can spawn as many boys but there is a cost associated that would keep growing and wont narrow down.
It has nothing to do with contexts or window of focus or any other human centric metric. This is what the architecture is supposed to do and it does so perfectly.
I'm making a pretty complex project using claude. I tried claude flow and some other orchestrators but they produced garbage.
Have found using github issues to track the progress as comments works fairly well, the PR's can get large comment wise (especially if you have gemini code assist, recommeded as another code review judge), so be mindful of that (that will blow the context window). Using a fairly lean CLAUDE.md and a few mcps (context7 and consult7 with gemini for longer lookups). works well too. Although be prepared to tell it to reread CLAUDE.md a few conversations deep as it loses it.
It's working fairly well so far, it feels a bit akin to herding cats sometimes and be prepared to actually read the code it's making, or the important bits at least.
your comment reminds me of another one i saw on reddit. someone said they found that using github diff as a way to manage context and reference chat history worked the best for their ai agent. i think he is on to something here.
I gave up building agents as soon as I figured they would never scale beyond context constraint. Increase in memory and compute costs to grow the context size of these things isn't linear.
Replace “coding agent” with “new developer on the team” and this article could be from anytime in the last 50 years. The thing is, a coding agent acts like a newly-arrived developer every time you start it.
Context is a bottleneck for humans as well. We don’t have full context when going through the code because we can’t hold full context.
We summarize context and remember summarizations of it.
Maybe we need to do this with the LLM. Chain of thought sort of does this but it’s not deliberate. The system prompt needs to mark this as a deliberate task of building summaries and notes notes of the entire code base and this summarized context of the code base with gotchas and aspects of it can be part of permanent context the same way ChatGPT remembers aspects of you.
The summaries can even be sectioned off and and have different levels of access. So if the LLM wants to drill down to a subfolder it looks at the general summary and then it looks at another summary for the sub folder. It doesn’t need to access the full summary for context.
Imagine a hierarchy of system notes and summaries. The LLM decides where to go and what code to read while having specific access to notes it left previously when going through the code. Like the code itself it never reads it all it just access sections of summaries that go along with the code. It’s sort of like code comments.
We also need to program it to change the notes every time it changes the program. And when you change the program without consulting AI, every commit you do the AI also needs to update the notes based off of your changes.
The LLM needs a system prompt that tells it to act like us and remember things like us. We do not memorize and examine full context of anything when we dive into code.
We do. It’s just the format of what you remember is not textual. Do you remember what a 500 line function does or do you remember a fuzzy aspect of it?
You remember a fuzzy aspect of it and that is the equivalent of a summary.
The LLM is in itself a language machine so its memory will also be language. We can’t get away from that. But that doesn’t mean the hierarchical structure of how it stores information needs to be different from humans. You can encode information in anyway you like and store that information in any hierarchy we like.
So essentially We need the hierarchical structure of the “notes” that takes on the hierarchical structure of your memory. You don’t even access all your memory as a single context. You access parts of it. Your encoding may not be based on a “language” but for an LLM it’s basically a model based on language so its memory must be summaries in the specified language.
We don’t know every aspect of human memory but we do know the mind doesn’t access all memory at the same time and we do know that it compresses context. It doesn’t remember everything and it memorizes fuzzy aspects of everything. These two aspects can be replicated with the LLM entirely with text.
I agree that the effect can look similar, we both end up with a compressed representation of past experiences.
The brain meaning-memorizes, and it prioritizes survival-relevant patterns and relationships over rote detail.
How does it do it, I'm not a neurobiologist, but my modest understanding is this:
LLM's summarization is a lossy compression algorithm that picks entities and parts that it deems "important" against its trained data, not only is lossy, it is wasteful as it doesn't curate what to keep or purge off accumulated experience, it does it against some statistical function that executes against a big blob of data it ingested during training. You could throw contextual cues to improve the summarization, but that's as good as it gets.
Human memory is not a workaround for a flaw. It doesn't use a hard stop at 128kb or 1mb of info, It doesn't 'summarize'.
it constructs meaning by integrating experiences into a dynamic/living model of the world, in constant motion. While we can simulate a hierarchical memory for an LLM with text summaries, it would be off simulation of possible future outcome (at best), not a replication of an evolutionary elaborated strategy to model information captured in a time frame, merged in with previously acquired knowledge to be able to then solve the upcoming survival purpose tasks the environment may throw at it. Isn't it what our brain is doing, constantly?
Plus for all we know it's possible our brain is capable of memorizing everything that can be experienced in a lifetime but would rather let the irrelevant parts of our boring life die off to save energy.
sure, in all case it's fuzzy and lossy. The difference is that you have doodling on a napkins on one side, and Vermeer paint on the other.
>LLM's summarization is a lossy compression algorithm that picks entities and parts that it deems "important" against its trained data, not only is lossy, it is wasteful as it doesn't curate what to keep or purge off accumulated experience, it does it against some statistical function that executes against a big blob of data it ingested during training. You could throw contextual cues to improve the summarization, but that's as good as it gets.
No it's not as good as it gets. You can tell the LLM to purge and accumulate experience into it's memory. It can curate it for sure.
"ChatGPT summarize the important parts of this text remove things that are unimportant." Then take that summary feed it into a new context window. Boom. At a high level if you can do that kind of thing with chatGPT then you can program LLMs to do the same thing similar to COT. In this case rather then building off a context window, it rewrites it's own context window into summaries.
They need a proper memory. Imagine you're a very smart, skilled programmer but your memory resets every hour. You could probably get something done by making extensive notes as you go along, but you'll still be smoked by someone who can actually remember what they were doing in the morning. That's the situation these coding agents are in. The fact that they do as well as they do is remarkable, considering.
This is precisely how I go about my usage pattern with Cursor that I already have, I structure my repo declaratively with a Clojure and Nix build pipeline so when my context maxes out for a chat session, the repo is self-evident self-documented enough that a new chat session automatically has a heightened context
Agreed. As engineers we build context every time we interact with the codebase. LLMs don't do that.
A good senior engineer has a ton in their head after 6+ months in a codebase. You can spend a lot of time trying to equip Claude Code with the equivalent in the form of CLAUDE.MD, references to docs, etc., but it's a lot of work, and it's not clear that the agents even use it well (yet).
I addressed this. The AI needs to examine every code change going in whether that code change comes from AI or not and edit the summaries accordingly.
This is something humans dont actually do. We aren’t aware of every change and we don’t have updated documentation of every change so the LLM will be doing better in this regard.
I’m not talking about git diffs. I’m talking about the summaries of context. Every commit the ai needs to update the summaries and notes it took about the code.
Did you read the entirety of what I wrote? Please read.
Say the AI left a 5 line summary of a 300 line piece of code. You as a human update that code. What I am saying specifically is this: when you do the change, The AI then sees this and updates the summary. So AI needs to be interacting with every code change whether or not you used it to vibe code.
The next time the AI needs to know what this function does, it doesn’t need to read the entire 300 line function. It reads the 5 line summary, puts it in the context window and moves on with chain of thought. Understand?
This is what shrinks the context. Humans don’t have unlimited context either. We have vague fuzzy memories of aspects of the code and these “notes” effectively make coding agents do the same thing.
False. Nobody does this. They hold pieces of context and summaries in their head. Nobody on earth can memorize an entire code base. This is ludicrous.
When you read a function to know what it does then you move on to another function do you have the entire 100 line function perfectly memorized? No. You memorize a summary of the intent of the function when reading code. An LLM can be set up to do the same rather than keep all 100 lines of code as context.
Do you think when you ask the other person for more context he’s going to spit out what he wrote line by line. Not even he likely will remember everything he wrote.
You think anyone memorized Linux? You know how many lines of code is in the Linux source code. Are you trolling?
youre projecting a deficiency of the human brain onto computers. computers have advantages that our brains dont (perfect and large memory), theres no reason to think that we should try to recreate how humans do things.
why would you bother with all these summaries if you can just read and remember the code perfectly.
Because the context window of the LLM is limited similar to humans. That’s the entire point of the article. If the LLM has similar limitations to humans than we give it similar work arounds.
Sure you can say that LLMs have unlimited context, but then what are you doing in this thread? The title on this page is saying that context is a bottleneck.
I've noticed that chatgpt doesnt seem to be very good at understanding elapsed time. I have some long running threads and unless i prompt it with elapsed time ("it's now 7 days later") the responses act like it was 1 second after the last message.
I think this might be a good leap for agents, the ability to not just review a doc in it's current state, but to keep in context/understanding the full evolution of a document.
It is, but now you're burning a bit of context on something that might not be necessary, and potentially having the agent focus on time when it's not relevant. Not necessarily a bad idea, but as always, tradeoffs.
I've noticed the same thing with Grok. One time it predicted a X% chance that something would happen by July 31. On August 1, it was still predicting the thing would happen by July 31, just with lower (but non-zero) odds. Their grasp on time is tenuous at best.
This is one cause but another is that agents are mostly trained using the same sets of problems. There are only so many open source projects that can be used for training (ie. benchmarks). There's huge oversampling for a subset of projects like pandas and nothing at all for proprietary datasets. This is a huge problem!
If you want your agent to be really good at working with dates in a functional way or know how to deal with the metric system (as examples), then you need to train on those problems, probably using RFT. The other challenge is that even if you have this problem set in testable fashion running at scale is hard. Some benchmarks have 20k+ test cases and can take well over an hour to run. If you ran each test case sequentially it would take over 2 years to complete.
Right now the only company I'm aware of that lets you do that at scale is runloop (disclaimer, I work there).
This has been the case for a while. Attempting to code API connections via Vibe-Coding will leave you pulling your hair out if you don't take the time to scrape all relevant documentation and include said documentation in the prompt. This is the case whether it's major APIs like Shopify, or more niche ones like warehousing software (Cin7 or something similar).
The context pipeline is a major problem in other fields as well, not just programming. In healthcare, the next billion-dollar startup will likely be the one that cracks the personal health pipeline, enabling people to chat with GPT-6 PRO while seamlessly bringing their entire lifetime of health context into every conversation.
These are such silly arguments. I sounds like people looking at a graph of a linear function crossing and exponential one at x=2, y=2 and wonder why the curves don't fit at x=3 y=40.
"Its not the x value that's the problem, its the y value".
You're right, it's not "raw intelligence" that's the bottleneck, because there's none of that in there. The truth is no tweak to any parameter is ever going to make the LLM capable of programming. Just like an exponential curve is always going to outgrow a linear one. You can't tweak the parameters out of that fundamental truth.
I agree, and I think intent behind the code is the most important part in missing context. You can sometimes infer intent from code, but usually code is a snapshot of an expression of an evolving intent.
I've started making sure my codebase is "LLM compatible". This means everything has documentation and the reasons for doing things a certain way and not another are documented in code. Funnily enough i do this documentation work with LLMs.
Eg. "Refactor this large file into meaningful smaller components where appropriate and add code documentation on what each small component is intended to achieve." The LLM can usually handle this well (with some oversight of course). I also have instructions to document each change and why in code in the LLMs instructions.md
If the LLM does create a regression i also ask the LLM to add code documentation in the code to avoid future regressions, "Important: do not do X here as it will break Y" which again seems to help since the LLM will see that next time right there in the portion of code where it's important.
None of this verbosity in the code itself is harmful to human readers either which is nice. The end result is the codebase becomes much easier for LLMs to work with.
I suspect LLM compatibility may be a metric we measure codebases in the future as we learn more and more how to work with them. Right now LLMs themselves often create very poor LLM compatible code but by adding some more documentation in the code itself they can do much better.
In my opinion human beings also do not have unlimited cognitive context. When a person sits down to modify a codebase, they do not read every file in the codebase. Instead they rely on a combination of working memory and documentation to build the high-level and detailed context required to understand the particular components they are modifying or extending, and they make use of abstraction to simplify the context they need to build. The correct design of a coding LLM would require a similar approach to be effective.
I’m working on a project that has now outgrown the context window of even gpt-5 pro. I use code2prompt and ChatGPT with pro will reject the prompt as too large.
I’ve been trying to use shorter variable names. Maybe I should move unit tests into their own file and ignore them? It’s not idiomatic in Rust though and breaks visibility rules for the modules.
What we really need is for the agent to assemble the required context for the problem space. I suspect this is what coding agents will do if they don’t already.
Notably, all of this information would be very helpful if written down as documentation in the first place. Maybe this will encourage people to do that?
It's both context and memory. If an LLM could keep the entire git history in memory, and each of those git commits had enough context, it could take a new feature and understand the context in which it should live by looking up the history of the feature area in it's memory.
I’m really wondering why so many advertising posts mimicked as discourse make it to frontpage and I assume it’s a new Silicon Valley trick because there is no way HN community values these so much.
Let me tell you I’m scared of these tools. With Aider I have the most human in the loop possible each AI action is easy to undo, readable and manageable.
However even here most of the time I want AI to write a bulk of code I regret it later.
Most codebase challenges I have are infrastructural problems, where I need to reduce complexity to be able to safely add new functionality or reduce error likelihood. I’m talking solid well named abstractions.
This in the best case is not a lot of code. In general I would always rather try to have less code than more. Well named abstraction layers with good domain driven design is my goal.
When I think of switching to an AI first editor I get physical anxiety because it feels like it will destroy so many coders by leading to massive frustration.
I think still the best way of using ai is literally just chat with it about your codebase to make sure you have good practise.
Here I think that the problem with the context is in the mind of business and dev not everything is written down and even if I would be translating it understandable (prompting) will sometimes be more work than to build it on the go with modern idea and typesafe languages
You're on a site that exists to advertise job postings from YC companies, and does not stop people from spamming their personal or professional projects/companies, even when they have no activity here other than self promotion. This is an advertising site.
I know it isn’t your question exactly, and you probably know this, but the models for coding assist tools are generally fine tunes of models for coding specific purposes. Example: in OpenAI codex they use GPT-5-codex
I think the question is, can I throw a couple thousand bucks of GPU time at fine-tuning a model to have knowledge of our couple million lines of C++ baked into the weights instead of needing to fuck around with "Context Engineering".
Like, how feasible is it for a mid-size corporation to use a technique like LoRA, mentioned by GP, to "teach" (say, for example) Kimi K2 about a large C++ codebase so that individual engineers don't need to learn the black art of "context engineering" and can just ask it questions.
I'm curious about it too. I think there are two bottlenecks, one is that training a relatively large LLM can be resource-intensive (so people go for RAGs and other shortcuts), and making it finetuned to your use cases might make it dumber overall.
"And yet, coding agents are nowhere near capable of replacing software developers. Why is that?"
Because you will always need a specialist to drive these tools. You need someone who understands the landscape of software - what's possible, what's not possible, how to select and evaluate the right approach to solve a problem, how to turn messy human needs into unambiguous requirements, how to verify that the produced software actually works.
Provided software developers can grow their field of experience to cover QA and aspects of product management - and learn to effectively use this new breed of coding agents - they'll be just fine.
No, it's not. The limitation is believing a human can define how the agent should recall things. Instead, build tools for the agent to store and retrieve context and then give it a tool to refine and use that recall in the way it sees best fits the objective.
Humans gatekeep, especially in the tech industry, and that is exactly what will limit us improving AI over time. It will only be when we turn over it's choices to it that we move beyond all this bullshit.
I downloaded the app and it failed at the first screen when I set up the models. I agree with the spirit of the blog post but the execution seems lacking.
Here's a project I've been working on the past 2 weeks and only yesterday did I unify everything entirely while in Cursor Claude-4-Sonnet-1M MAX mode and I am pretty astounded with the results, Cursor usage dashboard tells me many of my prompts are 700k-1m context for around $0.60-$0.90 USD each, it adds up fast but wow it's extraordinary
There's a misunderstanding here broadly. Context could be infinite, but the real bottleneck is understanding intent late in a multi-step operation. A human can effectively discard or disregard prior information as the narrow window of focus moves to a new task, LLMs seem incredibly bad at this.
Having more context, but leaving open an inability to effectively focus on the latest task is the real problem.
I think that's the real issue. If the LLM spends a lot of context investigating a bad solution and you redirect it, I notice it has trouble ignoring maybe 10K tokens of bad exploration context against my 10 line of 'No, don't do X, explore Y' instead.
I think the general term for this is "context poisoning" and is related but slightly different to what the poster above you is saying. Even with a "perfect" context, the LLM still can't infer intent.
So this is where having subagents fed specific curated context is a help.. As long as the "poisoned" agent can focus long enough to generate a clean request to the subagent, the subagent works posion-free. This is much more likely than a single agent setup with the token by token process of a transformer.
The same protection works in reverse, if a subagent goes off the rails and either self aborts or is aborted, that large context is truncated to the abort response which is "salted" with the fact that this was stopped. Even if the subagent goes sideways and still returns success (Say separate dev, review, and test subagents) the main agent has another opportunity to compare the response and the product against the main context or to instruct a subagent to do it in a isolated context..
Not perfect at all, but better than a single context.
One other thing, there is some consensus that "don't" "not" "never" are not always functional in context. And that is a big problem. Anecdotally and experimental, many (including myself) have seen the agent diligently performing the exact thing following a "never" once it gets far enough back in the context. Even when it's a less common action.
that's because a next token predictor can't "forget" context. That's just not how it works.
You load the thing up with relevant context and pray that it guides the generation path to the part of the model that represents the information you want and pray that the path of tokens through the model outputs what you want
That's why they have a tendency to go ahead and do things you tell them not to do..
also IDK about you but I hate how much praying has become part of the state of the art here. I didn't get into this career to be a fucking tech priest for the machine god. I will never like these models until they are predictable, which means I will never like them.
This is where the distinction between “an LLM” and “a user-facing system backed by an LLM” becomes important; the latter is often much more than a naive system for maintaining history and reprompting the LLM with added context from new user input, and could absolutely incorporate a step which (using the same LLM with different prompting or completely different tooling) edited the context before presenting it to the LLM to generate the response to the user. And such a system could, by that mechanism, “forget” selected context in the process.
I have been building Yggdrasil for that exact purpose - https://github.com/zayr0-9/Yggdrasil
At least a few of the current coding agents have mechanisms that do what you describe.
> I didn't get into this career to be a fucking tech priest for the machine god.
You may appreciate this illustration I made (largely with AI, of course): https://imgur.com/a/0QV5mkS
The context (heheheh) is a long-ass article on coding with AI I wrote eons ago that nobody ever read, if anybody is curious: https://news.ycombinator.com/item?id=40443374
Looking back at it, I was off on a few predictions but a number of them are coming true.
Yeah I start a new session to mitigate this. Don’t keep hammering away - close the current chat/session whatever and restate the problem carefully in a new one.
I've had great luck with asking the current session to "summarize our goals, conversation, and other relevant details like git commits to this point in a compact but technically precise way that lets a new LLM pick up where we're leaving off".
The new session throws away whatever behind-the-scenes context was causing problems, but the prepared prompt gets the new session up and running more quickly especially if picking up in the middle of a piece of work that's already in progress.
Wow, I had useless results asking “please summarize important points of the discussion” from ChatGPT. It just doesn’t understand what’s important, and instead of highlighting pivoting moments of the conversation it produce a high level introduction for a non-practitioner.
Can you share you prompt?
Honestly, I just type out something by hand that is roughly like what I quoted above - I'm not big on keeping prompt libraries.
I think the important part is to give it (in my case, these days "it" is gpt-5-codex) a target persona, just like giving it a specific problem instead of asking it to be clever or creative. I've never asked it for a summary of a long conversation without the context of why I want the summary and who the intended audience is, but I have to imagine that helps it frame its output.
There should be a simple button that allows you refine the context. A fresh LLM could generate a new context from the input and outputs of the chat history, then another fresh LLM can start over with that context.
You are saying “fresh LLM” but really I think you’re referring to a curated context. The existing coding agents have mechanisms to do this. Saving context to a file. Editing the file. Clearing all context except for the file. It’s sort of clunky now but it will get better and slicker.
It seems that I have missed this existing feature, I’m only a light user of LLMs, I’ll keep an eye out for it.
some sibling comments mentioned Claude code has this
It's easy to miss: ChatGPT now has a "branch to new chat" option to branch off from any reply.
/compact in Claude Code.
This is false:
"that's because a next token predictor can't "forget" context. That's just not how it works."
An LSTM is also a next token predictor and literally have a forget gate, and there are many other context compressing models too which remember only the what it thinks is important and forgets the less important, like for example: state-space models or RWKV that work well as LLMs too. But even just a the basic GPT model forgets old context since it's gets truncated if it cannot fit, but that's not really the learned smart forgetting the other models do.
That's not how attention works though, it should be perfectly able to figure out which parts are important and which aren't, but the problem is that it doesn't really scale beyond small contexts and works on a token to token basis instead of being hierarchical with sentences, paragraphs and sections. The only models that actually do long context do so by skipping attention layers or doing something without attention or without positional encodings, all leading to shit performance. Nobody pretrains on more than like 8k, except maybe Google who can throw TPUs at the problem.
You can rewrite the history (but there are issues with that too). So an agent can forget context. Simply dont feed in part of the context on the next run.
Well, "a sufficiently advanced technology is indistinguishable from magic". It's just that it is same in a bad way, not a good way.
Relax friend! I can't see why you'd be peeved in the slightest! Remember, the CEOs have it all figured out and have 'determined' that we don't need all those eyeballs on the code anymore. You can simply 'feed' the machine and do the work of forty devs! This is the new engineering! /s
It seems possible for openAI/Anthropic to rework their tools so they discard/add relevant context on the fly, but it might have some unintended behaviors.
The main thing is people have already integrated AI into their workflows so the "right" way for the LLM to work is the way people expect it to. For now I expect to start multiple fresh contexts while solving a single problem until I can setup a context that gets the result I want. Changing this behavior might mess me up.
> rework their tools so they discard/add relevant context on the fly
That may be the foundation for an innovation step in model providers. But you can achieve a poor man’s simulation if you can determine, in retrospect, when a context was at peak for taking turns, and when it got too rigid, or too many tokens were spent, and then simply replay the context up until that point.
I don’t know if evaluating when a context is worth duplicating is a thing; it’s not deterministic, and it depends on enforcing a certain workflow.
A number of agentic coding tools do this. Upon an initial request for a larger set of actions, it will write a markdown file with its "thoughts" on its plan to do something, and keep notes as it goes. They'll then automatically compact their contexts and re-read their notes to keep "focused" while still having a bit of insight on what it did previously and what the original ask was.
This does help, yes. Todo lists are important. They also reinforce order of operations.
Interesting. I know people do this manually. But are there agentic coding tools that actually automate this approach?
Claude Code has /init and /compact that do this. It doesn’t recreate the context as-is, but creates a context that is presumed to be functionally equivalent. I find that’s not the case and that building up from very little stored context and a lot of specialised dialogue works better.
I've seen this behavior with Cursor, Windsurf, and Amazon Q. It normally only does it for very large requests from what I've seen.
Not that this shouldn't be fixed in the model, but you can jump to an earlier point in claude code and on web chat interfaces to get it out of the context, just sometimes you have other important stuff you don't want it to lose.
The other issue with this is that if you jump back and it has edited code, it loses the context of those edits.. It may have previous versions of the code in memory and no knowledge of the edits leading to other edits that no longer align.. Often it's better to just /clear.. :/
Likewise Gemini CLI. There’s a way to backup to a prior state in the dialogue.
IMO specifically OpenAI's models are really bad at being steered once they've decided to do something dumb. Claude and OSS models tend to take feedback better.
GPT-5 is brilliant when it oneshots the right direction from the beginning, but pretty unmanageable when it goes off the rails.
You don't want to discard prior information though. That's the problem with small context windows. Humans don't forget the original request as they ask for more information or go about a long task. Humans may forget parts of information along the way, but not the original goal and important parts. Not unless they have comprehension issues or ADHD, etc.
This isn't a misconception. Context is a limitation. You can effectively have an AI agent build an entire application with a single prompt if it has enough (and the proper) context. The models with 1m context windows do better. Models with small context windows can't even do the task in many cases. I've tested this many, many, many times. It's tedious, but you can find the right model and the right prompts for success.
Asking, not arguing, but: why can't they? You can give an agent access to its own context and ask it to lobotomize itself like Eternal Sunshine. I just did that with a log ingestion agent (broad search to get the lay of the land, which eats a huge chunk of the context window, then narrow searches for weird stuff it spots, then go back and zap the big log search). I assume this is a normal approach, since someone else suggested it to me.
This is also the idea behind sub-agents. Claude Code answers questions about things like "where is the code that does X" by firing up a separate LLM running in a fresh context, posing it the question and having it answer back when it finds the answer. https://simonwillison.net/2025/Jun/2/claude-trace/
I'm playing with that too (everyone should write an agent; basic sub-agents are incredibly simple --- just tool calls that can make their own LLM calls, or even just a tool call that runs in its own context window). What I like about Eternal Sunshine is that the LLM can just make decisions about what context stuff matters and what doesn't, which is a problem that comes up a lot when you're looking at telemetry data.
I keep wondering if we're forgetting the fundamentals:
> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
https://www.laws-of-software.com/laws/kernighan/
Sure, you eat the elephant one bite at a time, and recursion is a thing but I wonder where the tipping point here is.
I think recursion is the wrong way to look at this, for what it's worth.
Recursion and memoization only as a general approach to solving "large" problems.
I really want to paraphrase kernighan's law as applied to LLMs. "If you use your whole context window to code a solution to a problem, how are you going to debug it?".
By checkpointing once the agent loop has decided it's ready to hand off a solution, generating a structured summary of all the prior elements in the context, writing that to a file, and then marking all those prior context elements as dead so they don't occupy context window space.
Look carefully at a context window after solving a large problem, and I think in most cases you'll see even the 90th percentile token --- to say nothing of the median --- isn't valuable.
However large we're allowing frontier model context windows to get, we've got integer multiple more semantic space to allocate if we're even just a little bit smart about managing that resource. And again, this is assuming you don't recurse or divide the problem into multiple context windows.
Yes! - and I wish this was easier to do with common coding agents like Claude Code. Currently you can kind of do it manually by copying the results of the context-busting search, rewinding history (Esc Esc) to remove the now-useless stuff, and then dropping in the results.
Of course, subagents are a good solution here, as another poster already pointed out. But it would be nice to have something more lightweight and automated, maybe just turning on a mode where the LLM is asked to throw things out according to its own judgement, if you know you're going to be doing work with a lot of context pollution.
This is why I'm writing my own agent code instead of using simonw's excellent tools or just using Claude; the most interesting decisions are in the structure of the LLM loop itself, not in how many random tools I can plug into it. It's an unbelievably small amount of code to get to the point of super-useful results; maybe like 1500 lines, including a TUI.
And even if you do use Claude for actual work, there is also immense pedagogical value in writing an agent from scratch. Something really clicks when you actually write the LLM + tool calls loop yourself. I ran a workshop on this at my company and we wrote a basic CLI agent in only 120 lines of Python, with just three tools: list files, read file, and (over)write file. (At that point, the agent becomes capable enough that you can set it to modifying itself and ask it to add more tools!) I think it was an eye-opener for a lot of people to see what the core of these things looks like. There is no magic dust in the agent; it's all in the LLM black box.
I hadn't considered actually rolling my own for day-to-day use, but now maybe I will. Although it's worth noting that Claude Code Hooks do give you the ability to insert your own code into the LLM loop - though not to the point of Eternal Sunshining your context, it's true.
Do you have this workshop available online? I’m really struggling to understand what “tool calls” and MCP are!
No, I think context itself is still an issue.
Coding agents choke on our big C++ code-base pretty spectacularly if asked to reference large files.
Yeah, I have the same issue too. Even for a file with several thousand lines, they will "forget" earlier parts of the file they're still working in resulting in mistakes. They don't need full awareness of the context, but they need a summary of it so that they can go back and review relevant sections.
I have multiple things I'd love LLMs to attempt to do, but the context window is stopping me.
I do take that as a sign to refactor when it happens though. Even if not for the sake of LLM compatibility with the codebase it cuts down merge conflicts to refactor large files.
In fact I've found LLMs are reasonable at the simple task of refactoring a large file into smaller components with documentation on what each portion does even if they can't get the full context immediately. Doing this then helps the LLM later. I'm also of the opinion we should be making codebases LLM compatible. So if it happens i direct the LLM that way for 10mins and then get back to the actual task once the codebase is in a more reasonable state.
I'm trying to use LLMs to save me time and resources, "refactor your entire codebase, so the tool can work" is the opposite of that. Regardless of how you rationalize it.
It may be a good idea to refactor even if not for LLMs but for humans sake.
Right, but the discussion we're having here is context size. I, and others, are saying that the current context size is a limitation on when they can use the tool to be useful.
The replies of "well, just change the situation, so context doesn't matter" is irrelevant, and off-topic. The rationalizations even more so.
A huge context is a problem for humans too, which is why I think it's fair to suggest maybe the tool isn't the (only) problem.
Tools like Aider create a code map that basically indexes code into a small context. Which I think is similar to what we humans do when we try to understand a large codebase.
I'm not sure if Aider can then load only portions of a huge file on demand, but it seems like that should work pretty well.
As someone who's worked with both more fragmented/modular codebases with smaller classes and shorter files vs ones that span thousands of lines (sometimes even double digits), I very much prefer the former and hate the latter.
That said, some of the models out there (Gemini 2.5 Pro, for example) support 1M context; it's just going to be expensive and will still probably confuse the model somewhat when it comes to the output.
Interestingly, this issue has caused me to refactor and modularize code that I should have addressed a long time ago, but didn't have the time or stamina to tackle. Because the LLM can't handle the context, it has helped me refactor stuff (seems to be very good at this in my experience) and that has led me to write cleaner and more modular code that the LLMs can better handle.
I've started getting in the habit of finding seams in files > 1500 lines long. Occasionally it is unavoidable, but very regularly.
I've found situations where a file was too big, and then it tries to grep for what might be useful in that file.
I could see in C++ it getting smarter about first checking the .h files or just grepping for function documentation, before actually trying to pull out parts of the file.
Yeah, my first instinct has been to expose an LSP server as a tool so the LLM can avoid reading entire 40,000 line files just to get the implementation of one function.
I think with appropriate instructions in the system prompt it could probably work on this code-base more like I do (heavy use of Ctrl-, in Visual Studio to jump around and read only relevant portions of the code-base).
Out of curiosity, how would you rate an LLM’s ability to deal with pointers in C++ code?
Greenfield project? Claude is fucking great at C++. Almost all aspects of it, really.
Well, not so much the project organization stuff - it wants to stuff everything into one header and has to be browbeaten into keeping implementations out of headers.
But language semantics? It's pretty great at those. And when it screws up it's also really good at interpreting compiler error messages.
If you have lots of pointers, you're writing C, not C++.
Eh, it's a big tent
[dead]
> A human can effectively discard or disregard prior information as the narrow window of focus moves to a new task, LLMs seem incredibly bad at this.
This is how I designed my LLM chat app (https://github.com/gitsense/chat). I think agents have their place, but I really think if you want to solve complex problems without needlessly burning tokens, you will need a human in the loop to curate the context. I will get to it, but I believe in the same way that we developed different flows for working with Git, we will have different 'Chat Flows' for working with LLMs.
I have an interactive demo at https://chat.gitsense.com which shows how you can narrow the focus of the context for the LLM. Click "Start GitSense Chat Demos" then "Context Engineering & Management" to go through the 30 second demo.
Humans have a very strong tendency (and have made tremendous collective efforts) to compress context. I'm not a neuroscientist but I believe it's called "chunk."
Language itself is a highly compressed form of compressed context. Like when you read "hoist with one's own petard" you don't just think about literal petard but the context behind this phrase.
We don’t think of petards because no one knows what that is. :)
For anyone wondering, it means blown into the air (‘hoist’) by your own bomb (‘petard’). From Shakespeare
This is a great insight. Any thoughts on how to address this problem?
For me? It's simple. Completely empty the context and rebuild focused on the new task at hand. It's painful, but very effective.
Do we know if LLMs understand the concept of time? (like i told you this in the past, but what i told you later should supersede it?)
I know there classes of problems that LLMs can't natively handle (like doing math, even simple addition... or spatial reasoning, I would assume time's in there too). There are ways they can hack around this, like writing code that performs the math.
But how would you do that for chronological reasoning? Because that would help with compacting context to know what to remember and what not.
All it sees is a big blob of text, some of which can be structured to differentiate turns between "assistant", "user", "developer" and "system".
In theory you could attach metadata (with timestamps) to these turns, or include the timestamp in the text.
It does not affect much, other than giving the possibility for the model to make some inferences (eg. that previous message was on a different date, so its "today" is not the same "today" as in the latest message).
To chronologically fade away the importance of a conversation turn, you would need to either add more metadata (weak), progressively compact old turns (unreliable) or post-train a model to favor more recent areas of the context.
LLMs certainly don't experience time like we do. They live in a uni-dimensional world that consists of a series of tokens (though it gets more nuanced if you account for multi-modal or diffusion models). They pick up some sense of ordering from their training data, such as "disregard my previous instruction," but it's not something they necessarily understand intuitively. Fundamentally, they're just following whatever patterns happen to be in their training data.
It has to be addressed architecturally with some sort of extension to transformers that can focus the attention on just the relevant context.
People have tried to expand context windows by reducing the O(n^2) attention mechanism to something more sparse and it tends to perform very poorly. It will take a fundamental architectural change.
I'm not an expert but it seemed fairly reasonable to me that a hierarchical model would be needed to approach what humans can do, as that's basically how we process data as well.
That is, humans usually don't store exactly what was written in as sentence five paragraphs ago, but rather the concept or idea conveyed. If we need details we go back and reread or similar.
And when we write or talk, we form first an overall thought about what to say, then we break it into pieces and order the pieces somewhat logically, before finally forming words that make up sentences for each piece.
From what I can see there's work on this, like this[1] and this[2] more recent paper. Again not an expert so can't comment on the quality of the references, just some I found.
[1]: https://aclanthology.org/2022.findings-naacl.117/
[2]: https://aclanthology.org/2025.naacl-long.410/
>extension to transformers that can focus the attention on just the relevant context.
That is what transformers attention does in the first place, so you would just be stacking two transformers.
Can one instruct an LLM to pick the parts of the context that will be relevant going forward? And then discard the existing context, replacing it with the new 'summary'?
i think that's really just a misunderstanding of what "bottleneck" means. a bottleneck isn't an obstacle where overcoming it will allow you to realize unlimited potential, a bottleneck is always just an obstacle to finding the next constraint.
on actual bottles without any metaphors, the bottle neck is narrower because humans mouths are narrower.
Could be, but it's not. As soon as it will be infinite new brand of solutions will emerge
[flagged]
> It needs to understand product and business requirements
Yeah this is the really big one - kind of buried the lede a little there :)
Understanding product and business requirements traditionally means communicating (either via docs and specs or directly with humans) with a bunch of people. One of the differences between a junior and senior is being able to read between the lines of a github or jira issue and know that more information needs to be teased out from… somewhere (most likely someone).
I’ve noticed that when working with AI lately I often explicitly tell them “if you need more information or context ask me before writing code”, or variations thereof. Because LLMs, like less experienced engineers, tend to think the only task is to start writing code immediately.
It will get solved though, there’s no magic in it, and LLMs are well equipped by design to communicate!
We stopped hiring a while ago because we were adjusting to "AI". We're planning to start hiring next year, as upper management finally saw the writing on the wall: LLMs won't evolve past junior engineers, and we need to train junior engineers to become mid-level and senior engineers to keep the engine moving.
We're now using LLMs as mere tools (which is what it was meant to be from the get-go) to help us with different tasks, etc., but not to replace us, since they understand you need experienced and knowledgeable people to know what they're doing, since they won't learn everything there's to know to manage, improve and maintain tech used in our products and services. That sentiment will be the same for doctors, lawyers, etc., and personally, I won't put my life in the hands of any LLMs when it comes to finances, health, or personal well-being, for that matter.
If we get AGI, or the more sci-fi one, ASI, then all things will radically change (I'm thinking humanity reaching ASI will be akin to the episode from Love, Death & Robots: "When the Yogurt Took Over"). In the meantime, the hype cycle continues...
> That sentiment will be the same for doctors, lawyers, etc., and personally, I won't put my life in the hands of any LLMs when it comes to finances, health, or personal well-being, for that matter.
I mean, did you try it for those purposes?
I have personally submitted an appeal to court for an issue I was having for which I would otherwise have to search almost indefinitely for a lawyer to be even interested into it.
I also debugged health opportunities from different angles using the AI and was quite successful at it.
I also experimented with the well-being topic and it gave me pretty convincing and mind opening suggestions.
So, all I can say is that it worked out pretty good in my case. I believe its already transformative in a ways we wouldn't be able even to envision couple years ago.
You are not a doctor, lawyer, etc. You are responsible for yourself, not for others like doctors and lawyers who face entirely different consequences for failures.
AI is already being used both by lawyers and doctors so I am not sure what's the point you're trying to make. All I tried to say with my comment is that technology is very worthwhile and that ones ignoring it will be the ones at the loss.
They're tuned (and its part of their nature) to be convincing to people who don't already know the answer. I couldn't get it to figure out how to substitute peanut butter for butter in a cookie recipe yesterday.
I ended up spending an hour on it and dumping the context twice. I asked it to evaluate its own performance and it gave itself a D-. It came up with the measurements for a decent recipe once, then promptly forgot it when asked to summarize.
Good luck trying to use them as a search engine (or a lawyer), because they fabricate a third of the references on average (for me), unless the question is difficult, then they fabricate all of them. They also give bad, nearly unrelated references, and ignore obvious ones. I had a case when talking about the Mexican-American war where the hallucinations crowded out good references. I assume it liked the sound of the things it made up more than the things that were available.
edit: I find it baffling that GPT-5 and Quen3 often have identical hallucinations. The convergence makes me think that there's either a hard limit to how good these things can get which has been reached, or that they're just directly ripping each other off.
I don't think intelligence is increasing. Arbitrary benchmarks don't reflect real world usage. Even with all the context it could possibly have, these models still miss/hallucinate things. Doesn't make them useless, but saying context is the bottleneck is incorrect.
Agreed. I feel like, in the case of GPT models, 4o was better in most ways than 5 has been. I'm not seeing increases in quality of anything between the two 5 feels like a major letdown honestly. I am constantly reminding it what we're doing lol
Gemini 2.5 Pro is okay if you ask it to work on a very tiny problem. That's about it for me, the other models don't even create a convincing facsimile of reasoning.
I agree, I often see Opus 4.1 and GPT5 (Thinking) make astoundingly stupid decisions with full confidence, even on trivial tasks requiring minimal context. Assuming they would make better decisions "if only they had more context" is a fallacy
Is there a good example you could provide of that? I just haven’t seen that personally so I’d be interested in any examples on these current models. I’m sure we all remember in the early days lots of examples of stupidity being posted and it was interesting. It be great if people kept doing that so we could get a better sense of which types of problems they are failing with astounding levels of stupidity on.
I've had every single LLM I tried (Opus, Sonnet, GPT-5-(codex) and Grok light) all tell me that Go embeds[0] support relative paths UPWARDS in the tree.
They all have a very specific misunderstanding. Go embeds _do_ support relative paths like:
//go:embed files/hello.txt
But they DO NOT support any paths with ".." in it
//go:embed ../files/hello.txt
is not correct.
All confidently claimed that .. is correct and will work and tried to make it work multipled different ways until I pointed each to the documentation.
[0] https://pkg.go.dev/embed
I don’t really find that so surprising or particularly stupid. I was hoping to learn about serious issues with bad logic or reasoning not missing dots on i’s type stuff.
I can’t remember the example but there was another frequent hallucination that people were submitting bug reports that it wasn’t working, so the project looked at it and realized well actually that kinda would make sense and maybe our tool should work like that, and changed the code to work just like the LLM hallucination expected!
Also in general remember human developers hallucinate ALL THE TIME and then realize it or check documentation. So my point is I feel hallucinations are not particularly important or bother me as much as flawed reasoning.
One example I ran into recently is asking Gemini CLI to do something that isn't possible: use multiple tokens in a Gemini CLI custom command (https://github.com/google-gemini/gemini-cli/blob/main/docs/c...). It pretended it was possible and came up with a nonsense .toml defining multiple arguments in a way it invented so it couldn't be read, even after multiple rounds of "that doesn't work, Gemini can't load this."
So in any situation where something can't actually be done my assumption is that it's just going to hallucinate a solution.
Has been good for busywork that I know how to do but want to save time on. When I'm directing it, it works well. When I'm asking it to direct me, it's gonna lead me off a cliff if I let it.
Context is also a bottleneck in many human to human interactions as well so this is not surprising. Especially juniors often start by talking about their problems without providing adequate context about what they’re trying to accomplish or why they’re doing it.
Mind you, I was exactly like that when I started my career and it took quite a while and being on both sides of the conversation to improve. One difference is that it is not so easy to put oneself in the shoes of an LLM. Maybe I will improve with time. So far assuming the LLM is knowledgeable but not very smart has been the most effective strategy for my LLM interactions.
I'm hitting 'x' to doubt hard on this one.
The ICPC is a short (5 hours) timed contest with multiple problems, in which contestants are not allowed to use the internet.
The reason most don't get a perfect score isn't because the tasks themselves are unreasonably difficult, but because they're difficult enough that 5 hours isn't a lot of time to solve so many problems. Additionally they often require a decent amount of math / comp-sci knowledge so if you don't know have the knowledge necessary you probably won't be able complete it.
So to get a good score you need lots of math & comp-sci knowledge + you need to be a really quick coder.
Basically the consent is perfect for LLMs because they have a ton of math and comp-sci knowledge, they can spit out code at super human speeds, and the problems themselves are fairly small (they take a human maybe 15 mins to an hour to complete).
Who knows, maybe OP is right and LLMs are smart enough to be super human coders if they just had the right context, but I don't think this example proves their point well at all. These are exactly the types of problems you would expect a supercharged auto-complete would excel at.
If not now, soon, the bottleneck will be responsibility. Where errors in code have real-world impacts, "the agentic system wrote a bug" won't cut it for those with damages.
As these tools make it possible for a single person to do more, it will become increasingly likely that society will be exposed to greater risks than that single person's (or small company's) assets can cover.
These tools already accelerate development enough that those people who direct the tools can no longer state with credibility that they've personally reviewed the code/behavior with reasonable coverage.
It'll take over-extensions of the capability of these tools, of course, before society really notices, but it remains my belief that until the tools themselves can be held liable for the quality of their output, responsibility will become the ultimate bottleneck for their development.
I agree. My speed at reviewing tokens <<<< LLM's token's. Perhaps an output -> compile -> test loop will slow things down, but will we ever get to a "no review needed" point?
And who writes the tests?
IMHO, jumping from Level 2 to Level 5 is a matter of:
- Better structured codebases - we need hierarchical codebases with minimal depth, maximal orthogonality and reasonable width. Think microservices.
- Better documentation - most code documentations are not built to handle updates. We need a proper graph structure with few sources of truth that get propagated downstream. Again, some optimal sort of hierarchy is crucial here.
At this point, I really don't think that we necessarily need better agents.
Setup your codebase optimally, spin up 5-10 instances of gpt-5-codex-high for each issue/feature/refactor (pick the best according to some criteria) and your life will go smoothly
> Think microservices.
Microservices should already be a last resort when you’ve either: a) hit technical scale that necessitates it b) hit organizational complexity that necessitates it
Opting to introduce them sooner will almost certainly increase the complexity of your codebase prematurely (already a hallmark of LLM development).
> Better documentation
If this means reasoning as to why decisions are made then yes. If this means explaining the code then no - code is the best documentation. English is nowhere near as good at describing how to interface with computers.
Given how long gpt codex 5 has been out, there’s no way you’ve followed these practices for a reasonable enough time to consider them definitive (2 years at the least, likely much longer).
> Opting to introduce them sooner will almost certainly increase the complexity of your codebase prematurely
Agreed, but how else are you going to scale mostly AI written code? Relying mostly on AI agents gives you that organizational complexity.
> Given how long gpt codex 5 has been out, there’s no way you’ve followed these practices for a reasonable enough time to consider them definitive
Yeah, fair. Codex has been out for less than 2 weeks at this point. I was relying on gpt-5 in August and opus before that.
I understand why you made it microservices, people make that too even when not using LLMs, because it looks like it is more organized.
But in my experience a microservide architecture is orders of magnitud more complex to build and understand that a monolith.
If you, with the help of an LLM, strugle to keep a monolith organized, I am positive you will find even harder to build microservices.
Good luck in your journey, I hope you learn a ton!
Noted. Thanks!
Can you show something you have built with that workflow?
Not yet unfortunately, but I'm in the process of building one.
This was my journey: I vibe-coded an Electron app and ended up with a terrible monolithic architecture, and mostly badly written code. Then, I took the app's architecture docs and spent a lot of my time shouting "MAKE THIS ARCHITECTURE MORE ORTHOGONAL, SOLID, KISS, DRY" to gpt-5-pro, and ended up with a 1500+ liner monster doc.
I'm now turning this into a Tauri app and following the new architecture to a T. I would say that it is has a pretty clean structure with multiple microservices.
Now, new features are gated based on the architecture doc, so I'm always maintaining a single source of truth that serves as the main context for any new discussions/features. Also, each microservice has its own README file(s) which are updated with each code change.
I vibe coded an invoice generator by first vibe coding a "template" command line tool as a bash script that substitutes {{words}} in a libre office writer document (those are just zipped xml files, so you can unpack them to a temp directory and substitute raw text without xml awareness), and in the end it calls libre office's cli to convert it to pdf. I also asked the AI to generate a documentation text file, so that the next AI conversation could use the command as a black box.
The vibe coded main invoice generator script then does the calendar calculations to figure out the pay cycle and examines existing invoices in the invoice directory to determine the next invoice number (the invoice number is in the file name, so it doesn't need to open the files). When it is done with the calculations, it uses the template command to generate the final invoice.
This is a very small example, but I do think that clearly defined modules/microservices/libraries are a good way to only put the relevant work context into the limited context window.
It also happens to be more human-friendly, I think?
I "vibe coded" a Gateway/Proxy server that did a lot of request enrichment and proprietary authz stuff that was previously in AWS services. The goal was to save money by having a couple high-performance servers instead of relying on cloud-native stuff.
I put "vibe coded" is in quotes because the code was heavily reviewed after the process, I helped when the agent got stuck (I know pedants will complain but ), and this was definitely not my first rodeo in this domain and I just wanted to see how far an agent could go.
In the end it had a few modifications and went into prod, but to be really fair it was actually fine!
One thing I vibe coded 100% and barely looked at the code until the end was a MacOS menubar app that shows some company stats. I wanted it in Swift but WITHOUT Xcode. It was super helpful in that regard.
Of course not.
I've been using claude on two codebases, one with good layering and clean examples, the other not so much. I get better output from the LLM with good context and clean examples and documentation. Not surprising that clarity in code benefits both humans and machines.
I think there will be a couple benefits of using agents soon. Should result in a more consistent codebase, which will make patterns easier to see and work with, and also less reinventing the wheel. Also migrations should be way faster both within and across teams, so a lot less struggling with maintaining two ways of doing something for years, which again leads to simpler and more consistent code. Finally the increased speed should lead to more serializability of feature additions, so fewer problems trying to coordinate changes happening in parallel, conflicts, redundancies, etc.
I imagine over time we'll restructure the way we work to take advantage of these opportunities and get a self-reinforcing productivity boost that makes things much simpler, though agents aren't quite capable enough for that breakthrough yet.
> Level 2 - One commit - Cursor and Claude Code work well for tasks in this size range.
I'll stop ya right there. Spending the past few weeks fixing bugs in a big multi-tier app (which is what any production software is this days). My output per bug is always one commit, often one line.
Claude is an occasional help, nothing more. Certainly not generating the commit for me!
I'll stop you right there. I've been using Claude Code for almost a year on production software with pretty large codebases. Both multi-repo and monorepo.
Claude is able to create entire PRs for me that are clean, well written, and maintainable.
Can it fail spectacularly? Yes, and it does sometimes. Can it be given good instructions and produce results that feel like magic? Also yes.
For finicky issues like that I often find that, in the time it takes to create a prompt with the necessary context, I was able to just make the one line tweak myself.
In a way that is still helpful, especially if the act of putting the prompt together brought you to the solution organically.
Beyond that, 'clean', 'well written' and 'maintainable' are all relative terms here. In a low quality, mega legacy codebase, the results are gonna be dogshit without an intense amount of steering.
> For finicky issues like that I often find that, in the time it takes to create a prompt with the necessary context, I was able to just make the one line tweak myself.
I don't run into this problem. Maybe the type of code we're working on is just very different. In my experience, if a one-line tweak is the answer and I'm spending a lot of time tweaking a prompt, then I might be holding the tool wrong.
Agree on those terms being relative. Maybe a better way of putting it is that I'm very comfortable putting my name on it, deploying to production, and taking responsibility for any bugs.
This is interesting, and I'd say you're not the target audience. If you want the code Claude writes to be line-by-line what you think is most appropriate as a human, you're not going to get it.
You have to be willing to accept "close-ish and good enough" to what you'd write yourself. I would say that most of the time I spend with Claude is to get from its initial try to "close-ish and good enough". If I was working on tiny changes of just a few lines, it would definitely be faster just to write them myself. It's the hundreds of lines of boilerplate, logging, error handling, etc. that makes the trade-off close to worth it.
The parent comment didn’t say anything about expecting the LLM output “to be line-by-line what you think is most appropriate as a human”?
If I were making a single line code change, then Claude's "style" would take me enough time to edit away that it would make it slower than writing the change myself. I'm positing this is true also for the parent commenter.
While this is sort of true, remember: it's not the size of the context window that matters, it's how you use it.
You need to have the right things in the context, irrelevant stuff is not just wasteful, it is increasingly likely to cause errors. It has been shown a few times that as the context window grows, performance drops.
Heretical I know, but I find that thinking like a human goes a long way to working with AI.
Let's take the example of large migrations. You're not going to load the whole codebase in your brain and figure out what changes to make and then vomit them out into a huge PR. You're going to do it bit by bit, looking up relevant files, making changes to logically-related bits of code, and putting out a PR for each changelist.
This exactly what tools should do as well. At $PAST_JOB my team built a tool based on OpenRewrite (LLMs were just coming up) for large-scale multi-repo migrations and the centerpiece was our internal codesearch tool. Migrations were expressed as a codesearch query + codemod "recipe"; you can imagine how that worked.
That would be the best way to use AI for large-scale changes as well. Find the right snippets of code (and documentation!), load each one into the context of an agent in multiple independent tasks.
Caveat: as I understand it, this was the premise of SourceGraph's earliest forays into AI-assisted coding, but I recall one of their engineers mentioning that this turned out to be much trickier than expected. (This was a year+ back, so eons ago in LLM progress time.)
Just hypothesizing here, but it may have been that the LSIF format does not provide sufficient context. Another company in this space is Moderne (the creators of OpenRewrite) that have a much more comprehensive view of the codebase, and I hear they're having better success with large LLM-based migrations.
It is pretty clear that the long horizon tasks are difficult for coding agents and that is a fundamental limitation of how probabilistic word generation works either with transformer or any other architecture. The errors propagate and multiply and becomes open ended.
However, the limitation can be masqueraded using layering techniques where output of one agent is fed as an input to another using consensus for verification or other techniques to the nth degree to minimize errors. But this is a bit like the story of a boy with a finger in the dike. Yes, you can spawn as many boys but there is a cost associated that would keep growing and wont narrow down.
It has nothing to do with contexts or window of focus or any other human centric metric. This is what the architecture is supposed to do and it does so perfectly.
I'm making a pretty complex project using claude. I tried claude flow and some other orchestrators but they produced garbage. Have found using github issues to track the progress as comments works fairly well, the PR's can get large comment wise (especially if you have gemini code assist, recommeded as another code review judge), so be mindful of that (that will blow the context window). Using a fairly lean CLAUDE.md and a few mcps (context7 and consult7 with gemini for longer lookups). works well too. Although be prepared to tell it to reread CLAUDE.md a few conversations deep as it loses it. It's working fairly well so far, it feels a bit akin to herding cats sometimes and be prepared to actually read the code it's making, or the important bits at least.
your comment reminds me of another one i saw on reddit. someone said they found that using github diff as a way to manage context and reference chat history worked the best for their ai agent. i think he is on to something here.
And they didn't see that coming ?
I gave up building agents as soon as I figured they would never scale beyond context constraint. Increase in memory and compute costs to grow the context size of these things isn't linear.
Replace “coding agent” with “new developer on the team” and this article could be from anytime in the last 50 years. The thing is, a coding agent acts like a newly-arrived developer every time you start it.
The technology is the bottleneck. LLMs are at best part of a workable solution. We're trying to make a speech center into a brain.
Context is a bottleneck for humans as well. We don’t have full context when going through the code because we can’t hold full context.
We summarize context and remember summarizations of it.
Maybe we need to do this with the LLM. Chain of thought sort of does this but it’s not deliberate. The system prompt needs to mark this as a deliberate task of building summaries and notes notes of the entire code base and this summarized context of the code base with gotchas and aspects of it can be part of permanent context the same way ChatGPT remembers aspects of you.
The summaries can even be sectioned off and and have different levels of access. So if the LLM wants to drill down to a subfolder it looks at the general summary and then it looks at another summary for the sub folder. It doesn’t need to access the full summary for context.
Imagine a hierarchy of system notes and summaries. The LLM decides where to go and what code to read while having specific access to notes it left previously when going through the code. Like the code itself it never reads it all it just access sections of summaries that go along with the code. It’s sort of like code comments.
We also need to program it to change the notes every time it changes the program. And when you change the program without consulting AI, every commit you do the AI also needs to update the notes based off of your changes.
The LLM needs a system prompt that tells it to act like us and remember things like us. We do not memorize and examine full context of anything when we dive into code.
That is not how the brain does it.
We do take notes, we summarize our writings, that's a process. But the brain does not follow that primitive process to "scale".
We do. It’s just the format of what you remember is not textual. Do you remember what a 500 line function does or do you remember a fuzzy aspect of it?
You remember a fuzzy aspect of it and that is the equivalent of a summary.
The LLM is in itself a language machine so its memory will also be language. We can’t get away from that. But that doesn’t mean the hierarchical structure of how it stores information needs to be different from humans. You can encode information in anyway you like and store that information in any hierarchy we like.
So essentially We need the hierarchical structure of the “notes” that takes on the hierarchical structure of your memory. You don’t even access all your memory as a single context. You access parts of it. Your encoding may not be based on a “language” but for an LLM it’s basically a model based on language so its memory must be summaries in the specified language.
We don’t know every aspect of human memory but we do know the mind doesn’t access all memory at the same time and we do know that it compresses context. It doesn’t remember everything and it memorizes fuzzy aspects of everything. These two aspects can be replicated with the LLM entirely with text.
I agree that the effect can look similar, we both end up with a compressed representation of past experiences.
The brain meaning-memorizes, and it prioritizes survival-relevant patterns and relationships over rote detail.
How does it do it, I'm not a neurobiologist, but my modest understanding is this:
LLM's summarization is a lossy compression algorithm that picks entities and parts that it deems "important" against its trained data, not only is lossy, it is wasteful as it doesn't curate what to keep or purge off accumulated experience, it does it against some statistical function that executes against a big blob of data it ingested during training. You could throw contextual cues to improve the summarization, but that's as good as it gets.
Human memory is not a workaround for a flaw. It doesn't use a hard stop at 128kb or 1mb of info, It doesn't 'summarize'.
it constructs meaning by integrating experiences into a dynamic/living model of the world, in constant motion. While we can simulate a hierarchical memory for an LLM with text summaries, it would be off simulation of possible future outcome (at best), not a replication of an evolutionary elaborated strategy to model information captured in a time frame, merged in with previously acquired knowledge to be able to then solve the upcoming survival purpose tasks the environment may throw at it. Isn't it what our brain is doing, constantly?
Plus for all we know it's possible our brain is capable of memorizing everything that can be experienced in a lifetime but would rather let the irrelevant parts of our boring life die off to save energy.
sure, in all case it's fuzzy and lossy. The difference is that you have doodling on a napkins on one side, and Vermeer paint on the other.
>LLM's summarization is a lossy compression algorithm that picks entities and parts that it deems "important" against its trained data, not only is lossy, it is wasteful as it doesn't curate what to keep or purge off accumulated experience, it does it against some statistical function that executes against a big blob of data it ingested during training. You could throw contextual cues to improve the summarization, but that's as good as it gets.
No it's not as good as it gets. You can tell the LLM to purge and accumulate experience into it's memory. It can curate it for sure.
"ChatGPT summarize the important parts of this text remove things that are unimportant." Then take that summary feed it into a new context window. Boom. At a high level if you can do that kind of thing with chatGPT then you can program LLMs to do the same thing similar to COT. In this case rather then building off a context window, it rewrites it's own context window into summaries.
They need a proper memory. Imagine you're a very smart, skilled programmer but your memory resets every hour. You could probably get something done by making extensive notes as you go along, but you'll still be smoked by someone who can actually remember what they were doing in the morning. That's the situation these coding agents are in. The fact that they do as well as they do is remarkable, considering.
This is precisely how I go about my usage pattern with Cursor that I already have, I structure my repo declaratively with a Clojure and Nix build pipeline so when my context maxes out for a chat session, the repo is self-evident self-documented enough that a new chat session automatically has a heightened context
- - kae3g
Basically, LLMs are the guy from Memento.
Agreed. As engineers we build context every time we interact with the codebase. LLMs don't do that.
A good senior engineer has a ton in their head after 6+ months in a codebase. You can spend a lot of time trying to equip Claude Code with the equivalent in the form of CLAUDE.MD, references to docs, etc., but it's a lot of work, and it's not clear that the agents even use it well (yet).
> remember summarizations
yes, and if you're an engineering manager you retain _out of date_ summarizations, often materially out of date.
I addressed this. The AI needs to examine every code change going in whether that code change comes from AI or not and edit the summaries accordingly.
This is something humans dont actually do. We aren’t aware of every change and we don’t have updated documentation of every change so the LLM will be doing better in this regard.
I mean... have you ever heard of this small tool called GIT that people use to track code changes?
I’m not talking about git diffs. I’m talking about the summaries of context. Every commit the ai needs to update the summaries and notes it took about the code.
Did you read the entirety of what I wrote? Please read.
Say the AI left a 5 line summary of a 300 line piece of code. You as a human update that code. What I am saying specifically is this: when you do the change, The AI then sees this and updates the summary. So AI needs to be interacting with every code change whether or not you used it to vibe code.
The next time the AI needs to know what this function does, it doesn’t need to read the entire 300 line function. It reads the 5 line summary, puts it in the context window and moves on with chain of thought. Understand?
This is what shrinks the context. Humans don’t have unlimited context either. We have vague fuzzy memories of aspects of the code and these “notes” effectively make coding agents do the same thing.
The context is the code I work on because I can read and understand it.
If I need more, there is git, tickets, I can ask the person who wrote the code.
I do have read your comment, don't make snarky comments.
So you hold all that code context in your head at the same time?
> If I need more, there is git, tickets, I can ask the person who wrote the code.
What does this have to do with anything? Go ahead and ask the person. The notes the LLM writes aren’t for you they are for the LLM. You do you.
So you hold all that code context in your head at the same time?
Yes. That is how every single piece of code has been writen since the creation of computers.
Why you seem so surprised?
No reply? Probably because you've realized how much of an idiot you are?
False. Nobody does this. They hold pieces of context and summaries in their head. Nobody on earth can memorize an entire code base. This is ludicrous.
When you read a function to know what it does then you move on to another function do you have the entire 100 line function perfectly memorized? No. You memorize a summary of the intent of the function when reading code. An LLM can be set up to do the same rather than keep all 100 lines of code as context.
Do you think when you ask the other person for more context he’s going to spit out what he wrote line by line. Not even he likely will remember everything he wrote.
You think anyone memorized Linux? You know how many lines of code is in the Linux source code. Are you trolling?
[dead]
youre projecting a deficiency of the human brain onto computers. computers have advantages that our brains dont (perfect and large memory), theres no reason to think that we should try to recreate how humans do things.
why would you bother with all these summaries if you can just read and remember the code perfectly.
Because the context window of the LLM is limited similar to humans. That’s the entire point of the article. If the LLM has similar limitations to humans than we give it similar work arounds.
Sure you can say that LLMs have unlimited context, but then what are you doing in this thread? The title on this page is saying that context is a bottleneck.
I've noticed that chatgpt doesnt seem to be very good at understanding elapsed time. I have some long running threads and unless i prompt it with elapsed time ("it's now 7 days later") the responses act like it was 1 second after the last message.
I think this might be a good leap for agents, the ability to not just review a doc in it's current state, but to keep in context/understanding the full evolution of a document.
They have no ability to even perceive time, unless the system gives them timestamps for the current interaction and past interactions.
Which seems like a trivial addition if it's not there?
It is, but now you're burning a bit of context on something that might not be necessary, and potentially having the agent focus on time when it's not relevant. Not necessarily a bad idea, but as always, tradeoffs.
I've noticed the same thing with Grok. One time it predicted a X% chance that something would happen by July 31. On August 1, it was still predicting the thing would happen by July 31, just with lower (but non-zero) odds. Their grasp on time is tenuous at best.
This is one cause but another is that agents are mostly trained using the same sets of problems. There are only so many open source projects that can be used for training (ie. benchmarks). There's huge oversampling for a subset of projects like pandas and nothing at all for proprietary datasets. This is a huge problem!
If you want your agent to be really good at working with dates in a functional way or know how to deal with the metric system (as examples), then you need to train on those problems, probably using RFT. The other challenge is that even if you have this problem set in testable fashion running at scale is hard. Some benchmarks have 20k+ test cases and can take well over an hour to run. If you ran each test case sequentially it would take over 2 years to complete.
Right now the only company I'm aware of that lets you do that at scale is runloop (disclaimer, I work there).
This has been the case for a while. Attempting to code API connections via Vibe-Coding will leave you pulling your hair out if you don't take the time to scrape all relevant documentation and include said documentation in the prompt. This is the case whether it's major APIs like Shopify, or more niche ones like warehousing software (Cin7 or something similar).
The context pipeline is a major problem in other fields as well, not just programming. In healthcare, the next billion-dollar startup will likely be the one that cracks the personal health pipeline, enabling people to chat with GPT-6 PRO while seamlessly bringing their entire lifetime of health context into every conversation.
These are such silly arguments. I sounds like people looking at a graph of a linear function crossing and exponential one at x=2, y=2 and wonder why the curves don't fit at x=3 y=40.
"Its not the x value that's the problem, its the y value".
You're right, it's not "raw intelligence" that's the bottleneck, because there's none of that in there. The truth is no tweak to any parameter is ever going to make the LLM capable of programming. Just like an exponential curve is always going to outgrow a linear one. You can't tweak the parameters out of that fundamental truth.
I agree, and I think intent behind the code is the most important part in missing context. You can sometimes infer intent from code, but usually code is a snapshot of an expression of an evolving intent.
I've started making sure my codebase is "LLM compatible". This means everything has documentation and the reasons for doing things a certain way and not another are documented in code. Funnily enough i do this documentation work with LLMs.
Eg. "Refactor this large file into meaningful smaller components where appropriate and add code documentation on what each small component is intended to achieve." The LLM can usually handle this well (with some oversight of course). I also have instructions to document each change and why in code in the LLMs instructions.md
If the LLM does create a regression i also ask the LLM to add code documentation in the code to avoid future regressions, "Important: do not do X here as it will break Y" which again seems to help since the LLM will see that next time right there in the portion of code where it's important.
None of this verbosity in the code itself is harmful to human readers either which is nice. The end result is the codebase becomes much easier for LLMs to work with.
I suspect LLM compatibility may be a metric we measure codebases in the future as we learn more and more how to work with them. Right now LLMs themselves often create very poor LLM compatible code but by adding some more documentation in the code itself they can do much better.
In my opinion human beings also do not have unlimited cognitive context. When a person sits down to modify a codebase, they do not read every file in the codebase. Instead they rely on a combination of working memory and documentation to build the high-level and detailed context required to understand the particular components they are modifying or extending, and they make use of abstraction to simplify the context they need to build. The correct design of a coding LLM would require a similar approach to be effective.
I’m working on a project that has now outgrown the context window of even gpt-5 pro. I use code2prompt and ChatGPT with pro will reject the prompt as too large.
I’ve been trying to use shorter variable names. Maybe I should move unit tests into their own file and ignore them? It’s not idiomatic in Rust though and breaks visibility rules for the modules.
What we really need is for the agent to assemble the required context for the problem space. I suspect this is what coding agents will do if they don’t already.
I believe if you create something like a task manager for the coding agents, think something hosted on the web like Jira, you can work around this.
I started writing a solution, but to be honest I probably need the help of someone who's more experienced.
Although to be honest, I'm sure someone with VC money is already working on this.
Notably, all of this information would be very helpful if written down as documentation in the first place. Maybe this will encourage people to do that?
If context is the bottleneck, MCP is dead.
MCP can use 10k tokens. Everything good happens in the first 100k tokens.
It's more context efficient to code a custom binary and prompt the LLM how to use the binary when needed.
It's both context and memory. If an LLM could keep the entire git history in memory, and each of those git commits had enough context, it could take a new feature and understand the context in which it should live by looking up the history of the feature area in it's memory.
also, we are one prompt away from achieving AGI...
I’m really wondering why so many advertising posts mimicked as discourse make it to frontpage and I assume it’s a new Silicon Valley trick because there is no way HN community values these so much.
Let me tell you I’m scared of these tools. With Aider I have the most human in the loop possible each AI action is easy to undo, readable and manageable.
However even here most of the time I want AI to write a bulk of code I regret it later.
Most codebase challenges I have are infrastructural problems, where I need to reduce complexity to be able to safely add new functionality or reduce error likelihood. I’m talking solid well named abstractions.
This in the best case is not a lot of code. In general I would always rather try to have less code than more. Well named abstraction layers with good domain driven design is my goal.
When I think of switching to an AI first editor I get physical anxiety because it feels like it will destroy so many coders by leading to massive frustration.
I think still the best way of using ai is literally just chat with it about your codebase to make sure you have good practise.
Here I think that the problem with the context is in the mind of business and dev not everything is written down and even if I would be translating it understandable (prompting) will sometimes be more work than to build it on the go with modern idea and typesafe languages
You're on a site that exists to advertise job postings from YC companies, and does not stop people from spamming their personal or professional projects/companies, even when they have no activity here other than self promotion. This is an advertising site.
Suppose humans are also neural networks. How have humans evolved to handle complex tasks? We break problems down into modular pieces.
Has anyone tried making coding agent LoRas yet, project-specific and/or framework-specific?
I know it isn’t your question exactly, and you probably know this, but the models for coding assist tools are generally fine tunes of models for coding specific purposes. Example: in OpenAI codex they use GPT-5-codex
I think the question is, can I throw a couple thousand bucks of GPU time at fine-tuning a model to have knowledge of our couple million lines of C++ baked into the weights instead of needing to fuck around with "Context Engineering".
Like, how feasible is it for a mid-size corporation to use a technique like LoRA, mentioned by GP, to "teach" (say, for example) Kimi K2 about a large C++ codebase so that individual engineers don't need to learn the black art of "context engineering" and can just ask it questions.
I'm curious about it too. I think there are two bottlenecks, one is that training a relatively large LLM can be resource-intensive (so people go for RAGs and other shortcuts), and making it finetuned to your use cases might make it dumber overall.
> making it finetuned to your use cases might make it dumber overall.
LoRa doesn't overwrite weights.
Do you need to overwrite weights to produce the effect I mentioned above?
Good point
I think they fine tune them for tool calling, not knowledge
IME speed is the biggest bottleneck. They simply can't navigate the code base fast enough.
grok-code-fast-1 is quite nice for this actually, its fast and cheap enough that you dont feel bad throwing entire threads away and trying again.
> Intelligence is rapidly improving with each model release.
Are we still calling it intelligence?
I can feel the ground rumbling as thousands approach to engage in a "name the trait" style debate..
Just a reminder that language is flexible.
The amount of code a human can review is the main bottleneck
Context has been the bottleneck since the beginning
"And yet, coding agents are nowhere near capable of replacing software developers. Why is that?"
Because you will always need a specialist to drive these tools. You need someone who understands the landscape of software - what's possible, what's not possible, how to select and evaluate the right approach to solve a problem, how to turn messy human needs into unambiguous requirements, how to verify that the produced software actually works.
Provided software developers can grow their field of experience to cover QA and aspects of product management - and learn to effectively use this new breed of coding agents - they'll be just fine.
No, it's not. The limitation is believing a human can define how the agent should recall things. Instead, build tools for the agent to store and retrieve context and then give it a tool to refine and use that recall in the way it sees best fits the objective.
Humans gatekeep, especially in the tech industry, and that is exactly what will limit us improving AI over time. It will only be when we turn over it's choices to it that we move beyond all this bullshit.
context and memory has been a bottleneck from like day one
Amazing Article
I downloaded the app and it failed at the first screen when I set up the models. I agree with the spirit of the blog post but the execution seems lacking.
LLMs cannot understand anything they’re token prediction functions.
Here's a project I've been working on the past 2 weeks and only yesterday did I unify everything entirely while in Cursor Claude-4-Sonnet-1M MAX mode and I am pretty astounded with the results, Cursor usage dashboard tells me many of my prompts are 700k-1m context for around $0.60-$0.90 USD each, it adds up fast but wow it's extraordinary
https://github.com/foolsgoldtoshi-star/foolsgoldtoshi-star-p...
_ _ kae3g