stuxf 2 months ago

> Some coding agents (Shelley included!) refuse to return a large tool output back to the agent after some threshold. This is a mistake: it's going to read the whole file, and it may as well do it in one call rather than five.

disagree with this: IMO the primary reason that these still need to exist is for when the agent messes up (e.g reads a file that is too large like a bundle file), or when you run a grep command in a large codebase and end up hitting way too many files, overloading context.

Otherwise lots of interesting stuff in this article! Having a precise calculator was very useful for the idea of how many things we should be putting into an agent loop to get a cost optimum (and not just a performance optimum) for our tasks, which is something that's been pretty underserved.

  • tekacs 2 months ago

    I think that's reasonable, but then they should have the ability for the agent to, on the next call, override it. Even if it requires the agent to have read the file once or something.

    In the absence of that you end up with what several of the harnesses ended up doing, where an agent will use a million tool calls to very slowly read a file in like 200 line chunks. I think they _might_ have fixed it now (or agent-fixes, my agent harness might be fixing it), but Codex used to do this and it made it unbelievably slow.

    • reactordev 2 months ago

      You’re describing peek.

      An agent needs to be able to peek before determining “Can I one shot this or does it need paging?”

      • tekacs 2 months ago

        Yep, I previously implemented it under that name in my own harness. That being said, there is value in actually performing a normal read, because you do often complete it on that first glance.

        • reactordev 2 months ago

          Confession, I too implemented a “smart” read. A read unless it’s over a size, then it’s paged, or if it’s a specific format, a summary. However, I also supply `cat`

  • inetknght 2 months ago

    > when you run a grep command in a large codebase and end up hitting way too many files, overloading context.

    On the other hand, I despise that it automatically pipes things through output-limiting things like `grep` with a filter, `head`, `tail`, etc. I would much rather it try to read a full grep and then decide to filter-down from there if the output is too large -- that's exactly what I do when I do the same workflow I told it to do.

    Why? Because piping through output liming things can hide the scope of the "problem" I'm looking at. I'd rather see the scope of that first so I can decide if I need to change from a tactical view/approach to a strategic view/approach. It would be handy if the agents could do the same thing -- and I suppose they could if I'm a little more explicit about it in my tool/prompt.

    • kaibee 2 months ago

      In my experience this is what Claude 4.5 (and 4.6) basically does, depending on why its grepping it in the first place. It'll sample the header, do a line count, etc. This is because the agent can't backtrack mid-'try to read full file'. If you put the 50,000 lines into the context, they are now in the context.

      • jtbayly 2 months ago

        Why can't the LLM/agent edit the context and dump that file if it decides it was dumb to have the whole thing in the context?

        • cyanydeez 2 months ago

          Base model is content. If it reads to much it becomes the content.

          What you want is a harness that continually inserts file portions until a sufficiently bright light bulb goes off.

          When they say agentic AI, ITS BASICALLY:

          <command><content-chunk-1/></command>

          its the ugliest string mashing indeterministic garbage the bearded masters would face palm.

      • inetknght 2 months ago

        > If you put the 50,000 lines into the context, they are now in the context.

        And you can't revert back to a previous context, and then add in new context summarizing to something like "the file is too large" with how to filter "there are too many unrelated lines matching '...', so use grep"?

        Using output-limiting stuff first won't tell you if you've limited too much. You should search again after changing something; and if you do search again then you need to remember which page you're on and how many there are. That's a bit more complex in my opinion, and agents don't handle that kind of complexity very well afaik.

cs702 2 months ago

> By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.

Yeah, it's a well-known problem. Every AI company is working on ways to deal with it, one way or another, with clever data center design, and/or clever hardware and software engineering, and/or with clever algorithmic improvements, and/or with clever "agentic recursive LLM" workflows. Anything that actually works is treated like a priceless trade secret. Nothing that can put competitors at a disadvantage will get published any time soon.

There are academics who have been working on it too, most notably Tri Dao and Albert Gu, the key people behind FlashAttention and SSMs like Mamba. There are also lots of ideas out there for compressing the KV cache. No idea if any of them work. I also saw this recently on HN: https://news.ycombinator.com/item?id=46886265 . No idea if it works but the authors are credible. Agentic recursive LLMs look most promising to me right now. See https://arxiv.org/abs/2512.24601 for an intro to them.

  • yowlingcat 2 months ago

    What do you think about RLMs? At first blush it looks like sub agents with some sprinkles on top, but people who have become more adept with it seem to show its ability to handle sublinear context scaling behavior very effectively.

    • cs702 2 months ago

      By "agentic recursive LLMs," I mean all the approaches that involve agents recursively calling LLMs, including RLMs. My post in fact links to an RLM paper.

TZubiri 2 months ago

I'm not sure, but I think that cached read costs are not the most accurately priced, if you consider your costs to be costs when consuming an API endpoint, then the answer will be 50k tokens, sure. But if you consider how much it costs the provider, cached tokens probably have a way higher margin than (the probably negative margin of ) input and output inference tokens.

Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.

To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.

This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.

So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.

  • 2001zhaozhao 2 months ago

    Caching might be free, but I think making caching cost nothing at the API level is not a great idea either considering that LLM attention is currently more expensive with more tokens in context.

    Making caching free would price "100000 token cache, 1000 read, 1000 write" the same as "0 token cache, 1000 read, 1000 write", whereas the first one might cost more compute to run. I might be wrong at the scale of the effect here though.

  • NitpickLawyer 2 months ago

    While some efficiencies could be gained from better client-server negotiation, the cost will never be 0. It isn't 0 even in "lab conditions", so it can't be 0 at scale. There are a few miss-conceptions in your post.

    > the time it takes to generate the Millionth output token is the same as the first output token.

    This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.

    > cached input tokens are almost virtually free naturally

    No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.

    Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.

    Now consider 100k users doing basically this, all day long. This is not free and can't become free.

    • TZubiri 2 months ago

      >This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.

      I'm not strong on how transformers work, but this is something that is verifiable empirically, and has nothing to do with how transformers work.

      Use any LLM through an API. Send 1 input token, and 10k output tokens. Then send 1 input token (different to avoid cache) and ask for 20k output tokens. If the cost and time to compute is exactly twice, then my theory holds.

      >No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.

      I was a bit loose in my definition of "virtually free", here is a more formal statement. The price of GPU compute is orders of magnitude more expensive than the cost of RAM, and the costs of caching inputs are tied to RAM and not GPU. To give an example of the most expensive price component, capital, an H100 costs 25K$, 1GB of RAM costs 10$. Therefore the cost component of cached inputs is negligible.

      >Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.

      As I said, sure it's not free, but you are talking about negligible costs when compared to the GPU capex. It's interesting to note that the API provider would charge the same no matter if the inference state is cached for 5 minutes, 1ms or 1 hour. So clearly the thing is not optimally priced yet.

      If cached inputs from API calls become your primary cost, then it makes sense to move to an API that pays less for cached inputs (if you haven't already done that), then look into APIs where you can control when and when not to cache and for how long to hold it, and finally, into renting GPU and self-hosting an open weights model.

      To give a concrete example, suppose we are building a feature where we want to stop upon hitting an ambiguous output token, our technical approach is to generate one output token at a time, check the logprobs, and continue if the prob of the top token is >90%, otherwise, halt. If we generate 1M output tokens with an API, we will pay for roughly 1M^2/2 cached input tokens, while if we self-host, the compute time will be almost identical to that of just generating 1M output tokens. Obviously if we do that with an API it will be almost entirely profit for the API provider, it's just not a use case that has been optimized for, we are in the early days of any type of deeply technical parametrization being done yet, everyone is just either prompting all the way down, or hacking with models directly, doesn't seem like a lot of in between.

  • mike_hearn 2 months ago

    GPU VRAM has an opportunity cost, so caching is never free. If that RAM is being used to hold KV caches in the hope that they'll be useful in future, but you lose that bet and you never hit that cache, you lost money that could have been used for other purposes.

    • TZubiri 2 months ago

      that cost is proportional to how long the cache is held. Currently the cache is not application controlled, it's like CPU caches.

      If you hit the cache 1ns after it's been held, you get charged the same as if it's held for 5 minutes or 1 hour.

      Also, in terms of LLM APIs, I'm almost certain that the state is offloaded onto RAM and then reloaded onto the GPU memory. If you are renting a GPU, you could keep the inferred state in GPU memory. If you are just holding it for very short periods of time, like my example of generating 1 output token at a time and doing some programmatic logic, then it's currently prohibitively expensive to use an API and you must self-host.

  • lostmsu 2 months ago

    > To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.

    This is wrong. Current models still use some full attention layers AFAIK, and their computational cost grows linearly (per token) with the token number.

    • TZubiri 2 months ago

      I have seen exactly one model that charges more for longer contexts:

      https://ai.google.dev/gemini-api/docs/pricing

      Gemini 1M context window

      That said the cost increase isn't very significant, approximately 2x at the longer end of the context window.

      This is in stark contrast with the quadratic phenomenon claimed by the article.

      • lostmsu 2 months ago

        They just do averaging. Imagine a quadratic pricing structure. Who'd want to deal with it?

        • TZubiri 2 months ago

          I guess 1.0001 ^2 is quadratic too, but note how it really only charges you 1.5x for more output tokens. Even if cost were quadratic with output length here, we are talking about a very small difference, nothing like the quadratic cost structure proposed by OP:

          >Pop quiz: at what point in the context length of a coding agent are cached reads costing you half of the next API call? By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.

          These are two different cost components, and the one you bring up is minor, OP is talking about a cost that at 1M output tokens, would cause the cost to be 20x per token. You are talking about a cost that at 1M output tokens would cost 1.5x, different things.

          The first is an imperfection of the API encapsulation, the latter may be a natural cost phenomenon related to the internals of the state of the state of the art algorithms

          • lostmsu 2 months ago

            What are you talking about? The cost is quadratic in total conversation length in tokens.

alexhans 2 months ago

Nice article. I think a key part of the conversation is getting people to start thinking in terms of evals [1] and observability but it's been quite tough to combat the hype of "but X magic product just solves what you mentioned as a concern for you".

You'd think cost is an easy talking point to help people care but the starting points for people are so heterogeneous that it's tough to show them they can take control of this measurement themselves.

I say the latter because the article is a point in time and if they didn't have a recurrent observation around this, some aspects may radically change depending on the black box implementations of the integrations they depend on (or even the pricing strategies).

[1] https://ai-evals.io/

vatsachak 2 months ago

The brain trims it's context through forgetting details that do not matter

LLMs will have to eventually cross this hurdle before they become our replacements

seyz 2 months ago

128k tokens sounds great until you see the bill

collinwilkins 2 months ago

what i've learned running multi-agent workflows... >use the expensive models for planning/design and the cheaper models for implementation >stick with small/tightly scoped requests >clear the context window often and let the AGENTS.md files control the basics

  • rubicon33 2 months ago

    there’s something of a paradox there. Reduce the context window and work on smaller/tightly scoped requests? Isn’t the whole value proposition that I can work much faster? To do that, I naturally try to describe what I want at a higher, vaguer level.

    • readyforbrunch 2 months ago

      That's where something like openspec and beads come in. You work high level, create a spec and break it down into beads (small tasks). Your main agent then spawns workers that perform a task with limited scope.

0-_-0 2 months ago

The cache gets read at every token generated, not at every turn on the conversation.

  • mzl 2 months ago

    Depends on which cache you mean. The KV Cache gets read on every token generated, but the prompt cache (which is what incurs the cache read cost) is read on conversation starts.

    • 0-_-0 2 months ago

      What's in the prompt cache?

      • bsenftner 2 months ago

        Way too much. This has got to be the most expensive and most lacking in common sense way to make software ever devised.

      • mzl 2 months ago

        The prompt cache caches KV Cache states based on prefixes of previous prompts and conversations. Now, for a particular coding agent conversation, it might be more involved in how caching works (with cache handles and so on), I'm talking about the general case here. This is a way to avoid repeating the same quadratic cost computing over the prompt. Typically, LLM providers have much lower pricing for reading from this cache than computing again.

        Since the prompt cache is (by necessity, this is how LLMs work) prefix of a prompt, if you have repeated API calls in some service, there is a lot of savings possible by organizing queries to have less commonly varying things first, and more varying things later. For example, if you included the current date and time as the first data point in your call, then that would force a recomputation every time.

        • lostmsu 2 months ago

          > The prompt cache caches KV Cache states

          Yes. The cache that caches KV cache states is called the KV cache. "Prompt cache" is just index from string prefixes into KV cache. It's tiny and has no computational impact. The parent was correct to question you.

          The cost of using it comes from the blend of the fact that you need more compute to calculate later tokens and the fact that you have to keep KV cache entries between requests of the same user somewhere while the system processes requests of other users.

          • mzl 2 months ago

            Saying that it is just in index from string prefixes into KV Cache misses all the fun, interesting, and complicated parts of it. While technically the size of the prompt-pointers is tiny compared with the data it points into, the massive scale of managing this over all users and requests and routing inside the compute cluster makes it an expensive thing to implement and tune. Also, keeping the prompt cache sufficiently responsive and storing the large KV Caches somewhere costs a lot as well in resources.

            I think that the OpenAI docs are pretty useful for the API level understanding of how it can work (https://developers.openai.com/api/docs/guides/prompt-caching...). The vLLM docs (https://docs.vllm.ai/en/stable/design/prefix_caching/) and SGLang radix hashing (https://lmsys.org/blog/2024-01-17-sglang/) are useful for insights into how to implement it locally for one computer ode.

            • lostmsu 2 months ago

              The implementation details are irrelevant to the discussion of the true cost of running the models.

              • mzl 2 months ago

                The cost of running things like prompt caching is defined by the implementation as that gives the infrastructure costs.

jauntywundrkind 2 months ago

Very awesome to see these numbers, to see this explored so. Nice job exe.dev.

intellirim 2 months ago

In my experience building agent pipelines, the real cost explosion happens at tool call chains - each hop multiplies tokens in ways that are hard to anticipate . Adding structured logging per step helped me identify and cut the worst offenders.