What makes 5% of AI agents work in production?

49 points by AnhTho_FR 4 days ago

>This Monday, I moderated a panel in San Francisco with engineers and ML leads from Uber, WisdomAI, EvenUp, and Datastrato. The event, Beyond the Prompt, drew 600+ registrants, mostly founders, engineers, and early AI product builders.

>We weren’t there to rehash prompt engineering tips.

>We talked about context engineering, inference stack design, and what it takes to scale agentic systems inside enterprise environments. If “prompting” is the tip of the iceberg, this panel dove into the cold, complex mass underneath: context selection, semantic layers, memory orchestration, governance, and multi-model routing.

I bet those four people love that the moderator took a couple notes and then asked ChatGPT to write a blog post.

As always, the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.

stingraycharles 2 hours ago

Yeah, “here’s the reality check:”, “not because they’re flashy, but because they’re blah blah”.
Why can’t anyone be bothered anymore to write actual content, especially when writing about AI, where your whole audience is probably already exposed to these patterns in content day in, day out?
It comes off as so cheap.
- alexchantavy 14 minutes ago
  
  Yeah it bugs me. We've got enough examples in this article to make Cards Against Humanity ChatGPT edition
  > One panelist shared a personal story that crystallized the challenge: his wife refuses to let him use Tesla’s autopilot. Why? Not because it doesn’t work, but because she doesn’t trust it.
  > Trust isn’t about raw capability, it’s about consistent, explainable, auditable behavior.
  > One panelist described asking ChatGPT for family movie recommendations, only to have it respond with suggestions tailored to his children by name, Claire and Brandon. His reaction? “I don’t like this answer. Why do you know my son and my girl so much? Don’t touch my privacy.”
- mccoyb 2 hours ago
  
  It comes off as someone who lives their life according to quantity, not quality.
  The real insight: have some fucking pride in what you make, be it a blog post, or a piece of software.
  
  palmotea 28 minutes ago
  
  > The real insight: have some fucking pride in what you make, be it a blog post, or a piece of software.
  The businessmen's job will be complete when they've totally eliminated all pride from work.
- rapind an hour ago
  
  > Why can’t anyone be bothered anymore to write actual content
  The way I see it is that the majority of people never bothered to write actual content. Now there’s a tool the non-writers can use to write dubious content.
  I would wager this tool is being used much differently by actual writers focused on producing quality. There’s just way less of them, same way there is less of any specialization.
  The real question with AI to me is whether it will remain consistently better when wielded by a specialist who has invested their time into whatever the thing is they are producing. If that ever changes then we are doomed. When it’s no longer slop…
esperent an hour ago

> the number one tell of LLM output, besides the tone, is that by default it will never include links in the body of the post.
This isn't true. I've been using Gemini 2.5 a lot recently and I can't get it to stop adding links!
I added custom instructions: Do not include links in your output. At the start of every reply say "I have not added any links as requested".
It works for the first couple of responses but then it's back to loads of links again.
tkgally an hour ago

I started to suspect a few paragraphs in that this post was written with a lot of AI assistance, but I continued to read to the end because the content was interesting to me. Here's one point that resonated in particular:
"There’s a missing primitive here: a secure, portable memory layer that works across apps, usable by the user, not locked inside the provider. No one’s nailed it yet. One panelist said if he weren’t building his current startup, this would be his next one."
- ares623 20 minutes ago
  
  Isn’t that markdown files?

AdieuToLogic 3 hours ago

It's funny that what the author identifies as "the reality check":

  Here’s the reality check: One panelist mentioned that 95%
  of AI agent deployments fail in production. Not because the 
  models aren’t smart enough, but because the scaffolding 
  around them, context engineering, security, memory design, 
  isn’t there yet.

Could be a reasonable definition of "understanding the problem to solve."

In other words, everything identified as what "the scaffolding" needs is what qualified people provide when delivering solutions to problems people want solved.

whatever1 2 hours ago

They fail because the “scaffolding” is building the complicated expert system that AI promised that one would not have to do.
If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middle altogether.
- AdieuToLogic 2 hours ago
  
  > If I implement myself a strict parser and an output post-processor to guard against hallucinations, I have done 100% of the business related logic. I can skip the LLM in the middles altogether.
  Well said and I could not agree more.
danieltanfh95 an hour ago

It is really just BS. These are just basic DSA stuff. We deployed a real world solution by doing of all of that on our side. It's not magic. It's engineering.

EdwardDiego 25 minutes ago

> One team suggested that instead of text-to-SQL, we should build semantic business logic layers, “show me Q4 revenue” should map to a verified calculation, not raw SQL generation.

Okay, how would that work though? Verified by who and calculated by what?

I need deets.

tirumaraiselvan 16 minutes ago

A simple way is perhaps implement a text-to-metrics system where metrics could be defined as SQL functions.
moomoo11 12 minutes ago

psychedelics

iagooar 2 hours ago

Wow, half of this article deeply resonates with what I am working on.

Text-to-SQL is the funniest example. It seems to be the "hello world" of agentic use in enterprise environments. It looks so easy, so clear, so straight-forward. But just because the concept is easy to grasp (LLMs are great at generating markup or code, so let's have them translate natural language to SQL) doesn't mean it is easy to get right.

I have spent the past 3 months building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries. And boy oh boy is that rabbit hole deep.

juleiie 8 minutes ago

> building a solution that actually bridges the stochastic nature of AI agents and the need for deterministic queries
Wait but this just sounds unhinged, why oh why
jamesblonde 44 minutes ago

Text2SQL was 75% on bird-bench 6 months ago. Now it's 80%. Humans are still at 90+%. We're not quite there yet. I suspect text-to-sql needs a lot of intermediate state and composition of abstractions, which vanilla attention is not great at.
https://bird-bench.github.io/
- ares623 31 minutes ago
  
  Text to sql is solved by having good UX and a reasonable team that’s in touch with the customers needs.
  A user having to come up with novel queries all the time to warrant text 2 sql is a failure of product design.
  
  caust1c 9 minutes ago
  
  This is exactly it. AI is sniffing out the good datamodels from the bad. Easy to understand? AI can understand it too! Complex business mess with endless technical debt? Not too much.
  But this is precisely why we're seeing startups build insane things fast while well established companies are still questioning if it's even worth it or not.

ares623 2 hours ago

At some point, say 5 years from now, someone will revisit their AI-powered production workloads and ask the question "how can we optimize this by falling back to non-AI workload?". Where does that leave AI companies when the obvious choice is to do away with their services once their customers reach a threshold?

EdwardDiego 23 minutes ago

"Huh, turns out we could replace it all with a 4 line Perl script doing linear regression."
- ares623 6 minutes ago
  
  “How I used ancient programming techniques to save the company $100k/year in token costs”

another_twist 3 hours ago

So I have read the MIT paper and the methodology as well as the conclusions are just something else.

For example, the number comes from perceived successes and failures and not actual measurements. The customer conclusions are also - it doesnt improve or it doesnt remember. Literally buying into the hype of recursive self improvement and completely oblivious to the fact that API dont control model weights and such cant do much self improvement besides writing more CRUD layers. The other complaints are about integrations which are totally valid. But in industries which still run windows XYZ without any API platforms so thats not going away in those cases.

Point being, if the paper itself is not very good discourse just a well marketed punditry, why should we discuss on the 5% number. It makes no sense.

hn_throwaway_99 an hour ago

> Here’s the reality check: One panelist mentioned that 95% of AI agent deployments fail in production. Not because the models aren’t smart enough, but because the scaffolding around them, context engineering, security, memory design, isn’t there yet.

It's a big pet peeve of mine when an author states an opinion, with no evidence, as some kind of axiom. I think there is plenty of evidence that "the models aren't smart enough". Or to put it more accurately, it's an incredibly difficult problem to get a big productivity gain when an automated system is blatantly wrong ~1% of the time but when those wrong answers are inherently designed to look like right answers as much as possible.

monero-xmr 3 hours ago

A non-open ended path collapses into a decision tree. Very hard to think of customer support use-cases that do not collapse into decision trees. Most prompt engineering on the SaaS side results in very long prompts to re-invent decision trees and protect against edge cases. Ultimately the AI makes a “decision function call” which hits a decision tree. LLM is very poor replacement for a decision tree.

I use LLM every day of my life to make myself highly productive. But I do not use LLM tools to replace my decision trees.

LPisGood 3 hours ago

It just occurred to me that with those massive system files people use we’re basically reinventing expert systems of the past. Time is a flat circle, I suppose.

schrodinger 33 minutes ago

A decision tree is simply a model where you follow branches and make a decision at each point. Like...

If we had tech support for a toaster, you might see:

    if toaster toasts the bread:
      if no: has turning it off and on again worked?
        if yes: great! you found a solution
        if no: hmm, try ...
      if yes:
        is the bread burnt after?
          if no: sounds like your toaster is fine!
          if yes: have you tried adjusting the darkness knob?
            if no: ship it in for repair
            if yes: try replacing the timer. does that help?
              if no: ship it in for repair
              if yes: yay you're toaster is fixed

LostMyLogin 2 hours ago

Any chance you can ELI5 this to me?
- dmbche 29 minutes ago
  
  Just search "expert system"

jongjong 3 hours ago

It's interesting because my management philosophy when delegating work has been to always start by telling people what my intent is, so that they don't get too caught up in a specific approach. Many problems require out-of-the-box thinking. This is really about providing context. Context engineering is basically a management skill.

Without context, even the brightest people will not be able to fill in the gaps in your requirements. Context is not just nice-to-have, it's a necessity when dealing with both humans and machines.

I suspect that people who are good engineering managers will also be good at 'vibe coding'.

another_twist 3 hours ago

Its weird that this makes the front page and Metas code world model never did.

metadat 3 hours ago

First I've heard of it:
https://ai.meta.com/research/publications/cwm-an-open-weight...
CuriouslyC 3 hours ago

HN front page dynamics are heavily driven by having readers of /new who are stans for your content.

ath3nd an hour ago

[dead]

hshdhdhehd 2 hours ago

Base models are the seed, fine tuning is the genetically modified seed. Context is the fertiliser.

handfuloflight 2 hours ago

Agents are the oxen pulling the plow through the seasons... turning over ground, following furrows, adapting to terrain. RAG is the irrigation system. Prompts are the farmer's instructions. And the harvest? That depends on how well you understood what you were trying to grow.
ath3nd an hour ago

[dead]