> Without specification, we employ a decoder-only language model GPT2 (Radford et al., 2019) with a configuration of 4 layers, 32 hidden dimensions, and 4 attention
heads.
Yeah, ok. The research is interesting, warranted, but writing an article about it, and leading with the conclusions gathered from toy models and implying this generalises to production LLMs is useless.
We've been here before with small models. Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results.
Research in this area is good, and needed. Mainly to understand limitations, discover if there are any scale levels where "emergent" stuff appears and so on. But writing articles based on incipient research, based on tiny models is not worth the effort.
Doing analysis on small models or small data is perfectly valid if the results extrapolate to large models. Which is why right now we're looking at new research papers that are still listing the same small datasets and comparing to the same small models that papers five years ago did.
I have nothing against researching this, I think it's important. My main issue is with articles choosing to grab a "conclusion" and imply it extrapolates to larger models, without any support for that. They are going for the catchy title first, fine-print be damned.
I was just at the KDD conference and the general consensus agreed with this paper. There was only one keynoter who just made the assumption that LLMs are associated with reasoning, which was jarring as the previous keynoter had just explained at length why we need a neuro-symbolic approach instead.
The thing is, I think the current companies making LLMs are _not_ trying to be correct or right. They are just trying to hide it better. In the business future for AI the coding stuff that we focus on on HN - how AI can help/impact us - is just a sideline.
The huge-money business future of LLMs is to end consumers not creators and it is product and opinion placement and their path to that is to friendship. They want their assistant to be your friend, then your best friend, then your only friend, then your lover. If the last 15 years of social media has been about discord and polarisation to get engagement, the next 15 will be about friendship and love even though that leads to isolation.
None of this needs the model to grow strong reasoning skills. That's not where the real money is. And CoT - whilst super great - is just as effective if it's hiding better that its giving you the wrong answer (by being more internally consistent) than if its giving you a better answer?
I don't think they were recorded. In fact, I don't think any of KDD gets recorded.
I think it was Dan Roth who talked about the challenges of reasoning from just adding more layers and it was Chris Manning who just quickly mentioned at the beginning of his talk that LLMs were well known for reasoning.
> None of this needs the model to grow strong reasoning skills. That's not where the real money is.
I never thought about it like that, but it sounds plausible.
However, I feel like getting to this stage is even harder to get right compared to reasoning?
Aside from the <0.1% of severely mentally unwell people which already imagine themselves to be in relationships with AIs, I don't think a lot of normal people will form lingering attachments to them without solving the issue of permanence and memory
They're currently essentially stateless, while that's surely enough for short term attachment, I'm not seeing this becoming a bigger issue because if that glaring shortfall.
It'd be like being in a relationship with a person with dementia, thats not a happy state of being.
Honestly, I think this trend is severely overstated until LLMs can sufficiently emulate memories and shared experiences. And that's still fundamentally impossible, just like "real" reasoning with understanding.
So I disagree after thinking about it more - emulated reasoning will likely have a bigger revenue stream via B2E applications compared to emotional attachment in B2C...
(the top post on HN right now is announcing Claude lets you buy a 1M token context. Extrapolate a few years.
Generally, there is a push towards 'context engineering' and there is a lot of bleeding edge research in snapshotting large contexts in ways to get the next back-forth turn in the conversation to be fast etc. So optimisations are already being made.)
As to general consensus, Hinton gave a recent talk, and he seemed adamant that neural networks (which LLMs are) really are doing reasoning. He gives his reasons for it. Is Hinton considered an outlier or?
This is sloppy, I was asking about scientific consensus from the perspective of the prior commenter as a conference-goer. I am not asking for opinions bordering on ad hominems of Hinton or any other scientist, please refrain from that style of misinformation.
I think Hinton uses terms like reasoning and creativity and consciousness in a way that are different from my own embeddings.
I recently had fun asking Gemini to compare how Wittgenstein and Chomsky would view calling a large transformer that was trained entirely on a synthetic 'language' (in my case symbols that encode user behaviour in an app) a 'language' or not. And then, for the killer blow, whether an LLM that is trained on Perl is a language model.
My point being that whilst Hinton is a great and all, I don't think I can quite pin down his definitions of the precise words like reasoning etc. Its possible for people to have opposite meanings for the same words (Wittgenstein famously had two contradictory approaches in his lifetime). In the case of Hinton, I can't quite pin down how loosely or precisely he is using the terms.
A forward-only transformer like GPT can only do symbolic arithmetic to the depth of its layers, for example. And I don't think the solution is to add more layers.
Of course humans are entirely neuro and we somehow manage to 'reason'. So YMMV.
Ultimately I somehwat disagreed with some of Hintons points in this talk, and after some thought I came up with specific reasons/doubts, and yet at the same time, his intuitive explanations helped shift my views somewhat as well.
Not sure what all this is about, I somewhat regret taking a breaking from coding with LLMs to have it explained to me its all a mirage and a secret and sloppy plan for getting me an automagic egirl or something. ;)
What does actually reason mean? It's doing this complex anesthesiologist x crna x resident surgery scheduling thingy for ~60 surgeries a day for this one client. Looked a lot like LSAT logic games stuff scaled up to me, took me almost 20-30m to hand check. Is that reasoning?
Right? Oh this fairly novel solution the the problem I was having that works and is well tested. Oh throw it away.. sorry the model can't think of stuff..
Can you please share a few sessions ? I want to get a better sense of what people have achieved with generic LLMs that is novel. (Emphasis on "generic", I think I can more readily imagine how specialized models for protein folding can lead to innovation)
Because model size is a trivial parameter, and not a new paradigm.
What you're saying is like, you can't extrapolate that long division works on 100 digit numbers because you only worked through it using 7 digit numbers and a few small polynomials.
Sometimes, we go so far as to say there is "emergence" of qualitative differences. But really, this is not necessary (and not proven to actually occur).
What is true is that the performance of LLMs at OOD tasks changes with scale.
So no, it's not the same as solving a math problem.
> What is true is that the performance of LLMs at OOD tasks changes with scale.
If scaling alone guaranteed strong OOD generalization, we’d expect the largest models to consistently top OOD benchmarks but this isn’t the case. In practice, scaling primarily increases a model’s capacity to represent and exploit statistical relationships present in the training distribution. This reliably boosts in-distribution performance but yields limited gains on tasks that are distributionally distant from the training data, especially if the underlying dataset is unchanged. That’s why trillion parameter models trained on the same corpus may excel at tasks similar to those seen in training, but won’t necessarily show proportional improvements on genuinely novel OOD tasks.
If you scale the LLM, you have to scale the tasks.
Of course performance improves on the same tasks.
The researchers behind the submitted work chose a certain size and certain size problems, controlling everything. There is no reason to believe that their results won't generalize to larger or smaller models.
Of course, not for the input problems being held constant! That is as strawman.
The extrapolation doesn't work if the transformer is too shallow (too few layers) relative to sequence length, because of https://arxiv.org/abs/2503.03961 . A bunch of tasks become unfeasible when the layer count is too low, and 4 layers is way too low. I.e. linearly increasing the number of layers in a model can result in a superlinear increase in performance on tasks like reasoning.
Aren't most major LLMs moving to an architecture where the model is made up of tons of smaller models?
There's a mountain of reasons why this makes sense from a cost perspective, and seemingly it does also for quality, too, as the newer models train substantially more cheaply and still outperform the older models.
> conclusions gathered from toy models and implying this generalises to production LLMs is useless
You are just trotting out the tired argument that model size magically fixes the issues, rather than just improves the mirage, and so nothing can be known about models with M parameters by studying models with N < M parameters.
Given enough parameters, a miraculous threshold is reached whereby LLMs switch from interpolating to extrapolating.
That’s what has been seen in practice though. SOTA LLMs have been shown again and again to solve problems unseen in their data set; and despite their shortcomings they have become extremely useful for a wide variety of tasks.
Even a tiny model for, say, classifying hand-written digits, will correctly classify digits that didn't appear in its training data. (Otherwise it wouldn't be very useful.) That classification is interpolative; the hand-written digit is lands in the space of the training data.
Every result is explainable by has having come from training data. That's the null hypothesis.
The alternative hypothesis is that it's not explainable as having come from training data. That's a hard-to-believe, hard-to-prove negative.
You don't get anything out of any computational process that you didn't put in.
No, that is false; a neural net trained on a decent set of handwritten digits will recognize a newly handwritten digit.
I'm saying that this is a strawman version of "not in the training data". The newly handwritten digit is squarely the same sort of stuff that is in the training data: an interpolation.
We are not surprised when we fit a curve to a bunch of points and then find points on the curve that are not exactly any of those points, but are located among the points.
Go too far outside of the cluster of points though and the curve is a hallucination.
This is the intuition behind interpolate vs extrapolate.
Mind linking any examples (or categories) of problems that are definitively not in pre training data but can still be solved by LLMs? Preferably something factual rather than creative, genuinely curious.
Dumb question but anything like this that’s written about on the internet will ultimately end up as training fodder, no?
I have applied O3 pro on unpublished abandoned research of mine that was never published and lives in an intersection that is as entirely novel as it's uninteresting.
O3 pro (but not O3) was successfully able to apply reasoning and math to this domain in interesting ways, much like an expert researcher in these areas would.
Again, the field and the problem is with 100% certainty OOD of the data.
However, the techniques and reasoning methods are of course learned from data. But that's the point, right?
The paper is evaluating how well an LLM can handle novelty, and on the paper's terms you need to calculate or otherwise somehow deduce the degree or type of novelty rather than simply describing your never published research as novel.
I don't even know that this is possible without seeing the training data. Hence the difficulty in describing how good at "reasoning" O3 Pro is.
The most novel problem would presumably be something only a martian could understand, written in an alien language, the least novel problem would be a basic question taught in preschool like what color is the sky.
Your research falls somewhere between those extremes.
LLMs don't learn reasoning. At all. They are statistical language models. Nothing else. If they get math right it's because correct math is more statistically probable given the training data, it can't actually do math. This should be pretty clear from all the "how many Rs are there in strawberry" type examples.
I think it is worth writing about simply because it might get the (cost constrained) researcher’s work in front of someone who has the near-unlimited research budgets at one of the big AI companies.
The results from a smaller model are still viable if the paradigm is identical. Unless you believe that larger volumes of data leads to more (unexplained) emergent properties of the AI. i.e, if you think that a larger volume of training data somehow means the model develops actual reasoning skills, beyond the normal next-token prediction.
I do think that larger models will perform better, but not because they fundamentally work differently than the smaller models, and thus the idea behind TFA still stands (in my opinion).
>Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results
You're conflating two very different things. Training on synthetic data one time is very different than cyclically training models on their own data. It has nothing to do with model size.
Perhaps I worded it poorly. My main point was that articles focus on the wrong thing. Most coverage of that paper was "Using LLM generated data leads to CATASTROPHIC collapse". Without reading the fineprint.
> [...] cyclically training models on their own data. It has nothing to do with model size.
Of course it does. GRPO is basically "training models on their own data". You sample, you check for a known truth, you adapt the weights. Repeat. And before GRPO there was RLAIF which showed improving scores at 3 "stages" of generate - select - re-train. With diminishing returns after 3 stages, but no catastrophic collapse.
My main point was about articles and cherrypicking catchy phrases, not criticising research. We need the research. But we also need good articles that aren't written just for the negativity sells titles.
cheeky edit: see this thread [1]. I know slashdot has fallen a lot in the last years, but I skimmed the root comments. Not one addressing the "toy" model problem. Everyone reads the title, and reinforces their own biases. That's the main problem I was trying to address.
"Training on synthetic data one time is very different than cyclically training models on their own data.", but every one with even a modicum of understanding of feedback knows that cyclic training on its own output will end in tears; it's bordering on a tautologic inverse.
Is there an actual general principle or theorem or anything that you can link on this? I’m skeptical because these “model collapse” ideas sound vaguely technical and intuitive, but mostly seem to be based on observations about things that happened to happen with current LLMs. It gets bandied about like it is the most obvious thing, but the support mostly seems to be… pseudo-technical vibes.
Almost every mention I've seen of gpt-oss was a complaint that the training on synthetic datasets produced a model that's mostly good at benchmarks. Are benchmarks the great results you're referring to or are there a lot of satisfied users out there that just don't post here on HN? Genuinely curious.
I can see how performing well on benchmarks at the expense of everything else counts as great results if that's the point of the model.
Well now they could use GPT-OSS, but it wasn't out when they began the study.
I've recently been taking a look at another paper, from 2023, and subsequent research. It has a morally similar finding, though not focused on "reasoning traces", but it's based on GPT-4:
It's interesting that there's still such a market for this sort of take.
> In a recent pre-print paper, researchers from the University of Arizona summarize this existing work as "suggest[ing] that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text."
What does this even mean? Let's veto the word "reasoning" here and reflect.
The LLM produces a series of outputs. Each output changes the likelihood of the next output. So it's transitioning in a very large state space.
Assume there exists some states that the activations could be in that would cause the correct output to be generated. Assume also that there is some possible path of text connecting the original input to such a success state.
The reinforcement learning objective reinforces pathways that were successful during training. If there's some intermediate calculation to do or 'inference' that could be drawn, writing out a new text that makes that explicit might be a useful step. The reinforcement learning objective is supposed to encourage the model to learn such patterns.
So what does "sophisticated simulators of reasoning-like text" even mean here? The mechanism that the model uses to transition towards the answer is to generate intermediate text. What's the complaint here?
It makes the same sort of sense to talk about the model "reasoning" as it does to talk about AlphaZero "valuing material" or "fighting for the center". These are shorthands for describing patterns of behaviour, but of course the model doesn't "value" anything in a strictly human way. The chess engine usually doesn't see a full line to victory, but in the games it's played, paths which transition through states with material advantage are often good -- although it depends on other factors.
So of course the chain-of-thought transition process is brittle, and it's brittle in ways that don't match human mistakes. What does it prove that there are counter-examples with irrelevant text interposed that cause the model to produce the wrong output? It shows nothing --- it's a probabilistic process. Of course some different inputs lead to different paths being taken, which may be less successful.
> The mechanism that the model uses to transition towards the answer is to generate intermediate text.
Yes, which makes sense, because if there's a landscape of states that the model is traversing, and there are probablistically likely pathways between an initial state and the desired output, but there isn't a direct pathway, then training the the model to generate intermediate text in order to move across that landscape so it can reach the desired output state is a good idea.
Presumably LLM companies are aware that there is (in general) no relationship between the generated intermediate text and the output, and the point of the article is that by calling it a "chain of thought" rather than "essentially-meaningless intermediate text which increases the number of potential states the model can reach" users are misled into thinking that the model is reasoning, and may then make unwarranted assumptions, such as that the model could in general apply the same reasoning to similar problems, which is in general not true.
So, you agree with the point that they’re making and you’re mad about it? It’s important to state that the models aren’t doing real reasoning because they are being marketed and sold as if they are.
As for your question: ‘So what does "sophisticated simulators of reasoning-like text" even mean here?’
It means CoT interstitial “reasoning” steps produce text that looks like reasoning, but is just a rough approximation, given that the reasoning often doesn’t line up with the conclusion, or the priors, or reality.
For example - at minimum reasoning should match what actually happened. This is not even a complete set of criteria for reasoning, but at least a minimal baseline. Currently LLM programs are generating BS in the "reasoning" part of the output. For example ask the LLM program to "reason" how it produces a sum of two numbers and you will see that it doesn't match at all with what LLM program did in the background. The "reasoning" it outputs is simply an extract of the reasoning which humans did in the LLM dataset. Even Anthropic officially admits this. If you ask a program how to do maintenance on a gearbox and it replies with very well articulated and correct (important!) guide to harvest wheat, then we can't call it reasoning of any kind, despite that wheat farming guide was correct and logical.
The reality is obvious. The only way not to see it when looking at research like this is to not want to see it. The idea that this critique is somehow more confusing than the use of the word "reasoning" itself is farcical.
LLMs are cool and some of the things they can do now are useful, even surprising. But when it comes to AI, business leaders are talking their books and many people are swept up by that breathless talk and their own misleading intuitions, frequently parroted by the media.
The "but human reasoning is also flawed, so I can't possibly understand what you mean!" objection cannot be sustained in good faith short of delusion.
Total AI capex in the past 6 months was greater than US consumer spending
Or
AGI is coming
Or
AI Agents will be able to do most white collar work
——
The paper is addressing parts of the conversation and expectations of AI that are in the HYPE quadrant. There’s money riding on the idea that AI is going to begin to reason reliably. That it will work as a ghost in the machine.
This is why research like this is important and needs to keep being published.
What we have seen the last few years is a conscious marketing effort to rebrand everything ML as AI and to use terms like "Reasoning", "Extended Thinking" and others that for many non technical people give the impression that it is doing far more than it is actually doing.
Many of us here can see his research and be like... well yeah we already knew this. But there is a very well funded effort to oversell what these systems can actually do and that is reaching the people that ultimately make the decisions at companies.
So the question is no longer will AI Agents be able to do most white collar work. They can probably fake it well enough to accomplish a few tasks and management will see that. But will the output actually be valuable long term vs short term gains.
Most people weren’t happy when the 2008 crash happened, and bank bailouts were needed, and a global recession ensued.
Most people here are going to use a coding agent, be happy about it (like you), and go on their merry way.
Most people here are not making near trillion dollar bets on the world changing power of AI.
EVERYONE here will be affected by those bets. It’s one thing if those bets pay off if future subscription growth matches targets. It’s an entirely different thing if those bets require “reasoning” to pan out.
The scary thing about ML isn’t that it’s poised to eat a lot of lower-reasoning tasks, it’s that we’re going to find ourselves in a landscape of “that’s just what the AI said to do” kind of excuses for all kinds of bad behavior, and we’re completely unwilling to explore what biases are encoded in the models we’re producing. It’s like how Facebook abdicates responsibility for how users feel because it’s just the product of an algorithm. And if I were a betting person I’d bet all this stuff is going to be used for making rental determinations and for deciding who gets exceptions to overdraft fees well before it’s used for anything else. It’s an enabling technology for all kinds of inhumanity.
Are you sure you are not comparing to human unreason?
Most of what humans think of as reason is actually "will to power". The capability to use our faculties in a way that produces logical conclusions seems like an evolutionary accident, an off-lable use of the brain's machinery for complex social interaction. Most people never learn to catch themselves doing the former when they intended to engage in the latter, some don't know the difference. Fortunately, the latter provides a means of self-correction, the research here hopes to elucidate whether an LLM based reasoning system has the same property.
In other words, given consistent application of reason I would expect a human to eventually draw logically correct conclusions, decline to answer, rephrase the question, etc. But with an LLM, should I expect a non-determisitic infinite walk though plausible nonsense? I expect reaaoning to converge.
It's not clear what LLMs are good at, and there's great interest in finding out. This is made harder by the frenetic pace of development (GPT 2 came out in 2019). Not surprising at all that there's research into how LLMs fail and why.
Even for someone who kinda understands how the models are trained, it's surprising to me that they struggle when the symbols change. One thing computers are traditionally very good at is symbolic logic. Graph bijection. Stuff like that. So it's worrisome when they fail at it. Even in this research model which is much, much smaller than current or even older models.
Not sure why everyone is downvoting you as I think you raise a good point - these anthropomorphic words like "reasoning" are useful as shorthands for describing patterns of behaviour, and are generally not meant to be direct comparisons to human cognition. But it goes both ways. You can still criticise the model on the grounds that what we call "reasoning" in the context of LLMs doesn't match the patterns we associate with human "reasoning" very well (such as ability to generalise to novel situations), which is what I think the authors are doing.
Sure, two things can be true. Personally I completely ignore anything Sam Altman (or other AI company CEOs/marketing teams for that matter) says about LLMs.
If you read the comments of AI articles on Arstechnica, you will find that they seem to have becomes the tech bastion of anti-ai. I'm not sure how it happened, but it seems they found or fell into a strong anti-AI niche, and now feed it.
You cannot even see the comments of people who pointed out the flaws in the study, since they are so heavily downvoted.
> ... that these "reasoning" models can often produce incoherent, logically unsound answers when questions include irrelevant clauses or deviate even slightly from common templates found in their training data.
I have encountered this problem numerous times, now. It really makes me believe that the models do not really understand the topic, even the basics but just try to predict the text.
One recent example was me asking the model to fix my docker-compose file. In it, there's the `network: host` for the `build` part. The model kept assuming that the container would be running with the host network and kept asking me to remove it as a way to fix my issue, even though it wouldn't do anything for the container that is running. Because container runs on `custom_net` network only. The model was obsessed with it and kept telling me to remove it until I explicitly told that it is not, and cannot be the issue.
> It really makes me believe that the models do not really understand the topic, even the basics but just try to predict the text.
This is correct. There is no understanding, there aren't even concepts. It's just math, it's what we've been doing with words in computers for decades, just faster and faster. They're super useful in some areas, but they're not smart, they don't think.
I’ve never seen so much misinformation trotted out by the laity as I have with LLMs. It’s like I’m in a 19th century forum with people earnestly arguing that cameras can steal your soul. These people haven’t a clue of the mechanism.
This assessment fits with my anecdotal evidence. LLMs just cannot reason in any basic way.
LLMs have a large knowledge base that can be spit out at a moment notice. But they have zero insight on its contents, even when the information has just been asked a few lines before.
Most of the "intelligence" that LLMs show is just the ability to ask in the correct way the correct questions mirrored back to the user. That is why there is so many advice on how to do "proper prompting".
That and the fact that most questions have already been asked before as anyone that spend some time in StackOverflow back in the day realized. And memory and not reasoning is what is needed to answer them.
Please don't tell me you were one of those marking every SO question as duplicate, more often than not missing the entire nuance in the question that made it not a duplicate at all, and the answers to the so called previously asked question utterly unusable?
This was one of those infuriating things that drove so many away from SO and jump ship the second there was an alternative.
I'm not sure why duplicates were ever considered an issue. For certain subjects (like JS) things evolved so quickly during the height of SO that even a year old answer was outdated.
That and search engines seemed to promote more recent content.. so an old answer sank under the ocean of blog spam
But the answer has not become incorrect. It is still correct for that question in that specific context. More likely, the 'canonicalization process' was overly coarse (for SEO?), inconsistent and confused.
Sounds like they optimised for a select 1% class of self appointed gatekeepers rather than the broad user base. Classic mistake of nearly every defunct social site.
I was "playing" the gamification part of StackOverflow. I wanted to ask a good question for points. But it was very difficult because any meaningful question had already been asked. It was way easier to find questions to answer.
Every time I ask people for an example of this, and get one, I agree with the duplicate determination. Sometimes it requires a little skimming of the canonical answers past just the #1 accepted one; sometimes there's a heavily upvoted clarification in a top comment, but it's usually pretty reasonable.
>This assessment fits with my anecdotal evidence. LLMs just cannot reason in any basic way.
Agreed completely, and the sentiment seems to be spreading at an ever-increasing rate. I wonder how long it will be before the bubble collapses. I was thinking maybe as long as a few years, but it might be far sooner at this rate. All it will take is one of the large AI companies coming out and publicly stating that they're no longer making meaningful gains or some other way that shows the public what's really going on behind the curtain.
I'm certain the AI hype bubble will be studied for generations as the greatest mass delusion in history (so far).
I've used LLMs to generate code for a custom serverless framework which I wrote from scratch that it had never seen before. The framework follows some industry conventions but applied in a distinct way with some distinct features which I have not yet encountered in any other framework...
I'm willing to accept that maybe LLMs cannot invent entirely new concepts but I know for a fact that they can synthesize and merge different unfamiliar concepts in complex logical ways to deliver new capabilities. This is valuable on its own.
I think that's the point, really: It's a reliable and reproducible weakness, but also one where the model can be trained to elicit impressive-looking "reasoning" about what the problem is and how it "plans" to overcome it.
Then when it fails to apply the "reasoning", that's evidence the artificial expertise we humans perceived or inferred is actually some kind of illusion.
Kind of like a a Chinese Room scenario: If the other end appears to talk about algebra perfectly well, but just can't do it, that's evidence you might be talking to a language-lookup machine instead of one that can reason.
Right, but if you're saying that something is 'incapable of reasoning' because of a failure mode also found in humans, then either humans are 'incapable of reasoning' or you concede that failure mode isn't a justification for that gross assertion. You can't have it both ways.
> Then when it fails to apply the "reasoning", that's evidence the artificial expertise we humans perceived or inferred is actually some kind of illusion.
That doesn't follow, if the weakness of the model manifests on a different level we wouldn't call rational in a human.
For example, a human might have dyslexia, a disorder on the perceptive level. A dyslexic can understand and explain his own limitation, but that doesn't help him overcome it.
I think you're conflating two separate issues: One is the original known impairment that we don't actually care much about, and the other is bullshitting about how the first problem is under-control.
Suppose a real person outlines a viable plan to work-around their dyslexia, and we watch them not do any of it during the test, and they turn in wrong results while describing the workaround they (didn't) follow. This keeps happening over and over.
In that case, we'd probably conclude they have another problem that isn't dyslexia, such as "parroting something they read somewhere and don't really understand."
Typically when a human has a disorder or limitation they adapt to it by developing coping strategies or making use of tools and environmental changes to compensate. Maybe they expect a true reasoning model to be able to do the same thing?
The argument is that letter level information is something llms don't have a chance to see.
It's a bit like asking human to read text and guess gender or emotional state of the author who wrote it. You just don't have this information.
Similarly you could ask why ":) is smiling and :D is happy" where the question will be seen as "[50372, 382, 62529, 326, 712, 35, 382, 7150]" - encoding looses this information, it's only visible in image rendering of this text.
The point is that if the model were really "reasoning", it would fail differently. Instead, what happens is consistent with it BSing on a textual level.
I have a real world problem I gave o1 when it came out and it got it quite wrong. It's a scheduling problem with 4 different constraints that vary each day, and success
criteria that need to be fulfilled over the whole week.
GPT-5 Thinking (Think Longer) and Opus 4.1 Extended Thinking both get it right.
Maybe this unique problem is somehow a part of synthetic training data? Or maybe it's not and the paper is wrong? Either way, we have models that are much more capable at solving unique problems today.
Models today also have access to certain tooling or have been reinforced to use that tooling in complicated situations. i.e. Questions of counting letters in word are being answered by using python code in background.
“ the researchers created a carefully controlled LLM environment in an attempt to measure just how well chain-of-thought reasoning works when presented with "out of domain" logical problems that don't match the specific logical patterns found in their training data.”
To see if LLMs adhere to logic or observed "logical" responses are rather reproduction of patterns.
I personally enjoy this idea of isolation "logic" from "pattern" and seeing if "logic" will manifest in LLM "thinking" about in "non-patternized" domain.
--
Also it's never bad give proves to public that "thinking" (like "intelligence") in AI context isn't the same thing we think about intuitively.
--
> If it’s out of domain we know it’ll fail.
Below goes question which is out of domain. Yet LLMs handle the replies in what appearing as logical way.
```
Kookers are blight. And shmakers are sin. If peker is blight and sin who is he?
```
It is out of domain and it does not fail (I've put it through thinking gemini 2.5). Now back to article. Is observed logic intristic to LLMs or it's an elaborate form of a pattern? Acoording to article it's a pattern.
I don't think we know that it'll fail, or at least that is not universally accepted as true. Rather, there are claims that given a large enough model / context window, such capabilities emerge. I think skepticism of that claim is warranted. This research validates that skepticism, at least for a certain parameters (model family/size, context size, etc).
There's a question which was rhetorically asked by Yaser S. Abu-Mostafa: "How do we know if we're learning from data?" and his answer was: "We are learning from data if we can generalize from our training set to our problem set."
To me, it feels a lot like Deming's "what gets measured gets done" (with the quiet part "...oftentimes at the expense of everything else."). Of course, the quiet part is different in this case.
What is this "domain" of which you speak? Because LLMs are supposedly good for flying airplanes, mental health, snakebites, and mushroom poisoning.
We're rapidly reaching trough of disillusionment with LLMs, and other generative transformer models for that matter. I am happy because it will help a lot of misinformed people understand what is and isn't possible (100+% productivity gains are not).
It is possible to say the same about the low code solutions, e.g. a perfect UI can be used instead of writing a single line of code. The problem is that creating such a system is too resource intensive and counterproductive, and such a system does not exist. Similarly coding has always some problem that cannot be generalised due to the non existent pattern in training, and creating such a pattern beats the goal of having such a system.
You're talking about net gains in "coding tasks" productivity, I'm talking in productivity gain across the board.
My company deals with an insane amount of customers who use chatgpt to pre-debug their problems before coming to our support. Once they contact our support they regurgitate llm generated BS to our support engineers thinking they're going to speed up the process, the only thing they're doing is generating noise that slows everyone down because chatgpt has absolutely no clue about our product and keeps sending them on wild goose chases. Sometimes they even lie pretending "a colleague" steered them in this or that direction while it's 100% obvious the whole thing was hallucinate and even written by an llm.
I can't tell you how frustrating it is to read a 10 min long customer email just to realise it's just an llm hallucinating probable causes for a bug that takes 2 sentences to describe.
I agree with that idea. For more business development areas, AI slop can slow things down.
I do think that these kinks will eventually work themselves out and actually increase productivity in these areas. People also need to learn that it is not acceptable to just generate some BS and send it to your boss or colleague. That just transfers the real work of understanding the generated content to someone else.
I would disagree. I would argue that if you aren't seeing gains in your productivity, you're either using the tools incorrectly, or you are in some ultra specific niche area of coding that AI isn't helpful on yet.
The math in the original paper is questionable. By leaving free the choice of divergence in Eq 3, Eq 4 has no practical value except when said divergence is zero exactly.
You're completely missing the point of OP's comment, and strangely, ironically lending credence to your interpretation of that comment lol (self-inflicted harm).
We don't have a good scientific or philosophical handle on what it actually means to "think" (let alone consciousness).
Humanity has so far been really bad at even using relative heuristics based on our own experiences to recognize, classify, and reason about entities that "think."
So it's really amusing when authors just arbitrarily side-step this whole issue and describe these systems as categorically not being real but imitating the real thing... all the while not realizing such characterizations apply to humanity as well.
> Without specification, we employ a decoder-only language model GPT2 (Radford et al., 2019) with a configuration of 4 layers, 32 hidden dimensions, and 4 attention heads.
Yeah, ok. The research is interesting, warranted, but writing an article about it, and leading with the conclusions gathered from toy models and implying this generalises to production LLMs is useless.
We've been here before with small models. Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results.
Research in this area is good, and needed. Mainly to understand limitations, discover if there are any scale levels where "emergent" stuff appears and so on. But writing articles based on incipient research, based on tiny models is not worth the effort.
Doing analysis on small models or small data is perfectly valid if the results extrapolate to large models. Which is why right now we're looking at new research papers that are still listing the same small datasets and comparing to the same small models that papers five years ago did.
I have nothing against researching this, I think it's important. My main issue is with articles choosing to grab a "conclusion" and imply it extrapolates to larger models, without any support for that. They are going for the catchy title first, fine-print be damned.
I was just at the KDD conference and the general consensus agreed with this paper. There was only one keynoter who just made the assumption that LLMs are associated with reasoning, which was jarring as the previous keynoter had just explained at length why we need a neuro-symbolic approach instead.
The thing is, I think the current companies making LLMs are _not_ trying to be correct or right. They are just trying to hide it better. In the business future for AI the coding stuff that we focus on on HN - how AI can help/impact us - is just a sideline.
The huge-money business future of LLMs is to end consumers not creators and it is product and opinion placement and their path to that is to friendship. They want their assistant to be your friend, then your best friend, then your only friend, then your lover. If the last 15 years of social media has been about discord and polarisation to get engagement, the next 15 will be about friendship and love even though that leads to isolation.
None of this needs the model to grow strong reasoning skills. That's not where the real money is. And CoT - whilst super great - is just as effective if it's hiding better that its giving you the wrong answer (by being more internally consistent) than if its giving you a better answer?
"as the previous keynoter had just explained at length why we need a neuro-symbolic approach instead"
Do you have a link to the video for that talk ?
I don't think they were recorded. In fact, I don't think any of KDD gets recorded.
I think it was Dan Roth who talked about the challenges of reasoning from just adding more layers and it was Chris Manning who just quickly mentioned at the beginning of his talk that LLMs were well known for reasoning.
https://kdd2025.kdd.org/keynote-speakers/
> None of this needs the model to grow strong reasoning skills. That's not where the real money is
"And the world is more and more complex, and the administrations are less and less prepared"
(~~ Henry Kissinger)
> None of this needs the model to grow strong reasoning skills. That's not where the real money is.
I never thought about it like that, but it sounds plausible.
However, I feel like getting to this stage is even harder to get right compared to reasoning?
Aside from the <0.1% of severely mentally unwell people which already imagine themselves to be in relationships with AIs, I don't think a lot of normal people will form lingering attachments to them without solving the issue of permanence and memory
They're currently essentially stateless, while that's surely enough for short term attachment, I'm not seeing this becoming a bigger issue because if that glaring shortfall.
It'd be like being in a relationship with a person with dementia, thats not a happy state of being.
Honestly, I think this trend is severely overstated until LLMs can sufficiently emulate memories and shared experiences. And that's still fundamentally impossible, just like "real" reasoning with understanding.
So I disagree after thinking about it more - emulated reasoning will likely have a bigger revenue stream via B2E applications compared to emotional attachment in B2C...
(the top post on HN right now is announcing Claude lets you buy a 1M token context. Extrapolate a few years.
Generally, there is a push towards 'context engineering' and there is a lot of bleeding edge research in snapshotting large contexts in ways to get the next back-forth turn in the conversation to be fast etc. So optimisations are already being made.)
As to general consensus, Hinton gave a recent talk, and he seemed adamant that neural networks (which LLMs are) really are doing reasoning. He gives his reasons for it. Is Hinton considered an outlier or?
A) Hinton is quite vocal about desiring to be an outsider/outlier as he says it is what lets him innovate.
B) He is also famous for his Doomerism, which often depends on machines doing "reasoning".
So...it's complicated, and we all suffer from confirmation bias.
This is sloppy, I was asking about scientific consensus from the perspective of the prior commenter as a conference-goer. I am not asking for opinions bordering on ad hominems of Hinton or any other scientist, please refrain from that style of misinformation.
I think Hinton uses terms like reasoning and creativity and consciousness in a way that are different from my own embeddings.
I recently had fun asking Gemini to compare how Wittgenstein and Chomsky would view calling a large transformer that was trained entirely on a synthetic 'language' (in my case symbols that encode user behaviour in an app) a 'language' or not. And then, for the killer blow, whether an LLM that is trained on Perl is a language model.
My point being that whilst Hinton is a great and all, I don't think I can quite pin down his definitions of the precise words like reasoning etc. Its possible for people to have opposite meanings for the same words (Wittgenstein famously had two contradictory approaches in his lifetime). In the case of Hinton, I can't quite pin down how loosely or precisely he is using the terms.
A forward-only transformer like GPT can only do symbolic arithmetic to the depth of its layers, for example. And I don't think the solution is to add more layers.
Of course humans are entirely neuro and we somehow manage to 'reason'. So YMMV.
Link to the talk?
It was a Royal Institution public lecture, "Will AI outsmart human intelligence? - with 'Godfather of AI' Geoffrey Hinton", https://www.youtube.com/watch?v=IkdziSLYzHw
Ultimately I somehwat disagreed with some of Hintons points in this talk, and after some thought I came up with specific reasons/doubts, and yet at the same time, his intuitive explanations helped shift my views somewhat as well.
Not sure what all this is about, I somewhat regret taking a breaking from coding with LLMs to have it explained to me its all a mirage and a secret and sloppy plan for getting me an automagic egirl or something. ;)
The point being made doesn’t impact people who can find utility from LLM output.
It’s only when you need to apply it to domains outside of code, or a domain where it needs to actually reason, that it becomes an issue.
What does actually reason mean? It's doing this complex anesthesiologist x crna x resident surgery scheduling thingy for ~60 surgeries a day for this one client. Looked a lot like LSAT logic games stuff scaled up to me, took me almost 20-30m to hand check. Is that reasoning?
Right? Oh this fairly novel solution the the problem I was having that works and is well tested. Oh throw it away.. sorry the model can't think of stuff..
Back to square one!!
Can you please share a few sessions ? I want to get a better sense of what people have achieved with generic LLMs that is novel. (Emphasis on "generic", I think I can more readily imagine how specialized models for protein folding can lead to innovation)
Because model size is a trivial parameter, and not a new paradigm.
What you're saying is like, you can't extrapolate that long division works on 100 digit numbers because you only worked through it using 7 digit numbers and a few small polynomials.
Scale changes the performance of LLMs.
Sometimes, we go so far as to say there is "emergence" of qualitative differences. But really, this is not necessary (and not proven to actually occur).
What is true is that the performance of LLMs at OOD tasks changes with scale.
So no, it's not the same as solving a math problem.
> What is true is that the performance of LLMs at OOD tasks changes with scale.
If scaling alone guaranteed strong OOD generalization, we’d expect the largest models to consistently top OOD benchmarks but this isn’t the case. In practice, scaling primarily increases a model’s capacity to represent and exploit statistical relationships present in the training distribution. This reliably boosts in-distribution performance but yields limited gains on tasks that are distributionally distant from the training data, especially if the underlying dataset is unchanged. That’s why trillion parameter models trained on the same corpus may excel at tasks similar to those seen in training, but won’t necessarily show proportional improvements on genuinely novel OOD tasks.
If you scale the LLM, you have to scale the tasks.
Of course performance improves on the same tasks.
The researchers behind the submitted work chose a certain size and certain size problems, controlling everything. There is no reason to believe that their results won't generalize to larger or smaller models.
Of course, not for the input problems being held constant! That is as strawman.
Alas, not true. It would be easier to predict progress if so.
This is 100% how it doesn't work with LLMs.
The extrapolation doesn't work if the transformer is too shallow (too few layers) relative to sequence length, because of https://arxiv.org/abs/2503.03961 . A bunch of tasks become unfeasible when the layer count is too low, and 4 layers is way too low. I.e. linearly increasing the number of layers in a model can result in a superlinear increase in performance on tasks like reasoning.
Aren't most major LLMs moving to an architecture where the model is made up of tons of smaller models?
There's a mountain of reasons why this makes sense from a cost perspective, and seemingly it does also for quality, too, as the newer models train substantially more cheaply and still outperform the older models.
Naively, this seems like it would be relevant.
> conclusions gathered from toy models and implying this generalises to production LLMs is useless
You are just trotting out the tired argument that model size magically fixes the issues, rather than just improves the mirage, and so nothing can be known about models with M parameters by studying models with N < M parameters.
Given enough parameters, a miraculous threshold is reached whereby LLMs switch from interpolating to extrapolating.
Sure!
That’s what has been seen in practice though. SOTA LLMs have been shown again and again to solve problems unseen in their data set; and despite their shortcomings they have become extremely useful for a wide variety of tasks.
Even a tiny model for, say, classifying hand-written digits, will correctly classify digits that didn't appear in its training data. (Otherwise it wouldn't be very useful.) That classification is interpolative; the hand-written digit is lands in the space of the training data.
Every result is explainable by has having come from training data. That's the null hypothesis.
The alternative hypothesis is that it's not explainable as having come from training data. That's a hard-to-believe, hard-to-prove negative.
You don't get anything out of any computational process that you didn't put in.
You actually do not classify digits that didn't appear, you classify different pictures of digits that DID appear.
Similarly, LLMs do not invent a new way of reasoning about problems or language. They do, however, apply these to unseen problems.
LLMs are one level of abstraction up, but it's a very interesting level of abstraction.
>you classify different pictures of digits that DID appear.
Are you implying models that classify hand-written digits don’t generalize and only work on training data?
No, that is false; a neural net trained on a decent set of handwritten digits will recognize a newly handwritten digit.
I'm saying that this is a strawman version of "not in the training data". The newly handwritten digit is squarely the same sort of stuff that is in the training data: an interpolation.
We are not surprised when we fit a curve to a bunch of points and then find points on the curve that are not exactly any of those points, but are located among the points.
Go too far outside of the cluster of points though and the curve is a hallucination.
This is the intuition behind interpolate vs extrapolate.
Mind linking any examples (or categories) of problems that are definitively not in pre training data but can still be solved by LLMs? Preferably something factual rather than creative, genuinely curious.
Dumb question but anything like this that’s written about on the internet will ultimately end up as training fodder, no?
How about the International Math Olympiad?
https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...
You're saying they don't use math textbooks and math forums to train LLMs, then?
The problems are not in textbooks. I’m curious what would count as an out of distribution problem for you. Only problems no one knows how to solve?
You can apply this same argument to humans, 99.999% of people will not be able to escape it.
In the case of the Math Olympiad, the students who take it grind hours a day for months on practice problems and past Olympiad problems.
> SOTA LLMs have been shown again and again to solve problems unseen in their data set
We have no idea what the training data is though, so you can't say that.
> and despite their shortcomings they have become extremely useful for a wide variety of tasks.
That seems like a separate question.
I have applied O3 pro on unpublished abandoned research of mine that was never published and lives in an intersection that is as entirely novel as it's uninteresting.
O3 pro (but not O3) was successfully able to apply reasoning and math to this domain in interesting ways, much like an expert researcher in these areas would.
Again, the field and the problem is with 100% certainty OOD of the data.
However, the techniques and reasoning methods are of course learned from data. But that's the point, right?
The paper is evaluating how well an LLM can handle novelty, and on the paper's terms you need to calculate or otherwise somehow deduce the degree or type of novelty rather than simply describing your never published research as novel.
I don't even know that this is possible without seeing the training data. Hence the difficulty in describing how good at "reasoning" O3 Pro is.
The most novel problem would presumably be something only a martian could understand, written in an alien language, the least novel problem would be a basic question taught in preschool like what color is the sky.
Your research falls somewhere between those extremes.
LLMs don't learn reasoning. At all. They are statistical language models. Nothing else. If they get math right it's because correct math is more statistically probable given the training data, it can't actually do math. This should be pretty clear from all the "how many Rs are there in strawberry" type examples.
I think it is worth writing about simply because it might get the (cost constrained) researcher’s work in front of someone who has the near-unlimited research budgets at one of the big AI companies.
The results from a smaller model are still viable if the paradigm is identical. Unless you believe that larger volumes of data leads to more (unexplained) emergent properties of the AI. i.e, if you think that a larger volume of training data somehow means the model develops actual reasoning skills, beyond the normal next-token prediction.
I do think that larger models will perform better, but not because they fundamentally work differently than the smaller models, and thus the idea behind TFA still stands (in my opinion).
>Training on LLM outputs leads to catastrophic collapse. Every outlet led with this. But no-one red the fine-print, they were testing on small toy models, and were using everything that came out to re-train. Of course it's gonna fail. L3 / phi / gpt-oss models showed that you can absolutely train on synthetic datasets and have great results
You're conflating two very different things. Training on synthetic data one time is very different than cyclically training models on their own data. It has nothing to do with model size.
Perhaps I worded it poorly. My main point was that articles focus on the wrong thing. Most coverage of that paper was "Using LLM generated data leads to CATASTROPHIC collapse". Without reading the fineprint.
> [...] cyclically training models on their own data. It has nothing to do with model size.
Of course it does. GRPO is basically "training models on their own data". You sample, you check for a known truth, you adapt the weights. Repeat. And before GRPO there was RLAIF which showed improving scores at 3 "stages" of generate - select - re-train. With diminishing returns after 3 stages, but no catastrophic collapse.
My main point was about articles and cherrypicking catchy phrases, not criticising research. We need the research. But we also need good articles that aren't written just for the negativity sells titles.
cheeky edit: see this thread [1]. I know slashdot has fallen a lot in the last years, but I skimmed the root comments. Not one addressing the "toy" model problem. Everyone reads the title, and reinforces their own biases. That's the main problem I was trying to address.
1 - https://slashdot.org/story/25/08/11/2253229/llms-simulated-r...
If you have a ground truth that you're comparing to, that's not training on your own data.
"Training on synthetic data one time is very different than cyclically training models on their own data.", but every one with even a modicum of understanding of feedback knows that cyclic training on its own output will end in tears; it's bordering on a tautologic inverse.
Is there an actual general principle or theorem or anything that you can link on this? I’m skeptical because these “model collapse” ideas sound vaguely technical and intuitive, but mostly seem to be based on observations about things that happened to happen with current LLMs. It gets bandied about like it is the most obvious thing, but the support mostly seems to be… pseudo-technical vibes.
Almost every mention I've seen of gpt-oss was a complaint that the training on synthetic datasets produced a model that's mostly good at benchmarks. Are benchmarks the great results you're referring to or are there a lot of satisfied users out there that just don't post here on HN? Genuinely curious.
I can see how performing well on benchmarks at the expense of everything else counts as great results if that's the point of the model.
Well now they could use GPT-OSS, but it wasn't out when they began the study.
I've recently been taking a look at another paper, from 2023, and subsequent research. It has a morally similar finding, though not focused on "reasoning traces", but it's based on GPT-4:
https://proceedings.neurips.cc/paper_files/paper/2023/hash/d...
It's interesting that there's still such a market for this sort of take.
> In a recent pre-print paper, researchers from the University of Arizona summarize this existing work as "suggest[ing] that LLMs are not principled reasoners but rather sophisticated simulators of reasoning-like text."
What does this even mean? Let's veto the word "reasoning" here and reflect.
The LLM produces a series of outputs. Each output changes the likelihood of the next output. So it's transitioning in a very large state space.
Assume there exists some states that the activations could be in that would cause the correct output to be generated. Assume also that there is some possible path of text connecting the original input to such a success state.
The reinforcement learning objective reinforces pathways that were successful during training. If there's some intermediate calculation to do or 'inference' that could be drawn, writing out a new text that makes that explicit might be a useful step. The reinforcement learning objective is supposed to encourage the model to learn such patterns.
So what does "sophisticated simulators of reasoning-like text" even mean here? The mechanism that the model uses to transition towards the answer is to generate intermediate text. What's the complaint here?
It makes the same sort of sense to talk about the model "reasoning" as it does to talk about AlphaZero "valuing material" or "fighting for the center". These are shorthands for describing patterns of behaviour, but of course the model doesn't "value" anything in a strictly human way. The chess engine usually doesn't see a full line to victory, but in the games it's played, paths which transition through states with material advantage are often good -- although it depends on other factors.
So of course the chain-of-thought transition process is brittle, and it's brittle in ways that don't match human mistakes. What does it prove that there are counter-examples with irrelevant text interposed that cause the model to produce the wrong output? It shows nothing --- it's a probabilistic process. Of course some different inputs lead to different paths being taken, which may be less successful.
> The mechanism that the model uses to transition towards the answer is to generate intermediate text.
Yes, which makes sense, because if there's a landscape of states that the model is traversing, and there are probablistically likely pathways between an initial state and the desired output, but there isn't a direct pathway, then training the the model to generate intermediate text in order to move across that landscape so it can reach the desired output state is a good idea.
Presumably LLM companies are aware that there is (in general) no relationship between the generated intermediate text and the output, and the point of the article is that by calling it a "chain of thought" rather than "essentially-meaningless intermediate text which increases the number of potential states the model can reach" users are misled into thinking that the model is reasoning, and may then make unwarranted assumptions, such as that the model could in general apply the same reasoning to similar problems, which is in general not true.
Meaningless? The participation in a usefully predicting path is meaning. A different meaning.
And Gemini has a note at the bottom about mistakes, and many people discuss this. Caveat emptor, as usual.
So, you agree with the point that they’re making and you’re mad about it? It’s important to state that the models aren’t doing real reasoning because they are being marketed and sold as if they are.
As for your question: ‘So what does "sophisticated simulators of reasoning-like text" even mean here?’
It means CoT interstitial “reasoning” steps produce text that looks like reasoning, but is just a rough approximation, given that the reasoning often doesn’t line up with the conclusion, or the priors, or reality.
What is "real reasoning"? The mechanism that the models use is well described. They do what they do. What is this article's complaint?
For example - at minimum reasoning should match what actually happened. This is not even a complete set of criteria for reasoning, but at least a minimal baseline. Currently LLM programs are generating BS in the "reasoning" part of the output. For example ask the LLM program to "reason" how it produces a sum of two numbers and you will see that it doesn't match at all with what LLM program did in the background. The "reasoning" it outputs is simply an extract of the reasoning which humans did in the LLM dataset. Even Anthropic officially admits this. If you ask a program how to do maintenance on a gearbox and it replies with very well articulated and correct (important!) guide to harvest wheat, then we can't call it reasoning of any kind, despite that wheat farming guide was correct and logical.
As soon as you introduce multiple constraints on what is and isn't reasoning people get confused and disengage.
I like this approach of setting a minimum constraint. But i feel adding more will just make people ignore the point entirely.
The reality is obvious. The only way not to see it when looking at research like this is to not want to see it. The idea that this critique is somehow more confusing than the use of the word "reasoning" itself is farcical.
LLMs are cool and some of the things they can do now are useful, even surprising. But when it comes to AI, business leaders are talking their books and many people are swept up by that breathless talk and their own misleading intuitions, frequently parroted by the media.
The "but human reasoning is also flawed, so I can't possibly understand what you mean!" objection cannot be sustained in good faith short of delusion.
“the mechanism the models us is well described”
Vs
Total AI capex in the past 6 months was greater than US consumer spending
Or
AGI is coming
Or
AI Agents will be able to do most white collar work
——
The paper is addressing parts of the conversation and expectations of AI that are in the HYPE quadrant. There’s money riding on the idea that AI is going to begin to reason reliably. That it will work as a ghost in the machine.
This is why research like this is important and needs to keep being published.
What we have seen the last few years is a conscious marketing effort to rebrand everything ML as AI and to use terms like "Reasoning", "Extended Thinking" and others that for many non technical people give the impression that it is doing far more than it is actually doing.
Many of us here can see his research and be like... well yeah we already knew this. But there is a very well funded effort to oversell what these systems can actually do and that is reaching the people that ultimately make the decisions at companies.
So the question is no longer will AI Agents be able to do most white collar work. They can probably fake it well enough to accomplish a few tasks and management will see that. But will the output actually be valuable long term vs short term gains.
I'm happy enough if I'm better off for having used a tool than having not.
Most people weren’t happy when the 2008 crash happened, and bank bailouts were needed, and a global recession ensued.
Most people here are going to use a coding agent, be happy about it (like you), and go on their merry way.
Most people here are not making near trillion dollar bets on the world changing power of AI.
EVERYONE here will be affected by those bets. It’s one thing if those bets pay off if future subscription growth matches targets. It’s an entirely different thing if those bets require “reasoning” to pan out.
The scary thing about ML isn’t that it’s poised to eat a lot of lower-reasoning tasks, it’s that we’re going to find ourselves in a landscape of “that’s just what the AI said to do” kind of excuses for all kinds of bad behavior, and we’re completely unwilling to explore what biases are encoded in the models we’re producing. It’s like how Facebook abdicates responsibility for how users feel because it’s just the product of an algorithm. And if I were a betting person I’d bet all this stuff is going to be used for making rental determinations and for deciding who gets exceptions to overdraft fees well before it’s used for anything else. It’s an enabling technology for all kinds of inhumanity.
"the reasoning often doesn’t line up with the conclusion, or the priors, or reality."
My dude, have you ever interacted with human reasoning?
Are you sure you are not comparing to human unreason?
Most of what humans think of as reason is actually "will to power". The capability to use our faculties in a way that produces logical conclusions seems like an evolutionary accident, an off-lable use of the brain's machinery for complex social interaction. Most people never learn to catch themselves doing the former when they intended to engage in the latter, some don't know the difference. Fortunately, the latter provides a means of self-correction, the research here hopes to elucidate whether an LLM based reasoning system has the same property.
In other words, given consistent application of reason I would expect a human to eventually draw logically correct conclusions, decline to answer, rephrase the question, etc. But with an LLM, should I expect a non-determisitic infinite walk though plausible nonsense? I expect reaaoning to converge.
It's not clear what LLMs are good at, and there's great interest in finding out. This is made harder by the frenetic pace of development (GPT 2 came out in 2019). Not surprising at all that there's research into how LLMs fail and why.
Even for someone who kinda understands how the models are trained, it's surprising to me that they struggle when the symbols change. One thing computers are traditionally very good at is symbolic logic. Graph bijection. Stuff like that. So it's worrisome when they fail at it. Even in this research model which is much, much smaller than current or even older models.
> It's interesting that there's still such a market for this sort of take.
What do you think the explanation might be for there being "such a market"?
Not sure why everyone is downvoting you as I think you raise a good point - these anthropomorphic words like "reasoning" are useful as shorthands for describing patterns of behaviour, and are generally not meant to be direct comparisons to human cognition. But it goes both ways. You can still criticise the model on the grounds that what we call "reasoning" in the context of LLMs doesn't match the patterns we associate with human "reasoning" very well (such as ability to generalise to novel situations), which is what I think the authors are doing.
""Sam Altman says the perfect AI is “a very tiny model with superhuman reasoning".""
It is being marketed as directly related to human reasoning.
Sure, two things can be true. Personally I completely ignore anything Sam Altman (or other AI company CEOs/marketing teams for that matter) says about LLMs.
If you read the comments of AI articles on Arstechnica, you will find that they seem to have becomes the tech bastion of anti-ai. I'm not sure how it happened, but it seems they found or fell into a strong anti-AI niche, and now feed it.
You cannot even see the comments of people who pointed out the flaws in the study, since they are so heavily downvoted.
> ... that these "reasoning" models can often produce incoherent, logically unsound answers when questions include irrelevant clauses or deviate even slightly from common templates found in their training data.
I have encountered this problem numerous times, now. It really makes me believe that the models do not really understand the topic, even the basics but just try to predict the text.
One recent example was me asking the model to fix my docker-compose file. In it, there's the `network: host` for the `build` part. The model kept assuming that the container would be running with the host network and kept asking me to remove it as a way to fix my issue, even though it wouldn't do anything for the container that is running. Because container runs on `custom_net` network only. The model was obsessed with it and kept telling me to remove it until I explicitly told that it is not, and cannot be the issue.
``` services:
```> It really makes me believe that the models do not really understand the topic, even the basics but just try to predict the text.
This is correct. There is no understanding, there aren't even concepts. It's just math, it's what we've been doing with words in computers for decades, just faster and faster. They're super useful in some areas, but they're not smart, they don't think.
I’ve never seen so much misinformation trotted out by the laity as I have with LLMs. It’s like I’m in a 19th century forum with people earnestly arguing that cameras can steal your soul. These people haven’t a clue of the mechanism.
https://www.experimental-history.com/p/bag-of-words-have-mer... Here is an explanation.
This assessment fits with my anecdotal evidence. LLMs just cannot reason in any basic way.
LLMs have a large knowledge base that can be spit out at a moment notice. But they have zero insight on its contents, even when the information has just been asked a few lines before.
Most of the "intelligence" that LLMs show is just the ability to ask in the correct way the correct questions mirrored back to the user. That is why there is so many advice on how to do "proper prompting".
That and the fact that most questions have already been asked before as anyone that spend some time in StackOverflow back in the day realized. And memory and not reasoning is what is needed to answer them.
Please don't tell me you were one of those marking every SO question as duplicate, more often than not missing the entire nuance in the question that made it not a duplicate at all, and the answers to the so called previously asked question utterly unusable?
This was one of those infuriating things that drove so many away from SO and jump ship the second there was an alternative.
I'm not sure why duplicates were ever considered an issue. For certain subjects (like JS) things evolved so quickly during the height of SO that even a year old answer was outdated.
That and search engines seemed to promote more recent content.. so an old answer sank under the ocean of blog spam
SO wanted to avoid being a raw Q&A site in favor of something more like a wiki.
If a year-old answer on a canonical question is now incorrect, you edit it.
But the answer has not become incorrect. It is still correct for that question in that specific context. More likely, the 'canonicalization process' was overly coarse (for SEO?), inconsistent and confused.
That's a valid goal, but they should have adapted the software to the community instead of trying to adapt the community to the software.
SO's biggest asset was its community and while they treated it with some respect in the beginning they took it for granted and trashed it later.
I think this policy was, in large part, intended to respect the user base, who get exhausted answering the same question over and over.
I do agree they later trashed that relationship with the Monica incident and AI policies.
Sounds like they optimised for a select 1% class of self appointed gatekeepers rather than the broad user base. Classic mistake of nearly every defunct social site.
To a certain extent, you have to. Same reasons Wikipedia has a core clique of editors; they do a lot of the work.
It worked beautifully for quite a while. I don't think anyone anticipated ChatGPT when planning it all out.
Then they should have made a wiki instead of a Q&A site
They did, really. That's why I can edit anyone else's questions and answers.
I was "playing" the gamification part of StackOverflow. I wanted to ask a good question for points. But it was very difficult because any meaningful question had already been asked. It was way easier to find questions to answer.
Every time I ask people for an example of this, and get one, I agree with the duplicate determination. Sometimes it requires a little skimming of the canonical answers past just the #1 accepted one; sometimes there's a heavily upvoted clarification in a top comment, but it's usually pretty reasonable.
>This assessment fits with my anecdotal evidence. LLMs just cannot reason in any basic way.
Agreed completely, and the sentiment seems to be spreading at an ever-increasing rate. I wonder how long it will be before the bubble collapses. I was thinking maybe as long as a few years, but it might be far sooner at this rate. All it will take is one of the large AI companies coming out and publicly stating that they're no longer making meaningful gains or some other way that shows the public what's really going on behind the curtain.
I'm certain the AI hype bubble will be studied for generations as the greatest mass delusion in history (so far).
I've used LLMs to generate code for a custom serverless framework which I wrote from scratch that it had never seen before. The framework follows some industry conventions but applied in a distinct way with some distinct features which I have not yet encountered in any other framework...
I'm willing to accept that maybe LLMs cannot invent entirely new concepts but I know for a fact that they can synthesize and merge different unfamiliar concepts in complex logical ways to deliver new capabilities. This is valuable on its own.
Hold on their evaluation tasks are based on rotating letters in text? Isn't this known weak area for token based models?
I think that's the point, really: It's a reliable and reproducible weakness, but also one where the model can be trained to elicit impressive-looking "reasoning" about what the problem is and how it "plans" to overcome it.
Then when it fails to apply the "reasoning", that's evidence the artificial expertise we humans perceived or inferred is actually some kind of illusion.
Kind of like a a Chinese Room scenario: If the other end appears to talk about algebra perfectly well, but just can't do it, that's evidence you might be talking to a language-lookup machine instead of one that can reason.
Reminds me of a number of grad students I knew who could “talk circles” around all sorts of subjects but failed to ever be able to apply anything.
Heh, but just because a human can fail at something doesn't mean everything that fails at it is human. :p
Right, but if you're saying that something is 'incapable of reasoning' because of a failure mode also found in humans, then either humans are 'incapable of reasoning' or you concede that failure mode isn't a justification for that gross assertion. You can't have it both ways.
> Then when it fails to apply the "reasoning", that's evidence the artificial expertise we humans perceived or inferred is actually some kind of illusion.
That doesn't follow, if the weakness of the model manifests on a different level we wouldn't call rational in a human.
For example, a human might have dyslexia, a disorder on the perceptive level. A dyslexic can understand and explain his own limitation, but that doesn't help him overcome it.
I think you're conflating two separate issues: One is the original known impairment that we don't actually care much about, and the other is bullshitting about how the first problem is under-control.
Suppose a real person outlines a viable plan to work-around their dyslexia, and we watch them not do any of it during the test, and they turn in wrong results while describing the workaround they (didn't) follow. This keeps happening over and over.
In that case, we'd probably conclude they have another problem that isn't dyslexia, such as "parroting something they read somewhere and don't really understand."
Typically when a human has a disorder or limitation they adapt to it by developing coping strategies or making use of tools and environmental changes to compensate. Maybe they expect a true reasoning model to be able to do the same thing?
The argument is that letter level information is something llms don't have a chance to see.
It's a bit like asking human to read text and guess gender or emotional state of the author who wrote it. You just don't have this information.
Similarly you could ask why ":) is smiling and :D is happy" where the question will be seen as "[50372, 382, 62529, 326, 712, 35, 382, 7150]" - encoding looses this information, it's only visible in image rendering of this text.
The point isn't that they fail at the task.
The point is that if the model were really "reasoning", it would fail differently. Instead, what happens is consistent with it BSing on a textual level.
I have a real world problem I gave o1 when it came out and it got it quite wrong. It's a scheduling problem with 4 different constraints that vary each day, and success criteria that need to be fulfilled over the whole week.
GPT-5 Thinking (Think Longer) and Opus 4.1 Extended Thinking both get it right.
Maybe this unique problem is somehow a part of synthetic training data? Or maybe it's not and the paper is wrong? Either way, we have models that are much more capable at solving unique problems today.
Models today also have access to certain tooling or have been reinforced to use that tooling in complicated situations. i.e. Questions of counting letters in word are being answered by using python code in background.
“ the researchers created a carefully controlled LLM environment in an attempt to measure just how well chain-of-thought reasoning works when presented with "out of domain" logical problems that don't match the specific logical patterns found in their training data.”
Why? If it’s out of domain we know it’ll fail.
> Why? If it’s out of domain we know it’ll fail.
To see if LLMs adhere to logic or observed "logical" responses are rather reproduction of patterns.
I personally enjoy this idea of isolation "logic" from "pattern" and seeing if "logic" will manifest in LLM "thinking" about in "non-patternized" domain.
--
Also it's never bad give proves to public that "thinking" (like "intelligence") in AI context isn't the same thing we think about intuitively.
--
> If it’s out of domain we know it’ll fail.
Below goes question which is out of domain. Yet LLMs handle the replies in what appearing as logical way.
``` Kookers are blight. And shmakers are sin. If peker is blight and sin who is he? ```
It is out of domain and it does not fail (I've put it through thinking gemini 2.5). Now back to article. Is observed logic intristic to LLMs or it's an elaborate form of a pattern? Acoording to article it's a pattern.
Out of domain means that the type of logic hasn’t been in the training set.
“All A are B, All C are D, X is A and B, what is X?” is not outside this domain.
I don't think we know that it'll fail, or at least that is not universally accepted as true. Rather, there are claims that given a large enough model / context window, such capabilities emerge. I think skepticism of that claim is warranted. This research validates that skepticism, at least for a certain parameters (model family/size, context size, etc).
There's a question which was rhetorically asked by Yaser S. Abu-Mostafa: "How do we know if we're learning from data?" and his answer was: "We are learning from data if we can generalize from our training set to our problem set."
To me, it feels a lot like Deming's "what gets measured gets done" (with the quiet part "...oftentimes at the expense of everything else."). Of course, the quiet part is different in this case.
What is this "domain" of which you speak? Because LLMs are supposedly good for flying airplanes, mental health, snakebites, and mushroom poisoning.
Its getting to the nub of whether models can extrapolate instead of interpolate.
If they had _succeeded_, we'd all be taking it as proof that LLMs can reason, right?
We're rapidly reaching trough of disillusionment with LLMs, and other generative transformer models for that matter. I am happy because it will help a lot of misinformed people understand what is and isn't possible (100+% productivity gains are not).
100% productivity gains on coding tasks are absolutely within the realm of possibility
It is possible to say the same about the low code solutions, e.g. a perfect UI can be used instead of writing a single line of code. The problem is that creating such a system is too resource intensive and counterproductive, and such a system does not exist. Similarly coding has always some problem that cannot be generalised due to the non existent pattern in training, and creating such a pattern beats the goal of having such a system.
And how much of productivity loss due to the insane amount of noise being generated ? (filler ridden reports, emails, videos, podcasts, &c.)
I'm talking about 100% net gain in productivity.
You're talking about net gains in "coding tasks" productivity, I'm talking in productivity gain across the board.
My company deals with an insane amount of customers who use chatgpt to pre-debug their problems before coming to our support. Once they contact our support they regurgitate llm generated BS to our support engineers thinking they're going to speed up the process, the only thing they're doing is generating noise that slows everyone down because chatgpt has absolutely no clue about our product and keeps sending them on wild goose chases. Sometimes they even lie pretending "a colleague" steered them in this or that direction while it's 100% obvious the whole thing was hallucinate and even written by an llm.
I can't tell you how frustrating it is to read a 10 min long customer email just to realise it's just an llm hallucinating probable causes for a bug that takes 2 sentences to describe.
I agree with that idea. For more business development areas, AI slop can slow things down.
I do think that these kinks will eventually work themselves out and actually increase productivity in these areas. People also need to learn that it is not acceptable to just generate some BS and send it to your boss or colleague. That just transfers the real work of understanding the generated content to someone else.
If that is the case, I would argue that you were taking money for doing a job that should've been automated or abstracted already.
I would disagree. I would argue that if you aren't seeing gains in your productivity, you're either using the tools incorrectly, or you are in some ultra specific niche area of coding that AI isn't helpful on yet.
don't know, maybe in the tecnical circles, but for users the thrill is still going on, and rising
The article already seems outdated on the first day. The key points about SFT are irrelevant in the era of RL.
remind me in 2 days
The math in the original paper is questionable. By leaving free the choice of divergence in Eq 3, Eq 4 has no practical value except when said divergence is zero exactly.
(in mice)
If only we could train people like that to see their reasoning output...
> LLMs are [...] sophisticated simulators of reasoning-like text
Most humans are unsophisticated simulators of reasoning-like text.
Except you, right? You're one of the special few who can actually reason, not like /those/ people.
You're completely missing the point of OP's comment, and strangely, ironically lending credence to your interpretation of that comment lol (self-inflicted harm).
We don't have a good scientific or philosophical handle on what it actually means to "think" (let alone consciousness).
Humanity has so far been really bad at even using relative heuristics based on our own experiences to recognize, classify, and reason about entities that "think."
So it's really amusing when authors just arbitrarily side-step this whole issue and describe these systems as categorically not being real but imitating the real thing... all the while not realizing such characterizations apply to humanity as well.
[dead]
'Chain-of-thought AI "degrades significantly" when asked to generalize beyond training.' - yeah thanks Captain Obvious.