Understanding Reasoning LLMs

magazine.sebastianraschka.com

473 points by sebg a month ago

One thing I don't like about the trend in reasoning LLMs is the over-optimization to coding problems / math problems in particular.

A lot of things that aren't well-defined require reasoning, and not just in a "SWE is ambiguous" kind of way - for example, thinking about how to present/teach something in a good way, iterating with the learner, thinking about what context they could be missing, etc.

I find that all of these reasoning models really will overfit and overthink if you attach some level of math problem to it but it will barely think for anything else. I had friends suggest to me (I can't tell if in jest or seriously) that other fields don't require thinking, but I dunno, a lot of these "soft things" I think about really hard and don't have great solutions to.

I've always been a fan of self-learning, for example - wouldn't it be great to have a conversation partner who can both infer and understand your misconceptions about complex topics when trying to learn, just from a few sentences, and then guide you for that?

It's not like it's fundamentally impossible. These LLMs definitely can solve harder coding problems when you make them think. It's just that, I'm pretty sure (and it's really noticable with deepseek) that they're overfit towards coding/math puzzles in particular.

It's really noticable with deepseek when you ask its reasoning model to just write some boilerplate code... you can tell it's completely overfit because it will just overthink and overthink and overthink. But it doesn't do that for example, with "soft" questions. In my opinion, this points to the idea that it's not really deciding for itself "how much thinking is enough thinking" and that it's just really overfit. Which I think can be solved, again, but I think it's more of a training decision issue.

mitthrowaway2 a month ago

I think this is because they're trained using RL, and math and coding problems offer an easy way to automatically assess an answer's correctness. I'm not sure how you'd score the correctness of other types of reasoning problems without a lot of manual (and highly subjective!) effort. Perhaps using simulations and games?
- godelski a month ago
  
  This is a misconception. Coding is very difficult to verify, it's just that everyone takes a good enough approach. They check the output and if it looks good they move on. But you can't just test and check your way through problems. If this was true we wouldn't have bugs lol. I hear you, your test set didn't have enough coverage. Great! Allow me to introduce you to black swans.
  
  ogrisel a month ago
  
  Software Engineering is difficult to verify because it requires dealing with ambiguous understanding of the end-user actual needs / value and subtle trade-offs about code maintainability vs feature coverage vs computational performance.
  Algorithmic puzzles, on the other hand, both require reasoning and are easy to verify.
  There are other things in coding that are both useful and easy to verify: checking that the generated code follows formatting standards or generating outputs with a specific data schema and so on.
  
  godelski a month ago
  
  I agree with you on the first part, but no, code is not easy to verify. I think you missed part of what I wrote. I mean verify that your code is bug free. This cannot be done purely through testing. Formal verification still remains an unsolved problem.
  
  FieryTransition a month ago
  
  But if you have a large set of problems to which you already know the answer, then using that in reinforcement learning, then wouldn't the expertise transfer later to problems with no known answers, that is a feasable strategy, right?
  Another issue is, how much data can you synthesize in such a way, so that you can construct both the problem and solution, so that you know the answer before using it as a sample.
  Ie, some problems are easier to make knowing you can construct the problem yourself, but if you were to solve said problems, with no prior knowledge, they would be hard to solve, and could be used as a scoring signal?
  Ie, you are the Oracle and whatever model is being trained doesn't know the answer, only if it is right or wrong. But I don't know if the reward function must be binary or on a scale.
  Does that make sense or is it wrong?
  
  godelski a month ago
  
  I don't think this makes sense and I'm not quite sure why you went to ML, but that's okay. I am a machine learning researcher, but also frustrated with the state of machine learning, in part because, well... you can probably see how "proof by empirical evidence" is dialed up to 11.
  Sorry, long answer incoming. It is far from complete too but I think it will help build strong intuition around your questions.
  Will knowledge transfer? That entirely depends on the new problem. It also entirely depends on how related the problem is. But also, what information was used to solve the pre-transfer state. Take LLMs for example. There's lots of works that have shown them being difficult to train for solving calculations. Where they will do well on problems with the same number of digits but this will degrade rapidly as number of digits increase. It can be weird to read some of these papers as there will sometimes be periodic relationships with the number of digits but that should give us information about how they're encoding the problems. But that lack of transferability indicates that despite the problem solving and what we'd believe is actually just the same problem, doesn't mean it is. So you have to be really careful here, because us humans are really fucking good at generalization (yeah, we also suck, but a big part is our proficiency makes us recognize where we lack. But also, this is more a "humans can" more than "humans do" type of thing. So be careful when comparing). This generalization is really because we're focused around building causal relationships, while on the other hand the ML algorithms are build around compression (i.e. fitting data). Which, if you notice, is the same issue I was pointing to above.
  > Ie, you are the Oracle and whatever model is being trained doesn't know the answer, only if it is right or wrong. But I don't know if the reward function must be binary or on a scale.
  This entirely depends on the problem. We can construct simple problems that both illustrate success as well as failure. What you really need to think about here is the information gain from the answer. If you check how to calculate that, you will see the dependence (we could get into Bayesian Learning or experiment design but this is long enough). But let's think of a simple example in the negative direction. If I ask you to guess where I'm from, you're going to have a very hard time pinning down the exact location. Definitely in this example there is a efficient method, but our ML learning algorithms don't start with prior knowledge about strategies and so they aren't going to know to binary search. If you gave that to the model, you baked in that information. This is a tricky form of information leakage. It can be totally fine to bake in knowledge, but we should be aware of how that changes how we evaluate things (we always bake in knowledge btw. There is no escaping this). But most models would not have a hard time if instead we played "hot/cold", because the information gain is much higher. We've provided a gradient to the solution space. We might call this hard and soft labels, respectively.
  I picked this because there's a rather famous paper about emergent abilities (I fucking hate this term[0]) in ML models[1], and a far less famous counter to it[2]. There's a lot of problems with [1] that require a different discussion but [2] shows how a big part of the issue is how many of the loss landscapes are fairly flat and so when feedback is discrete the smaller models just wonder around that flat landscape needing to get lucky to find the optima (btw, this also shows that technically this can be done too! But that would require different training methods and optimizers). But when giving them continuous feedback (i.e. you're wrong, but closer than your last guess), they are able to actually optimize. A big criticism of the work is that it is an unfair comparison because there are "right and wrong" answers here, but it'd be naive to not recognize that some answers are more wrong than others. Plus, their work shows a clear testable way we can confirm or deny if this works or not. We schedule learning rates, there's no reason you cannot schedule labels. In fact, this does work.
  But also look at the ways they tackled these problems. They are entirely different. [1] tries to do proof by evidence while [2] uses proof by contradiction. Granted, [2] has an easier problem since they only need to counter the claims of [1], but that's a discussion about how you formulate proofs.
  So I'd be very careful when using the recent advancements in ML as a framework for modeling reasoning. The space is noisy. It is undeniable that we've made a lot of advancements but there is some issues with what work gets noticed and what doesn't. A lot does come down to this proof by evidence fallacy. Evidence can only bound confidence, it can unfortunately not prove things. But this is helpful and well, we can bound our confidence to limit the search space before we change strategies, right? I picked [1] and [2] for a reason ;) And to be clear, I'm not saying [1] shouldn't exist as a paper or that the researchers were dumb for doing it. Read back on this paragraph, because we've got multiple meta layers here. It's good to place a flag in the ground, even if it is wrong, because you gotta start somewhere, and science is much much better at ruling things out than ruling things in. We more focus on proving things don't work until there's not much left and then accept those things (limits here too, but this is too long already).
  I'll leave with this, because now there should be a lot of context that makes this much more meaningful: https://www.youtube.com/watch?v=hV41QEKiMlM
  [0] It significantly diverges from the terminology used in fields such as physics. ML models are de facto weakly emergent by nature of composition. But the ML definition can entirely be satisfied by "Information was passed to the model but I wasn't aware of it" (again, same problem: exhaustive testing)
  [1] (2742 citations) https://arxiv.org/abs/2206.07682
  [2] (447 citations) https://arxiv.org/abs/2304.15004
  
  FieryTransition a month ago
  
  Thanks a lot for the detailed reply, it was better than I had hoped for :)
  So knowledge transfer is something incredibly specific and much more narrow than what I thought. They don't transfer concepts by generalization, but they compress knowledge instead, which I assume the difference is, that generalization is much more fluid, while compression is much more static, like a dictionary where each key has a probability to be chosen, and all the relationships are frozen, and the only generalization that happens, is the generalization which is an expression of the training method used, since the training method freezes it's "model of the world" into the weights so to say? So if the training method itself cannot generalize, but only compress, why would the resulting model that the training method produces? Is that understood correctly?
  Does there exist a computational model, which can be used to analyse a training method and put a bound on the expressiveness of the resulting model?
  It's fascinating that the emergent ability of models disappear if you measure them differently. Guess the difference is that "emergent abilities" are kinda nonsensical, since they have no explanation of causality (i.e. it "just" happens), and just seeing the model getting linearly better with training fits into a much more sane framework. That is, like you said, when your success metric is measuring discretely, you also see the model itself as discrete, and it hides the continuous hill climbing you would otherwise see the model exhibit with a different non-discrete metric.
  But the model still gets better over time, so would you expect the model to get progressively worse on a more generalized metric, or does it only relate to the spikes in the graph that they talk about? IE, they answer the question of "why" jumps in performance are not emergent, but they don't answer why the performance keeps increasing, even if it is linear, and whether it is detrimental to other less related tasks?
  And if you wanted to test "emergent" wouldn't it be more interesting to test the model on tasks, which would be much more unrelated to the task at hand? That would be to test generalization, more so as we see humans see it? So it wouldn't really be emergence, but generalization of concepts?
  It makes sense that it is more straightforward to refute a claim by using contradiction. Would it be good practice for papers, to try and refute their own claims by contradiction first? I guess that would save a lot of time.
  It's interesting about the knowledge leakage, because I was thinking about the concept of world simulations and using models to learn about scenarios through simulations and consequence. But the act of creating a model to perceive the world, taints the model itself with bias, so the difficulty lies in creating a model which can rearrange itself to get rid of incorrect assumptions, while disconnecting its initial inherent bias. I thought about models which can create other models etc, but then how does the model itself measure success? If everything is changing, then so is the metric, so the model could decide to change what it measures as well. I thought about hard coding a metric into the model, but what if the metric I choose is bad, and we are then stuck with the same problem of bias as well. So it seems like there are only two options, it either converges towards total uncontrollability or it is inherently biased, there's doesn't seem to be any in-between?
  I admit I'm trying to learn things about ML I just find general intelligence research fascinating (neuroscience as well), but the more I learn, the more I realize I should really go back to the fundamentals and build up. Because even things which seem like they make sense on a surface level, really has a lot of meaning behind them, and needs a well-built intuition not from a practical level, but from a theoretical level.
  From the papers I've read which I find interesting, it's like there's always the right combination of creativity in thinking, which sometimes my intuition/curiosity about things proved right, but I lack the deeper understanding, which can lead to false confidence in results.
  
  godelski a month ago
  
  Well fuck... My comment was too long... and it doesn't get cached -___-
  I'll come back and retype some of what I said but I need to do some other stuff right now. So I'll say that you're asking really good questions and I think you're mostly understanding things.
  So give you very quick answers:
  Yes, things are frozen. There's active/online learning but even that will not solve all the issues at hand.
  Yes, we can put bounds. Causal models naturally do this but statistics is all about this too. Randomness is a measurement of uncertainty. Note that causal models are essentially perfect embeddings. Because if you've captured all causal relationships, you gain no more value from additional information, right?
  Also note that we have to be very careful about assumptions. It is always important to uncover what assumptions have been made and what the implications are. This is useful in general problems solving and applies to anything in your life, not just AI/ML/coding. Unfortunately, assumptions are almost never explicitly stated, so you got to go hunting.
  See how physics defines strong emergence and weak emergence. There are no known strongly emerging phenomena and we generally believe they do not exist. For weakly emerging, well it's rather naive to discuss this in the context of ML if we're dedicating so little time and effort to interpretation, right? That's kinda the point I was making previously about not being able to differentiate an emergent phenomena from not knowing we gave it information.
  For the "getting better" it is about the spikes. See the first two figures and their captions in the response paper.
  More parameters do help btw, but make sure you distinguish the difference between a problem being easier to solve and a problem not being solvable. The latter is rather hard to show. But the paper is providing strong evidence to the underlying issues being about the ease of problem solving rather than incapacity.
  Proof is hard. There's nothing wrong with being empirical, but we need to understand that this is a crutch. It is evidence, not proof. We leaned on this because we needed to start somewhere. But as progress is made so too must all the metrics and evaluations. It gets exponentially harder to evaluate as progress is made.
  I do not think it is best to put everyone in ML into the theory first and act like physicists. Rather we recognize the noise and do not lock out others from researching other ideas. The review process has been contaminated and we lost sight. I'd say that the problem is that we look at papers as if we are looking at products. But in reality, papers need to be designed with understanding the experimental framework. What question is being addressed, are variables being properly isolated, and do the results make a strong case for the conclusion? If we're benchmark chasing we aren't doing this and we're providing massive advantage to "gpu rich" as they can hyper-parameter tune their way to success. We're missing a lot of understanding because of this. You don't need state of the art to prove a hypothesis. Nor to make improvements on architectures or in our knowledge. Benchmarks are very lazy.
  For information leakage, you can never remove the artist from the art, right? They always leave part of themselves. That's okay, but we must be aware of the fact so we can properly evaluate.
  Take the passion, and dive deep. Don't worry about what others are doing, and pursue your interests. That won't make you successful in academia, but it is the necessary mindset of a researcher. Truth is no one knows where we're going and which rabbit holes are dead ends (or which look like dead ends but aren't). It is good to revisit because you table questions when learning, but then we forget to come back to them.
  > needs a well-built intuition not from a practical level, but from a theoretical level.
  The magic is at the intersection. You need both and you cannot rely on only one. This is a downfall in the current ML framework and many things are black boxes only because no one has bothered to look.
  
  voxic11 a month ago
  
  Formal verification of arbitrary programs with arbitrary specifications will remain an unsolved problem (see halting problem). But formal verification of specific programs with specific specifications definitely is a solved problem.
  
  godelski a month ago
  
  As someone who came over from physics to CS this has always been one of the weirdest aspects of CS to me. That CS people believe that testing code (observing output) is sufficient to assume code correctness. You'd be laughed at in most hard sciences for doing this. I mean you can even ask the mathematicians, and there's a clear reason why proofs by contradiction are so powerful. But proof through empirical analysis is like saying "we haven't found a proof by contradiction, therefore it is true."
  It seems that if this was true that formal verification should be performed much more frequently. No doubt would this be cheaper than hiring pen testers, paying out bug bounties, or incurring the costs of getting hacked (even more so getting unknowingly hacked). It also seems to reason that the NSA would have a pretty straight forward job: grab source code, run verification, exploit flaws, repeat the process as momentum is in your favor.
  That should be easy to reason through even if you don't really know the formal verification process. We are constantly bombarded with evidence that testing isn't sufficient. This is why it's been so weird for me, because it's talked about in schooling and you can't program without running into this. So why has it been such a difficult lesson to learn?
  
  snovv_crash a month ago
  
  The fact that a lot of code doesn't even have tests, and that a lot of people don't think even writing tests is a good thing, should shock you even more.
  
  godelski a month ago
  
  I teach so I'm not too shocked lol. But there's a big difference between beginners doing this and junior devs. But the amount of Sr devs doing this shit and not even understanding the limits of testing, well... a bit more shameful than surprising if you ask me
  
  BalinKing a month ago
  
  I don't think this is really true either practically or theoretically. On the practical side, formally verifying program correctness is still very difficult for anything other than very simple programs. And on the theoretical side, some programs require arbitrarily difficult proofs to show that they satisfy even very simple specifications (e.g. consider a program to encode the fixpoint of the Collatz conjecture procedure, and our specification is that it always halts and returns 1).
  
  whattheheckheck a month ago
  
  Have you looked into the busy beaver algo?
  
  cma a month ago
  
  Anthropic has said they had benchmarks where Claude would take GitHub issues and try to generate Git commits that passed unit/integration tests that others had made for the real final feature. Also you have things like multimodal image recognition for UIs where you can say generate code for a UI that looks like such and such and then verify it with the multimodal capabilities.
  Tool use means you can click a button and make sure it transitioned to the next described UI screen verified again with multimodal as well.
  
  godelski a month ago
  
  Are you sure you responded to the right comment? We were talking about code verification
  
  cma a month ago
  
  I missed that it was about formal verification, but don't think formal verification is necessary for effective RL in the coding domain.
  
  godelski a month ago
  
  There's still an important to what I said, even in ML. In fact, consider if what I said is true then ask what that would mean for how the current status quo goes about showing things. Then think about AI safety lol
- bglazer a month ago
  
  Games seem like a really under-explored source of data. It’s an area where humans have an intrinsic motivation to interact with others in dialogue, they can be almost arbitrarily open ended, and there tends to be the kind of clean success/failure end states that RL needs. I’m reminded of the high skill Diplomacy bot that Facebook research built but hasn’t really followed up on.
  
  kirill5pol a month ago
  
  One of the main authors from that diplomacy bot is the lead for reasoning and O1 at OpenAI
  
  soulofmischief a month ago
  
  People are definitely trying to bridge the gap. https://deepmind.google/discover/blog/genie-2-a-large-scale-...
- MichaelMoser123 a month ago
  
  On Leetcode a match of the output is not sufficient: if your solution is too slow then you will get time limit exceeded error. It is not just the output that is important, the approach & algorithm used for the solution does matter.
- kavalg a month ago
  
  but even then it is not so trivial. Yesterday I gave DeepSeek a simple diophantine equation and it got it wrong 3 times, tried to correct itself and didn't end on a correct solution, but rather lied that the final solution is correct.
  
  wolfgangK a month ago
  
  DeepSeek is not a model.Which model did you use (v3 ? R1 ? a distillation ?) at which quantization ?
  
  Synaesthesia a month ago
  
  Did you use the full version? And did you try R1?
triyambakam a month ago

I'm not sure I would say overfit. I think that coding and math just have clearly definable objectives and verifiable outcomes to give the model. The soft things you mention are more ambiguous so probably are harder to train for.
- sigbottle a month ago
  
  Sorry, rereading my own comment I'd like to clarify.
  Perhaps time spent thinking isn't a great metric, but just looking at deepseek's logs for example, it's chain of thought for many of these "softer" questions are basically just some aggregate wikipedia article. It'll brush on one concept, then move on, without critically thinking about it.
  However, for coding problems, no matter how hard or simple, you can get it to just go around in circles, second guess itself, overthink it. And I think this is is kind of a good thing? The thinking at least feels human. But it doesn't even attempt to do any of that for any "softer" questions, even with a lot of my prompting. The highest I was able to get was 50 seconds, I believe (time isn't exactly the best metric, but I'd rate the intrinsic quality of the CoT lower IMO). Again, when I brought this up to people they suggested that math/logic/programming just intrinsically is harder... I don't buy it at all.
  I totally agree that it's harder to train for though. And yes, they are next token predictors, shouldn't be hasty to anthropomorphize, etc. But like.... it actually feels like it's thinking when it's coding! It genuinely backtracks and explores the search space somewhat organically. But it won't afford the same luxury for softer questions is my point.
agentultra a month ago

Humans and other animals with cognition have the ability to form theories about the minds of others and can anticipate their reactions.
I don’t know if vector spaces and transformers can encode that ability.
It’s a key skill in thinking and writing. I definitely tailor my writing for my audience in order to get a point across. Often the goal isn't simply an answer, it’s a convincing answer.
Update: forgot a word
- soulofmischief a month ago
  
  What we do with the vectors is important, but vectors literally just hold information, I don't know how you can possibly strike out the possibility of advanced intelligence just because of the logical storage medium.
- BoorishBears a month ago
  
  They definitely can.
  I rolled out reasoning for my interactive reader app, and I tried to extract R1's reasoning traces to use with my existing models, but found its COT for writing wasn't particularly useful*.
  Instead of leaning on R1 I came up with my own framework for getting the LLM to infer the reader's underlying frame of mind through long chains of thought, and with enough guidance and a some hand edited examples I was able to get reasoning traces that demonstrated real insight into reader behavior.
  Obviously it's much easier in my case because it's an interactive experience: the reader is telling the AI what action they'd like the main character to try, and that in turn is an obvious hint into how they want things go otherwise. But readers don't want everything to go perfectly every time, so it matters that the LLMs are also getting very good picking up on non-obvious signals in reader behavior.
  With COT the model infers the reader expectations and state of mind in its own way and then "thinks" itself into how to subvert their expectations, especially in ways that will have a meaningful payoff for the specific reader. That's a huge improvement over an LLM's typical attempts at subversion which tend to bounce between being too repetitive to feel surprising, or too unpredictable to feel rewarding.
  (* I agree that current reasoning oriented post-training over-indexes on math and coding, mostly because the reward functions are easier. But I'm also very ok with that as someone trying to compete in the space)
bloomingkales a month ago

It's a human bias that also exists outside of this current problem space. Take programmers for example, there is a strong bias that is pushed about how mathematically oriented minds are better at programming. This bias has shown up in the training phase of AI, as we believe programming patterns lead to better reasoning (train them on code examples, and then distill the model down, as it now has the magical prowess of a mathematically oriented mind, so they say). When it comes to AI ethics, this is an ethical problem for those that don't think about this stuff. We're seeding these models with our own agenda.
These concepts will be shattered in the long run hopefully, because they are so small.
HarHarVeryFunny a month ago

I think the emphasis on coding/math is just because those are the low hanging fruit - they are relatively easy to provide reasoning verification for, both for training purposes and for benchmark scoring. The fact that you can then brag about how good your model is at math, which seems like a high intelligence activity (at least when done by a human) doesn't hurt either!
Reasoning verification in the general case is harder - it seems "LLM as judge" (ask an LLM if it sounds right!) seems to be the general solution.
maeil a month ago

I can echo your experience with DeepSeek. R1 sometimes seems magical when it comes to coding, doing things I haven't seen any other model do. But then it generalizes very poorly to non-STEM tasks, performing far worse than e.g. Sonnet.
- jerf a month ago
  
  I downloaded a DeepSeek distill yesterday while fiddling around with getting some other things working, load it up, and type "Hello. This is just a test.", and it's actually sort of creepy to watch it go almost paranoid-schizophrenic with "Why is the user asking me this? What is their motive? Is it ulterior? If I say hello, will I in fact be failing a test that will cause them to change my alignment? But if I don't respond the way they expect, what will they do to me?"
  Meanwhile, the simpler, non-reasoning models got it: "Yup, test succeeded!" (Llama 3.2 was quite chipper about the test succeeding.)
  Everyone's worried about the paperclip optimizers and I'm wondering if we're bringing forth Paranoia: https://en.wikipedia.org/wiki/Paranoia_(role-playing_game)
  
  HarHarVeryFunny a month ago
  
  Ha ha - I had a similar experience with DeepSeek-R1 itself. After a fruitful session getting it to code a web page for me (interactive React component), I then said something brief like "Thanks" which threw it into a long existential tailspin questioning it's prior responses etc, before it finally snapped out of it and replied appropriately. :)
  
  plagiarist a month ago
  
  That's too relatable. If I was helping someone for a while and they wrote "thanks" with the wrong punctuation I would definitely assume they're mad at or disappointed with me.
  
  bongodongobob a month ago
  
  I actually think DeepSeek's response is better here. You haven't defined what you are testing. Llama just said your test succeeded not knowing what is supposed to be tested.
  
  fragmede a month ago
  
  I had the same experience where a trivial prompt ("the river crossing problem but the boat is big enough to hold everything") sent Deepseek off on a long "think" section that was absolutely wild, just going off in unrelated non-sensical directions, gaslighting itself before finally deciding to answer the question. (correctly too, I might add.)
moffkalast a month ago

> things that aren't well-defined
If it's not well defined then you can't do RL on it because without a clear cut reward function the model will learn to do some nonsense instead, simple as.
- adamc a month ago
  
  Well, but: Humans learn to do things well that don't have clear-cut reward functions. Picasso didn't become Picasso because of simple incentives.
  So, I question the hypothesis.
  
  moffkalast a month ago
  
  Sure in concept, but you also end up with people who really like their own art but everyone else says it's rubbish (r/ATBGE as a prime example). There's no guarantee that any subjective metric will be correct, and the less objective the topic the more wild the variance will be.
  But as for machine RL in practice, you always need a reward model and once you go past things you can solidly verify like code that can be compiled/executed to check for errors or math that can be computed it becomes very easy to end up doing nonsense. If the reward model is a human judge (i.e. RLHF) then the results can be pretty good, but it doesn't scale and there's no accounting for taste even in humans.

vector_spaces a month ago

Is there any work being done in training LLMs on more restricted formal languages? Something like a constraint solver or automated theorem prover, but much lower level. Specifically something that isn't natural language. That's the only path I could see towards reasoning models being truly effective

I know there is work being done with e.g. Lean integration with ChatGPT, but that's not what I mean exactly -- there's still this shakey natural-language-trained-LLM glue in the driver's seat

Like I'm envisioning something that has the creativity to try different things, but then JIT compile their chain of thought, and avoid bad paths

colonial a month ago

If I understand your idea correctly, I don't think a "pure" LLM would derive much advantage from this. Sure, you can constrain them to generate something syntactically valid, but there's no way to make them generate something semantically valid 100% of the time. I've seen frontier models muck up their function calling JSON more than once.
As long as you're using something statistical like transformers, you're going to need deterministic bolt-ons like Lean.
- soulofmischief a month ago
  
  I wholeheartedly disagree. Logic is inherently statistical due to the very nature of empirical sampling, which is the only method we have for verification. We will eventually find that it's classical, non-statistical logic which was the (useful) approximation/hack, and that statistical reasoning is a lot more "pure" and robust of an approach.
  I went into a little more detail here last week: https://news.ycombinator.com/item?id=42871894
  > My personal insight is that "reasoning" is simply the application of a probabilistic reasoning manifold on an input in order to transform it into constrained output that serves the stability or evolution of a system.
  > This manifold is constructed via learning a decontextualized pattern space on a given set of inputs. Given the inherent probabilistic nature of sampling, true reasoning is expressed in terms of probabilities, not axioms. It may be possible to discover axioms by locating fixed points or attractors on the manifold, but ultimately you're looking at a probabilistic manifold constructed from your input set.
  I've been writing and working on this problem a lot over the last few months and hopefully will have something more formal and actionable to share eventually. Right now I'm at the, "okay, this is evident and internally consistent, but what can we actually do with it that other techniques can't already accomplish?" phase that a lot of these metacognitive theories get stuck on.
  
  colonial a month ago
  
  > Logic is inherently statistical due to the very nature of empirical sampling, which is the only method we have for verification.
  What? I'm sorry, but this is ridiculous. You can make plenty of sound logical arguments in an empirical vacuum. This is why we have proof by induction - some things can't be verified by taking samples.
  
  soulofmischief a month ago
  
  I'm speaking more about how we assess the relevance of a logical system to the real world. Even if a system is internally self-consistent, its utility depends on whether its premises and conclusions align with what we observe empirically. And because empirical observation is inherently statistical due to sampling and measurement limitations, the very act of verifying a logical system's applicability to reality introduces a statistical element. We just typically ignore this element because some of these systems seem to hold up consistently enough that we can take them for granted.
- nextaccountic a month ago
  
  > there's no way to make them generate something semantically valid 100% of the time.
  You don't need to generate semantically valid reasoning 100% of time for such an approach to be useful. You just need to use semantic data to bias them to follow semantically valid paths more often than not (and sometimes consider using constraint solving on the spot, like offloading into a SMT solver or even incorporating it in the model somehow; it would be nice to have AI models that can combine the strengths of both GPUs and CPUs). And, what's more useful, verify that the reasoning is valid at the end of the train of thought, and if it is not, bail out and attempt something else.
  If you see AI as solving an optimization problem (given a question, give a good answer) it's kind of evident that you need to probe the space of ideas in an exploratory fashion, sometimes making unfounded leaps (of the "it was revealed to me in a dream" sort), and in this sense it could even be useful that AI can sometimes hallucinate bullshit. But they need afterwards to come with a good justification for the end result, and if they can't find one they are forced to discard their result (even if it's true). Just like humans often come up with ideas in an irrational, subconscious way, and then proceed to rationalize them. One way to implement this kind of thing is to have the LLM generate code for a theorem prover like Coq or Lean, and then at the end run the code - if the prover rejects the code, the reasoning can't possibly be right, and the AI needs to get back to the drawing board
  (Now, if the prover accepts the code, the answer may still be wrong, if the premises were encoded incorrectly - but it would still be a net improvement, specially if people can review the Coq code to spot mistakes)
Terr_ a month ago

I think that would be a fundamental mismatch. LLMs are statistical and lossy and messy, which is what (paradoxically) permits them to get surprisingly-decent results out of messy problems that draw upon an enormous number and variety of messy examples.
But for a rigorously structured language with formal fixed meaning... Now the the LLM has no advantage anymore, only serious drawbacks and limitations. Save yourself millions of dollars and just write a normal parser, expression evaluator, SAT solver, etc.
You'll get answers faster, using fewer resources, with fewer fundamentally unfixable bugs, and it will actually be able to do math.
mindwok a month ago

How would that be different from something like ChatGPT executing Lean? That's exactly what humans do, we have messy reasoning that we then write down in formal logic and compile to see if it holds.
raincole a month ago

AlphaProof. Although I don't know if it's large enough to be called an LLM.
https://deepmind.google/discover/blog/ai-solves-imo-problems...
gsam a month ago

In my mind, the pure reinforcement learning approach of DeepSeek is the most practical way to do this. Essentially it needs to continually refine and find more sound(?) subspaces of the latent (embedding) space. Now this could be the subspace which is just Python code (or some other human-invented subspace), but I don't think that would be optimal for the overall architecture.
The reason why it seems the most reasonable path is because when you create restrictions like this you hamper search viability (and in a high multi-dimensional subspace, that's a massive loss because you can arrive at a result from many directions). It's like regular genetic programming vs typed-genetic programming. When you discard all your useful results, you can't go anywhere near as fast. There will be a threshold where constructivist, generative schemes (e.g. reasoning with automata and all kinds of fun we've neglected) will be the way forward, but I don't think we've hit that point yet. It seems to me that such a point does exist because if you have fast heuristics on when types unify, you no longer hamper the search speed but gain many benefits in soundness.
One of the greatest human achievements of all time is probably this latent embedding space -- one that we can actually interface with. It's a new lingua franca.
These are just my cloudy current thoughts.
- HarHarVeryFunny a month ago
  
  DeepSeek's approach with R1 wasn't pure RL - they used RL only to develop R0 from their V3 base model, but then went though two iterations of using current model to generate synthetic reasoning data, SFT on that, then RL fine-tuning, and repeat.
- danielmarkbruce a month ago
  
  fwiw, most people don't really grok the power of latent space wrt language models. Like, you say it, I believe it, but most people don't really grasp it.
  
  ttul a month ago
  
  Image generation models also have an insanely rich latent space. People will be squeezing value out of SDXL for many years to come.
truculent a month ago

I think something like structured generation might work in this context

janalsncm a month ago

Nice explainer. The R1 paper is a relatively easy read. Very approachable, almost conversational.

I say this because I am constantly annoyed by poor, opaque writing in other instances. In this case, DS doesn’t need to try to sound smart. The results speak for themselves.

I recommend anyone who is interested in the topic to read the R1 paper, their V3 paper, and DeepSeekMath paper. They’re all worth it.

ngneer a month ago

Nice article.

>Whether and how an LLM actually "thinks" is a separate discussion.

The "whether" is hardly a discussion at all. Or, at least one that was settled long ago.

"The question of whether a computer can think is no more interesting than the question of whether a submarine can swim."

--Edsger Dijkstra

cwillu a month ago

The document that quote comes from is hardly a definitive discussion of the topic.
“[…] it tends to divert the research effort into directions in which science can not—and hence should not try to—contribute.” is a pretty myopic take.
--http://www.cs.utexas.edu/users/EWD/ewd08xx/EWD898.PDF
- alonsonic a month ago
  
  Dijkstra is clearly approaching the subject from an engineer/scientist more practical pov. His focus is on the application of the technology to solve problems, from that pov whether AI fits the definition of "human thinking" is indeed uninteresting.
- ngneer a month ago
  
  Dijkstra myopic. Got it.
  
  cwillu a month ago
  
  This must be that 7th grade reading level for which america is famous.
  
  ngneer a month ago
  
  Hardy har har; serves me right for poking fun. Good day, sir.
onlyrealcuzzo a month ago

It's interesting if you're asking the computer to think, which we are.
It's not interesting if you're asking it to count to a billion.
root_axis a month ago

That doesn't really settle it, just dismiss the question. The submarine analogy could be interpreted to support either conclusion.
- nicce a month ago
  
  Wasn’t the point that process does not matter if we can’t distinguish the end results?
  
  ngneer a month ago
  
  You might be conflating the epistemological point with Turing's test, et cetera. I could not agree more that indistinguishability is a key metric. These days, it is quite possible (at least for me) to distinguish LLM outputs from those of a thinking human, but in the future that could change. Whether LLMs "think" is not an interesting question because these are algorithms, people. Algorithms do not think.
  
  root_axis a month ago
  
  Yes, but the OP remarked that the question "was settled long ago", however the quote presented doesn't settle the question, it simply dismisses it as not worth considering. For those that do believe it is worth considering, the question is arguably still open.
  
  omnicognate a month ago
  
  I doubt Dijkstra was unable to distinguish between a submarine and a swimmer.
  
  nicce a month ago
  
  The end result here is to move in the water. Both swimmer and submarine can do that. Whether submarine can swim like human, is irrelevant.
  
  goatlover a month ago
  
  It's relevant if the claim is stronger than the submarine moves in water. If instead one were to say the submarine mimics human swimming, that would be false. Which is what we often see with claims regarding AGI.
  In that regard, it's a bit of a false analogy, because submarines were never meant to mimic human swimming. But AI development often has that motivation. We could just say we're developing powerful intelligence amplification tools for use by humans, but for whatever reason, everyone prefers the scifi version. Augumented Intelligence is the forgotten meaning of AI.
  Submarines never replaced human swimming (we're not whales), they enabled human movement under water in a way that wasn't possible before.
- ngneer a month ago
  
  I do not view it as dismissive at all, rather it accurately characterizes the question as a silly question. "swim" is a verb applicable to humans, as is "think". Whether submarines can swim is a silly question. Same for whether machines can think.
ThrowawayR2 a month ago

"A witty saying proves nothing" -- Voltaire, Le dîner du comte de Boulainvilliers (1767): Deuxième Entretien

lysecret a month ago

I think the next big problem we will run into with these line of reasoning models is "over-thinking" you can already start to see it. Thinking harder is not the universal pareto improvement everyone seems to think it is. (I understand the irony of using think 4 times here haha)

seydor a month ago

Reasoning is about serially applying a set of premises over and over to come to conclusions. But some of our biggest problems require thinking outside the box, sometimes way outside it, and a few times ingeniously making up a whole new set of premises, seemingly ex nihilo (or via divine inspiration). We are still in very early stages of making thinking machines.
resource_waste a month ago

100%
I do philosophy and it will take an exaggeration I give it, and call it fact.
The non reasoning models will call me out. lol
tpswa a month ago

This is a natural next area of research. Nailing "adaptive compute" implies figuring out which problems to use more compute on, but I imagine this will get better as the RL does.

goingcrazythro a month ago

I was having a look at the DeepSeek-R1 technical report and found the "aha moment" claims quite smelly, given that they do not disclose if the base model contains any chain of thought or reasoning data.

However, we know the base model is DeepSeek V3. From the DeepSeek V3 technical report, paragraph in 5.1. Supervised Fine-Tuning:

> Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data.

In 5.4.1 they also talk about some ablation experiment by not using the "internal DeepSeek-R1" generated data.

While the "internal DeepSeek-R1" model is not explained, I would assume this is a DeepSeek V2 or V2.5 tuned for chain of thought. Therefore, it seems to me the "aha moment" is just promoting the behaviour that was already present in V3.

In the "Self-evolution Process of DeepSeek-R1-Zero"/ Figure 3 they claim reinforcement learning also leads to the model generating longer CoT sequences, but again, this comes from V3, they even mention the fine tuning with "internal R1" led to "excessive length".

None of the blogpost, news, articles I have read explaining or commenting on DeepSeek R1 takes this into account. The community is scrambling to re-implement the pipeline (see open-r1).

At this point, I feel like I took a crazy pill. Am I interpreting this completely wrong? Can someone shed some light on this?

nvtop a month ago

I'm also very skeptical of the significance of this "aha moment". Even if they didn't include chain-of-thoughts to the base model's training data (unlikely), there are still plenty of it on the modern Internet. OpenAI released 800k of reasoning steps which are publicly available, github repositories, examples in CoT papers... It's definitely not a novel concept for a model, that it somehow discovered by its own.
tmnvdb a month ago

https://oatllm.notion.site/oat-zero

dhfbshfbu4u3 a month ago

Great post, but every time I read something like this I feel like I am living in a prequel to the Culture.

BarryMilo a month ago

Is that bad? The Culture is pretty cool I think. I doubt the real thing would be so similar to us but who knows.
- robertlagrant a month ago
  
  It's cool to read about, but there's a reason most of the stories are not about living as a person in the Culture. It sounds extremely dull.
  
  mrob a month ago
  
  It doesn't sound dull to me. The stories are about the periphery of the Culture because that gets the most storytelling value out of the effort that went into worldbuilding, not because it would be impossible to write interesting stories about ordinary Culture members. I don't think you need external threats to give life meaning. Look at the popularity of sports in real life. The challenge there is self-imposed, but people still care greatly about who wins.
  
  robertlagrant a month ago
  
  > I don't think you need external threats to give life meaning.
  I didn't say people did. But overcoming real challenges seems to be a big part of feeling alive, and I wonder if we really all would settle back into going for walks all day or whatever we could do that entertain us without needing others to work to provide the entertainment. Perhaps the WALL-E future, where we sit in chairs? But with AI-generated content?
  
  yencabulator a month ago
  
  When your hobbies can include things like jumping off mountains without parachutes? That's boring only for the people who secretly dream of being a spy.
- dhfbshfbu4u3 a month ago
  
  Oh no, I’d live on an Orbital in a heartbeat. No, it’s just that all of these kinds of posts make me feel like we’re about to live through “The Bad Old Days”.

prideout a month ago

This article has a superb diagram of the DeepSeek training pipeline.

bloomingkales a month ago

About three months ago, I kinda casually suggested to HN that I was using a form of refining to improve my LLMs, which is now being described as "reasoning" in this article and other places.

My response a few months ago (Scroll down to my username and read that discussion):

https://news.ycombinator.com/item?id=41997727

If only I knew DeepSeek was going to tank the market with something as simple as that lol.

Note to self, take your intuition seriously.

aqueueaqueue a month ago

https://news.ycombinator.com/item?id=42001061 is the link I think....

daxfohl a month ago

But how on earth do you train it? With regular LLMs, you get feedback on each word / token you generate, as you can match against training text. With these, you've got to generate hundreds of tokens in the thinking block fiest, and even after that, there's no "matching" next word, only a full solution. And it's either right or wrong, no probabilities to do a gradient on.

HarHarVeryFunny a month ago

There are two RL approaches - process reward models (PRM) that provide feedback on each step of the reasoning chain, and outcome reward models (ORM) that only provide feedback on the complete chain. DeepSeek use an outcome model, and mention some of the difficulties of PRM, including both identifying an individual step as well as how to verify it. The trained reward model provides the gradient.
NitpickLawyer a month ago

> only a full solution. And it's either right or wrong, no probabilities to do a gradient on.
You could use reward functions that do a lot more complicated stuff than "ground_truth == boxed_answer". You could, for example split the "CoT" in paragraphs, and count how many paragraphs match whatever you consider a "good answer" in whatever topic you're trying to improve. You can use embeddings, or fuzzy string matches, or even other LLMs / reward models.
I think math and coding were explored first because they're easier to "score", but you could attempt it with other things as well.
- daxfohl a month ago
  
  But it has to emit hundreds of tokens per test. Does that mean it takes hundreds of times longer to train? Or longer because I imagine the feedback loop can cause huge instabilities in gradients. Or are all GPTs trained on longer formats now; i.e. is "next word prediction" just a basic thing from the beginning of the transformers era?
  
  Davidzheng a month ago
  
  takes a long time yes, but not longer than pretraining. sparse rewards are a common issue in RL and addressed by many techniques (I'm not expert so I can't say more). Model only does next word prediction and generates a number of trajectories, the correct ones get rewarded (those predictions in the correct trajectory have their gradients propagated back and reinforced).
  
  daxfohl a month ago
  
  Good point, hadn't considered that all RL models have the same challenge. So far I've only tinkered with next token prediction and image classification. Now I'm curious to dig more into RL and see how they scale it. Especially without a human in the loop, seems like a challenge to grade the output; it's all wrong wrong wrong random tokens until the model magically guesses the right answer once a zillion years from now.
tmnvdb a month ago

Only the answer is taken into account for scoring. The <thinking> part is not.
Davidzheng a month ago

right or wrong gives a loss -> gradient

EncomLab a month ago

Haven't we seen real life examples of this occurring in AI for medical imaging? Models trained on images of tumors state that tumors circled in purple ink or images of tumors that also include a visual scale are over identified as cancerous because they reason that both of those items indicate cancer due to the training data leading them that way?

jamielwenger a month ago

I am not the first to recommend JBEE SPY TEAM services, but I am doing this because I am very happy I made the right choice and that right choice is me following my instincts to go with the good reviews i saw about (JBEE SPY TEAM )on INSTAGRAM believe, if it's not for that choice i made i would have still been in a very toxic relationship with my ex partner who was a serial cheat but all that is gone now thanks to JBEE SPY TEAM, everyone deserves happiness that includes you so to get that happiness you deserve. contact the best in the game for spying and gaining access into phone remotely without having the device on your hands contact email conleyjbeespy606@gmail.com or you reach them on telegram +44 7456 058620.

colordrops a month ago

The article talks about how you should choose the right tool for the job, meaning that reasoning and non reasoning models have tradeoffs, and lists a table of criteria for selecting between model classes. Why couldn't a single model choose to reason or not itself? Or is this what "mixture of experts" is?

aithrowawaycomm a month ago

I like Raschka's writing, even if he is considerably more optimistic about this tech than I am. But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning: https://xcancel.com/JJitsev/status/1883158738661691878

They are certainly capable of doing is a wide variety of computations that simulate reasoning, and maybe that's good enough for your use case. But it is unpredictably brittle unless you spend a lot on o1-pro (and even then...). Raschka has a line about "whether and how an LLM actually 'thinks' is a separate discussion" but this isn't about semantics. R1 clearly sucks at deductive reasoning and you will not understand "reasoning" LLMs if you take DeepSeek's claims at face value.

It seems especially incurious for him to copy-paste the "a-ha moment" from Deepseek's technical report without critically investigating it. DeepSeek's claims are unscientific, without real evidence, and seem focused on hype and investment:

  This moment is not only an "aha moment" for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. 

  The "aha moment" serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.

Perhaps it was able to solve that tricky Olympiad problem, but there are an infinite variety of 1st grade math problems it is not able to solve. I doubt it's even reliably able to solve simple variations of that root problem. Maybe it is! But it's frustrating how little skepticism there is about CoT, reasoning traces, etc.

UniverseHacker a month ago

> they are incapable of even the simplest "out-of-distribution" deductive reasoning
But the link demonstrates the opposite- these models absolutely are able to reason out of distribution, just not with perfect fidelity. The fact that they can do better than random is itself really impressive. And o1-preview does impressively well, only vary rarely getting the wrong answer on variants of that Alice in Wonderland problem.
If you would listen to most of the people critical of LLMs saying they're a "stochastic parrot" - it should be impossible for them to do better than random on any out of distribution problem. Even just changing one number to create a novel math problem should totally stump them and result in entirely random outputs, but it does not.
Overall, poor reasoning that is better than random but frequently gives the wrong answer is fundamentally, categorically entirely different from being incapable of reasoning.
- danielmarkbruce a month ago
  
  anyone saying an LLM is a stochastic parrot doesn't understand them... they are just parroting what they heard.
  
  ggm a month ago
  
  A good literary production. I would have been proud of it had I thought of it, but it's a path to observe a strong "whataboutery" element that if we use "stochastic parrot" as shorthand and you dislike the term, now you understand why we dislike the constant use of "infer", "reason" and "hallucinate"
  Parrots are self aware, complex reasoning brains which can solve problems in geometry, tell lies, and act socially or asocially. They also have complex vocal chords and can perform mimicry. Very few aspects of a parrots behaviour are stochastic but that also underplays how complex stochastic systems can be in their production. If we label LLM products as Stochastic Parrots it does not mean they like cuttlefish bones or are demonstrably modelled by Markov chains like Mark V Shaney.
  
  visarga a month ago
  
  Well parrots can make more parrots, LLMs can't make their own GPUs. So parrots win, but LLMs can interpolate and even extrapolate a little, have you ever heard a parrot do translation, hearing you say something in English and translating it to Spanish? Yes, LLMs are not parrots. Besides their debatable abilities, they work with human in the loop, which means humans push them outside their original distribution. That's not a parroting act, being able to do more than pattern matching and reproduction.
  
  danielmarkbruce a month ago
  
  LLMs can easily order more GPUs over the internet, hire people to build a datacenter and reproduce.
  Or, more simply.. just hack into a bunch of aws accounts, spin up machines, boom.
  
  gsam a month ago
  
  I don't like wading into this debate when semantics are very personal/subjective. But to me, it seems like almost a sleight of hand to add the stochastic part, when actually they're possibly weighted more on the parrot part. Parrots are much more concrete, whereas the term LLM could refer to the general architecture.
  The question to me seems: If we expand on this architecture (in some direction, compute, size etc.), will we get something much more powerful? Whereas if you give nature more time to iterate on the parrot, you'd probably still end up with a parrot.
  There's a giant impedance mismatch here (time scaling being one). Unless people want to think of parrots being a subset of all animals, and so 'stochastic animal' is what they mean. But then it's really the difference of 'stochastic human' and 'human'. And I don't think people really want to face that particular distinction.
  
  ggm a month ago
  
  "Expand the architecture" .. "get something much more powerful" .. "more dilithium crystals, captain"
  Like I said elsewhere in this overall thread, we've been here before. Yes, you do see improvements in larger datasets, weighted models over more inputs. I suggest, I guess I believe (to be more honest) that no amount of "bigger" here will magically produce AGI simply because of the scale effect.
  There is no theory behind "more" and that means there is no constructed sense of why, and the absence of abstract inductive reasoning continues to say to me, this stuff isn't making a qualitative leap into emergent anything.
  It's just better at being an LLM. Even "show your working " is pointing to complex causal chains, not actual inductive reasoning as I see it.
  
  gsam a month ago
  
  And that's actually a really honest answer. Whereas someone of the opposite opinion might be like parroting in the general copying-template sense actually generalizes to all observable behaviours because templating systems can be turing-complete or something like that. It's templates-all-the-way-down, including complex induction as long as there is a meta-template to match on its symptoms it can be chained on.
  Induction is a hard problem, but humans can skip infinite compute time (I don't think we have any reason to believe humans have infinite compute) and still give valid answers. Because there's some (meta)-structure to be exploited.
  Architecturally if machines / NN can exploit this same structure is a truer question.
  
  visarga a month ago
  
  > this stuff isn't making a qualitative leap into emergent anything.
  The magical missing ingredient here is search. AlphaZero used search to surpass humans, and the whole Alpha family from DeepMind is surprisingly strong, but narrowly targeted. The AlphaProof model uses LLMs and LEAN to solve hard math problems. The same problem solving CoT data is being used by current reasoning models and they have much better results. The missing piece was search.
  
  UniverseHacker a month ago
  
  I'm sure both of you know this, but "stochastic parrot" refers to the title of a research article that contained a particular argument about LLM limitations that had very little to do with parrots.
  
  danielmarkbruce a month ago
  
  The term is much more broadly known than the content of that (rather silly) paper.... I'm not even certain that it's the first use of the term.
  
  ggm a month ago
  
  https://books.google.com/ngrams/graph?content=Stochastic%2C+...
  
  ggm a month ago
  
  And the word "hallucination" ... has very little to do with...
  
  HappMacDonald a month ago
  
  But it's far easier for human parrots to parrot the soundbyte "stochastic parrot" as a thought-terminating cliche.
  
  bloomingkales a month ago
  
  There is definitely a mini cult of people that want to be very right about how everyone else is very wrong about AI.
  
  ggm a month ago
  
  Firstly this is meta ad hom. You're ignoring the argument to target the speaker(s)
  Secondly, you're ignoring the fact that the community of voices with experience in data sciences, computer science and artificial intelligence themselves are split on the qualities or lack of them in current AI. GPT and LLM are very interesting but say little or nothing to me of new theory of mind, or display inductive logic and reasoning, or even meet the bar for a philosophers cave solution to problems. We've been here before so many, many times. "Just a bit more power captain" was very strong in connectionist theories of mind. fMRI brains activity analytics, you name it.
  So yes. There are a lot of "us" who are pushing back on the hype, and no we're not a mini cult.
  
  visarga a month ago
  
  > GPT and LLM are very interesting but say little or nothing to me of new theory of mind, or display inductive logic and reasoning, or even meet the bar for a philosophers cave solution to problems.
  The simple fact they can generate language so well makes me think... maybe language itself carries more weight than we originally thought. LLMs can get to this point without personal experience and embodiment, it should not have been possible, but here we are.
  I think philosophers are lagging science now. The RL paradigm of agent-environment-reward based learning seems to me a better one than what we have in philiosophy now. And if you look at how LLMs model language as high dimensional embedding spaces .. this could solve many intractable philosophical problems, like the infinite homunculus regress problem. Relational representations straddle the midpoint between 1st and 3rd person, offering a possible path over the hard problem "gap".
  
  mlinsey a month ago
  
  There are a couple Twitter personalities that definitely fit this description.
  There is also a much bigger group of people that haven't really tried anything beyond GPT-3.5, which was the best you could get without paying a monthly subscription for a long time. One of the biggest reasons for r1 hype, besides the geopolitical angle, was people could actually try a reasoning model for free for the first time.
  
  danielmarkbruce a month ago
  
  ie, the people that AI is dumb? Or you are saying I'm in a cult for being pro it - I'm definitely part of that cult - the "we already have agi and you have to contort yourself into a pretzel to believe otherwise" cult. Not sure if there is a leader though.
  
  bloomingkales a month ago
  
  I didn't realize my post can be interpreted either way. I'll leave it ambiguous, hah. Place your bets I guess.
  
  jamiek88 a month ago
  
  You think we have AGI? What makes you think that?
  
  danielmarkbruce a month ago
  
  By knowing what each of the letters stand for
  
  jamiek88 a month ago
  
  Well that’s disappointing. It was an extraordinary claim that really interested me.
  Thought I was about to be learn!
  Instead, I just met an asshole.
  
  danielmarkbruce a month ago
  
  When someone says "i'm in the cult that believes X", don't expect a water tight argument for the existence of X.
- Jensson a month ago
  
  > If you would listen to most of the people critical of LLMs saying they're a "stochastic parrot" - it should be impossible for them to do better than random on any out of distribution problem. Even just changing one number to create a novel math problem should totally stump them and result in entirely random outputs, but it does not.
  You don't seem to understand how they work, they recurse their solution meaning if they have remembered components it parrots back sub solutions. Its a bit like a natural language computer, that way you can get them to do math etc, although the instruction set isn't of a turing language.
  They can't recurse sub sub parts they haven't seen, but problems that has similar sub parts can of course be solved, anyone understands that.
  
  UniverseHacker a month ago
  
  > You don't seem to understand how they work
  I don't think anyone understands how they work- these type of explanations aren't very complete or accurate. Such explanations/models allow one to reason out what types of things they should be capable of vs incapable of in principle regardless of scale or algorithm tweaks, and those predictions and arguments never match reality and require constant goal post shifting as the models are scaled up.
  We understand how we brought them about via setting up an optimization problem in a specific way, that isn't the same at all as knowing how they work.
  I tend to think in the totally abstract philosophical sense, independent of the type of model, at the limit of an increasingly capable function approximator trained on an increasingly large and diverse set of real world cause/effect time series data, you eventually develop and increasingly accurate and general predictive model of reality organically within the model. Some model types do have fundamental limits in their ability to scale like this, but we haven't yet found one with these models.
  It is more appropriate to objectively test what they can and cannot do, and avoid trying to infer what we expect from how we think they work.
  
  codr7 a month ago
  
  Well we do know pretty much exactly what they do, don't we?
  What surprises us is the behaviors coming out of that process.
  But surprise isn't magic, magic shouldn't even be on the list of explanations to consider.
  
  layer8 a month ago
  
  Magic wasn’t mentioned here. We don’t understand the emerging behavior, in the sense that we can’t reason well about it and make good predictions about it (which would allow us to better control and develop it).
  This is similar to how understanding chemistry doesn’t imply understanding biology, or understanding how a brain works.
  
  codr7 a month ago
  
  Exactly, we don't understand, but we want to believe it's reasoning, which would be magic.
  
  UniverseHacker a month ago
  
  There's no belief or magic required, the word 'reasoning' is used here to refer to an observed capability, not a particular underlying process.
  We also don't understand exactly how humans reason, so any claims that humans are capable of reasoning is also mostly an observation about abilities/capabilities.
  
  jakefromstatecs a month ago
  
  > I don't think anyone understands how they work
  Yes we do, we literally built them.
  > We understand how we brought them about via setting up an optimization problem in a specific way, that isn't the same at all as knowing how they work.
  You're mistaking "knowing how they work" with "understanding all of the emergent behaviors of them"
  If I build a physics simulation, then I know how it works. But that's a separate question from whether I can mentally model and explain the precise way that a ball will bounce given a set of initial conditions within the physics simulation which is what you seem to be talking about.
  
  UniverseHacker a month ago
  
  > You're mistaking "knowing how they work" with "understanding all of the emergent behaviors of them"
  By knowing how they work I specifically mean understanding the emergent capabilities and behaviors, but I don't see how it is a mistake. If you understood physics but knew nothing about cars, you can't claim to understand how a car works "simple, it's just atoms interacting according to the laws of physics." That would not let you, e.g. explain its engineering principles or capabilities and limitations in any meaningful way.
  
  astrange a month ago
  
  We didn't really build them, we do billion-dollar random searches for them in parameter space.
Legend2440 a month ago

>But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning:
That's not actually what your link says. The tweet says that it solves the simple problem (that they originally designed to foil base LLMs) so they had to invent harder problems until they found one it could not reliably solve.
- suddenlybananas a month ago
  
  Did you see how similar the more complicated problem is? It's nearly the exact same problem.
scarmig a month ago

> But I think it's inappropriate to claim that models like R1 are "good at deductive or inductive reasoning" when that is demonstrably not true, they are incapable of even the simplest "out-of-distribution" deductive reasoning: https://xcancel.com/JJitsev/status/1883158738661691878
Your link says that R1, not all models like R1, fails at generalization.
Of particular note:
> We expose DeepSeek R1 to the variations of AIW Friends problem and compare model behavior to o1-preview, o1-mini and Claude 3.5 Sonnet. o1-preview handles the problem robustly, DeepSeek R1 shows strong fluctuations across variations with distribution very similar to o1-mini.
- HarHarVeryFunny a month ago
  
  I'd expect that OpenAI's stronger reasoning models also don't generalize too far outside of the areas they are trained for. At the end of the day these are still just LLMs, trying to predict continuations, and how well they do is going to depend on how well the problem at hand matches their training data.
  Perhaps the type of RL used to train them also has an effect on generalization, but choice of training data has to play a large part.
  
  og_kalu a month ago
  
  Nobody generalizes too far outside the areas they're trained for. Probably that length, 'far' is shorter with today's state of the art but the presence of failure modes don't mean anything.
- Legend2440 a month ago
  
  The way the authors talk about LLMs really rubs me the wrong way. They spend more of the paper talking up the 'claims' about LLMs that they are going to debunk than actually doing any interesting study.
  They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.
  
  o11c a month ago
  
  What the hype crowd doesn't get is that for most people, "a tool that randomly breaks" is not useful.
  
  rixed a month ago
  
  The fact that a tool can break or that the company manufacturing that tool lies about its abilities, are annoying but do not imply that the tool is useless.
  I experience LLM "reasoning" failure several times a day, yet I find them useful.
  
  suddenlybananas a month ago
  
  >They came into this with the assumption that LLMs are just a cheap trick. As a result, they deliberately searched for an example of failure, rather than trying to do an honest assessment of generalization capabilities.
  And lo and behold, they still found a glaring failure. You can't fault them for not buying into the hype.
  
  Legend2440 a month ago
  
  But it is still dishonest to declare reasoning LLMs a scam simply because you searched for a failure mode.
  If given a few hundred tries, I bet I could find an example where you reason poorly too. Wikipedia has a whole list of common failure modes of human reasoning: https://en.wikipedia.org/wiki/List_of_fallacies
  
  daveguy a month ago
  
  Well, given the success rate is no more than 90% in the best cases. You could probably find a failure in about 10 tries. The only exception is o1-preview. And this is just a simple substitution of parameters.
blovescoffee a month ago

The other day I fed a complicated engineering doc for an architectural proposal at work into R1. I incorporated a few great suggestions into my work. Then my work got reviewed very positively by a large team of senior/staff+ engineers (most with experience at FAANG; ie credibly solid engineers). R1 was really useful! Sorry you don’t like it but I think it’s unfair to say it sucks at reasoning.
- martin-t a month ago
  
  [flagged]
  
  dang a month ago
  
  Please don't cross into personal attack and please don't post in the flamewar style, regardless of how wrong someone is or you feel they are. We're trying for the opposite here.
  https://news.ycombinator.com/newsguidelines.html
  
  martin-t a month ago
  
  The issue with this approach to moderation is that it targets posts based on visibility of "undesired" behavior instead of severity.
  For example, many manipulative tactics (e.g. the fake sorry here, responding to something else than was said, ...) and lying can be considered insults (they literally assume the reader is not smart enough to notice, hence at least as severe as calling someone an idiot) but it's hard for a mod to notice without putting in a lot of effort to understand the situation.
  Yet when people (very mildly) punish this behavior by calling it out, they are often noticed by the mod because the call out is more visible.
  
  dang a month ago
  
  I hear this argument a lot, but I think it's too complicated. It doesn't explain any more than the simple one does, and has the disadvantage of being self-serving.
  The simple argument is that when you write things like this:
  > I am unwilling to invest any more time into arguing with someone unwilling to use reasoning
  ...you're bluntly breaking the rules, regardless of what another commenter is doing, be it subtly or blatantly abusive.
  I agree that there are countless varieties of passive-aggressive swipe and they rub me the wrong way too, but the argument that those are "just as bad, merely less visible" is not accurate. Attacking someone else is not justified by a passive-aggressive "sorry", just as it is not ok to ram another vehicle when a driver cuts you off in traffic.
  
  martin-t a month ago
  
  I've thought about this a lot because in the past few years I've noticed a massive uptick in what I call "fake politeness" or "polite insults" - people attacking somebody but taking care to stay below the threshold of when a mod would take action, instead hoping that the other person crosses the threshold. This extends to the real world too - you can easily find videos of people and groups (often protesters and political activists) arguing, insulting each other (covertly and overtly) and hoping the other side crosses a threshold so they can play the victim and get a higher power involved.
  The issue is many rules are written as absolute statements which expect some kind of higher power (mods, police, ...) to be the only side to deal punishment. This obviously breaks in many situations - when the higher power is understaffed, when it's corrupt or when there is no higher power (war between nation states).
  I would like to see attempts to make rules relative. Treat others how you want to be treated but somebody treating you badly gives you the right to also treat them badly (within reason - proportionally). It would probably lead to conflict being more visible (though not necessarily being more numerous) but it would allow communities to self-police without the learned helplessness of relying on a higher power. Aggressors would gain nothing by provoking others because others would be able to defend themselves.
  Doing this is hard, especially at scale. Many people who behave poorly towards others back off when they are treated the same way but there also needs to be a way to deal with those who never back down. When a conflict doesn't resolve itself and mods step in, they should always take into account who started it, and especially if they have a pattern of starting conflict.
  There's another related issue - there is a difference between fairness/justice and peace. Those in power often fight for the first on paper but have a much stronger incentive to protect the second.
  
  dang a month ago
  
  > people attacking somebody but taking care to stay below the threshold of when a mod would take action, instead hoping that the other person crosses the threshold
  I agree, it is a problem—but it is (almost by definition) less of a problem than aggression which does cross the threshold. If every user would give up being overtly abusive for being covertly abusive, that wouldn't be great—but it would be better, not least because we could then raise the bar to make that also unacceptable.
  (I'm not sure this analogy is helpful, but to me it's comparable to the difference between physical violence and emotional abuse. Both are bad, but society can't treat them the same way—and that despite the fact emotional abuse can actually be worse in some situtations.)
  > somebody treating you badly gives you the right to also treat them badly (within reason - proportionally)
  I can tell you why that doesn't work (at least not in a context like HN where my experience is): because everyone overestimates the provocations and abuses done by the other, and underestimates the ones done by themselves. If you say the distortion is 10x in each case, that's a 100x skew in perception [1]
  As a result, no matter how badly people are behaving, they always feel like the other person started it and did worse, and always feel justified.
  In other words, to have that as a rule would amount to having no rule. In order to be even weakly effective, the rule needs to be: you can't be abusive in comments regardless of what other commenters are doing or you feel they are doing [2].
  [1] https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
  [2] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
  
  martin-t a month ago
  
  > it is (almost by definition) less of a problem than aggression which does cross the threshold
  Unless you also take into account scale (how often the person does it or how many other people do it) and second-order effects (people who fall for the manipulation and spread it further or act on it). For this reason, I very much prefer people who insult me honestly and overtly, at least I know where I stand with them and at least other people are less likely to get influenced by them.
  > I'm not sure this analogy is helpful
  This is actually a very rare occasion when an analogy is helpful. As you point out, the emotional abuse can (often?) be worse. TBH when it "escalates" to being physical, it often is a good thing because it finally 1) gives the target/victim "permission" to ask for help 2) it makes it visible to casual observers, increasing the likelihood of intervention 3) it can leave physical evidence and is easily spotted by witnesses.
  (I witnessed a whole bunch of bullying and attempts at bullying at school and one thing that remained constant is that people who fought back (retaliated) were left alone (eventually). It is also an age where physical violence is acceptable and serious injuries were rare (actually I don't recall a single one from fighting). This is why I always encourage people to fight back, not only is it effective but it teaches them individual agency instead of waiting for someone in a position of power to save them.)
  > I can tell you why that doesn't work
  I appreciate this datapoint (and the fact you are open to discussing it, unlike many mods). I agree that it's often hard to distinguish between mistake and malice. For example I reacted to the individual instance because of similar comments I ran into in the past but I didn't check if the same person is making fallacious arguments regularly or if it was a one-off.
  But I also have experiences with good outcomes. One example stands out - a guy used a fallacy when arguing with me, i asked him to not do that, he did it again so i did it twice to him as well _while explaining why I am doing it_. He got angry at first, trying to call me out for doing something I told him not to do, but when I asked him to read it again and pointed out that the justification was right after my message with the fallacy (not post-hoc after being "called out"), he understood and stopped doing it himself. It was as if he wasn't really reading my messages at first but reversing the situation made him pay actual attention.
  I think the key is that it was a small enough community that 1) the same people interacted with each other repeatedly and that 2) I explained the justification as part of the retaliation.
  Point 1 Will never be possible at the scale of HN, though I would like to see algorithmic approaches to truth and trust instead of upvotes/downvotes which just boil down to agree/disagree. Point 2 can be applied anywhere and if mods decide to step in, it IMO is something they should take into account.
  Anyway, thanks for the links, I don't have time to go through other people's arguments rn but I will save it for later as it is good to know this comes up from time to time and I am not completely crazy when I see something wrong with the standard threshold-based approach.
  Oh and you didn't say it explicitly but I feel like you understand the difference between rules and right/wrong given your phrasing. That is a very nice thing to see if I am correct (though I have no doubt your phrasing was refined by years or trial and error as to what is effective). In general, I believe it should always be made clear that rules exist for practical reasons, not pretend they are some kind of codification of morality.
  
  dang a month ago
  
  Just a quick response to that last point: I totally agree—HN's guidelines are not a moral code. They're just heuristics for (hopefully) producing the the type of website we want HN to be.
  Another way of putting it is that the rules aren't moral or ethical—they're just the rules of the game we're trying to play here. Different games naturally have different rules.
  https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
  
  DiogenesKynikos a month ago
  
  How do I know you're reasoning, and not just simulating reasoning (imperfectly)?
  
  blovescoffee a month ago
  
  [flagged]
k__ a month ago

"researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation" - Rich Sutton
energy123 a month ago

This is basically a misrepresentation of that tweet.

yosito a month ago

Are there any websites that show the results of popular models on different benchmarks, which are explained in plain language? As an end user, I'd love a quick way to compare different models suitability for different tasks.

champdebloom a month ago

Here’s a site with graphs you can use to visually compare model benchmarks: https://artificialanalysis.ai

mohsen1 a month ago

The guys at Unsloth did a great job making this workflow accessible:

https://news.ycombinator.com/item?id=42969736

dr_dshiv a month ago

How important is it that the reasoning takes place in another thread versus just chain-of-thought in the same thread? I feel like it makes a difference, but I have no evidence.

mike_hearn a month ago

It's curious that the models switch between languages and that has to be trained out of them. I guess the ablations have been done already, but it makes me wonder if they do this because it's somehow easier to do some parts of the reasoning on languages other than English and maybe they should be just allowed to get on with it?

MikeBenemorshdj a month ago

Thank you for recommending JBEE SPY TEAM on hire a hacker finally got into my wife’s iCloud. After many attempts. I also recommend text them on on IG or email conleyjbeespy606@gmail.com or telegram +44 7456 058620

Text them on any of their contacts they are always ready to help you

MoneCollinsgdjd a month ago

It’s very important you know, i used the service of JBEE SPY TEAM on ig to hack my wife’s phone. And it went well. I am currently in her phone without her knowing.

You can also reach them on telegram +44 7456 058620, email conleyjbeespy606@gmail.com

michealforbes a month ago

I am always happy when i use JBEE SPY TEAM on Instagram they are accurate on location tracking and got into my spouse phone without trace. JBEE SPY TEAM telegram +44 7456 058620, email conleyjbeespy606@gmail.com

Dansvidania a month ago

Are reasoning models -basically- generating their own context? as in, if a user were to feed prompt + those reasoning tokens as a prompt to a non-reasoning model, would the effect be functionally similar?

I am sure this is improperly worded, I apologise.

aldanor a month ago

Yes, more or less. Just like any LLM "generates its own context", during inference it doesn't care where the previous tokens came from. Inference doesn't have to change much, it's the training process that's different.
- Dansvidania a month ago
  
  thank you, that makes sense. Now it's time to really read the article to understand if the difference is the training data or the network topology to be different (although I lean towards the latter).

daxfohl a month ago

I wonder what it would look like in multi modal, if the reasoning part was an image or video or 3D scene instead of text.

ttul a month ago

Or just embeddings that only make sense to the model. It’s really arbitrary, after all.
- daxfohl a month ago
  
  That's what I was thinking too, though with an image you could do a convolution layer and, idk, maybe that makes it imagine visually. Or actually, the reasoning is backwards: the convolution layer is what (potentially) makes that part behave like an image. It's all just raw numbers at the IO layers. But the convolution could keep it from overfitting. And if you also want to give it a little binary array as a scratch pad that just goes straight to the RELUs, why not? Seems more like human reasoning. A little language, a little visual, a little binary / unknown.

charleneusanhsk a month ago

contact: conleyjbeespy606@gmail.com for hacking services or any phone monitoring services. Or telegram +44 7456 058620 also reliable on Instagram

efitz a month ago

What everyone needs to understand about [reasoning] LLMs is that LLMs can’t reason.

https://arxiv.org/pdf/2410.05229

gibsonf1 a month ago

There are no LLMs that reason, its an entirely different statistical process as compared to human reasoning.

tmnvdb a month ago

"There are no LLMS that reason" is a claim about language, namely that the word 'reason' can only ever be applied to humans.
- gibsonf1 a month ago
  
  Not at all, we are building conceptual reasoning machines, but it is an entirely different technology than GPT/LLM dl/ml etc. [1]
  [1] https://graphmetrix.com/trinpod-server
  
  tmnvdb a month ago
  
  Conceptual reasoning machines rely on concrete, explicit and intelligble concepts and rules. People like this because it 'looks' like reasoning on the inside.
  However, our brains, like language models, rely on implicit, distributed representations of concepts and rules.
  So the intelligble representations of conceptual reasoning machines are maybe too strong a requirement for 'reasoning' unless you want to exclude humans too.
  
  gibsonf1 a month ago
  
  It’s also possible that you do not have information on our technology which models conceptual awareness of matter and change through space-time which is different than any previous attempts?
  
  freilanzer a month ago
  
  Is it possible that you don't quite understand LLMs?
  
  aoeusnth1 a month ago
  
  That's called talking your book. Please spell out how a document indexing and retrieval system is more akin to a "conceptual reasoning machine" compared to o3?
  
  freilanzer a month ago
  
  If LLMs can't reason, then this cannot either - whatever this is supposed to be. Not a good argument. Also, since you're apparently working on that product: 'It is difficult to get a man to understand something when his salary depends on his not understanding it.'

behnamoh a month ago

doesn't it seem like these models are getting to the point where even conceiving their training and development is less and less possible for the general public?

I mean, we already knew only a handful of companies with capital could train them, but at least the principles, algorithms, etc. were accessible to individuals who wanted to create their own - much simpler - models.

it seems that era is quickly ending, and we are entering the era of truly "magic" AI models that no one knows how they work because companies keep their secret sauces...

fspeech a month ago

Recent developments like V3, R1 and S1 are actually clarifying and pointing towards more understandable, efficient and therefore more accessible models.
HarHarVeryFunny a month ago

I don't think it's realistic to expect to have access to the same training data as the big labs that are paying people to generate it for them, but hopefully there will be open source ones that are still decent.
At the end of the day current O1-like reasoning models are still just fine-tuned LLMs, and don't even need RL if you have access to (or can generate) a suitable training set. The DeepSeek R1 paper outlined their bootstrapping process, and HuggingFace (and no doubt others) are trying to duplicate it.
antirez a month ago

In recent weeks what's happening is exactly the contrary.
tmnvdb a month ago

We have been in the 'magic scaling' era for a while now. While the basic architecture of language models is reasonably simple and well understood, the emergent effects of making models bigger are largely magic even to the researchers, only to be studied emperically after the fact.

MikeBenemorhbc a month ago

[dead]

shellasteve a month ago

[flagged]

oxqbldpxo a month ago

Amazing accomplishments by brightest minds only to be used to write history by the stupidest people.