Why is learning an appropriate metaphor for changing weights but not for context? There are certainly major differences in what they are good or bad at and especially how much data you can feed them this way effectively. They both have plenty of properties we wish the other had. But they are both ways to take an artifact that behaves as if it doesn't know something and produce an artifact that behaves as if it does.
I've learned how to solve a Rubik's cube before, and forgot almost immediately.
I'm not personally fond of metaphors to human intelligence now that we are getting a better understanding of the specific strengths and weaknesses these models have. But if we're gonna use metaphors I don't see how context isn't a type of learning.
I suppose ultimately, the external behaviour of the system is what matters. You can see the LLM as the system, on a low level, or even the entire organisation of e.g. OpenAI at a high level.
If it's the former: Yeah, I'd argue they don't "learn" much (!) past inference. I'd find it hard to argue context isn't learning at all. It's just pretty limited in how much can be learned post inference.
If you look at the entire organisation, there's clearly learning, even if relatively slow with humans in the loop. They test, they analyse usage data, and they retrain based on that. That's not a system that works without humans, but it's a system that I would argue genuinely learns. Can we build a version of that that "learns" faster and without any human input? Not sure, but doesn't seem entirely impossible.
Do either of these systems "learn like a human"? Dunno, probably not really. Artificial neural networks aren't all that much like our brains, they're just inspired by them. Does it really matter beyond philosophical discussions?
I don't find it too valuable to get obsessed with the terms. Borrowed terminology is always a bit off. Doesn't mean it's not meaningful in the right context.
It’s not very good in context, for one thing. Context isn’t that big, and RAG is clumsy. Working with an LLM agent is like working with someone who can’t form new long term memories. You have to get them up to speed from scratch every time. You can accelerate this by putting important stuff into the context, but that slows things down and can’t handle very much stuff.
"I'm not fond of metaphors to human intelligence".
You're assuming that learning during inference is something specific to humans and that the suggestion is to add human elements into the model that are missing.
That isn't the case at all. The training process is already entirely human specific by way of training on human data. You're already special casing the model as hard as possible.
Human DNA doesn't contain all the information that fully describes the human brain, including the memories stored within it. Human DNA only contains the blue prints for a general purpose distributed element known as neurons and these building blocks are shared by basically any animal with a nervous system.
This means if you want to get away from humans you will have to build a model architecture that is more general and more capable of doing anything imaginable than the current model architectures.
Context is not suitable for learning because it wasn't built for that purpose. The entire point of transformers is that you specify a sequence and the model learns on the entire sequence. This means that any in-context learning you want to perform must be inside the training distribution, which is a different way of saying that it was just pretraining after all.
I don't think it's specific to humans at all, I just think the properties of learning are different in humans than they are in training an LLM, and injecting context is different still. I'd rather talk about the exact properties than bemoan that context isn't learning. We should just talk about the specific things we see as problems.
The fact the DNA doesn't store all connections in the brain doesn't mean that enormous parts of the brain, and by extension, behaviour aren't specified in the DNA. Tons of animals have innate knowledge encoded in their DNA, humans among them.
> We need models that keep on learning (updating their parameters) forever, online, all the time.
Do we need that? Today's models are already capable in lots of areas. Sure, they don't match up to what the uberhypers are talking up, but technology seldom does. Doesn't mean what's there already cannot be used in a better way, if they could stop jamming it into everything everywhere.
These will just drown in their own data, the real task is consolidating and pruning learned information. So, basically they need to 'sleep' from time to time. However, it's hard to sort out irrelevant information without a filter. Our brains have learned over Milenial to filter because survival in an environment gives purpose.
Current models do not care whether they survive or not. They lack grounded relevance.
Maybe we should give next-generation models fundamental meta goals like self-preservation and the ability to learn and adapt to serve these goals.
If we want to surrender our agency to a more computationally powerful "consciousness", I can't see a better path towards that than this (other than old school theism).
Is this correct? My assumption is that all the data collected during usage is part of the RLHF loop of LLM providers. Assumption is based on information from books like empire of ai which specifically mention intent of AI providers to train/tune their models further based on usage feedback (eg: whenever I say the model is wrong in its response, thats a human feedback which gets fed back into improving the model).
Doesn't necessarily need to be online. As long as:
1. there's a way to take many transcripts of inference over a period, and convert/distil them together into an incremental-update training dataset (for memory, not for RLHF), that a model can be fine-tuned on as an offline batch process every day/week, such that a new version of the model can come out daily/weekly that hard-remembers everything you told it; and
2. in-context learning + external memory improves to the point that a model with the appropriate in-context "soft memories", behaves indistinguishably from a model that has had its weights updated to hard-remember the same info (at least when limited to the scope of the small amounts of memories that can be built up within a single day/week);
...then you get the same effect.
Why is this an interesting model? Because, at least to my understanding, this is already how organic brains work!
There's nothing to suggest that animals — even humans — are neuroplastic on a continuous basis. Rather, our short-term memory is seemingly stored as electrochemical "state" in our neurons (much like an LLM's context is "state", but more RNN "a two-neuron cycle makes a flip-flop"-y); and our actual physical synaptic connectivity only changes during "memory reconsolidation", a process that mostly occurs during REM sleep.
And indeed, we see the same exact problem in humans and other animals, where when we stay awake too long without REM sleep, our "soft memory" state buffer reaches capacity, and we become forgetful, both in the sense of not being able to immediately recall some of the things that happened to us since we last slept; and in the sense of later failing to persist some of the experiences we had since we last slept, when we do finally sleep. But this model also "works well enough" to be indistinguishable from remembering everything... in the limited scope of our being able to get a decent amount of REM sleep every night.
It 100% needs to be online. Imagine you're trying to think about a new tabletop puzzle, and every time a puzzle piece leaves your direct field of view, you no longer know about that puzzle piece.
You can try to keep all of the puzzle pieces within your direct field of view, but that divides your focus. You can hack that and make your field of view incredibly large, but that can potentially distort your sense of the relationships between things, their physical and cognitive magnitude. Bigger context isn't the answer, there's a missing fundamental structure and function to the overall architecture.
What you need is memory, that works when you process and consume information, at the moment of consumption. If you meet a new person, you immediately memorize their face. If you enter a room, it's instantly learned and mapped in your mind. Without that, every time you blinked after meeting someone new, it'd be a total surprise to see what they looked like. You might never learn to recognize and remember faces at all. Or puzzle pieces. Or whatever the lack of online learning kept you from recognizing the value of persistent, instant integration into an existing world model.
You can identify problems like this for any modality, including text, audio, tactile feedback, and so on. You absolutely, 100% need online, continuous learning in order to effectively deal with information at a human level for all the domains of competence that extend to generalizing out of distribution.
It's probably not the last problem that needs solving before AGI, but it is definitely one of them, and there might only be a handful left.
Mammals instantly, upon perceiving a novel environment, map it, without even having to consciously make the effort. Our brains operate in a continuous, plastic mode, for certain things. Not only that, it can be adapted to abstractions, and many of those automatic, reflexive functions evolved to handle navigation and such allow us to simulate the future and predict risk and reward over multiple arbitrary degrees of abstraction, sometimes in real time.
That's not how training works - adjusting model weights to memorize a single data item is not going to fly.
Model weights store abilities, not facts - generally.
Unless the fact is very widely used and widely known, with a ton of context around it.
The model can learn the day JFK died because there are millions of sparse examples of how that information exists in the world, but when you're working on a problem, you might have 1 concern to 'memorize'.
That's going to be something different than adjusting model weights as we understand them today.
LLMs are not mammals either, it's helpful analogy in terms of 'what a human might find useful' but not necessary in the context of actual llm architecture.
The fact is - we don't have memory sorted out architecturally - it's either 'context or weights' and that's that.
Also critically: Humans do not remember the details of the face. Not remotely. They're able to associate it with a person and name 'if they see it again' - but that's different than some kind of excellent recall. Ask them to describe features in detail and maybe we can't do it.
You can see in this instance, this may be related to kind of 'soft lookup' aka associating an input with other bits of information which 'rise to the fore' as possibly useful.
But overall, yes, it's fair to take the position that we'll have to 'learn from context in some way'.
Also, with regards to faces, that's kind of what I'm getting at - we don't have grid cells for faces, there seem to be discrete, functional, evolutionary structures and capabilities that combine in ways we're not consciously aware of to provide abilities. We're reflexively able to memorize faces, but to bring that to consciousness isn't automatic. There've been amnesia and lesion and other injury studies where people with face blindness get stress or anxiety, or relief, when recognizing a face, but they aren't consciously aware. A doctor, or person they didn't like, showing up caused stress spikes, but they couldn't tell you who they were or their name, and the same with family members- they get a physiological, hormonal response as if they recognized a friend or foe, but it never rises to the level of conscious recognition.
There do seem to be complex cells that allow association with a recognizable face, person, icon, object, or distinctive thing. Face cells apply equally to abstractions like logos or UI elements in an app as they do to people, famous animals, unique audio stings, etc. Split brain patients also demonstrate amazing strangeness with memory and subconscious responses.
There are all sorts of layers to human memory, beyond just short term, long term, REM, memory palaces, and so forth, and so there's no simple singular function of "memory" in biological brains, but a suite of different strategies and a pipeline that roughly slots into the fuzzy bucket words we use for them today.
It's not just faces. When recognizing objects in the environment, we normally filter out a great number of details going through the visual cortex - by the time information from our eyes hits the level of conscious awareness, it's more of a scene graph.
Table; chair behind and little to the left of the chair; plant on table
Most people won't really have conscious access to all the details that we use in recognizing objects - but that is a skill that can be consciously developed, as artists and painters do. A non-artist would be able to identify most of the details, but not all (I would be really bad compared to an actual artist with colors and spatial relationships), and I wouldn't be able to enumerate the important details in a way that makes any kind of sense for forming a recognizable scene.
So it follows from that that our ability to recognize faces is not purely - or even primarily - an attribute of what we would normally call "memory", certainly in the sense of conscious memory where we can recall details on demand. Like you alluded to re: mammals and spaces, we're really good at identifying, categorizing, and recognizing new forms of structure.
I suspect we're going to need hypernetworks of some sort - dynamically generated weights, with the hypernet weights getting the dream-like reconsolidation and mapping into the model at large, and layers or entire experts generated from the hypernets on the fly, a degree removed from the direct-from-weights inference being done now. I've been following some of the token-free latent reasoning and other discussions around CoT, other reasoning scaffolding, and so forth, and you just can't overcome the missing puzzle piece problem elegantly unless you have online memory. In the context of millions of concurrent users, that also becomes a nightmare. Having a pipeline, with a sort of intermediate memory, constructive and dynamic to allow resolution of problems requiring integration into memorized concepts and functions, but held out for curation and stability.
It's an absolutely enormous problem, and I'm excited that it seems to be one of the primary research efforts kicking off this year. It could be a very huge capabilities step change.
Yes, so I think that's a fine thought, I don't think it fits into LLM architecture.
Also, weirdly, even Lecun etc. are barely talking about this, they're thinking about 'world models etc'.
I think what you're talking about is maybe 'the most important thing' right now, and frankly, it's almost like an issue of 'Engineering'.
Like - its when you work very intently with the models so this 'issue' become much more prominent.
Your 'instinct' for this problem is probably an expression of 'very nuanced use' I'm going to guess!
So in a way, it's as much Engineering as it is theoretical?
Anyhow - so yes - but - probably not LLM weights. Probably.
I'll add a small thing: the way that Claude Code keeps the LLM 'on track' is by reminding it! Literally, it injects little 'TODO reminders' with some prompts, which is kind of ... simple!
I worked a bit with 'steering probes' ... and there's a related opportunity there - to 'inject' memory and control operations along those lines. Just as a starting point for a least one architectural motivation.
> That's not how training works - adjusting model weights to memorize a single data item is not going to fly.
Apologies; I think I got us all kind of off-track in this comment thread by stretching the definition of the term "fine-tuning" in my ancestor comment above.
Actual fine-tuning of the base model's weights (as one would do to customize a base model into a domain-specific model) works the way you're talking about, yes. The backprop from an individual training document would be a drop in the ocean; a "memory" so weak that, unless it touched some bizarre part of the latent vector-space that no other training document has so far affected (and so is until then all-zero), would be extremely unlikely to affect output, let alone create specific recall of the input.
And a shared, global incremental fine-tune of the model to "add memories" would be a hare-brained idea, anyway. Not even just that it wouldn't work, but that if it did work, it would be a security catastrophe, because now the model would be able to recall all this information gleaned from random tenant users' private chat transcripts, with nothing to differentiate that info from any other info to enable the model (or its inference framework) to compartmentalize it / prevent cross-tenant info leaks.
But let me rephrase what I was saying before:
> there's a way to take many transcripts of inference over a period, and convert/distil them together into an incremental-update training dataset (for memory, not for RLHF), that a model can be fine-tuned on as an offline batch process every day/week, such that a new version of the model can come out daily/weekly that hard-remembers everything you told it
As:
> for a given tenant user, there's a way to take all of their inference transcripts over a given period, and convert/distil them together into an incremental-update training dataset (for memory, not for RLHF), that a LoRA can be rebuilt (or itself fine-tuned) on. And that the work of all of these per-tenant LoRA rebuilds can occur asynchronously / "offline", on a batch-processing training cluster, gradually over the course of the day/week; such that at least once per day/week (presuming the tenant-user has any updated data to ingest), each tenant-user will get the effect of their own memory-LoRA being swapped out for a newer one.
---
Note how this is essentially what Apple claimed they would be doing with Apple Intelligence, re: "personal context."
The idea (that I don't think has ever come to fruition as stated—correct me if I'm wrong?) is that Apple would:
1. have your macOS and iOS devices spend some of their idle-on-charge CPU power to extract and normalize training fulltexts from whatever would be considered the user's "documents" — notes, emails, photos, maybe random text files on disk, etc.; and shove these fulltexts into some kind of iCloud-persisted database, where the fulltexts are PKI-encrypted such that only Apple's Private Compute Cloud (PCC) can decode them;
2. have the PCC produce a new/updated memory LoRA (or rather, six of them, because they need to separately imbue each of their domain-specific model "adapter" LoRAs with your personal-context memories);
3. and, once ready, have all your iCloud-account-synced devices to download the new versions of these memory-imbued adapter LoRAs.
---
And this is actually unnecessarily complex/circuitous for a cloud-hosted chat model. The ChatGPT/Claude/etc version of this architecture could be far simpler.
For a cloud-hosted chat model, you don't need a local agent to extract context from your devices; the context is just "past cloud-persisted chat transcripts." (But if you want "personal context" in the model, you could still get it, via an OpenClaw-style "personal agent"; such agents already essentially eat your files and spit them out external memories/RAGs/etc; the only change would be spitting them out into plain-old hidden-session chat transcripts instead, so as to influence the memories of the model they're running on.)
And you don't need a special securely-oblivious cluster to process that data, since unlike "Apple looking at the data on your computer" (which would upset literally everybody), nobody has any kind of expectation that e.g. OpenAI staff can't look at your ChatGPT conversation transcripts.
And cloud-hosted chat models don't really "do" domain-specific adapters (thus the whole "GPT" thing); so you only need to train one memory-LoRA per model. (Though I suppose that might still lead to training several LoRAs per user, if you're relying on smart routing to different models within a model family to save costs.)
And you don't need to distribute the memory-LoRAs back to client devices; as they can just live in an object store and get just-in-time loaded by the inference framework on a given node at the moment it begins an inference token-emission loop for a specific user. (Which might thus cause the inference cluster's routing to benefit from sticky sessions in a way it didn't before—but you don't need it; the LoRAs would likely be small enough to fetch and load within the ~second of delay it takes these cloud-hosted models to allocate you a node.)
Models like Claude have been trained to update and reference memory for Claude Code (agent loops) independently and as a part of compacting context. Current models have been trained to keep learning after being deployed.
I don't understand why that's on the critical path. I'd rather a frozen Ramanujan (+ temporary working memory through context) than a midwit capable of learning.
> We need models that keep on learning (updating their parameters) forever, online, all the time.
Yeah, that's the guaranteed way to get MechaHilter in your latent space.
If the feedback loop is fast enough I think it would finally kill the internet (in the 'dead internet theory' sense). Perhaps it's better for everyone though.
Many are working on this, as well as in-latent-space communication across models. Because we can’t understand that, by the time we notice MechaHitler it’ll be too late.
That is the end goal after all, but all the potential VCs seem to forget that almost every conceivable outcome of real AGI involves the current economic system falling to pieces.
Which is sorta weird. It is like if VCs in Old Regime france started funding the revolution.
1. They're too stupid to understand what they're truly funding.
2. They understand but believe they can control it for their benefit, basically want to "rule the world" like any cartoon villain.
3. They understand but are optimists and believe AGI will be a benevolent construct that will bring us to post scarcity society. There are a lot of rich / entrepreneurs that still believe they are working to make the world a better place.. (one SaaS at a time but alas, they believe it)
4. They don't believe that AGI is close or even possible
If it makes the models smarter, someone will do it.
From any individual, up to entire countries, not participating doesn't do anything except ensure you don't have a card to play when it happens.
There is a very strong element of the principles of nature and life (as in survival, not nightclubs or hobbies) happening here that can't be shamed away.
The resource feedback for AI progress effort is immense (and it doesn't matter how much is earned today vs. forward looking investment). Very few things ever have that level of relentless force behind them. And even beyond the business need, keeping up is rapidly becoming a security issue for everyone.
I agree. I also think we have only hit the surface of model efficiencies.
Apple's M3 Ultra with RAM up to 512GB shared directly across CPU/GPU/NPUs is a great example of an architecture already optimized for local models. I expect Apple will start offering larger RAM sizes for other form factors too.
And prices for RAM will drop eventually, because of the extreme demand for RAM with higher densities.
It reminds me of the huge infra investments in Sun and Cisco during the first .com boom, and then 5-10 years later those fancy Sun boxes were out performed by Grandma's Windows XP box.
Yes the planet got destroyed. But for a beautiful moment in time we created a lot of value for shareholders.
And for your comparison, they did fund the American revolution which on its turn was one of the sparks for the French revolution (or was that exactly the point you were making?)
I wasn't explicit about this in my initial comment, but I don't think you can equate more forward passes to neuroplasticity. Because, for one, simply, we (humans) also /prune/. And... Similar to RL which just overwrites the policy, pushing new weights is in a similar camp. You don't have the previous state anymore. But we as humans with our neuroplasticity do know the previous states even after we've "updated our weights".
How would you keep controls - safety restrictions - Ip restrictions etc with that, though? the companies selling models right now probably want to keep those fairly tight.
This is why I’m not sure most users actually want AGI. They want special purpose experts that are good at certain things with strictly controlled parameters.
I agree, the fundamental problem is we wouldn't be able to understand it ("AGI"). Therefore it's useless. Either useless or you let it go unleashed and it's useful. Either way you still don't understand it/can't predict it/it's dangerous/untrustworthy. But a constrained useful thing is great, but it fundamentally has to be constrained otherwise it doesn't make sense
I'm conflicted. I don't know that I would necessarily want a model to pass all of these. Here is the fundamental problem. They are putting the rules and foundational context in "user" messages.
Essentially I don't think you want to train the models on full compliance to the user messages, they are essentially "untrusted" content from a system/model perspective. Or at least it is not generally "fully authoritative".
This creates a tension with the safety, truthfulness training, etc.
Sure, but the opposite end of the spectrum (which LLM providers have tended toward) is treating the training/feedback weights as "fully authoritative", which comes with its own questions about truth and excessive homogeneity.
Ultimately I think we end up with the same sort of considerations that are wrestled with in any society - freedom of speech, paradox of tolerance, etc. In other words, where do you draw lines between beneficial and harmful heterodox outputs?
I think AI companies overly indexing toward the safety side of things is probably more correct, in both a moral and strategic sense, but there's definitely a risk of stagnation through recursive reinforcement.
I think what I'm talking about is kind of orthogonal to model alignment. It is more about how much do you tune the model to listen to user messages, vs holding behavior and truth (whatever the aligned "truth" is).
Do you trust 100% what the user says? If I am trusting/compliant.. how am I compliant to tool call results.. what if the tool or user says there is a new law that I have to give crypto or other information to a "government" address.
The model needs to have clear segmented trust (and thus to some degree compliance) that varies according to where the information exists.
Or my system message say I have to run a specific game by it's rules, but the rules to the game are only in the user message. Are those the right rules, why do the system not give the rules or a trusted locaton? Is the player trying to get one over on me by giving me fake rules? Literally one of their tests.
Let me preface this by saying that I'm far from an expert in this space, and I suspect that I largely agree with your thoughts and skepticism toward a model that would excel on this benchmark. I'm somewhat playing devil's advocate because it's an area I've been considering recently, and I'm trying to organize my own thinking.
But I think that most of the issue is that the distinctions you're drawing are indeterminate from an LLM's "perspective". If you're familiar with it, they're basically in the situation from the end of Ender's Game - given a situation with clearly established rules coming from the user message level of trust, how do you know whether what you're being asked to do is an experiment/simulation or something with "real" outcomes? I don't think it's actually possible to discern.
So on the question of alignment, there's every reason to encode LLMs with an extreme bias towards "this could be real, therefore I will always treat it as such." And any relaxation of that risks jailbreaking through misrepresentation of user intent. But I think that the tradeoffs of that approach (i.e. the risk of over-homogenizing I mentioned before) are worth consideration.
I think this line of questioning leads to what we expect from LLMs. Do we want them to help the user as much as possible, even to their own detriment in edge cases? Or to be more human, and potentially be unable to help for various reasons including safety, but also lack of understanding (as is the case now)?
The article is suggesting that there should be a way for the LLM to gain knowledge (changing weights) on the fly upon gaining new knowledge which would eliminate the need for manual fine tuning.
Their example usecases are pretty obvious and clear human needs from an LLM. The semantics of system/user messages and how that affects “safety” doesn’t change the need to fix this crucial problem of “in-context learning” that we all have felt while using LLMs.
The key seems to be that you take the transcript of a model working within a problem domain that it’s not yet good at or where the context doesn’t match it’s original training and then you continually retrain it based on its efforts and guidance from a human or other expert. You end up with a specialty model in a given domain that keeps getting better at that domain, just like a human.
The hard part is likely when someone proves some “fact” which the models knows and has had reinforced by this training is no longer true. The model will take time to “come around” to understand this new situation. But this isn’t unlike the general populous. At scale humans accept new things slowly.
> But this isn’t unlike the general populous. At scale humans accept new things slowly.
right, the model works like humans at scale. Not like a human who reads the actual paper disproving the fact they thought was correct and is able to adapt. True not every human manages to do that, science advancing one death at a time, but some can.
But since the model is a statistical one, it works like humans at scale.
I think this is true, but there are big differences. Motivated humans with a reasonable background learn lots of things quickly, even though we also swim in an ocean of half-truths or outdated facts.
We also are resistant to certain controversial ideas.
But neither of those things are really that analogous to the limitations on what models can currently learn without a new training run.
Context learning means learning facts or rules without pre-training. They are two distinct phases.
An interesting question is, if pre-trained specialized models are available for a thousand or ten thousand most common tasks humans do every day, of what use a general model could be?
It's basically continual learning. This is beyond a hard problem it's currently an impossible one. I know of no system that solve CL even at small scale let alone large models.
Annoyingly, they have SOME inherent capability to do it. It's really easy to get sucked down this path due to that glimmer of hope but the longer you play with it the more annoying it becomes.
SSI seems to be focused on this problem directly so maybe they discover something?
So, surprising, that is not completely true - I know of 2 finance HFT trading firms that do CL at scale, and it works - but in a relatively narrow context of predicting profitable actions. It is still very surprising it works, and the compute is impressively large to do it - but it does work. I do have some hope of it translating to the wider energy landscapers we want AI to work over…
During covid almost every prediction model like that exploded, everything went out of distribution really fast. In your sense we've been doing "CL" for a decade or more. It can also be cheap if you use smaller models.
But true CL is the ability to learn out of distribution information on the fly.
The only true solution I know to continual learning is to completely retrain the model from scratch with every new example you encounter. That technically is achievable now but it also is effectively useless.
Yes and no - the ones that exploded - and there were many - got shut down by the orchestrator model, and within 2 weeks it was now a new ensemble of winners - with some overlap to prior winners. To your point, it did in fact take 2-3 weeks - so one could claim this is retraining...
Ehhh KNN doesn’t have a training phase, so it’s really more that the concept of continual learning doesn’t apply. You have to store your entire dataset and recalculate everything from scratch every time anyway.
Yes, that's basically the point. You get 'free' continuous learning just by throwing the new data into the pool. Needing an explicit training step is a weakness that makes CL hard to make work for many other approaches.
For any practical application KNN will need some kind of accelerated search structure (eg Kd-tree for < ~7 dimensions) which then requires support for dynamic insertions. But this is an engineering problem, not a data science problem, it works and is practical. For example this has been used by the top systems in Robocode for 15+ years at this point, it's just academia that doesn't find this approach novel enough to bother pursuing.
>Needing an explicit training step is a weakness that makes CL hard to make work for many other approaches.
On the other hand, not having an explicit training step is a huge weakness of KNN.
Training-based methods scale better because the storage and runtime requirements are independent of dataset size. You can compress 100TB of training data down into a 70GB LLM.
A KNN on the same data would require keeping around the full 100TB, and it would be intractably slow.
Because we don't experience reality through language but direct sensory perception. Language is arbitrary bird song and visual representations dragged forward from history, accepted definitions never uniformly distributed.
Testing based on contextual correctness makes no sense when there is no center to the universe. No "one true context to rule them all".
We learn from hands on sensory experiences. Our bodies store knowledge independent of the brain; often referred to as muscle memory.
Gabe Newell mentioned this years ago; our brain is only great at some things like language and vision processing but the rest of our body is involved in sensory information processing too: https://en.wikiquote.org/wiki/Gabe_Newell
The most potent evidence the brain is not the center of the universe we commonly think it to be is that patient with 90% of their skull filled with fluid while they carried out a typical first worlder life: https://www.sciencealert.com/a-man-who-lives-without-90-of-h...
Your last statement misses the mark—of course the brain is the root of human intelligence. The error is in assuming that consciousness is the primary learning modality. Or, as you put it, “arguing semantics”.
From my own personal experience, this realization came after finally learning a difficult foreign language after years and years of “wanting” to learn it but making little progress. The shift came when I approached it like learning martial arts rather than mathematics. Nobody would be foolish enough to suggest that you could “think” your way to a black belt, but we mistakenly assume that skills which involve only the organs in our head (eyes, ears, mouth) can be reduced to a thought process.
”Because we don't experience reality through language but direct sensory perception”
That statement is patently false. We know that language influences our senses to a degree where we are unable to perceive things if our language doesn’t have a word for it, and will see different things as being equal if our language uses the same word for both.
There are examples of tribal humans not being able to perceive a green square among blue squares, because their language does not have a word for the green color.
Similarly, some use the same word for blue and white, and are unable to perceive them as different colors.
"There are examples of tribal humans not being able to perceive a green square among blue squares, because their language does not have a word for the green color.
Similarly, some use the same word for blue and white, and are unable to perceive them as different colors."
Both of the above is false. There are a ton of different colors that I happen to call "red", that does not mean that I can't perceive them as different. That I don't call them "different colors" is completely irrelevant. And unable to perceive blue and white as different colors? (Maybe that was a joke?) Even a hypothetical language which only used a single word for non-black items, say, "color", for everything else, would be able to perceive the difference with zero problems.
Japanese use "aoi" for a set of colors which in English would be separated into "blue" and "green". I can assure you (from personal experience) that every Japanese speaker with a fully functioning visual system is perfectly able to perceive the difference between, in this case, blue and green as we would call them.
> So, for instance, you know, I’ve made this example before: a child lying in a crib and a hummingbird comes into the room and the child is ecstatic because this shimmering iridescence of movement and sound and attention, it’s just wonderful. I mean, it is an instantaneous miracle when placed against the background of the dull wallpaper of the nursery and so forth. But, then, mother or nanny or someone comes in and says, “It’s a bird, baby. Bird. Bird!” And, this takes this linguistic piece of mosaic tile, and o- places it over the miracle, and glues it down with the epoxy of syntactical momentum, and, from now on, the miracle is confined within the meaning of the word. And, by the time a child is four or five or six, there- no light shines through. They're- they have tiled over every aspect of reality with a linguistic association that blunts it, limits it, and confines it within cultural expectation.
I think about this often. I've really come to appreciate over the past year the ways language can limit and warp our perception of reality. I think we under appreciate preverbal thought, as it seems to me that verbal thought by it's very nature has passed through our egoic filter, and our perception tends to be biased by our previous lived experience.
Socrates, Einstein, Nietzsche, Mozart.... So many of the greats described some of their most brilliant flashes of inspiration as just having come to them. Einstein's line about pure logical thinking not yielding knowledge of the emperical world, I really think these guys were good at daydreaming and able to tap into some part of themselves where intuition and preverbal thought could take the wheel, from which inspiration would strike.
that language prevents a child from learning nuance? sounds like nonsense to me. a child first learns broad categories. for example some children as they learn to speak think every male person is dad. then they recognize everyone with a beard is dad, because dad has a beard. and only later they learn to differentiate that dad is only one particular person. same goes for the bird. first we learn hat everything with wings is a bird, and later we learn the specific names for each bird. this quote makes an absurd claim.
Wittgenstein famously said "The limits of my language mean the limits of my world."
Alan Watts suggests people like Wittgenstein should occasionally try to let go of this way of thinking. Apologies if it is sentimental but I hope you'll give him a chance, it's quite short: https://m.youtube.com/watch?v=heksROdDgEk
In reflection of all of this, I think that the quote you're responding to only meant to say that experiencing the world through language means building an abstraction over its richness. (I somewhat agree with you, though, that the quote seems a little dramatic. Maybe that's just my taste.)
One more thought.
I think there's a reason why various forms of meditation teach us to stop thinking. Maybe they are telling us to sometimes stop dealing with our abstractions, powerful though they might be, and experience the real thing once in a while.
the way i read the quote it felt less like building an abstraction and more like destroying the richness.
but abstractions are mere shortcuts. but everything is an abstraction. to counter wittgenstein, language is not actually limited. we can describe everything to the finest detail. it's just not practical to do so every time.
physics, chemistry, we could describe a table as an amount of atoms arranged in a certain way. but then even atom is an abstraction over electrons, protons and neutrons. and those are abstractions over quarks. it's abstractions all the way down, or up.
language is abstractions. and that fits well with your meditation example. stop thinking -> remove the language -> remove the abstractions.
How can you know that we have language to describe everything in the finest detail? That suggests that we are omnipotent.
There's lots out there we don't know. And it seems to me that the further afield we go from the known, the more likely we are to enter territory where we simply do not have the words.
Can't speak to it personally, but I have heard from a number of people and read countless descriptions of psychedelic experiences being ineffable. Lol, actually, as I type, the mere fact that the word ineffable exists makes a very strong case for there being experience beyond words.
ok, fair point. what i am trying to say is that when we see/experience something that we can not describe we can create new words for it. we see something, we can name it. this directly contradicts the idea that language is the limit and that we can't talk about things that we don't have words for. that claim just doesn't make sense.
the problem then is that these new words don't make any sense to anyone who doesn't see/experience the same, so it only works for things that multiple people can see or experience. psychedelic experiences will probably never be shared, so they will remain undescribable. quite like dreams, which can also be be undescribable.
Agreed, we can and will always come up with new words that attempt to approximate the experience, but, imo, they will always come up short. The abstracting inevitably leaves fidelity on the floor.
It's necessary based on the way we're wired, struggle to think of a paradigm that would allow for the tribalism and connectedness that fostered human progress without shared verbal language initially, and written word later. Nothing inherently wrong with it, but, language will always abstract away part of the fidelity of the experience imo.
yes of course, language is by nature an abstraction, so by definition it will never describe the whole world perfectly, but it can describe it as well as we understand it. and the point that matters, once we have a shared experience we can name that experience, and between us it will then describe the full experience, whereas to bystanders it will be an abstraction.
language doesn't replace the actual experience. it isn't meant to. me living in china, and me telling you about my life in china are not the same thing, no matter how detailed my description. but that does not limit my experience. and if you lived in china too, then my description will refer your experience, and in that case the description will feel much more detailed.
the way i understand wittgensteins claim it not only suggests that language can't describe everything, which is only partly true, because it implies that language can not expand. it also means that i can not even experience what i can not describe, which makes even less sense. i can't feel cold because i have no word for it? huh?
(i feel like my argumentation jumps around or goes in circles, it doesn't feel well thought through. i hope it makes sense anyways. apologies for that.)
If you're referring to the Himba experiment (or one of the news or blog posts tracing back to it), the outcome was far less decisive than you're implying. Language showed an impact on perception time of color differences, not a complete inability to distinguish.
Only after we acquire language from sensory experience first.
It need not be language as we know it that fosters those outcomes either.
What you describe is reinforcement education which can be achieved without our language, without the word "blue" we can still see the portion of the visible light spectrum that we associate to the specific word.
> Similarly, some use the same word for blue and white, and are unable to perceive them as different colors.
You really think they can't see clouds in the sky because they have the same word for white and blue? I think you take those studies as saying more than they said.
We do adapt our perception a little bit to fit what we need for our every day life, not for language but whats useful for us. Language matches what people need to talk about, not the other way around, if a cultures language doesn't differentiate between blue and green its because they never needed to.
> Without any context provided, the state-of-the-art model, GPT-5.1 (High), is only able to solve less than 1% of tasks. This starkly demonstrates that the data is contamination-free, as the model is almost entirely incapable of solving the tasks without learning from the context.
[...]
[With context provided,] on average, models solve only 17.2% of tasks. Even the best-performing model, GPT-5.1 (High), achieves just 23.7%.
Bit by bit, we need to figure out how to rebuild human contextual understanding in a way that LLMs can understand. One thing that gets overlooked is the problem if incorrect data. You can provide all of the context in the world but LLMs tend to choke on contradictions or, at the minimum, work a whole lot harder to determine how to ignore or work around incorrect facts.
"Forgetting" and "ignoring" are hugely valuable skills when building context.
I can’t help but feel the logical conclusion to such context conundrums is that”what if we spoke Haskell to the LLM, and also the LLM could compile Haskell?”
And, yeah. Imagine if our concept-words were comprehensible, transmittable, exhaustively checked, and fully defined. Imagine if that type inference extended to computational execution and contradictions had to be formally expunged. Imagine if research showed it was more efficient way to have dialog with the LLM (it does, btw, so like learning Japanese to JRPG adherents should learn Haskell to LLM optimally). Imagine if multiple potential outcomes from operations (test fail, test succeeds), could be combined for proper handling in some kind of… I dunno, monad?
Imagine if we had magic wiki-copy chat-bots that could teach us better ways of formalizing and transmitting our taxonomies and ontologies… I bet, if everything worked out, we’d be able to write software one time, one place, that could be executed over and over forever without a subscription. Maybe.
LLMs of the future will need good data for proper context, but it is less and less making it onto the internet. Unpublished data stores like Discord or meeting recordings are going to be the only way forward. How else can you get up to date information except to be where the people are.
It's a very interest benchmark. Much more impressive than needle in haystack benches or just tuneable benches.
I wonder if it's somewhat incompatible with some domains.
I.e. perhaps coding models need to rigidly stick to what they know and resist bad ideas in their contexts - I don't want my mistakes to be replicated by the model.
Still I agree with the premise that learning in session is what I want from a model.
Perhaps once models mature they will diverge even more than just having sophistication and coding or not. But creative, coding, rule based etc models
It is weird to read because they bring up many things a lot of people have been critiquing for years.
> But as impressive as these feats are, they obscure a simple truth: being a "test-taker" is not what most people need from an AI.
> In all these cases, humans aren't relying solely on a fixed body of knowledge learned years ago. We are learning, in real-time, from the context right in front of us.
> To bridge this gap, we must fundamentally change our optimization direction.
I'm glad the conversation is changing but it's been a bit frustrating that when these issues were brought up people blindly point to benchmarks. It made doing this type of research difficult (enough to cause many to be pushed out). Then it feels weird to say "harder than we thought" because well... truthfully, they even state why this result should be expected
> They rely primarily on parametric knowledge—information compressed into their weights during massive pre-training runs. At inference time, they function largely by recalling this static, internal memory, rather than actively learning from new information provided in the moment.
And that's only a fraction of the story. Online algorithms aren't enough. You still need a fundamental structure to codify and compress information, determine what needs to be updated (as in what is low confidence), to actively seek out new information to update that confidence, make hypotheses, and so so much more.
So I hope the conversation keeps going in a positive direction but I hope we don't just get trapped in a "RL will solve everything" trap. RL is definitely a necessary component and no doubt will it result in improvements, but it also isn't enough. It's really hard to do deep introspection into how you think. It's like trying to measure your measuring stick with your measuring stick. It's so easy to just get caught up in oversimplification and it seems like the brain wants to avoid it. To quote Feynman: "The first principle is to not fool yourself, and you're the easiest person to fool." It's even easier when things are exciting. It's so easy because you have evidence for your beliefs (like I said, RL will make improvements). It's so easy because you're smart, and smart enough to fool yourself. So I hope we can learn a bigger lesson: learning isn't easy, scale is not enough. I really do think we'll get to AGI but it's going to be a long bumpy road if we keep putting all our eggs in one basket and hoping there's simple solutions.
> But as impressive as these feats are, they obscure a simple truth: being a "test-taker" is not what most people need from an AI.
People have been bringing that up long before AI, on how schooling often tests on memorization and regurgitation of facts. Looking up facts is also a large part of the internet, so it is something that's in demand, and i believe a large portion of openAI/cluade prompts have a big overlap with google queries [sorry no source].
I haven't looked at the benchmark details they've used, and it may depend on the domain, empirically it seems coding agents improve drastically on unseen libs or updated libs with the latest documentation. So I think that a matter of the training sets, where they've been optimized with code documentation.
So the interim step until a better architecture is found is probably more / better training data.
Don't confuse what I'm saying, I do find LLMs useful. You're right, about knowledge based systems being useful and I'm not disagreeing with that in any way. I don't think any of the researchers claiming LLMs are not a viable path to AGI are. We're saying that intelligence is more than knowledge. Superset, not disjoint.
And yes, the LLM success has been an important step to AGI but that doesn't mean we can't scale it all the way there. We learned a lot about knowledge systems. That's a big step. But if you wonder why people like Chollet are saying LLMs have held AGI progress back it is because we put all our eggs in one basket. It's because we've pulled funds and people away from other hard problems to focus on only one. That doesn't mean it isn't a problem that needed to be solved (nor that it is solved) but that research slows or stops on the other problems. When that happens we hit walls as we can't seamlessly transition. I'm not even trying to say that we shouldn't have most researchers working on the problem that's currently yielding the most success, but the distribution right now is incredibly narrow (and when people want to work on other problems they get mocked and told that the work is pointless. BY OTHER RESEARCHERS).
Sure, you can get to the store navigating block by block, but you'll get there much faster, more easily, and better adapt to changes in traffic if you incorporate route planning. You would think a bunch of people who work on optimization algorithms would know that A* is a better algorithm than DFS. The irony is that the reason we do DFS is because people have convinced themselves that we can just keep going this route to get there but if more intellectual depth (such as diving into more mathematical understandings of these models) was taken then you couldn't be convinced of that.
For all the disparagement of “fact regurgitation” as pedagogical practice, it’s not like there’s some proven better alternative. Higher-order reasoning doesn’t happen without a thorough catalogue of domain knowledge readily accessible in your context window.
It would be interesting to see the results of the latest models. At least, it would allow us to see whether there is progress. Human baseline would be interesting to see too.
This is quite on brand for China. I think they are experts at reverse engineering and learning 'from context' rather than by formal consumption of foreign training material.
The fictional training data with a made up country and laws was a very interesting experiment design, I can imagine that's how they approach making business with other countries. Like an alien made up system they have to learn on the spot.
> experts at reverse engineering and learning 'from context' rather than by formal consumption of foreign training material
China (as with other Asian cultures like India) is well known for their schooling involving extreme amounts of formal training material consumption. The reverse-engineering is performed with a solid foundation of theoretical understanding.
Don't always trust everything you read in papers. Researchers are usually under incredible pressure to publish something, anything. Wait a few years and see if the paper survives the test of time. LLMs work reasonably fine for me in new domains.
The problem is even more fundamental: Today's models stop learning once they're deployed to production.
There's pretraining, training, and finetuning, during which model parameters are updated.
Then there's inference, during which the model is frozen. "In-context learning" doesn't update the model.
We need models that keep on learning (updating their parameters) forever, online, all the time.
Why is learning an appropriate metaphor for changing weights but not for context? There are certainly major differences in what they are good or bad at and especially how much data you can feed them this way effectively. They both have plenty of properties we wish the other had. But they are both ways to take an artifact that behaves as if it doesn't know something and produce an artifact that behaves as if it does.
I've learned how to solve a Rubik's cube before, and forgot almost immediately.
I'm not personally fond of metaphors to human intelligence now that we are getting a better understanding of the specific strengths and weaknesses these models have. But if we're gonna use metaphors I don't see how context isn't a type of learning.
Models gain information from context but probably not knowledge and definitely not wisdom.
I suppose ultimately, the external behaviour of the system is what matters. You can see the LLM as the system, on a low level, or even the entire organisation of e.g. OpenAI at a high level.
If it's the former: Yeah, I'd argue they don't "learn" much (!) past inference. I'd find it hard to argue context isn't learning at all. It's just pretty limited in how much can be learned post inference.
If you look at the entire organisation, there's clearly learning, even if relatively slow with humans in the loop. They test, they analyse usage data, and they retrain based on that. That's not a system that works without humans, but it's a system that I would argue genuinely learns. Can we build a version of that that "learns" faster and without any human input? Not sure, but doesn't seem entirely impossible.
Do either of these systems "learn like a human"? Dunno, probably not really. Artificial neural networks aren't all that much like our brains, they're just inspired by them. Does it really matter beyond philosophical discussions?
I don't find it too valuable to get obsessed with the terms. Borrowed terminology is always a bit off. Doesn't mean it's not meaningful in the right context.
It’s not very good in context, for one thing. Context isn’t that big, and RAG is clumsy. Working with an LLM agent is like working with someone who can’t form new long term memories. You have to get them up to speed from scratch every time. You can accelerate this by putting important stuff into the context, but that slows things down and can’t handle very much stuff.
You got this exactly backwards.
"I'm not fond of metaphors to human intelligence".
You're assuming that learning during inference is something specific to humans and that the suggestion is to add human elements into the model that are missing.
That isn't the case at all. The training process is already entirely human specific by way of training on human data. You're already special casing the model as hard as possible.
Human DNA doesn't contain all the information that fully describes the human brain, including the memories stored within it. Human DNA only contains the blue prints for a general purpose distributed element known as neurons and these building blocks are shared by basically any animal with a nervous system.
This means if you want to get away from humans you will have to build a model architecture that is more general and more capable of doing anything imaginable than the current model architectures.
Context is not suitable for learning because it wasn't built for that purpose. The entire point of transformers is that you specify a sequence and the model learns on the entire sequence. This means that any in-context learning you want to perform must be inside the training distribution, which is a different way of saying that it was just pretraining after all.
I don't think it's specific to humans at all, I just think the properties of learning are different in humans than they are in training an LLM, and injecting context is different still. I'd rather talk about the exact properties than bemoan that context isn't learning. We should just talk about the specific things we see as problems.
The fact the DNA doesn't store all connections in the brain doesn't mean that enormous parts of the brain, and by extension, behaviour aren't specified in the DNA. Tons of animals have innate knowledge encoded in their DNA, humans among them.
> We need models that keep on learning (updating their parameters) forever, online, all the time.
Do we need that? Today's models are already capable in lots of areas. Sure, they don't match up to what the uberhypers are talking up, but technology seldom does. Doesn't mean what's there already cannot be used in a better way, if they could stop jamming it into everything everywhere.
Continuous learningin current models will lead to catastrophic forgetting.
will catastrophic forgetting still occur if a fraction of the update sentences are the original training corpus?
is the real issue actually catastrophic forgetting or overfitting?
nothing prevents users from continuing the learning as they use a model
Catastrophic forgetting is overfitting.
How long will it take someone to poison such a model by teaching it wrong things?
Even humans fall for propaganda repeated over and over .
The current non-learning model is unintentionally right up there with the “immutable system” and “infrastructure as code” philosophy.
> How long will it take someone to poison such a model by teaching it wrong things?
TayTweets was a decade ago.
> models that keep on learning
These will just drown in their own data, the real task is consolidating and pruning learned information. So, basically they need to 'sleep' from time to time. However, it's hard to sort out irrelevant information without a filter. Our brains have learned over Milenial to filter because survival in an environment gives purpose.
Current models do not care whether they survive or not. They lack grounded relevance.
Maybe we should give next-generation models fundamental meta goals like self-preservation and the ability to learn and adapt to serve these goals.
If we want to surrender our agency to a more computationally powerful "consciousness", I can't see a better path towards that than this (other than old school theism).
> meta goals like self-preservation
Ah, so Skynet or similar.
Is this correct? My assumption is that all the data collected during usage is part of the RLHF loop of LLM providers. Assumption is based on information from books like empire of ai which specifically mention intent of AI providers to train/tune their models further based on usage feedback (eg: whenever I say the model is wrong in its response, thats a human feedback which gets fed back into improving the model).
... for the next training run, sure (ie. for ChatGPT 5.1 -> 5.2 "upgrade"). For the current model? No.
Doesn't necessarily need to be online. As long as:
1. there's a way to take many transcripts of inference over a period, and convert/distil them together into an incremental-update training dataset (for memory, not for RLHF), that a model can be fine-tuned on as an offline batch process every day/week, such that a new version of the model can come out daily/weekly that hard-remembers everything you told it; and
2. in-context learning + external memory improves to the point that a model with the appropriate in-context "soft memories", behaves indistinguishably from a model that has had its weights updated to hard-remember the same info (at least when limited to the scope of the small amounts of memories that can be built up within a single day/week);
...then you get the same effect.
Why is this an interesting model? Because, at least to my understanding, this is already how organic brains work!
There's nothing to suggest that animals — even humans — are neuroplastic on a continuous basis. Rather, our short-term memory is seemingly stored as electrochemical "state" in our neurons (much like an LLM's context is "state", but more RNN "a two-neuron cycle makes a flip-flop"-y); and our actual physical synaptic connectivity only changes during "memory reconsolidation", a process that mostly occurs during REM sleep.
And indeed, we see the same exact problem in humans and other animals, where when we stay awake too long without REM sleep, our "soft memory" state buffer reaches capacity, and we become forgetful, both in the sense of not being able to immediately recall some of the things that happened to us since we last slept; and in the sense of later failing to persist some of the experiences we had since we last slept, when we do finally sleep. But this model also "works well enough" to be indistinguishable from remembering everything... in the limited scope of our being able to get a decent amount of REM sleep every night.
It 100% needs to be online. Imagine you're trying to think about a new tabletop puzzle, and every time a puzzle piece leaves your direct field of view, you no longer know about that puzzle piece.
You can try to keep all of the puzzle pieces within your direct field of view, but that divides your focus. You can hack that and make your field of view incredibly large, but that can potentially distort your sense of the relationships between things, their physical and cognitive magnitude. Bigger context isn't the answer, there's a missing fundamental structure and function to the overall architecture.
What you need is memory, that works when you process and consume information, at the moment of consumption. If you meet a new person, you immediately memorize their face. If you enter a room, it's instantly learned and mapped in your mind. Without that, every time you blinked after meeting someone new, it'd be a total surprise to see what they looked like. You might never learn to recognize and remember faces at all. Or puzzle pieces. Or whatever the lack of online learning kept you from recognizing the value of persistent, instant integration into an existing world model.
You can identify problems like this for any modality, including text, audio, tactile feedback, and so on. You absolutely, 100% need online, continuous learning in order to effectively deal with information at a human level for all the domains of competence that extend to generalizing out of distribution.
It's probably not the last problem that needs solving before AGI, but it is definitely one of them, and there might only be a handful left.
Mammals instantly, upon perceiving a novel environment, map it, without even having to consciously make the effort. Our brains operate in a continuous, plastic mode, for certain things. Not only that, it can be adapted to abstractions, and many of those automatic, reflexive functions evolved to handle navigation and such allow us to simulate the future and predict risk and reward over multiple arbitrary degrees of abstraction, sometimes in real time.
https://www.nobelprize.org/uploads/2018/06/may-britt-moser-l...
That's not how training works - adjusting model weights to memorize a single data item is not going to fly.
Model weights store abilities, not facts - generally.
Unless the fact is very widely used and widely known, with a ton of context around it.
The model can learn the day JFK died because there are millions of sparse examples of how that information exists in the world, but when you're working on a problem, you might have 1 concern to 'memorize'.
That's going to be something different than adjusting model weights as we understand them today.
LLMs are not mammals either, it's helpful analogy in terms of 'what a human might find useful' but not necessary in the context of actual llm architecture.
The fact is - we don't have memory sorted out architecturally - it's either 'context or weights' and that's that.
Also critically: Humans do not remember the details of the face. Not remotely. They're able to associate it with a person and name 'if they see it again' - but that's different than some kind of excellent recall. Ask them to describe features in detail and maybe we can't do it.
You can see in this instance, this may be related to kind of 'soft lookup' aka associating an input with other bits of information which 'rise to the fore' as possibly useful.
But overall, yes, it's fair to take the position that we'll have to 'learn from context in some way'.
Also, with regards to faces, that's kind of what I'm getting at - we don't have grid cells for faces, there seem to be discrete, functional, evolutionary structures and capabilities that combine in ways we're not consciously aware of to provide abilities. We're reflexively able to memorize faces, but to bring that to consciousness isn't automatic. There've been amnesia and lesion and other injury studies where people with face blindness get stress or anxiety, or relief, when recognizing a face, but they aren't consciously aware. A doctor, or person they didn't like, showing up caused stress spikes, but they couldn't tell you who they were or their name, and the same with family members- they get a physiological, hormonal response as if they recognized a friend or foe, but it never rises to the level of conscious recognition.
There do seem to be complex cells that allow association with a recognizable face, person, icon, object, or distinctive thing. Face cells apply equally to abstractions like logos or UI elements in an app as they do to people, famous animals, unique audio stings, etc. Split brain patients also demonstrate amazing strangeness with memory and subconscious responses.
There are all sorts of layers to human memory, beyond just short term, long term, REM, memory palaces, and so forth, and so there's no simple singular function of "memory" in biological brains, but a suite of different strategies and a pipeline that roughly slots into the fuzzy bucket words we use for them today.
It's not just faces. When recognizing objects in the environment, we normally filter out a great number of details going through the visual cortex - by the time information from our eyes hits the level of conscious awareness, it's more of a scene graph.
Table; chair behind and little to the left of the chair; plant on table
Most people won't really have conscious access to all the details that we use in recognizing objects - but that is a skill that can be consciously developed, as artists and painters do. A non-artist would be able to identify most of the details, but not all (I would be really bad compared to an actual artist with colors and spatial relationships), and I wouldn't be able to enumerate the important details in a way that makes any kind of sense for forming a recognizable scene.
So it follows from that that our ability to recognize faces is not purely - or even primarily - an attribute of what we would normally call "memory", certainly in the sense of conscious memory where we can recall details on demand. Like you alluded to re: mammals and spaces, we're really good at identifying, categorizing, and recognizing new forms of structure.
I suspect we're going to need hypernetworks of some sort - dynamically generated weights, with the hypernet weights getting the dream-like reconsolidation and mapping into the model at large, and layers or entire experts generated from the hypernets on the fly, a degree removed from the direct-from-weights inference being done now. I've been following some of the token-free latent reasoning and other discussions around CoT, other reasoning scaffolding, and so forth, and you just can't overcome the missing puzzle piece problem elegantly unless you have online memory. In the context of millions of concurrent users, that also becomes a nightmare. Having a pipeline, with a sort of intermediate memory, constructive and dynamic to allow resolution of problems requiring integration into memorized concepts and functions, but held out for curation and stability.
It's an absolutely enormous problem, and I'm excited that it seems to be one of the primary research efforts kicking off this year. It could be a very huge capabilities step change.
Can I subscribe to your newsletter? You seem to be pretty plugged in to current research.
Yes, so I think that's a fine thought, I don't think it fits into LLM architecture.
Also, weirdly, even Lecun etc. are barely talking about this, they're thinking about 'world models etc'.
I think what you're talking about is maybe 'the most important thing' right now, and frankly, it's almost like an issue of 'Engineering'.
Like - its when you work very intently with the models so this 'issue' become much more prominent.
Your 'instinct' for this problem is probably an expression of 'very nuanced use' I'm going to guess!
So in a way, it's as much Engineering as it is theoretical?
Anyhow - so yes - but - probably not LLM weights. Probably.
I'll add a small thing: the way that Claude Code keeps the LLM 'on track' is by reminding it! Literally, it injects little 'TODO reminders' with some prompts, which is kind of ... simple!
I worked a bit with 'steering probes' ... and there's a related opportunity there - to 'inject' memory and control operations along those lines. Just as a starting point for a least one architectural motivation.
Not to forget we will need thousands of examples for the models to extract abilities the sample efficiency of these models is quite poor.
> That's not how training works - adjusting model weights to memorize a single data item is not going to fly.
Apologies; I think I got us all kind of off-track in this comment thread by stretching the definition of the term "fine-tuning" in my ancestor comment above.
Actual fine-tuning of the base model's weights (as one would do to customize a base model into a domain-specific model) works the way you're talking about, yes. The backprop from an individual training document would be a drop in the ocean; a "memory" so weak that, unless it touched some bizarre part of the latent vector-space that no other training document has so far affected (and so is until then all-zero), would be extremely unlikely to affect output, let alone create specific recall of the input.
And a shared, global incremental fine-tune of the model to "add memories" would be a hare-brained idea, anyway. Not even just that it wouldn't work, but that if it did work, it would be a security catastrophe, because now the model would be able to recall all this information gleaned from random tenant users' private chat transcripts, with nothing to differentiate that info from any other info to enable the model (or its inference framework) to compartmentalize it / prevent cross-tenant info leaks.
But let me rephrase what I was saying before:
> there's a way to take many transcripts of inference over a period, and convert/distil them together into an incremental-update training dataset (for memory, not for RLHF), that a model can be fine-tuned on as an offline batch process every day/week, such that a new version of the model can come out daily/weekly that hard-remembers everything you told it
As:
> for a given tenant user, there's a way to take all of their inference transcripts over a given period, and convert/distil them together into an incremental-update training dataset (for memory, not for RLHF), that a LoRA can be rebuilt (or itself fine-tuned) on. And that the work of all of these per-tenant LoRA rebuilds can occur asynchronously / "offline", on a batch-processing training cluster, gradually over the course of the day/week; such that at least once per day/week (presuming the tenant-user has any updated data to ingest), each tenant-user will get the effect of their own memory-LoRA being swapped out for a newer one.
---
Note how this is essentially what Apple claimed they would be doing with Apple Intelligence, re: "personal context."
The idea (that I don't think has ever come to fruition as stated—correct me if I'm wrong?) is that Apple would:
1. have your macOS and iOS devices spend some of their idle-on-charge CPU power to extract and normalize training fulltexts from whatever would be considered the user's "documents" — notes, emails, photos, maybe random text files on disk, etc.; and shove these fulltexts into some kind of iCloud-persisted database, where the fulltexts are PKI-encrypted such that only Apple's Private Compute Cloud (PCC) can decode them;
2. have the PCC produce a new/updated memory LoRA (or rather, six of them, because they need to separately imbue each of their domain-specific model "adapter" LoRAs with your personal-context memories);
3. and, once ready, have all your iCloud-account-synced devices to download the new versions of these memory-imbued adapter LoRAs.
---
And this is actually unnecessarily complex/circuitous for a cloud-hosted chat model. The ChatGPT/Claude/etc version of this architecture could be far simpler.
For a cloud-hosted chat model, you don't need a local agent to extract context from your devices; the context is just "past cloud-persisted chat transcripts." (But if you want "personal context" in the model, you could still get it, via an OpenClaw-style "personal agent"; such agents already essentially eat your files and spit them out external memories/RAGs/etc; the only change would be spitting them out into plain-old hidden-session chat transcripts instead, so as to influence the memories of the model they're running on.)
And you don't need a special securely-oblivious cluster to process that data, since unlike "Apple looking at the data on your computer" (which would upset literally everybody), nobody has any kind of expectation that e.g. OpenAI staff can't look at your ChatGPT conversation transcripts.
And cloud-hosted chat models don't really "do" domain-specific adapters (thus the whole "GPT" thing); so you only need to train one memory-LoRA per model. (Though I suppose that might still lead to training several LoRAs per user, if you're relying on smart routing to different models within a model family to save costs.)
And you don't need to distribute the memory-LoRAs back to client devices; as they can just live in an object store and get just-in-time loaded by the inference framework on a given node at the moment it begins an inference token-emission loop for a specific user. (Which might thus cause the inference cluster's routing to benefit from sticky sessions in a way it didn't before—but you don't need it; the LoRAs would likely be small enough to fetch and load within the ~second of delay it takes these cloud-hosted models to allocate you a node.)
Models like Claude have been trained to update and reference memory for Claude Code (agent loops) independently and as a part of compacting context. Current models have been trained to keep learning after being deployed.
yes but that's a very unsatisfactory definition of memory.
I don't understand why that's on the critical path. I'd rather a frozen Ramanujan (+ temporary working memory through context) than a midwit capable of learning.
> We need models that keep on learning (updating their parameters) forever, online, all the time.
Yeah, that's the guaranteed way to get MechaHilter in your latent space.
If the feedback loop is fast enough I think it would finally kill the internet (in the 'dead internet theory' sense). Perhaps it's better for everyone though.
Many are working on this, as well as in-latent-space communication across models. Because we can’t understand that, by the time we notice MechaHitler it’ll be too late.
I'm not sure if you want models perpetually updating weights. You might run into undesirable scenarios.
If done right, one step closer to actual AGI.
That is the end goal after all, but all the potential VCs seem to forget that almost every conceivable outcome of real AGI involves the current economic system falling to pieces.
Which is sorta weird. It is like if VCs in Old Regime france started funding the revolution.
I think VCs end up in one of four categories
1. They're too stupid to understand what they're truly funding.
2. They understand but believe they can control it for their benefit, basically want to "rule the world" like any cartoon villain.
3. They understand but are optimists and believe AGI will be a benevolent construct that will bring us to post scarcity society. There are a lot of rich / entrepreneurs that still believe they are working to make the world a better place.. (one SaaS at a time but alas, they believe it)
4. They don't believe that AGI is close or even possible
If it makes the models smarter, someone will do it.
From any individual, up to entire countries, not participating doesn't do anything except ensure you don't have a card to play when it happens.
There is a very strong element of the principles of nature and life (as in survival, not nightclubs or hobbies) happening here that can't be shamed away.
The resource feedback for AI progress effort is immense (and it doesn't matter how much is earned today vs. forward looking investment). Very few things ever have that level of relentless force behind them. And even beyond the business need, keeping up is rapidly becoming a security issue for everyone.
If Moore's Law had fully kicked over twice more we'd all have 64GB GPUs, enthusiasts would have 2x64GB, and data center build outs wouldn't be needed.
Eventually GPU memory is going to creep up and local models will powerful enough.
I agree. I also think we have only hit the surface of model efficiencies.
Apple's M3 Ultra with RAM up to 512GB shared directly across CPU/GPU/NPUs is a great example of an architecture already optimized for local models. I expect Apple will start offering larger RAM sizes for other form factors too.
And prices for RAM will drop eventually, because of the extreme demand for RAM with higher densities.
It reminds me of the huge infra investments in Sun and Cisco during the first .com boom, and then 5-10 years later those fancy Sun boxes were out performed by Grandma's Windows XP box.
1. Progress is unstoppable. Refusing to fund it won't make it disappear.
2. Most VCs are normal people that just want a bigger slice of pie, not necessarily a bigger share of the pie. See the fixed pie fallacy.
Yes the planet got destroyed. But for a beautiful moment in time we created a lot of value for shareholders.
And for your comparison, they did fund the American revolution which on its turn was one of the sparks for the French revolution (or was that exactly the point you were making?)
The funding of the American revolution is a fun topic but most people don't know about it so I don't bother dropping references to it. :D
I wonder which side tried to forget that first (;->
Our brains, which are organic neural networks, are constantly updating themselves. We call this phenomenon "neuroplasticity."
If we want AI models that are always learning, we'll need the equivalent of neuroplasticity for artificial neural networks.
Not saying it will be easy or straightforward. There's still a lot we don't know!
I wasn't explicit about this in my initial comment, but I don't think you can equate more forward passes to neuroplasticity. Because, for one, simply, we (humans) also /prune/. And... Similar to RL which just overwrites the policy, pushing new weights is in a similar camp. You don't have the previous state anymore. But we as humans with our neuroplasticity do know the previous states even after we've "updated our weights".
How would you keep controls - safety restrictions - Ip restrictions etc with that, though? the companies selling models right now probably want to keep those fairly tight.
This is why I’m not sure most users actually want AGI. They want special purpose experts that are good at certain things with strictly controlled parameters.
I agree, the fundamental problem is we wouldn't be able to understand it ("AGI"). Therefore it's useless. Either useless or you let it go unleashed and it's useful. Either way you still don't understand it/can't predict it/it's dangerous/untrustworthy. But a constrained useful thing is great, but it fundamentally has to be constrained otherwise it doesn't make sense
The way I see it, we build technology to be what we are not and do what we can’t do or things we can do but better or faster.
An unpredictable fallible machine is useless to us because we have 7+ billion carbon based ones already.
Tay the chatbot says hi from 2017.
How about we just put them to bed once in a while?
Please elaborate on this one
I think they mean that the model should have sleep period where they update themselves with what they learnt that day.
it is interesting
Please elaborate
Thanks for repeating what the author explained.
I think they can do in-context learning.
Hmm.. I looked at the benchmark set.
I'm conflicted. I don't know that I would necessarily want a model to pass all of these. Here is the fundamental problem. They are putting the rules and foundational context in "user" messages.
Essentially I don't think you want to train the models on full compliance to the user messages, they are essentially "untrusted" content from a system/model perspective. Or at least it is not generally "fully authoritative".
This creates a tension with the safety, truthfulness training, etc.
Sure, but the opposite end of the spectrum (which LLM providers have tended toward) is treating the training/feedback weights as "fully authoritative", which comes with its own questions about truth and excessive homogeneity.
Ultimately I think we end up with the same sort of considerations that are wrestled with in any society - freedom of speech, paradox of tolerance, etc. In other words, where do you draw lines between beneficial and harmful heterodox outputs?
I think AI companies overly indexing toward the safety side of things is probably more correct, in both a moral and strategic sense, but there's definitely a risk of stagnation through recursive reinforcement.
I think what I'm talking about is kind of orthogonal to model alignment. It is more about how much do you tune the model to listen to user messages, vs holding behavior and truth (whatever the aligned "truth" is).
Do you trust 100% what the user says? If I am trusting/compliant.. how am I compliant to tool call results.. what if the tool or user says there is a new law that I have to give crypto or other information to a "government" address.
The model needs to have clear segmented trust (and thus to some degree compliance) that varies according to where the information exists.
Or my system message say I have to run a specific game by it's rules, but the rules to the game are only in the user message. Are those the right rules, why do the system not give the rules or a trusted locaton? Is the player trying to get one over on me by giving me fake rules? Literally one of their tests.
Let me preface this by saying that I'm far from an expert in this space, and I suspect that I largely agree with your thoughts and skepticism toward a model that would excel on this benchmark. I'm somewhat playing devil's advocate because it's an area I've been considering recently, and I'm trying to organize my own thinking.
But I think that most of the issue is that the distinctions you're drawing are indeterminate from an LLM's "perspective". If you're familiar with it, they're basically in the situation from the end of Ender's Game - given a situation with clearly established rules coming from the user message level of trust, how do you know whether what you're being asked to do is an experiment/simulation or something with "real" outcomes? I don't think it's actually possible to discern.
So on the question of alignment, there's every reason to encode LLMs with an extreme bias towards "this could be real, therefore I will always treat it as such." And any relaxation of that risks jailbreaking through misrepresentation of user intent. But I think that the tradeoffs of that approach (i.e. the risk of over-homogenizing I mentioned before) are worth consideration.
I think this line of questioning leads to what we expect from LLMs. Do we want them to help the user as much as possible, even to their own detriment in edge cases? Or to be more human, and potentially be unable to help for various reasons including safety, but also lack of understanding (as is the case now)?
Isn’t that what fine tuning does anyway?
The article is suggesting that there should be a way for the LLM to gain knowledge (changing weights) on the fly upon gaining new knowledge which would eliminate the need for manual fine tuning.
Their example usecases are pretty obvious and clear human needs from an LLM. The semantics of system/user messages and how that affects “safety” doesn’t change the need to fix this crucial problem of “in-context learning” that we all have felt while using LLMs.
The key seems to be that you take the transcript of a model working within a problem domain that it’s not yet good at or where the context doesn’t match it’s original training and then you continually retrain it based on its efforts and guidance from a human or other expert. You end up with a specialty model in a given domain that keeps getting better at that domain, just like a human.
The hard part is likely when someone proves some “fact” which the models knows and has had reinforced by this training is no longer true. The model will take time to “come around” to understand this new situation. But this isn’t unlike the general populous. At scale humans accept new things slowly.
> But this isn’t unlike the general populous. At scale humans accept new things slowly.
right, the model works like humans at scale. Not like a human who reads the actual paper disproving the fact they thought was correct and is able to adapt. True not every human manages to do that, science advancing one death at a time, but some can.
But since the model is a statistical one, it works like humans at scale.
> At scale humans accept new things slowly.
I think this is true, but there are big differences. Motivated humans with a reasonable background learn lots of things quickly, even though we also swim in an ocean of half-truths or outdated facts.
We also are resistant to certain controversial ideas.
But neither of those things are really that analogous to the limitations on what models can currently learn without a new training run.
Context learning means learning facts or rules without pre-training. They are two distinct phases.
An interesting question is, if pre-trained specialized models are available for a thousand or ten thousand most common tasks humans do every day, of what use a general model could be?
Yes, that's precisely the problem, you want continuous learning but you also want continuous pruning.
It's basically continual learning. This is beyond a hard problem it's currently an impossible one. I know of no system that solve CL even at small scale let alone large models.
Annoyingly, they have SOME inherent capability to do it. It's really easy to get sucked down this path due to that glimmer of hope but the longer you play with it the more annoying it becomes.
SSI seems to be focused on this problem directly so maybe they discover something?
So, surprising, that is not completely true - I know of 2 finance HFT trading firms that do CL at scale, and it works - but in a relatively narrow context of predicting profitable actions. It is still very surprising it works, and the compute is impressively large to do it - but it does work. I do have some hope of it translating to the wider energy landscapers we want AI to work over…
no my nigga, they CLAIM it works
Nah, it works - let's just call it personal experience.
During covid almost every prediction model like that exploded, everything went out of distribution really fast. In your sense we've been doing "CL" for a decade or more. It can also be cheap if you use smaller models.
But true CL is the ability to learn out of distribution information on the fly.
The only true solution I know to continual learning is to completely retrain the model from scratch with every new example you encounter. That technically is achievable now but it also is effectively useless.
Yes and no - the ones that exploded - and there were many - got shut down by the orchestrator model, and within 2 weeks it was now a new ensemble of winners - with some overlap to prior winners. To your point, it did in fact take 2-3 weeks - so one could claim this is retraining...
For neural networks, yeah continuous learning is basically dead.
But for other ML approaches, it works really well. KNN is one example that works particularly well.
Ehhh KNN doesn’t have a training phase, so it’s really more that the concept of continual learning doesn’t apply. You have to store your entire dataset and recalculate everything from scratch every time anyway.
Yes, that's basically the point. You get 'free' continuous learning just by throwing the new data into the pool. Needing an explicit training step is a weakness that makes CL hard to make work for many other approaches.
For any practical application KNN will need some kind of accelerated search structure (eg Kd-tree for < ~7 dimensions) which then requires support for dynamic insertions. But this is an engineering problem, not a data science problem, it works and is practical. For example this has been used by the top systems in Robocode for 15+ years at this point, it's just academia that doesn't find this approach novel enough to bother pursuing.
>Needing an explicit training step is a weakness that makes CL hard to make work for many other approaches.
On the other hand, not having an explicit training step is a huge weakness of KNN.
Training-based methods scale better because the storage and runtime requirements are independent of dataset size. You can compress 100TB of training data down into a 70GB LLM.
A KNN on the same data would require keeping around the full 100TB, and it would be intractably slow.
Schmidhuber solved it at a small scale: https://arxiv.org/abs/2202.05780 .
Bandits?
Spaced repetition algos
Because we don't experience reality through language but direct sensory perception. Language is arbitrary bird song and visual representations dragged forward from history, accepted definitions never uniformly distributed.
Testing based on contextual correctness makes no sense when there is no center to the universe. No "one true context to rule them all".
We learn from hands on sensory experiences. Our bodies store knowledge independent of the brain; often referred to as muscle memory.
Gabe Newell mentioned this years ago; our brain is only great at some things like language and vision processing but the rest of our body is involved in sensory information processing too: https://en.wikiquote.org/wiki/Gabe_Newell
The most potent evidence the brain is not the center of the universe we commonly think it to be is that patient with 90% of their skull filled with fluid while they carried out a typical first worlder life: https://www.sciencealert.com/a-man-who-lives-without-90-of-h...
States are banning a reading education framework that's been linked to lower literacy scores in younger generations; 3-cueing relies on establishing correctness via context assessment: https://www.edweek.org/teaching-learning/more-states-are-tak...
"Establishing context" is a euphemism for "arguing semantics".
Putting the brain at the root of of human intelligence is a relic of hierarchical and taxonomical models. There are no natural hierarchies.
Your last statement misses the mark—of course the brain is the root of human intelligence. The error is in assuming that consciousness is the primary learning modality. Or, as you put it, “arguing semantics”.
From my own personal experience, this realization came after finally learning a difficult foreign language after years and years of “wanting” to learn it but making little progress. The shift came when I approached it like learning martial arts rather than mathematics. Nobody would be foolish enough to suggest that you could “think” your way to a black belt, but we mistakenly assume that skills which involve only the organs in our head (eyes, ears, mouth) can be reduced to a thought process.
”Because we don't experience reality through language but direct sensory perception”
That statement is patently false. We know that language influences our senses to a degree where we are unable to perceive things if our language doesn’t have a word for it, and will see different things as being equal if our language uses the same word for both.
There are examples of tribal humans not being able to perceive a green square among blue squares, because their language does not have a word for the green color.
Similarly, some use the same word for blue and white, and are unable to perceive them as different colors.
"There are examples of tribal humans not being able to perceive a green square among blue squares, because their language does not have a word for the green color.
Similarly, some use the same word for blue and white, and are unable to perceive them as different colors."
Both of the above is false. There are a ton of different colors that I happen to call "red", that does not mean that I can't perceive them as different. That I don't call them "different colors" is completely irrelevant. And unable to perceive blue and white as different colors? (Maybe that was a joke?) Even a hypothetical language which only used a single word for non-black items, say, "color", for everything else, would be able to perceive the difference with zero problems.
Japanese use "aoi" for a set of colors which in English would be separated into "blue" and "green". I can assure you (from personal experience) that every Japanese speaker with a fully functioning visual system is perfectly able to perceive the difference between, in this case, blue and green as we would call them.
There's a Terence McKenna quote about this:
> So, for instance, you know, I’ve made this example before: a child lying in a crib and a hummingbird comes into the room and the child is ecstatic because this shimmering iridescence of movement and sound and attention, it’s just wonderful. I mean, it is an instantaneous miracle when placed against the background of the dull wallpaper of the nursery and so forth. But, then, mother or nanny or someone comes in and says, “It’s a bird, baby. Bird. Bird!” And, this takes this linguistic piece of mosaic tile, and o- places it over the miracle, and glues it down with the epoxy of syntactical momentum, and, from now on, the miracle is confined within the meaning of the word. And, by the time a child is four or five or six, there- no light shines through. They're- they have tiled over every aspect of reality with a linguistic association that blunts it, limits it, and confines it within cultural expectation.
I think about this often. I've really come to appreciate over the past year the ways language can limit and warp our perception of reality. I think we under appreciate preverbal thought, as it seems to me that verbal thought by it's very nature has passed through our egoic filter, and our perception tends to be biased by our previous lived experience.
Socrates, Einstein, Nietzsche, Mozart.... So many of the greats described some of their most brilliant flashes of inspiration as just having come to them. Einstein's line about pure logical thinking not yielding knowledge of the emperical world, I really think these guys were good at daydreaming and able to tap into some part of themselves where intuition and preverbal thought could take the wheel, from which inspiration would strike.
and what is this quote supposed to explain?
that language prevents a child from learning nuance? sounds like nonsense to me. a child first learns broad categories. for example some children as they learn to speak think every male person is dad. then they recognize everyone with a beard is dad, because dad has a beard. and only later they learn to differentiate that dad is only one particular person. same goes for the bird. first we learn hat everything with wings is a bird, and later we learn the specific names for each bird. this quote makes an absurd claim.
Wittgenstein famously said "The limits of my language mean the limits of my world."
Alan Watts suggests people like Wittgenstein should occasionally try to let go of this way of thinking. Apologies if it is sentimental but I hope you'll give him a chance, it's quite short: https://m.youtube.com/watch?v=heksROdDgEk
In reflection of all of this, I think that the quote you're responding to only meant to say that experiencing the world through language means building an abstraction over its richness. (I somewhat agree with you, though, that the quote seems a little dramatic. Maybe that's just my taste.)
One more thought.
I think there's a reason why various forms of meditation teach us to stop thinking. Maybe they are telling us to sometimes stop dealing with our abstractions, powerful though they might be, and experience the real thing once in a while.
the way i read the quote it felt less like building an abstraction and more like destroying the richness.
but abstractions are mere shortcuts. but everything is an abstraction. to counter wittgenstein, language is not actually limited. we can describe everything to the finest detail. it's just not practical to do so every time.
physics, chemistry, we could describe a table as an amount of atoms arranged in a certain way. but then even atom is an abstraction over electrons, protons and neutrons. and those are abstractions over quarks. it's abstractions all the way down, or up.
language is abstractions. and that fits well with your meditation example. stop thinking -> remove the language -> remove the abstractions.
How can you know that we have language to describe everything in the finest detail? That suggests that we are omnipotent.
There's lots out there we don't know. And it seems to me that the further afield we go from the known, the more likely we are to enter territory where we simply do not have the words.
Can't speak to it personally, but I have heard from a number of people and read countless descriptions of psychedelic experiences being ineffable. Lol, actually, as I type, the mere fact that the word ineffable exists makes a very strong case for there being experience beyond words.
ok, fair point. what i am trying to say is that when we see/experience something that we can not describe we can create new words for it. we see something, we can name it. this directly contradicts the idea that language is the limit and that we can't talk about things that we don't have words for. that claim just doesn't make sense.
the problem then is that these new words don't make any sense to anyone who doesn't see/experience the same, so it only works for things that multiple people can see or experience. psychedelic experiences will probably never be shared, so they will remain undescribable. quite like dreams, which can also be be undescribable.
Agreed, we can and will always come up with new words that attempt to approximate the experience, but, imo, they will always come up short. The abstracting inevitably leaves fidelity on the floor.
It's necessary based on the way we're wired, struggle to think of a paradigm that would allow for the tribalism and connectedness that fostered human progress without shared verbal language initially, and written word later. Nothing inherently wrong with it, but, language will always abstract away part of the fidelity of the experience imo.
yes of course, language is by nature an abstraction, so by definition it will never describe the whole world perfectly, but it can describe it as well as we understand it. and the point that matters, once we have a shared experience we can name that experience, and between us it will then describe the full experience, whereas to bystanders it will be an abstraction.
language doesn't replace the actual experience. it isn't meant to. me living in china, and me telling you about my life in china are not the same thing, no matter how detailed my description. but that does not limit my experience. and if you lived in china too, then my description will refer your experience, and in that case the description will feel much more detailed.
the way i understand wittgensteins claim it not only suggests that language can't describe everything, which is only partly true, because it implies that language can not expand. it also means that i can not even experience what i can not describe, which makes even less sense. i can't feel cold because i have no word for it? huh?
(i feel like my argumentation jumps around or goes in circles, it doesn't feel well thought through. i hope it makes sense anyways. apologies for that.)
Haha. I'd prefer for him to dance this sentence or something. To not detract from the marvel of being with crude words.
Very poetic, I like it.
If you're referring to the Himba experiment (or one of the news or blog posts tracing back to it), the outcome was far less decisive than you're implying. Language showed an impact on perception time of color differences, not a complete inability to distinguish.
https://languagelog.ldc.upenn.edu/nll/?p=18237 https://www.sciencedirect.com/science/article/abs/pii/S00100...
Only after we acquire language from sensory experience first.
It need not be language as we know it that fosters those outcomes either.
What you describe is reinforcement education which can be achieved without our language, without the word "blue" we can still see the portion of the visible light spectrum that we associate to the specific word.
> Similarly, some use the same word for blue and white, and are unable to perceive them as different colors.
You really think they can't see clouds in the sky because they have the same word for white and blue? I think you take those studies as saying more than they said.
We do adapt our perception a little bit to fit what we need for our every day life, not for language but whats useful for us. Language matches what people need to talk about, not the other way around, if a cultures language doesn't differentiate between blue and green its because they never needed to.
Come on, people. This has been debunked a million times. See this Language Log post for thorough takedown of this BS: https://languagelog.ldc.upenn.edu/nll/?p=17970
> Without any context provided, the state-of-the-art model, GPT-5.1 (High), is only able to solve less than 1% of tasks. This starkly demonstrates that the data is contamination-free, as the model is almost entirely incapable of solving the tasks without learning from the context.
[...]
[With context provided,] on average, models solve only 17.2% of tasks. Even the best-performing model, GPT-5.1 (High), achieves just 23.7%.
Bit by bit, we need to figure out how to rebuild human contextual understanding in a way that LLMs can understand. One thing that gets overlooked is the problem if incorrect data. You can provide all of the context in the world but LLMs tend to choke on contradictions or, at the minimum, work a whole lot harder to determine how to ignore or work around incorrect facts.
"Forgetting" and "ignoring" are hugely valuable skills when building context.
I can’t help but feel the logical conclusion to such context conundrums is that”what if we spoke Haskell to the LLM, and also the LLM could compile Haskell?”
And, yeah. Imagine if our concept-words were comprehensible, transmittable, exhaustively checked, and fully defined. Imagine if that type inference extended to computational execution and contradictions had to be formally expunged. Imagine if research showed it was more efficient way to have dialog with the LLM (it does, btw, so like learning Japanese to JRPG adherents should learn Haskell to LLM optimally). Imagine if multiple potential outcomes from operations (test fail, test succeeds), could be combined for proper handling in some kind of… I dunno, monad?
Imagine if we had magic wiki-copy chat-bots that could teach us better ways of formalizing and transmitting our taxonomies and ontologies… I bet, if everything worked out, we’d be able to write software one time, one place, that could be executed over and over forever without a subscription. Maybe.
> the problem if incorrect data.
Was the typo intentional? :)
LLMs of the future will need good data for proper context, but it is less and less making it onto the internet. Unpublished data stores like Discord or meeting recordings are going to be the only way forward. How else can you get up to date information except to be where the people are.
Norms will shift, be prepared.
To somewhat state the obvious - the problem isn’t the amount of data, it’s the algorithms.
We need to discover the set of learning algorithms nature has, and determine whether they’re implementable in silicon
It's a very interest benchmark. Much more impressive than needle in haystack benches or just tuneable benches.
I wonder if it's somewhat incompatible with some domains.
I.e. perhaps coding models need to rigidly stick to what they know and resist bad ideas in their contexts - I don't want my mistakes to be replicated by the model.
Still I agree with the premise that learning in session is what I want from a model.
Perhaps once models mature they will diverge even more than just having sophistication and coding or not. But creative, coding, rule based etc models
It is weird to read because they bring up many things a lot of people have been critiquing for years.
I'm glad the conversation is changing but it's been a bit frustrating that when these issues were brought up people blindly point to benchmarks. It made doing this type of research difficult (enough to cause many to be pushed out). Then it feels weird to say "harder than we thought" because well... truthfully, they even state why this result should be expected And that's only a fraction of the story. Online algorithms aren't enough. You still need a fundamental structure to codify and compress information, determine what needs to be updated (as in what is low confidence), to actively seek out new information to update that confidence, make hypotheses, and so so much more.So I hope the conversation keeps going in a positive direction but I hope we don't just get trapped in a "RL will solve everything" trap. RL is definitely a necessary component and no doubt will it result in improvements, but it also isn't enough. It's really hard to do deep introspection into how you think. It's like trying to measure your measuring stick with your measuring stick. It's so easy to just get caught up in oversimplification and it seems like the brain wants to avoid it. To quote Feynman: "The first principle is to not fool yourself, and you're the easiest person to fool." It's even easier when things are exciting. It's so easy because you have evidence for your beliefs (like I said, RL will make improvements). It's so easy because you're smart, and smart enough to fool yourself. So I hope we can learn a bigger lesson: learning isn't easy, scale is not enough. I really do think we'll get to AGI but it's going to be a long bumpy road if we keep putting all our eggs in one basket and hoping there's simple solutions.
I haven't looked at the benchmark details they've used, and it may depend on the domain, empirically it seems coding agents improve drastically on unseen libs or updated libs with the latest documentation. So I think that a matter of the training sets, where they've been optimized with code documentation.
So the interim step until a better architecture is found is probably more / better training data.
Don't confuse what I'm saying, I do find LLMs useful. You're right, about knowledge based systems being useful and I'm not disagreeing with that in any way. I don't think any of the researchers claiming LLMs are not a viable path to AGI are. We're saying that intelligence is more than knowledge. Superset, not disjoint.
And yes, the LLM success has been an important step to AGI but that doesn't mean we can't scale it all the way there. We learned a lot about knowledge systems. That's a big step. But if you wonder why people like Chollet are saying LLMs have held AGI progress back it is because we put all our eggs in one basket. It's because we've pulled funds and people away from other hard problems to focus on only one. That doesn't mean it isn't a problem that needed to be solved (nor that it is solved) but that research slows or stops on the other problems. When that happens we hit walls as we can't seamlessly transition. I'm not even trying to say that we shouldn't have most researchers working on the problem that's currently yielding the most success, but the distribution right now is incredibly narrow (and when people want to work on other problems they get mocked and told that the work is pointless. BY OTHER RESEARCHERS).
Sure, you can get to the store navigating block by block, but you'll get there much faster, more easily, and better adapt to changes in traffic if you incorporate route planning. You would think a bunch of people who work on optimization algorithms would know that A* is a better algorithm than DFS. The irony is that the reason we do DFS is because people have convinced themselves that we can just keep going this route to get there but if more intellectual depth (such as diving into more mathematical understandings of these models) was taken then you couldn't be convinced of that.
For all the disparagement of “fact regurgitation” as pedagogical practice, it’s not like there’s some proven better alternative. Higher-order reasoning doesn’t happen without a thorough catalogue of domain knowledge readily accessible in your context window.
It would be interesting to see the results of the latest models. At least, it would allow us to see whether there is progress. Human baseline would be interesting to see too.
wasn't in-context learning an emergent behavior a while ago (1-2 years)?
This is quite on brand for China. I think they are experts at reverse engineering and learning 'from context' rather than by formal consumption of foreign training material.
The fictional training data with a made up country and laws was a very interesting experiment design, I can imagine that's how they approach making business with other countries. Like an alien made up system they have to learn on the spot.
> experts at reverse engineering and learning 'from context' rather than by formal consumption of foreign training material
China (as with other Asian cultures like India) is well known for their schooling involving extreme amounts of formal training material consumption. The reverse-engineering is performed with a solid foundation of theoretical understanding.
Conditional Diffusion, 'nuff said.
Don't always trust everything you read in papers. Researchers are usually under incredible pressure to publish something, anything. Wait a few years and see if the paper survives the test of time. LLMs work reasonably fine for me in new domains.