I suspect that at the base level of knowledge are chunks of concept, let's call them nodes. A node would appear at the weighted center of all the words that are synonyms for that concept.
Each word has a direction from that node, and if you use words together, there's a vector between them, a direction.
Like the recent thread[1] about autoencoders that could find very low dimensional representations of lava lamps, and other chaotic systems, I think that's how transformers and word2vec operate.
Also recall that someone had built a tool[2] to find notes that share n-grams automatically in the background, this is a similar idea.
I have no ML experience so tell me if I’m wrong here.
The argument is that Transformers spend most of their compute finding this language subspace. Once the subspace is found, it’s very easy for it to add words/phrases/etc to the model.
What this is proposing is that we should try to find a better way to represent this subspace.
We have no idea how to represent it at the moment. But maybe transformers can help us with figuring that out.
Not a language subspace specifically, but a [learning maybe?] subspace for representing data and its relations to each other. A scaffolding that is highly transferrable between domains.
His idea is that this structure might be similar to the brain's built-in structures, which make learning easy for children. Also, that language materials naturally surface the desirable properties, which is why large language models do so well on other types of problems.
Is someone aware of transform architecture or implementation where every non linearity is made out of ReLu from linearities (max which can be expressed by ReLus instead of softmax).
In particular an Attention module which satisfies this property?
I did some googling however i only find papers like this one https://arxiv.org/abs/2204.07731 which deals with the O() behaviour with the window size. Not with the activation of the systems.
It sounds a bit like neural networks are the brain's hardware (at least when squinting hard), and training large language models forces the models to spend a large chunk of their training time/effort re-creating portions of the firmware. Once that's done, training on the language (or anything else) is almost trivial in comparison.
That sounds quite plausible to me. It would also explain why so many people feel like this is almost a taste of AGI.
Beautiful article. I have been using neural networks since the mid-1980s and I sometimes think of a high dimension space representing a neural network, and I have wondered if there were special places in that space. Clearly there must be an astronomically large number of “dead sub-spaces”.
From the article “Well-trained transformers, regardless of task, occupy the same relatively small subspace of parameter space”
I suspect that at the base level of knowledge are chunks of concept, let's call them nodes. A node would appear at the weighted center of all the words that are synonyms for that concept.
Each word has a direction from that node, and if you use words together, there's a vector between them, a direction.
Like the recent thread[1] about autoencoders that could find very low dimensional representations of lava lamps, and other chaotic systems, I think that's how transformers and word2vec operate.
Also recall that someone had built a tool[2] to find notes that share n-grams automatically in the background, this is a similar idea.
A good way to think about knowledge modeling is from [1]. The base element of knowledge there is the knowledge component.
[1] http://pact.cs.cmu.edu/pubs/PSLC-Theory-Framework-Tech-Rep.p...
I have no ML experience so tell me if I’m wrong here.
The argument is that Transformers spend most of their compute finding this language subspace. Once the subspace is found, it’s very easy for it to add words/phrases/etc to the model.
What this is proposing is that we should try to find a better way to represent this subspace.
We have no idea how to represent it at the moment. But maybe transformers can help us with figuring that out.
Not a language subspace specifically, but a [learning maybe?] subspace for representing data and its relations to each other. A scaffolding that is highly transferrable between domains.
His idea is that this structure might be similar to the brain's built-in structures, which make learning easy for children. Also, that language materials naturally surface the desirable properties, which is why large language models do so well on other types of problems.
I really like the idea of a learning sub space!
Once the learning scaffolding is modelled, you can use that to build on top of it.
I wonder if the neocortex is the thing that we use together with the learning scaffolding.
Good question. That seems plausible to me as well.
Is someone aware of transform architecture or implementation where every non linearity is made out of ReLu from linearities (max which can be expressed by ReLus instead of softmax).
In particular an Attention module which satisfies this property?
I did some googling however i only find papers like this one https://arxiv.org/abs/2204.07731 which deals with the O() behaviour with the window size. Not with the activation of the systems.
An analogy that comes to mind.
It sounds a bit like neural networks are the brain's hardware (at least when squinting hard), and training large language models forces the models to spend a large chunk of their training time/effort re-creating portions of the firmware. Once that's done, training on the language (or anything else) is almost trivial in comparison.
That sounds quite plausible to me. It would also explain why so many people feel like this is almost a taste of AGI.
Beautiful article. I have been using neural networks since the mid-1980s and I sometimes think of a high dimension space representing a neural network, and I have wondered if there were special places in that space. Clearly there must be an astronomically large number of “dead sub-spaces”.
From the article “Well-trained transformers, regardless of task, occupy the same relatively small subspace of parameter space”