mikewarot 6 days ago

I suspect that at the base level of knowledge are chunks of concept, let's call them nodes. A node would appear at the weighted center of all the words that are synonyms for that concept.

Each word has a direction from that node, and if you use words together, there's a vector between them, a direction.

Like the recent thread[1] about autoencoders that could find very low dimensional representations of lava lamps, and other chaotic systems, I think that's how transformers and word2vec operate.

Also recall that someone had built a tool[2] to find notes that share n-grams automatically in the background, this is a similar idea.

  1 - https://news.ycombinator.com/item?id=32294403
  2 - https://news.ycombinator.com/item?id=32282260
brad0 6 days ago

I have no ML experience so tell me if I’m wrong here.

The argument is that Transformers spend most of their compute finding this language subspace. Once the subspace is found, it’s very easy for it to add words/phrases/etc to the model.

What this is proposing is that we should try to find a better way to represent this subspace.

We have no idea how to represent it at the moment. But maybe transformers can help us with figuring that out.

  • solarmist 5 days ago

    Not a language subspace specifically, but a [learning maybe?] subspace for representing data and its relations to each other. A scaffolding that is highly transferrable between domains.

    His idea is that this structure might be similar to the brain's built-in structures, which make learning easy for children. Also, that language materials naturally surface the desirable properties, which is why large language models do so well on other types of problems.

    • brad0 5 days ago

      I really like the idea of a learning sub space!

      Once the learning scaffolding is modelled, you can use that to build on top of it.

      I wonder if the neocortex is the thing that we use together with the learning scaffolding.

      • solarmist 5 days ago

        Good question. That seems plausible to me as well.

freemint 5 days ago

Is someone aware of transform architecture or implementation where every non linearity is made out of ReLu from linearities (max which can be expressed by ReLus instead of softmax).

In particular an Attention module which satisfies this property?

I did some googling however i only find papers like this one https://arxiv.org/abs/2204.07731 which deals with the O() behaviour with the window size. Not with the activation of the systems.

solarmist 5 days ago

An analogy that comes to mind.

It sounds a bit like neural networks are the brain's hardware (at least when squinting hard), and training large language models forces the models to spend a large chunk of their training time/effort re-creating portions of the firmware. Once that's done, training on the language (or anything else) is almost trivial in comparison.

That sounds quite plausible to me. It would also explain why so many people feel like this is almost a taste of AGI.

mark_l_watson 6 days ago

Beautiful article. I have been using neural networks since the mid-1980s and I sometimes think of a high dimension space representing a neural network, and I have wondered if there were special places in that space. Clearly there must be an astronomically large number of “dead sub-spaces”.

From the article “Well-trained transformers, regardless of task, occupy the same relatively small subspace of parameter space”