Memorizing Transformers

179 points by silencedogood3 3 years ago

have an implementation of this over at https://github.com/lucidrains/memorizing-transformers-pytorc..., for any researcher exploring retrieval and memory with attention networks

knrz 3 years ago

Dude your repo’s are great, marvellous code quality too for cutting edge papers. Keep it up!
- lucidrains 3 years ago
  
  hey thanks! :^) hope someone makes the next big discovery with them
silencedogood3 3 years ago

Neat! Can you explain what the KNn is doing? I can’t quite follow the paper.
- visarga 3 years ago
  
  It's a sparse attention scheme. They store and reuse activations thus "memorising" the past without the need for training. In order to keep the sequence short enough to fit into memory they only recall the k most similar memories from a much larger log.

axg11 3 years ago

See also RETRO, a type of retrieval transformer: [0], [1], [2]

[0] - https://www.deepmind.com/publications/improving-language-mod...

[1] - https://jalammar.github.io/illustrated-retrieval-transformer...

[2] - https://arsham.substack.com/p/retrieval-transformers-for-med...

6gvONxR4sf7o 3 years ago

External memory with pretrained models (or more generally, external not-necessarily-differentiable memory) is one of the most exciting areas of ML right now. It opens up models to external things like facts and databases.

silencedogood3 3 years ago

Can you explain what the big deal is? I’m still in the early learning stages.
- 6gvONxR4sf7o 3 years ago
  
  As an example, if you want to encode all of the data in wikipedia with embeddings and train a model to answer questions with that information, historically, that would mean a model that encodes all of wikipedia, encodes the question, uses all of encoded wikipedia to decode an answer, then does backprop through all of that and updates the weights. Then it re-encodes all of wikipedia with the new weights and goes all over again, again and again at each training step, also somehow holding all of that in GPU memory. Meaning you basically couldn’t do it that way.
  Today, we’re seeing big models that can encode all of wikipedia in useful ways. If the encodings are “good enough” then you can encode all of wikipedia once, before training another model that just has to encode a question, then use encoded wikipedia to decode an answer, then do backprop through just the answer and question. If wikipedia changes in the meantime, you can probably just update your database of encoded stuff and your learned QA model will be able to incorporate that new information.
  
  amelius 3 years ago
  
  Replace Wikipedia by the internet, and you can replace Google Search by some (hopefully) soon to be discovered algorithm based on these principles. Exciting times.

jerpint 3 years ago

The basic idea is to have a q,k,v cache of all the previously seen tokens that gets updated over time. The transformer can decide to do self-attention (and ignore the cache) or focus on elements from the cache (enabling it to attend to previously seen tokens). They mainly apply this to large documents, i'd be very curious to see a followup on time-dependent tasks like videos

shallichange 3 years ago

Top of my head: Rodimus, Bumblebee, Ratchet, Optimus Prime, Laserbeak, Megatron, Astro Train, Jazz

UmbertoNoEco 3 years ago

People here dont deserve you :)
lukaszkups 3 years ago

this is what I've been expecting when clicking on this submission

tipsytoad 3 years ago

Could there be any merit training this on a common-sense dataset such as Cyc?

https://www.lesswrong.com/tag/cyc

ipsum2 3 years ago

Probably not, most common facts (sandcat is a type of feline) are already known by transformers. Maybe some obscure ones.

mountainriver 3 years ago

Love it! Its seems like a lot of the ideas from reinforcement learning are making their way into transformer land and NLP

blackbear_ 3 years ago

> On benchmarks including code and mathematics, we find that the model is capable of making use of newly defined functions and theorems during test time.

Train on test, improved performance on test. Wow.

visarga 3 years ago

> Wow.
Transformers are very limited in the size of the attention window. They can take a few thousand tokens at maximum. But your data might not fit into the window, and you also don't want to have to fine-tune the model. This paper offers a solution.
spullara 3 years ago

It isn't being trained on test. Kind of the point of memory is that you can change the memory at will and don't need to train on new information you have never seen before.

jameshart 3 years ago

The ‘ethics’ section seems surprisingly cursory and lacking in references.

“The ability to memorize large databases of facts could have potential ramifications for society, especially if those databases include sensitive personal information or copyrighted works. However, one advantage of using an external memory is that the memory can be easily cleared of all such information”

That’s it? Just ‘may have ramifications’?

No concern that this enables ‘Tay’-like failure modes where a system can be manipulated through input into generating particular output?

Or even just grappling with whether adding ‘memory of experiences’ to a language model might open the door to creating a system that has beliefs, or opinions…? and that maybe there might be some ethical concerns with just wiping that out?

ipsum2 3 years ago

That'd be a waste of space. Most transformer models have the same ethical concerns, which have been addressed in countless other papers. Why bother copy pasting the same essays in every minor tweak of transformers?
dotnet00 3 years ago

The ethics sections for ML papers almost always seem extremely superfluous. It's like asking a CPU designer to talk about the danger that their CPU can run code for computing firing trajectories. It's a paper about providing memory to ML models, it'll have all the possible applications that require memory, what else does one need?
kettleballroll 3 years ago

The ethics section is a tacked on thing which is required by some large ML conferences. They're essentially a PR stunt. No ML researcher i know cares about it, or devotes more than the 5 minutes it takes to write some platitudes to the task. There are simply no incentives to write this properly. And quite frankly, i don't think there should be. We are educated, paid and motivated to push the boundaries of research, not to think about all potential fallout (which, let's face it, would usually require a whole additional paper for most meaningful contributions). I don't really see how we could change this.
Tldr: as a general rule you can ignore the ethics section of ML papers.
- 6gvONxR4sf7o 3 years ago
  
  > We are educated, paid and motivated to push the boundaries of research, not to think about all potential fallout
  That’s the whole problem that led to the introduction of these sections.
  
  belval 3 years ago
  
  That's debatable, would an "ethics" section on the original deepfake paper have changed anything?
  ML research isn't as inaccessible as genetics research, if there's something idiotic that people can do with DL, they will eventually do it. Acting as if having people add a paragraph to their paper where they "reflect" on the consequences will change anything is only showing how disconnected you are with reality.
  Research is research, there shouldn't be any "forbidden knowledge", we have laws for a reason.
- balthigor 3 years ago
  
  > not to think about all potential fallout
  You're doing it wrong then.
  Ignoring ethics is lazy.
- enchiridion 3 years ago
  
  Yep, this is correct.
- YeGoblynQueenne 3 years ago
  
  >> Tldr: as a general rule you can ignore the ethics section of ML papers.
  More generally still, you can ignore the ethics of ML researchers- pretty much for the same reasons that you can ignore the Great Turnip of Justice in the sky.
refulgentis 3 years ago

I'm not sure it's scientific or helpful to include the risk that a program develops "beliefs" or "opinions", and terminating the program is "wiping [someone] out"
visarga 3 years ago

> No concern that this enables ‘Tay’ like failure modes where a system can be manipulated through input into generating particular output?
Isn't that the core idea in prompting and few shot learning for large language models?
changoplatanero 3 years ago

My feeling is that those topics would be best addressed in a separate paper by authors who have more of a background in ethics.