This paper highlights something that should have been obvious: prediction and retrieval are two sides of the same coin. To predict effectively, you must first identify what's relevant. What's remarkable is that a 0.5B parameter model can perform perfect retrieval over 1M tokens when its natural attention patterns are leveraged properly.
It raises an interesting question: what if we designed architectures explicitly around retrieval capabilities? Transformer architectures were designed for prediction, and retrieval emerged as a byproduct. What would an architecture optimized specfically for retrieval look like?
A lot of money has been spent on building out large-scale RAG systems. If the performance improvements promised by the paper are real, the ramifications will be huge. Exciting to see that the authors are promising to release their code - it will be fun to how this model performs on consumer hardware.
I think this could be expanded further. You can convert attention traces to knowledge graph with arbitrary and/or dynamic density. Traversing it can also be exotic – zooming in/expanding details at arbitrary points during traversal. With common format you can create topic/knowledge trace packs that could be shared, merged (subtracted?) etc.
Using attention for the retrieval of relevant information seems super intuitive. Only feed the model what it deems relevant. Curious about the scenarios where this mechanism misses relevant information.
This paper highlights something that should have been obvious: prediction and retrieval are two sides of the same coin. To predict effectively, you must first identify what's relevant. What's remarkable is that a 0.5B parameter model can perform perfect retrieval over 1M tokens when its natural attention patterns are leveraged properly.
It raises an interesting question: what if we designed architectures explicitly around retrieval capabilities? Transformer architectures were designed for prediction, and retrieval emerged as a byproduct. What would an architecture optimized specfically for retrieval look like?
A lot of money has been spent on building out large-scale RAG systems. If the performance improvements promised by the paper are real, the ramifications will be huge. Exciting to see that the authors are promising to release their code - it will be fun to how this model performs on consumer hardware.
I think this could be expanded further. You can convert attention traces to knowledge graph with arbitrary and/or dynamic density. Traversing it can also be exotic – zooming in/expanding details at arbitrary points during traversal. With common format you can create topic/knowledge trace packs that could be shared, merged (subtracted?) etc.
Using attention for the retrieval of relevant information seems super intuitive. Only feed the model what it deems relevant. Curious about the scenarios where this mechanism misses relevant information.
Do I understand right this requires access to internals of the LLM and can not be used with todays models behind an API like ChatGPT or Claude?
Innovation that would be applicable to open weight models running locally only would be awesome.