briancleland 4 months ago

This paper highlights something that should have been obvious: prediction and retrieval are two sides of the same coin. To predict effectively, you must first identify what's relevant. What's remarkable is that a 0.5B parameter model can perform perfect retrieval over 1M tokens when its natural attention patterns are leveraged properly.

It raises an interesting question: what if we designed architectures explicitly around retrieval capabilities? Transformer architectures were designed for prediction, and retrieval emerged as a byproduct. What would an architecture optimized specfically for retrieval look like?

A lot of money has been spent on building out large-scale RAG systems. If the performance improvements promised by the paper are real, the ramifications will be huge. Exciting to see that the authors are promising to release their code - it will be fun to how this model performs on consumer hardware.

  • mirekrusin 4 months ago

    I think this could be expanded further. You can convert attention traces to knowledge graph with arbitrary and/or dynamic density. Traversing it can also be exotic – zooming in/expanding details at arbitrary points during traversal. With common format you can create topic/knowledge trace packs that could be shared, merged (subtracted?) etc.

vignesh865 4 months ago

I read through the paper, and I found the insights to be excellent.

However, regarding the practical implementation, the paper assumes that the questions will be available in advance. For each question, it requires calculating attention scores between the question and the context chunks, which makes it impractical as a replacement for Retrieval-Augmented Generation (RAG). For instance, if there are 1,000 documents, each with 10 chunks, it would be infeasible to compute attention scores between 10,000 chunks and a user query every time.

riddelln 4 months ago

Am I correct in thinking that RAG, or SFT, would still be needed to introduce unseen context to the model.

maalouli 4 months ago

Using attention for the retrieval of relevant information seems super intuitive. Only feed the model what it deems relevant. Curious about the scenarios where this mechanism misses relevant information.

smallnix 4 months ago

Do I understand right this requires access to internals of the LLM and can not be used with todays models behind an API like ChatGPT or Claude?

  • mirekrusin 4 months ago

    Innovation that would be applicable to open weight models running locally only would be awesome.