High context is pretty normal these days though, as you keep interfacing with the llms the context window just grows. And with mcps and RAG is trivial to get 30k contexts++ in every query
skimmed the paper - how well does this plug into real serving stacks (paged-kv, vllm, speculative decoding, caching)? layer-wise top-k chunk voting sounds compatible, but does it fight with RoPE scaling or sliding-window kv eviction policies?
Seeing frameworks like this pop up reminds me how much the LLM ecosystem is moving toward more modular and hardware-aware solutions. Performance at lower compute cost will be key as adoption spreads past tech giants.
Curious to see how devs plug this into real-time apps; so much room for lightweight innovation now.
wasn't this the attention sink concept to some degree? I mean it doesn't seem out of the realm of possibility that if the latency overhead isn't signifigant, that frontier models start adopting similar to DeepSeek OCR tech
From the results in Figure 5, it appears that this would only be advantageous for long long contexts.
In particular, it is slower when used with <30k token context.
High context is pretty normal these days though, as you keep interfacing with the llms the context window just grows. And with mcps and RAG is trivial to get 30k contexts++ in every query
The system prompt for coding agents is already in the 30k range.
skimmed the paper - how well does this plug into real serving stacks (paged-kv, vllm, speculative decoding, caching)? layer-wise top-k chunk voting sounds compatible, but does it fight with RoPE scaling or sliding-window kv eviction policies?
Seeing frameworks like this pop up reminds me how much the LLM ecosystem is moving toward more modular and hardware-aware solutions. Performance at lower compute cost will be key as adoption spreads past tech giants. Curious to see how devs plug this into real-time apps; so much room for lightweight innovation now.
Love it, they're teaching LLMs how to skim texts properly, which is exactly the right approach for handling long contexts.
wasn't this the attention sink concept to some degree? I mean it doesn't seem out of the realm of possibility that if the latency overhead isn't signifigant, that frontier models start adopting similar to DeepSeek OCR tech
High speed improvement (4x) with low quality loss (2%). Sounds promising.