carbocation 6 hours ago

My kingdom for renaming this paper to something like "Tensor Product Attention is a Memory-Efficient Approach for Long-Sequence Language Modeling"

  • Zacharias030 4 hours ago

    If you don’t like the title, wait till you see this acronym: „… we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture…“

    • imjonse 4 hours ago

      There is a famous transformer model named T5 from Google, and also S4, S4 and S6 (Mamba) in the LLM space, so it is not unusual naming.

      • svantana 26 minutes ago

        Yes, but T5 is at least a normal acronym: Text-To-Text Transfer Transformer (albeit a bit forced)

      • black_puppydog 2 hours ago

        "... is all you need" isn't unusual either, and yet GGP isn't happy about it (and I understand why)

bbcc90 2 hours ago

(trying to move the critique beyond the title...)

When trying to deploy llms in with larger context windows constrained environments 2 things start to hurt: a) increased memory footprint for longer KV cache b) increased decode speed due to longer context window. this paper addresses a) only, which is useful, but we are still left with b) (right?)

  • verdverm 2 hours ago

    The more meaningful contribution may be (section 3.4)

    > These variants illustrate TPA’s versatility in balancing memory cost, computational overhead, and representation power. By choosing which dimensions (heads or tokens) remain contextual and adjusting ranks (RQ, RK, RV ), TPA unifies multiple existing attention mechanisms— such as MHA, MQA, and GQA—under one framework, while potentially reducing the KV cache size by an order of magnitude during autoregressive inference.

    re: the title, it might be the true one if their proofs hold up

    ---

    I'm now curious if the Element-wise Attention is All You Need preprint can be fit into this framework. Sadly my math is not currently up to the task. It appears to offer even better computational savings during both training and inference while maintaining accuracy, though only tested with a smaller model

    https://arxiv.org/abs/2501.05730

whymauri 6 hours ago

I really can't with these paper titles anymore, man.

  • magicalhippo 6 hours ago

    There's an Ask HN thread going[1] asking about what people have done with small LLMs. This seems like a possible application. I asked Granite 3.1 MOE 3B to generate a title based on the abstract and it came up with:

    Tensor Product Attention: A Memory-Efficient Solution for Longer Input Sequences in Language Models

    Maybe a Greasemonkey script to pass arXiv abstracts to a local Ollama could be something...

    [1]: https://news.ycombinator.com/item?id=42784365

  • wisty 5 hours ago

    Clickbait paper titles considered harmful?

    • gbnwl 5 hours ago

      OK I'll admit I chuckled

  • ilove196884 6 hours ago

    I hate how paper titles are worded like seo techniques.

    • spiritplumber 5 hours ago

      Turn something into a metric and it will be misused. Ever always was

      • jampekka 28 minutes ago

        Attention is all you need!

    • verdverm 5 hours ago

      This is a riff on the original "attention is all you need" paper, there has been a few of these lately

      • Matthyze 2 hours ago

        A few? A multitude.

        • verdverm 2 hours ago

          This one might be right if they have in fact unified multiple attention approaches into a single framework

          see Section 3.4

  • amelius an hour ago

    And they don't formally show that the titles are correct, therefore I don't think these papers belong in CS.

  • anigbrowl 5 hours ago

    By 2038 all scientific papers will be titled 'Bruh.' While this might at first seem a recipe for confusion, the fundamental interconnectedness of all things as demonstrated by Ollama(Googol 13) highlight the fact that pretty much any insight is as good as any other and are all descriptions of the same underlying phenomenon. Freed from constraint like survival or the necessity to engage in economic activity, humanity in the 203s will mainly devote itself to contemplating amusing but fundamentally interchangeable perspectives within increasingly comfy pleasure cubes.

hangonhn 2 hours ago

For those of us who are lay people outside of machine learning and AI, what was the critical insight that made “attention all you need” in the original Transformer paper?

  • yorwba 29 minutes ago

    The abstract https://arxiv.org/abs/1706.03762 explains it well:

    "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."

    They did not invent attention, but while previous language models had used attention as an auxiliary mechanism, they removed everything but the attention and the models still worked. Really, the title already says it all.

  • danielbln an hour ago

    I believe the insight was the introduction of the attention mechanisms that allow the NN to look at all words (well, embeddings) in parallel and make connections between them, instead of processing things purely sequential.

  • freilanzer an hour ago

    That attention works and is highly parallelisable.

esafak 5 hours ago

Tensor decomposition has traditionally suffered from high computational complexity. Is it an issue here?

  • verdverm 5 hours ago

    My math is rusty, but it looks to have a higher complexity than the original attention. I cannot say if it is an issue. Generally it seems we are willing to spend more computation at training time if it produces better results at inference time. In this case they are reducing the resources needed at inference time (an order of magnitude for the KV cache) or enabling longer sequences given the same resources.

    There's another paper I saw yesterday, "Element-wise Attention is All You Need" which looks like an early preprint, written by a solo author with a solo A800, and tested on some smaller problems. If the results hold up for language benchmarks, it could reduce resource requirements during training as well. It looks to have a lower complexity when scaling

    https://arxiv.org/abs/2501.05730

  • davmre an hour ago

    They're not proposing to apply tensor decomposition to an existing collection of weights. It's an architecture in which the K, V, and Q tensors are constructed as a product of factors. The model works with the factors directly and you just need to compute their product on the forward pass (and adjoints on the backwards pass), so there's no decomposition.

  • absolutelastone 5 hours ago

    Looks like it's just a matrix decomposition in the paper. I'm guessing anyway. These attention papers are always a painful mix of mathematical, quasi-mathematical, and information retrieval jargon.

    There is something in the github repo about higher-order decompositions. Don't find where the method for factoring is given.

    • verdverm 5 hours ago

      I chuckled when I read, in S-3.1

      > Specifically, for each token t, with a small abuse of notation, we define:

  • dartos 5 hours ago

    At a sniff test it would make sense.

    Trading computational complexity for space.

thunkingdeep 3 hours ago

If you don’t pay to read papers, you don’t get to complain about the titles, imo.

I hate ads, but I’m not paying for YouTube Premium either. That’s how it goes. I get ads.

  • jampekka 23 minutes ago

    Authors or their institutions don't get a penny from paywalled papers either. Oftentimes they have to pay to publish. Authors choose the titles, sometimes reviewers (not paid a dime either) can demand changes to the title, and the academic editor (paid ziltch too) can require the changes.

    This doesn't apply to arXiv though, as it is not peer reviewed nor edited and is funded by various institutions.

    Admittedly the academic publishing system is so corrupt that it's hard to phantom, so it's easy to misunderstand it.

    Typically you do pay for the papers (and publisher profits), either through taxes or inflated product prices.

  • Vampiero 2 hours ago

    > I hate ads, but I’m not paying for YouTube Premium either. That’s how it goes. I get ads.

    No that's not how it goes. You get Ublock Origin and then you don't get ads. Simple as that.

    If you don't like ads and don't fight against them it means you accept ads and want to see more of them shoved down our collective throat. At least from the perspective of marketers and industries who rely on ads. That's how we ended up in this predicament in the first place. Lazy compliance.

    If Youtube isn't sustainable without ads, it should die so that natural selection can take over and so that a better ecosystem can finally take its place. Every single "Youtuber" hates the platform, mostly because it has zero transparency being a Google product. The viewers hate it too because it constantly takes down their favorite videos and creators and because it's full of ads.

    The only reason it's (still) the main site for hosting videos is quite literally just ad-fueled inertia due to the intrinsic cost of hosting videos. If ads didn't exist the only sustainable solution would be something less centralized like Peertube. And to me that's a desirable outcome.

  • black_puppydog 2 hours ago

    Authors are from, let's see, china and california. Then I guess a good chunk of the HN crowd is entitled to bitching about the title?

cute_boi 5 hours ago

> a novel attention mechanism

Why do every paper has to mention this word "novel" and these titles are getting crazier day by day.

  • NitpickLawyer 3 hours ago

    If your paper is scored / gated on "novel factor" by admission committees, then applicants will over-use that term.

  • verdverm 5 hours ago

    There are a number of papers which aim to improve the attention aspect of models, all being some derivation of the original "Attention is All You Need" paper. A pattern of "'blank' Attention is All You Need" has emerged

  • patrick451 5 hours ago

    Because to publish in a real journal, you typically need both novelty and for your work to be "interesting". The job of the abstract and introduction of a paper (where the word "novel" normally lives) is to sell the reviewer that the paper should be published and to sell you that you should read and cite it.

joshdavham 4 hours ago

I'm sorry but can people please stop naming their papers "X is all you need"? It's super annoying.

  • recursive 4 hours ago

    Are you saying... you consider it harmful?

    • pepinator 3 hours ago

      A more precise title would be better, so, yeah, it's harmful.

  • WithinReason an hour ago

    Consider submitting an article titled "all you need is considered harmful"