carbocation 6 months ago

My kingdom for renaming this paper to something like "Tensor Product Attention is a Memory-Efficient Approach for Long-Sequence Language Modeling"

  • Zacharias030 6 months ago

    If you don’t like the title, wait till you see this acronym: „… we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture…“

    • imjonse 6 months ago

      There is a famous transformer model named T5 from Google, and also S4, S4 and S6 (Mamba) in the LLM space, so it is not unusual naming.

      • svantana 6 months ago

        Yes, but T5 is at least a normal acronym: Text-To-Text Transfer Transformer (albeit a bit forced)

      • TeMPOraL 6 months ago

        That it's not unusual tells us that too many researchers in the field are chasing citations and fame at the expense of doing quality work.

        • ben_w 6 months ago

          Mm. That or all sharing a sense of humour/in-jokes: I'm sure I'm not the only one here who immediately thought of "GOTO is all you need" and "Attention considered harmful"

          • TeMPOraL 6 months ago

            Right. But then, what they did to the title to make it collapse down to T6 is even worse than what I did to my nickname back in high school to squeeze in a long forgotten inner joke about our city's municipal sanitation department (MPO).

          • EGreg 6 months ago

            Ironically, both are true!

      • black_puppydog 6 months ago

        "... is all you need" isn't unusual either, and yet GGP isn't happy about it (and I understand why)

    • superjan 6 months ago

      I propose T-POT (Tensor Product attentiOn Transformer)

bbcc90 6 months ago

(trying to move the critique beyond the title...)

When trying to deploy llms in with larger context windows constrained environments 2 things start to hurt: a) increased memory footprint for longer KV cache b) increased decode speed due to longer context window. this paper addresses a) only, which is useful, but we are still left with b) (right?)

  • verdverm 6 months ago

    The more meaningful contribution may be (section 3.4)

    > These variants illustrate TPA’s versatility in balancing memory cost, computational overhead, and representation power. By choosing which dimensions (heads or tokens) remain contextual and adjusting ranks (RQ, RK, RV ), TPA unifies multiple existing attention mechanisms— such as MHA, MQA, and GQA—under one framework, while potentially reducing the KV cache size by an order of magnitude during autoregressive inference.

    re: the title, it might be the true one if their proofs hold up

    ---

    I'm now curious if the Element-wise Attention is All You Need preprint can be fit into this framework. Sadly my math is not currently up to the task. It appears to offer even better computational savings during both training and inference while maintaining accuracy, though only tested with a smaller model

    https://arxiv.org/abs/2501.05730

    • hansvm 6 months ago

      EA doesn't quite fit in the same umbrella. EA has a constant cache size (it's just another classical recurrent architecture inspired by approximating transformers), where this paper gives speedups to a variety of true attention mechanisms which still require caches to be proportional to the sequence length.

      • verdverm 6 months ago

        very succinct and insightful, thank you!

    • ashupadhi01 6 months ago

      Curious to know what mathematics you are comfortable with. If you are able to understand the papers you mentioned, you must belong to the 99 percentile.

      • verdverm 6 months ago

        I was never good at proof writing. I found group theory and algebra interesting, topology and analysis eluded me. It's just been a while since I did any serious math thinking

  • llm_trw 6 months ago

    It addresses b too since decompositions are always smaller than the original tensor. It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement.

    • menaerus 6 months ago

      > It's usually the case that memory access is also slower than matrix multiplications so this will be faster. Burning flops to save memory movement.

      I haven't read this paper (yet) but isn't this the case that mostly applies to training and not so much to inference? A good example would be flash-attention, it trades the higher flops for better memory utilization but it's mostly irrelevant in inference workloads.

      • verdverm 6 months ago

        They claim an inference time savings to the kv cache

        • menaerus 6 months ago

          I skimmed through the paper real quickly. There's no performance data on inference speedups in the paper. Only the benchmarks relevant for training.

          They also, interestingly, don't compare against the flash-attention. Flash-attention outperforms all of the other attention mechanisms mentioned in the paper: MHA, MQA, GQA, and MLA.

          • apophis-ren 5 months ago

            Flash attention is an implementation trick; you can implement MHA/GQA, for example, with flash attention.

          • verdverm 6 months ago

            They aren't claiming speedups, they are claiming up to an order of magnitude less space needed for the kv cache at runtime. This translates to a smaller GPU or longer sequences in the same GPU

            • menaerus 6 months ago

              Under what circumstances can you cut down your LOADS and STORE from and to main memory by an order of magnitude without observing major improvements in algorithm runtime that is memory-bound?

              • verdverm 6 months ago

                AI models are compute bound, it's why we use GPUs

                • menaerus 6 months ago

                  Incorrect. Self-attention is a highly parallel algorithm that makes it a great candidate for being a memory-bound workload once you have enough compute.

                  Both datacenter grade CPUs and GPUs have enough compute to carry out the self-attention computation but it is only the latter that has enough hi-bandwidth memory to make the algorithm really perform. If this hadn't been the case, the theory behind flash-attention wouldn't materialize, and it does, and reason being that (main) memory is slow.

                  Deep FFWD networks OTOH are compute-bound.

                  • zaptrem 6 months ago

                    Transformers are deep feedforward networks that happen to also have attention. Causal LMs are super memory bound during inference due to kv caching as all of those linear layers need to be loaded onto the core to transform only a single token per step.

                    • menaerus 6 months ago

                      And I said something else?

                  • lostmsu 6 months ago

                    Memory bound only applies to low batch size scenarios AFAIK

                    • menaerus 6 months ago

                      This obviously depends on the hardware and the shape of the LLM model itself but, generally speaking, it's quite the opposite. The idea of batching is to grow the compute bandwidth per single request, bigger batch sizes with much more compute will put more stress to the underlying (cache, RAM) memory subsystem, no?

                      For N self-attention layers, there will be N compute (tensor) units doing the computation in parallel. To retire the computation, each compute unit will need to LOAD/STORE from and to the chip memory. At batch size B, this only becomes a bigger scale, e.g. B * (N, LOAD/STORE).

                      • lostmsu 6 months ago

                        If you have a batch of size 1, for every token you need to load the entire model from memory into cache as you go through it. If it is 32 you can produce 32 tokens while doing the same amount of loading from VRAM.

                        • menaerus 6 months ago

                          That's not how it works because if what you're saying had been true then the self-attention memory complexity would be O(1), e.g. regardless of the batch size. This obviously isn't the case since each batch computation necessitates it's own load/store memory bandwidth. I suggest reading one of the transformers papers to really understand how it works.

                          • lostmsu 6 months ago

                            This was a simplification. Of course you need some extra VRAM I/O based on your KV cache size.

                            But assuming your KV cache size is << model size, that simplification is pretty accurate.

                            See, e.g. https://www.databricks.com/blog/llm-inference-performance-en...

                            You can just scroll to the first chart they have that explains the idea.

  • wseqyrku 6 months ago

    > (trying to move the critique beyond the title...)

    This is kind of a theme in HN now. The top comments are completely besides the point of the article/story/etc.

    • msoad 6 months ago

      I know. It is sad. Naming can also be seen as a way of showing respect to a hugely impactful paper if you want to be positive about it.

whymauri 6 months ago

I really can't with these paper titles anymore, man.

  • magicalhippo 6 months ago

    There's an Ask HN thread going[1] asking about what people have done with small LLMs. This seems like a possible application. I asked Granite 3.1 MOE 3B to generate a title based on the abstract and it came up with:

    Tensor Product Attention: A Memory-Efficient Solution for Longer Input Sequences in Language Models

    Maybe a Greasemonkey script to pass arXiv abstracts to a local Ollama could be something...

    [1]: https://news.ycombinator.com/item?id=42784365

  • wisty 6 months ago

    Clickbait paper titles considered harmful?

    • moffkalast 6 months ago

      Clickbait paper titles cause cancer, study shows

      • hatthew 6 months ago

        Clickbait paper titles cure cancer (if you print them out, set them on fire, and incinerate the cancer cells*)

        *side effects TBD

    • gbnwl 6 months ago

      OK I'll admit I chuckled

    • llm_trw 6 months ago

      Only if the paper is handwritten.

  • anigbrowl 6 months ago

    By 2038 all scientific papers will be titled 'Bruh.' While this might at first seem a recipe for confusion, the fundamental interconnectedness of all things as demonstrated by Ollama(Googol 13) highlight the fact that pretty much any insight is as good as any other and are all descriptions of the same underlying phenomenon. Freed from constraint like survival or the necessity to engage in economic activity, humanity in the 203s will mainly devote itself to contemplating amusing but fundamentally interchangeable perspectives within increasingly comfy pleasure cubes.

    • smlacy 6 months ago

      Bruh is all you need

  • ilove196884 6 months ago

    I hate how paper titles are worded like seo techniques.

    • spiritplumber 6 months ago

      Turn something into a metric and it will be misused. Ever always was

    • verdverm 6 months ago

      This is a riff on the original "attention is all you need" paper, there has been a few of these lately

      • Matthyze 6 months ago

        A few? A multitude.

        • verdverm 6 months ago

          This one might be right if they have in fact unified multiple attention approaches into a single framework

          see Section 3.4

    • LPisGood 6 months ago

      Having a catchy title is great for short hand. If it didn’t have such a catchy name I probably wouldn’t remember Flush+Reload, Spectre, or even Attention is All You Need

    • Upvoter33 6 months ago

      On the one hand, sure, it's dumb.

      But, on the other hand, it's hard to get researchers to read your paper, esp. in fast-moving areas. Every little thing might be the difference between reading the abstract or not. Reading the abstract might lead to reading the intro. And so on.

      So, for better or worse, the competition for human eyeballs is real.

      Ironically, in this case, "attention" is all that the authors want.

  • WesolyKubeczek 6 months ago

    Preach, mate. The bloody Beatles and their bloody catchy refrains and hit songs. They sing “Love is all you need” once, and now it’s everywhere! Can’t hide from it. Even scientific papers! Especially scientific papers!

    Bloody hell and brimstone. Been crazy 57 years and a half already.

  • amelius 6 months ago

    And they don't formally show that the titles are correct, therefore I don't think these papers belong in CS.

  • sva_ 6 months ago

    And I can't with the constant off-topic meta-discussions about the titles of papers.

  • jdefr89 6 months ago

    Same. At this point “All we need” is about a thousand different things..

    • taylorius 6 months ago

      A catchy title is all you need (to get attention).

    • philipov 6 months ago

      All you need is love.

esafak 6 months ago

Tensor decomposition has traditionally suffered from high computational complexity. Is it an issue here?

  • verdverm 6 months ago

    My math is rusty, but it looks to have a higher complexity than the original attention. I cannot say if it is an issue. Generally it seems we are willing to spend more computation at training time if it produces better results at inference time. In this case they are reducing the resources needed at inference time (an order of magnitude for the KV cache) or enabling longer sequences given the same resources.

    There's another paper I saw yesterday, "Element-wise Attention is All You Need" which looks like an early preprint, written by a solo author with a solo A800, and tested on some smaller problems. If the results hold up for language benchmarks, it could reduce resource requirements during training as well. It looks to have a lower complexity when scaling

    https://arxiv.org/abs/2501.05730

  • davmre 6 months ago

    They're not proposing to apply tensor decomposition to an existing collection of weights. It's an architecture in which the K, V, and Q tensors are constructed as a product of factors. The model works with the factors directly and you just need to compute their product on the forward pass (and adjoints on the backwards pass), so there's no decomposition.

  • absolutelastone 6 months ago

    Looks like it's just a matrix decomposition in the paper. I'm guessing anyway. These attention papers are always a painful mix of mathematical, quasi-mathematical, and information retrieval jargon.

    There is something in the github repo about higher-order decompositions. Don't find where the method for factoring is given.

  • dartos 6 months ago

    At a sniff test it would make sense.

    Trading computational complexity for space.

jdefr89 6 months ago

Every day there are literally tons of papers with “XYX is All you need” at this point we apparently need thousands of things…

hangonhn 6 months ago

For those of us who are lay people outside of machine learning and AI, what was the critical insight that made “attention all you need” in the original Transformer paper?

  • yorwba 6 months ago

    The abstract https://arxiv.org/abs/1706.03762 explains it well:

    "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."

    They did not invent attention, but while previous language models had used attention as an auxiliary mechanism, they removed everything but the attention and the models still worked. Really, the title already says it all.

  • danielbln 6 months ago

    I believe the insight was the introduction of the attention mechanisms that allow the NN to look at all words (well, embeddings) in parallel and make connections between them, instead of processing things purely sequential.

  • imtringued 6 months ago

    I don't remember the contents of that paper, but I can give you some context based on my knowledge of "traditional" theoretical computer science.

    In theoretical CS you have state machines, pushdown automatons and turing machines. What may surprise you is that the difference between those three does not lie in the way the algorithm for them is represented. They all use state transition diagrams!

    Pushdown automatons are more powerful than state machines, because they have a stack onto which they can push new data, peek at the top of the stack or pop data off the stack.

    Now here is the kicker! How do you get a turing machine? You take a pushdown automaton and remove the restriction that you can only push, peek or pop at the head! You can now move the head pointing at the stack, your stack has turned into a tape!

    The key difference lies in the datastructure that the transition diagram is manipulating, not the algorithm itself!

    The attention mechanism of the transformer architecture is analogous to a tape that can only be written to once per blank and in theory that alone is enough to emulate a full Turing machine with read, write, rewrite and delete semantics.

  • freilanzer 6 months ago

    That attention works and is highly parallelisable.

AxesPushPatty 6 months ago

Another approach has been where separate physics-informed neural networks learned the tensor product. They reformulated the initial optimization problem to be structured as tensors. I assume that tensor products could be another factor in improving the actual computations.

https://arxiv.org/abs/2408.13101

cute_boi 6 months ago

> a novel attention mechanism

Why do every paper has to mention this word "novel" and these titles are getting crazier day by day.

  • patrick451 6 months ago

    Because to publish in a real journal, you typically need both novelty and for your work to be "interesting". The job of the abstract and introduction of a paper (where the word "novel" normally lives) is to sell the reviewer that the paper should be published and to sell you that you should read and cite it.

  • NitpickLawyer 6 months ago

    If your paper is scored / gated on "novel factor" by admission committees, then applicants will over-use that term.

  • verdverm 6 months ago

    There are a number of papers which aim to improve the attention aspect of models, all being some derivation of the original "Attention is All You Need" paper. A pattern of "'blank' Attention is All You Need" has emerged

  • LPisGood 6 months ago

    This is not new at all, by the way. Bringing novel ideas and techniques is kind of the whole point of research, and explicitly describing what novel thing you did is a good thing to do in an intro/abstract.

sva_ 6 months ago

The main contribution of the paper aside, I have to say that the background section 2 is very neatly and succinctly written.

t_mann 6 months ago

> Because memory consumption grows linearly with sequence length, the maximum context window is limited by practical hardware constraints

I thought the number of parameters grows quadratically with context window length - what do they mean?

joshdavham 6 months ago

I'm sorry but can people please stop naming their papers "X is all you need"? It's super annoying.

  • recursive 6 months ago

    Are you saying... you consider it harmful?

    • joshdavham 6 months ago

      > Are you saying... you consider it harmful?

      No.

    • oliverx0 6 months ago

      I see what you did there

    • pepinator 6 months ago

      A more precise title would be better, so, yeah, it's harmful.

  • edflsafoiewq 6 months ago

    Why? It clearly and instantly communicates the genre of result the paper presents.

  • Deutschland314 6 months ago

    It's clearly a play on words as it looks like a follow up paper

thunkingdeep 6 months ago

If you don’t pay to read papers, you don’t get to complain about the titles, imo.

I hate ads, but I’m not paying for YouTube Premium either. That’s how it goes. I get ads.

  • Vampiero 6 months ago

    > I hate ads, but I’m not paying for YouTube Premium either. That’s how it goes. I get ads.

    No that's not how it goes. You get Ublock Origin and then you don't get ads. Simple as that.

    If you don't like ads and don't fight against them it means you accept ads and want to see more of them shoved down our collective throat. At least from the perspective of marketers and industries who rely on ads. That's how we ended up in this predicament in the first place. Lazy compliance.

    If Youtube isn't sustainable without ads, it should die so that natural selection can take over and so that a better ecosystem can finally take its place. Every single "Youtuber" hates the platform, mostly because it has zero transparency being a Google product. The viewers hate it too because it constantly takes down their favorite videos and creators and because it's full of ads.

    The only reason it's (still) the main site for hosting videos is quite literally just ad-fueled inertia due to the intrinsic cost of hosting videos. If ads didn't exist the only sustainable solution would be something less centralized like Peertube. And to me that's a desirable outcome.

    • philipov 6 months ago

      "The forest must burn to make room for new trees."

  • jampekka 6 months ago

    Authors or their institutions don't get a penny from paywalled papers either. Oftentimes they have to pay to publish. Authors choose the titles, sometimes reviewers (not paid a dime either) can demand changes to the title, and the academic editor (paid ziltch too) can require the changes.

    This doesn't apply to arXiv though, as it is not peer reviewed nor edited and is funded by various institutions.

    Admittedly the academic publishing system is so corrupt that it's hard to phantom, so it's easy to misunderstand it.

    Typically you do pay for the papers (and publisher profits), either through taxes or inflated product prices.

  • black_puppydog 6 months ago

    Authors are from, let's see, china and california. Then I guess a good chunk of the HN crowd is entitled to bitching about the title?