rikimaru0345 20 minutes ago

Ok, I've read the paper and now I wonder, why did they stop at the most interesting part?

They did all that work to figure out that learning "base conversion" is the difficult thing for transformers. Great! But then why not take that last remaining step to investigate why that specifically is hard for transformers? And how to modify the transformer architecture so that this becomes less hard / more natural / "intuitive" for the network to learn?

  • embedding-shape 3 minutes ago

    Why release one paper when you can release two? Easier to get citations if you spread your efforts, and if you're lucky, someone needs to reference both of them.

    A more serious answer might be that it was simply out of scope of what they set out to do, and they didn't want to fall for scope-creep, which is easier said than done.

niek_pas an hour ago

Can someone ELI5 this for a non-mathematician?

  • esafak an hour ago

    The model partially solves the problem but fails to learn the correct loop length:

    > An investigation of model errors (Section 5) reveals that, whereas large language models commonly “hallucinate” random solutions, our models fail in principled ways. In almost all cases, the models perform the correct calculations for the long Collatz step, but use the wrong loop lengths, by setting them to the longest loop lengths they have learned so far.

    The article is saying the model struggles to learn a particular integer function. https://en.wikipedia.org/wiki/Collatz_conjecture

    • spuz 33 minutes ago

      That's a bit of an uncharitable summary. In bases 8, 12, 16, 24 and 32 their model achieved 99.7% accuracy. They would never expect it to achieve 100% accuracy. It would be like if you trained a model to predict whether or not a given number is prime. A model that was 100% accurate would defy mathematical knowledge but a model that was 99.7% would certainly be impressive.

      In this case, they prove that the model works by categorising inputs into a number of binary classes which just happen to be very good predictors for this otherwise random seeming sequence. I don't know whether or not some of these binary classes are new to mathematics but either way, their technique does show that transformer models can be helpful in uncovering mathematical patterns even in functions that are not continuous.

      • jacquesm 22 minutes ago

        A pocket calculator that would give the right numbers 99.7% of the time would be fairly useless. The lack of determinism is a problem and there is nothing 'uncharitable' about that interpretation. It is definitely impressive, but it is fundamentally broken, because when you start making chains of things that are 99.7% correct you end up with garbage after very few iterations. That's precisely why digital computers won out over analog ones, the fact that they are deterministic.

        • pixl97 3 minutes ago

          Why do people keep using LLMs as algorithms?

          LLMs are not calculators. If you want a calculator use a calculator. Hell, have your LLM use a calculator.

          >That's precisely why digital computers won out over analog ones, the fact that they are deterministic.

          I mean, no not really, digital computers are far easier to build and far more multi-purpose (and technically the underlying signals are analog).

          Again, if you have a deterministic solution that is 100% correct all the time, use it, it will be cheaper than an LLM. People use LLMs because there are problems that are either not deterministic or the deterministic solution uses more energy than will ever be available in the local part of our universe. Furthermore a lot of AI (not even LLMs) use random noise at particular steps as a means to escape local maxima.

        • fkarg 9 minutes ago

          yeah it's only correct in 99.7% of all cases, but what if it's also 10'000 times faster? There's a bunch of scenarios where that combination provides a lot of value

  • poszlem an hour ago

    A transformer can. Here gemini:

    The Experiment: Researchers trained AI models (Transformers) to solve a complex arithmetic problem called the "long Collatz step".

    The "Language" Matters: The AI's ability to solve the problem depended entirely on how the numbers were written. Models using bases divisible by 8 (like 16 or 24) achieved nearly 100% accuracy, while those using odd bases struggled significantly.

    Pattern Matching, Not Math: The AI did not learn the actual arithmetic rules. Instead, it learned to recognize specific patterns in the binary endings of numbers (zeros and ones) to predict the answer.

    Principled Errors: When the AI failed, it didn't hallucinate random answers. It usually performed the correct calculation but misjudged the length of the sequence, defaulting to the longest pattern it had already memorized.

    Conclusion: These models solve complex math by acting as pattern recognizers rather than calculators. They struggle with the "control structure" (loops) of algorithms unless the input format reveals the answer through shortcuts.

    • embedding-shape 32 minutes ago

      Do you think maybe OP would have asked a language model for the answer if they felt like they wanted a language model to give an answer? Or in your mind parent doesn't know about LLMs, and this is your way of introducing them to this completely new concept?

      • NitpickLawyer 22 minutes ago

        Funny that the "human" answer above took 2 people to be "complete" (i.e. an initial answer, followed by a correction and expansion of concepts), while the LLM one had mostly the same explanation, but complete and in one answer.

        • embedding-shape 15 minutes ago

          Maybe most of us here don't seek just whatever answer to whatever question, but the human connection part of it is important too, that we're speaking with real humans that have real experience with real situations.

          Otherwise I'd just be sitting chatting with ChatGPT all day instead of wast...spending all day on HN.

Onavo 17 minutes ago

Interesting, what about the old proof that neural networks can't model arbitrary length sine waves?