Why Can't Transformers Learn Multiplication?

westurner 11 hours ago

> We study these questions by contrasting a standard fine-tuned model (SFT), which fails at multiplication, with a model trained with implicit chain-of-thought (ICoT) (Deng et al., 2024; 2023), which succeeds. ICoT works by providing explicit chain-of-thought tokens during training, but gradually removes them and thus forces the model to internalize intermediate steps in its latent states.