2026-03-17 07:47:27

What report left Elon Musk amazed?

Moonshot AI (Kimi team) recently released a groundbreaking technical report: "Attention Residuals," which directly upgrades the residual connections that have been used in Transformers for nearly 10 years. The result? Even Elon Musk couldn't help but comment, expressing his amazement (unbelievable level of shock).

The core idea of this paper can be summed up in one sentence:
"Stop having every layer mindlessly add information from all previous layers with equal weight. Let the model learn to use attention to pick which early layer signals are actually useful!"

In traditional Transformers (PreNorm structure), the output of each layer is:
x_{l} = x_{l-1} + sublayer(x_{l-1} / √something)

Straightforward and brutal: regardless of whether information from the previous 100 layers is useful, it all gets added together. With deeper layers, important early signals get diluted by countless subsequent layers until they nearly disappear (they call this phenomenon "PreNorm dilution" or "representational dilution").

The Kimi team simply replaced that "+" sign with a lightweight cross-layer attention mechanism (depth-wise attention):

The new formula roughly looks like this (simplified):
x_l = Attention( Q=x_l^{pre}, K=summary of all previous layers, V=corresponding value) + other components

Their more practical implementation is called Block AttnRes: every few layers (such as every 8-16 layers), create a summary of key/value, then use attention to select these summaries instead of computing attention at every single layer. This adds minimal memory and computation overhead (inference latency <2%), but the performance gains are substantial.

Their experimental results (using their own Kimi Linear series models, 48B total / 3B active):
• Performance improvement equivalent to 1.25x computational advantage at the same FLOPs
• Notable improvements in long sequence inference and complex multi-step reasoning tasks
• More stable hidden state magnitude (norm), unlike traditional residual connections that either explode or decay with depth
• More uniform gradient propagation, making deeper layers easier to train

⚠️⚠️

So why such a strong reaction from Musk?

"Residual connections have been resting for eight years, and finally someone dares to improve them—and does it so elegantly with such great results?!"

Why does this matter? Because residual connections are practically the only lifeline that allows Transformers to train to 100+ layers, even thousands of layers. Everyone thought they were already optimal with no room for improvement. But Kimi, using the most familiar attention mechanism, turned around and fixed residual connections' own problems. It's like taking the phrase "attention is all you need" to a whole new level.

There are already Rust implementations (based on the burn framework) and various explanatory visualizations flooding X. Some say this is, after DeepSeek mHC, another genuinely architecture innovation that will make it into the next generation of open-source/closed-source large language models.

If you're working on large language models or training your own LLM, this report is worth reading the original paper overnight + checking the code (already open-sourced on GitHub).

Report:
Get ready to be amazed. 🚀

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

2 Likes