skip to content
fdke.vin

Attention Residuals

Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer’s contribution.

From Attention Residuals by Kimi Team.

Impressive.