✨ ICLR 2026

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

New York University
NerVE Framework Overview

Figure: NerVE tracks eigenspectrum dynamics at pre-activation (after Wup, before σ) and post-activation (after σ, before Wdown) points in each FFN layer, computing four complementary metrics: Spectral Entropy (SE) for dispersion, Participation Ratio (PR) for effective dimensionality, Eigenvalue Early Enrichment (EEE) for top-heaviness, and Jensen–Shannon Divergence (JS) to quantify the distributional shift, characterizing how nonlinearities restructure the latent geometry.

Abstract

We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection.

We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.

What do FFN nonlinearities actually do?

GELU and ReLU eigenspectrum dynamics

Figure: Eigen-metrics (SE, PR, EEE, and JS) illustrate how FFN nonlinearities regulate information flow and reshape the eigenspectrum during training for GELU (top) and ReLU (bottom). Pre- and post-activation dynamics are shown for SE, PR, and EEE, highlighting how nonlinearities reinject variance and alter spectral structure. JS heatmaps (rightmost) capture the layer-wise distributional shift induced by the nonlinearity. In-panel titles report Pearson correlations (r) between each metric and evaluation loss, shown as orange curves.

Attention collapses rank, but what fights back? Dong et al. (ICML 2021) proved that pure self-attention drives representations toward a rank-1 matrix doubly exponentially with depth, a phenomenon they termed token uniformity. They identified a counteracting force: FFNs slow this collapse. But their analysis stopped there; the FFN side of this tug-of-war was characterized only as a passive brake on convergence speed, not as a mechanism in its own right. NerVE reveals what is actually happening on the other side: FFN nonlinearities actively reinject variance into under-utilized eigenmodes, inflating effective rank through structured spectral reorganization, not merely slowing collapse, but reversing it.

FFN nonlinearities reinject variance into inactive directions, inflating the effective rank. Before the nonlinearity fires, the FFN eigenspectrum is dominated by a handful of principal directions; most of the latent space is effectively unused. After the nonlinearity, the spectrum flattens: variance redistributes across many more dimensions, and previously silent directions begin carrying information. This is not a simple rescaling; the nonlinearity fundamentally reorganizes the eigenspectrum.

Four metrics decompose this rank inflation into its geometric constituents. Spectral Entropy (SE) and Participation Ratio (PR) both rise after activation, indicating broader variance distribution and higher effective dimensionality. Eigenvalue Early Enrichment (EEE) drops, confirming a less top-heavy spectrum. Jensen-Shannon (JS) heatmaps reveal where across depth this redistribution is strongest: a structured, depth-localized transition band rather than a uniform effect. GELU and ReLU follow the same trajectory but differ in pace: GELU explores a broader subspace, correlating with lower perplexity. All four metrics correlate strongly with evaluation loss (|r| > 0.92).

BibTeX

@inproceedings{jha2026nerve,
  title={NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks},
  author={Nandan Kumar Jha and Brandon Reagen},
  booktitle={The Fourteenth International Conference on Learning Representations (ICLR)},
  year={2026}
}