NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

Abstract

We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection.

We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with the model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.

What do FFN nonlinearities actually do?

Figure: Eigen-metrics (SE, PR, EEE, and JS) illustrate how FFN nonlinearities regulate information flow and reshape the eigenspectrum during training for GELU (top) and ReLU (bottom). Pre- and post-activation dynamics are shown for SE, PR, and EEE, highlighting how nonlinearities reinject variance and alter spectral structure. JS heatmaps (rightmost) capture the layer-wise distributional shift induced by the nonlinearity. In-panel titles report Pearson correlations (r) between each metric and evaluation loss, shown as orange curves.

Attention-induced rank collapse. Dong et al. (ICML 2021) showed that self-attention possesses a strong inductive bias toward token uniformity: a pure self-attention network (when skip connections and FFNs are disabled) loses expressive power doubly exponentially with depth. They observed a tug-of-war between self-attention and FFN nonlinearities: attention collapses rank, FFN nonlinearity somehow fights back and keeps transformer networks alive. However, the mechanism of rank inflation through FFN nonlinearity has not been well understood, and their precise role is not quantified. NerVE provides the quantitative answer.

Nonlinearity-induced rank inflation. We show that FFN nonlinearities actively reinject variance into under-utilized directions of the latent space, reawakening dimensions that would otherwise remain inactive, a process we term nonlinearity-induced rank inflation. This is not a passive rescaling; the nonlinearity fundamentally reorganizes the eigenspectrum, flattening its top-heavy structure by spreading variance across a broader set of directions.

NerVE tracks this mechanism through four complementary metrics. Spectral Entropy (SE) and Participation Ratio (PR) both rise after activation, indicating broader variance distribution and higher effective dimensionality. Eigenvalue Early Enrichment (EEE) drops, confirming that the spectrum becomes less top-heavy. Jensen-Shannon divergence (JS) heatmaps reveal where this redistribution is strongest across depth and training; a structured, depth-localized transition band rather than a uniform effect.

Table: Summary of NerVE's four complementary eigen-metrics, their inputs, ranges, spectral sensitivities, and what each captures about the latent space geometry. SE, PR, and EEE characterize a single spectrum; while JS quantifies the information-theoretic distance between the pre- and post-activation, and characterizes nonlinearity-induced geometric transformation. Here, λ denotes the raw eigenvalues, λ̂ the normalized eigenvalues, and D the FFN hidden dimension.

GELU vs. ReLU: who explores more of the latent space? GELU and ReLU follow the same qualitative trajectory (variance reinjection, spectral flattening, distributional reordering) but differ in pace and extent. ReLU stabilizes earlier; GELU progresses more gradually yet ultimately explores a broader subspace, correlating with its lower perplexity. All four metrics correlate strongly with evaluation loss (|r| > 0.92), confirming that spectral dynamics track generalization throughout training.

How FFN nonlinearities compensate for LayerNorms

Figure: Eigenspectrum dynamics for norm-free GPT-2 (125M) models with GELU (top), ReLU (middle), and learnable-slope Leaky ReLU (bottom). Each row shows layer-averaged SE (pre vs. post), PR gain (post-to-pre ratio), post-activation EEE, and JS divergence across layers and training steps. Norm-free GELU exhibits spectral inertia in layers 0 to 5 (EEE → 1, JS → 0), while ReLU and Leaky ReLU aggressively reinject variance (PR gain > 200x), flattening the spectrum (EEE < 0.3).

Removing LayerNorm shifts the entire burden of statistical regularization onto FFN activations, and not all activations survive. LayerNorm re-centers and rescales representations at every layer, quietly preventing variance from concentrating into a few dominant directions. Without it, the FFN nonlinearity is the last line of defense against spectral collapse. NerVE reveals that GELU and ReLU respond to this pressure in fundamentally different ways.

GELU exhibits spectral inertia: early FFNs fail to reinject variance, and information flows through a narrow subspace. In normalization-free models with GELU, the post-activation EEE remains near 1 and JS near 0 in early layers; the nonlinearity is effectively acting as a near-identity, leaving the top-heavy eigenspectrum untouched. This spectral bottleneck is the geometric signature of entropic overload (Jha & Reagen, NeurIPS ATTRIB 2024), where early attention heads are stuck in high-entropy states, starving deeper layers of representational diversity.

ReLU breaks spectral inertia through aggressive overcompensation (PR gains >200). In sharp contrast, ReLU and learnable-slope Leaky ReLU variants exhibit massive variance reinjection in the first two FFN layers, flattening the spectrum (EEE < 0.3) and producing non-overlapping pre/post spectral entropy curves. This compensatory behavior partially assumes the regularization role of LayerNorm, closing roughly 50% of the perplexity gap to the normalized baseline.

	Baseline Models		Norm-free Models
	GELU	ReLU	GELU	ReLU	Leaky ReLU
PPL	2.714	2.774	3.223	2.988	3.081

Table: Evaluation perplexity (PPL) comparison across GPT-2 baseline models (GELU and ReLU), and the norm-free variants (GELU, ReLU, learnable-slope Leaky ReLU). All models trained from scratch on 2.1B tokens from the CodeParrot dataset.

How optimizer geometry determines FFN capacity allocation

Figure: Optimizer-dependent FFN eigenspectrum dynamics in GPT-2 (350M) trained on FineWeb dataset. Rows show AdamW (top), Muon (middle), and Dion (bottom). AdamW exhibits large early PR gains and high JS with relatively high post-activation EEE, indicating optimizer-induced pre-activation collapse followed by aggressive but incomplete nonlinear repair. Muon shows the smallest PR gains, lowest JS, and lowest post-activation EEE, with flatter post-spectra. Dion falls between these two regimes, improving over AdamW but not matching Muon's spectral behavior. The perplexity ordering (Muon < Dion < AdamW) aligns with post-activation spectral flatness.

Repair or refinement: more effort does not mean better outcome. Under AdamW, FFN nonlinearities exhibit the largest PR gains and highest JS divergence across all three optimizers; they are working the hardest. But this effort is corrective, not productive: the nonlinearity spends its capacity undoing spectral collapse that the optimizer itself induced, and despite massive corrections, AdamW's post-activation effective dimensionality remains the lowest. Muon achieves the opposite: highest post-activation PR with the smallest gains and lowest JS. Its nonlinearities perform modest refinement on already healthy spectra rather than expensive repair. Dion falls between, improving over AdamW but not matching Muon's spectral efficiency. The perplexity ordering (Muon < Dion < AdamW) tracks this distinction: productive refinement outperforms heroic repair.

Figure: Layer-wise pre-activation PR over training for AdamW, Muon, and Dion on GPT-2 350M (24 layers) trained on FineWeb dataset. Muon maintains the highest PR_pre across almost all layers throughout training, Dion is intermediate, and AdamW shows early-layer collapse.

Muon preserves well-conditioned pre-activations; AdamW lets them collapse. The root cause of the repair-refinement divide lies in what each optimizer does to the pre-activation eigenspectrum. AdamW allows early-layer pre-activation PR to collapse during training; variance concentrates into a few dominant eigenmodes, handing the nonlinearity a spectrally damaged input. Muon maintains high pre-activation PR across nearly all layers throughout training, producing near-isotropic spectra before the nonlinearity even acts. Dion partially mitigates the early-layer collapse but does not match Muon's conditioning. These dynamics persist across model scales (160M, 350M) and context lengths (512, 1024), confirming that they are intrinsic to optimizer geometry rather than artifacts of a specific configuration.

Figure: Final post-activation PR per layer for AdamW, Muon, and Dion on GPT-2 350M (24 layers) trained on FineWeb dataset. Muon concentrates the largest effective dimensionality in middle FFN layers, the layers most critical for generalization.

Where capacity accumulates matters more than how much is injected. Muon concentrates the highest post-activation effective dimensionality in middle FFN layers, the layers recent evidence identifies as disproportionately important for generalization (Queipo-de-Llano et al., ICLR 2026; Lad et al., NeurIPS 2025; Ikeda et al., COLM 2025; Skean et al., ICML 2025). AdamW inflates PR_post in early layers through aggressive repair but leaves middle layers underserved. Dion pushes capacity into early FFNs without yielding the best perplexity. The decisive pattern: perplexity tracks mid-layer spectral capacity, not early-layer effort. This suggests that optimizer selection should be evaluated not by aggregate training metrics but by where across depth the optimizer allocates effective representational capacity. These findings provide empirical evidence that optimizer geometry introduces qualitatively distinct representational biases, not merely different convergence rates, aligning with the recent position that optimizers should be leveraged as explicit sources of inductive biases (Pascanu et al., 2025).

Beyond attention: NerVE on MLP-Mixer

Figure: Eigenspectrum dynamics in MLP-Mixer under activation ablations. Rows correspond to the four activation configurations for token-mixing (FFN1) and channel-mixing (FFN2) layers, and columns (from left to right) show SE, PR, EEE, and JS for the channel-mixing FFNs (FFN2). Each panel traces pre- and post-activation metrics over training, showing that ReLU in the channel-mixing MLP (3rd and 4th rows) most strongly increases SE/PR and reduces EEE, reinjects variance into low-energy directions and flattens the spectrum.

The variance reinjection pattern is not transformer-specific; it emerges wherever deep FFNs meet nonlinearity. MLP-Mixer removes self-attention entirely, isolating the contribution of FFN nonlinear transformations from attention-specific dynamics like rank collapse. We apply NerVE to MLP-Mixer (B/16) on CIFAR-100, a pure-MLP architecture with no self-attention. The same core pattern holds: post-activation SE and PR rise above pre-activation throughout training, EEE drops, and the nonlinearity actively flattens the eigenspectrum across all four activation configurations tested. NerVE further reveals that activation choice in the channel-mixing MLP (the component analogous to transformer FFNs) has a far stronger spectral impact than activation choice in the token-mixing MLP, identifying which nonlinearity matters most. The optimizer story also extends: SGD achieves higher post-activation SE and PR than Adam throughout training, correlating with better accuracy (68.07% vs 66.96%), confirming that optimizer-dependent spectral dynamics are not a transformer-specific phenomenon.

NerVE metrics predict generalization without evaluation

NerVE metrics are not just descriptive; they track generalization with near-perfect correlation. Pre-activation SE and PR correlate with validation loss at |r| ≥ 0.97 across every FFN width configuration tested, throughout training. This means spectral health can be monitored with a single forward pass, no gradient computation, no validation set evaluation. Post-activation correlations strengthen as FFN width increases, suggesting that a modest width is needed before the spectral signal becomes generalization-predictive.

	FFN Width Configuration (GPT-2 GELU)
Metric	D=1d	D=2d	D=3d	D=4d	D=5d	D=6d	D=7d	D=8d
SE_pre	-0.98	-0.98	-0.99	-0.99	-0.99	-0.99	-0.99	-0.99
SE_post	-0.84	-0.84	-0.86	-0.87	-0.87	-0.87	-0.87	-0.87
PR_pre	-0.97	-0.98	-0.98	-0.99	-0.98	-0.97	-0.98	-0.97
PR_post	-0.85	-0.93	-0.94	-0.94	-0.95	-0.95	-0.93	-0.93

Table A (Within-run tracking): Pearson r between each metric and validation loss over training checkpoints at each FFN width (D=1d to 8d). Pre-activation correlations exceed |r| ≥ 0.97 at every width. Post-activation PR strengthens from 0.85 at D=1d to ≥ 0.93 at D ≥ 2d, suggesting a modest FFN width is required for generalization-predictive spectral signatures.

Short runs can rank architectures without training to convergence. Across eight FFN width configurations and multiple activation variants, final spectral metric values correlate strongly with final perplexity (|r| ≥ 0.85). The one notable exception: normalization-free ReLU, where pre-activation correlations weaken while post-activation correlations strengthen. This directly reflects the compensatory dynamics identified earlier: when nonlinearity overcompensates, the post-activation spectrum becomes the more informative diagnostic. NerVE tells you not only what to measure, but which measurement to trust in each regime.

	GPT-2				NormFree GPT-2
Metric	GELU	ReLU	GeGLU	SwiGLU	GELU	ReLU	LReLU
SE_pre	-0.99	-0.98	-0.95	-0.97	-0.82	0.03	0.03
SE_post	-1.00	-1.00	-0.57	-0.85	-0.92	-0.99	-1.00
PR_pre	-0.99	-0.98	-0.97	-0.97	-0.93	-0.55	-0.60
PR_post	-1.00	-0.97	-0.94	-0.89	-0.99	-0.94	-0.99

Table B (Cross-configuration ranking): Pearson r between final metric values and final perplexity across eight width configurations, for each architecture and activation variant. Correlations remain strong (|r| ≥ 0.85) across most configurations. The notable exception: NormFree ReLU and LReLU, where pre-activation correlations weaken (red values) while post-activation correlations stay strong, reflecting the compensatory overcompensation dynamics.

BibTeX

@inproceedings{jha2026nerve,
  title={NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks},
  author={Nandan Kumar Jha and Brandon Reagen},
  booktitle={The Fourteenth International Conference on Learning Representations (ICLR)},
  year={2026}
}