Bereska et al. — "Towards Measuring Superposition with Sparse Autoencoders"

Towards Measuring Superposition with Sparse Autoencoders

Leonard Bereska, Reza Samavi, Efstratios Gavves

*equal contribution

Abstract

Neural networks achieve remarkable capabilities by representing more features than they have neurons through superposition – encoding features as shared directions in activation space. While this phenomenon explains many empirical observations, measuring superposition in real networks remains an open challenge, limiting our ability to engineer more interpretable models. We present an entropy-based framework for quantifying superposition using sparse autoencoders (SAEs) to recover underlying features. Our approach introduces scale-invariant metrics that work without ground truth features and reveals systematic differences between statistical and algorithmic domains. On toy models, we achieve strong correlation with ground truth measures (r > 0.94), while analysis of compiled transformers demonstrates how different computational tasks employ distinct feature organization strategies. Most surprisingly, we find that adversarial training increases measured feature counts while improving robustness, challenging previous theoretical predictions about superposition and vulnerability. These advances provide quantitative tools for understanding and controlling how neural networks organize information, with direct implications for interpretability and safety.

Towards Measuring Superposition with Sparse Autoencoders

Leonard Bereska, Reza Samavi*, Efstratios Gavves*

Menke and Tan — "How Effective Is Constitutional AI in Small LLMs?"

Nešić — "Right to AI: Ensuring Equitable Access to AI Resources"

Leonard Bereska, Reza Samavi, Efstratios Gavves