Towards Measuring Superposition with Sparse Autoencoders
Leonard Bereska, Reza Samavi*, Efstratios Gavves*
*equal contribution
Abstract
Neural networks achieve remarkable capabilities by representing more features than they have neurons through superposition – encoding features as shared directions in activation space. While this phenomenon explains many empirical observations, measuring superposition in real networks remains an open challenge, limiting our ability to engineer more interpretable models. We present an entropy-based framework for quantifying superposition using sparse autoencoders (SAEs) to recover underlying features. Our approach introduces scale-invariant metrics that work without ground truth features and reveals systematic differences between statistical and algorithmic domains. On toy models, we achieve strong correlation with ground truth measures (r > 0.94), while analysis of compiled transformers demonstrates how different computational tasks employ distinct feature organization strategies. Most surprisingly, we find that adversarial training increases measured feature counts while improving robustness, challenging previous theoretical predictions about superposition and vulnerability. These advances provide quantitative tools for understanding and controlling how neural networks organize information, with direct implications for interpretability and safety.