Alternative approach for Sparse Embeddings for AI Mechanical Interpretability
Kevin Noel
Abstract
In the context of AI safety, Mechanical Interpretability (MI) represents a fundamental approach to understanding large neural networks, particularly Large Language Models (LLM), through reverse engineering their internal computational mechanisms. AI Mechanical Interpretability made recent progress through the usage of Sparse Auto-Encoders (SAE). SAE can effectively decompose complex neural networks into interpretable, monosemantic features, facilitating a cleaner understanding of model behavior and safety implications. SAE methods were successfully scaled to LLM, revealing highly abstract, multilingual, and multimodal features that encompass both concrete and abstract representations. Building on the linear representation and superposition hypotheses, dictionary learning SAE methods were used to extract interpretable features. Findings reveal that this mechanistic interpretability approach through SAE type learning enables identifying safety-relevant features related to deception, bias, and potentially harmful content.
Sparse Autoencoder (SAE) has emerged as the de facto standard for analyzing and interpreting Large Language Models (LLMs). While SAE has enabled significant advances in mechanical interpretation and monosemantic feature detection, its widespread adoption faces significant barriers due to computational demands and training complexity. Therefore, most studies relied on pre-trained SAE provided by large AI labs.
In this initial working paper, we introduce an alternative approach for extracting Sparse Embedding from LLM, inspired from Neural Information Retrieval field. The method combines bi-directional LLM architectures with Sparse Lexical and Expansion techniques from encoder decoder Language model to create an alternative for LLM probing through Sparse Embedding. This approach leverages the complementary strengths of these methods bi-directional LLM rich contextual representations and SPLADE interpretable sparse expansions to generate meaningful sparse embeddings with reducing the computational overhead of traditional full SAE training. In this work, we show preliminary results of this method, showing the potential of combining bi-directional LLM and Masked Layer Model for Lexical Expansion, and generation of sparse embedding.