How Effective Is Constitutional AI in Small LLMs? A Study on DeepSeek-R1 and Its Peers

Antonio-Gabriel Chaćon Menke, Phan Xuan Tan

Abstract

Recent incidents highlight safety risks in Large Language Models (LLMs), motivating research into alignment methods like Constitutional AI (CAI). This paper explores CAI's self-critique mechanism on small, uncensored 7-9B parameter models: DeepSeek-R1-8B, Gemma-2-9B, Llama 3.1-8B, and Qwen2.5-7B. We show that while Llama-based models exhibited significant harm reduction through self-critique, other architectures struggled with harm detection post-abliteration. These findings suggest CAI's effectiveness may vary depending on model architecture and reasoning capabilities.

Previous
Previous

Yamashina — "Why Accelerating Human–Machine Interfaces (HMI) Matters"

Next
Next

Bereska et al. — "Towards Measuring Superposition with Sparse Autoencoders"