YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation
Abstract
Activation steering offers a lightweight alternative to fine-tuning, but dense steering vectors often entangle multiple behaviors, limiting stability and fine-grained control. We propose YaPO, a reference-free method that learns sparse steering vectors in the latent space of a pretrained Sparse Autoencoder (SAE) using a bi-directional preference optimization objective. By optimizing sparse codes and reconstructing activations with residual correction, YaPO yields disentangled, interpretable, and efficient steering directions. Experiments show faster convergence and more stable training than dense baselines, with stronger cultural alignment and broad generalization to hallucination, jailbreak, power-seeking, and wealth-seeking behaviors. YaPO preserves general knowledge, with no measurable degradation on MMLU, providing a practical recipe for efficient, stable domain adaptation of LLMs.
Key Ideas
Dense steering vectors entangle multiple behaviors due to neuron multi-semanticity. Sparse Autoencoders provide a disentangled basis for fine-grained, interpretable steering.
YaPO learns steering vectors in sparse SAE space using a bi-directional preference objective, then decodes them back with residual correction for faithful activation reconstruction.
Contributions
- First reference-free method to learn sparse steering vectors in SAE latent space from preference data.
- New cultural alignment benchmark spanning five language families and fifteen cultural contexts.
- Faster convergence, improved stability, and strong generalization across alignment behaviors without MMLU drop.
Observations
YaPO combines DPO-style preference optimization with SAE-based sparsity: it optimizes sparse codes while keeping the LLM and SAE frozen, and injects the decoded steering direction at a target layer. This yields stable training, fine-grained cultural adaptation, and broader alignment control without degrading general knowledge.
Training Dynamics
Convergence: Across Egypt and Nepal, YaPO’s loss drops below 0.1 within ~150 steps, while BiPO remains above 0.3 even after 600 steps. We observe this consistently on other countries, and attribute this gap to optimizing in sparse SAE space, which yields cleaner gradients and more stable optimization than dense residual-space steering.
Steering Multiplier Effect
Steering multiplier λ sensitivity: The multiplier curves show how accuracy changes as the steering strength varies, with YaPO exhibiting a clearer gain under positive multipliers in these case studies compared to other methods.