Project Page Yet another Policy Optimization (YaPO)

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

YaPO learns sparse steering vectors in SAE space with a bi-directional preference objective for stable, interpretable control. It converges faster than dense baselines, improves cultural alignment, and preserves general knowledge on MMLU.

Abdelaziz Bounhar^1,* Rania Hossam Elmohamady Elbadry¹ Hadi Abdine¹ Preslav Nakov¹ Michalis Vazirgiannis^1,2 Guokan Shang^1,*

¹ MBZUAI ² École Polytechnique

^* Correspondence: abdelaziz.bounhar@mbzuai.ac.ae, guokan.shang@mbzuai.ac.ae

Abstract

Activation steering offers a lightweight alternative to fine-tuning, but dense steering vectors often entangle multiple behaviors, limiting stability and fine-grained control. We propose YaPO, a reference-free method that learns sparse steering vectors in the latent space of a pretrained Sparse Autoencoder (SAE) using a bi-directional preference optimization objective. By optimizing sparse codes and reconstructing activations with residual correction, YaPO yields disentangled, interpretable, and efficient steering directions. Experiments show faster convergence and more stable training than dense baselines, with stronger cultural alignment and broad generalization to hallucination, jailbreak, power-seeking, and wealth-seeking behaviors. YaPO preserves general knowledge, with no measurable degradation on MMLU, providing a practical recipe for efficient, stable domain adaptation of LLMs.

YaPO method overview diagram — YaPO projects target-layer activations into SAE space, optimizes a sparse steering vector with a bi-directional preference objective, then decodes with residual correction before injecting the steer.

Key Ideas

Dense steering vectors entangle multiple behaviors due to neuron multi-semanticity. Sparse Autoencoders provide a disentangled basis for fine-grained, interpretable steering.

YaPO learns steering vectors in sparse SAE space using a bi-directional preference objective, then decodes them back with residual correction for faithful activation reconstruction.

Contributions

- First reference-free method to learn sparse steering vectors in SAE latent space from preference data.

- New cultural alignment benchmark spanning five language families and fifteen cultural contexts.

- Faster convergence, improved stability, and strong generalization across alignment behaviors without MMLU drop.

Observations

YaPO combines DPO-style preference optimization with SAE-based sparsity: it optimizes sparse codes while keeping the LLM and SAE frozen, and injects the decoded steering direction at a target layer. This yields stable training, fine-grained cultural adaptation, and broader alignment control without degrading general knowledge.

MCQ Performance Evaluation

Note: Results use Gemma-2-2B-it and report average MCQ accuracy across language categories; Baseline is the unsteered model.

Training Dynamics

Egypt (non-localized)

Training dynamics: Nepal MCQ non-localized loss

Nepal (non-localized)

Convergence: Across Egypt and Nepal, YaPO’s loss drops below 0.1 within ~150 steps, while BiPO remains above 0.3 even after 600 steps. We observe this consistently on other countries, and attribute this gap to optimizing in sparse SAE space, which yields cleaner gradients and more stable optimization than dense residual-space steering.

Steering Multiplier Effect

Accuracy vs. steering multiplier for Egypt and Levantine

Egypt & Levantine

Accuracy vs. steering multiplier for Nepal and Spain

Nepal & Spain

Steering multiplier λ sensitivity: The multiplier curves show how accuracy changes as the steering strength varies, with YaPO exhibiting a clearer gain under positive multipliers in these case studies compared to other methods.