Skip to content

Teaching an AI to Feel Things: How Activation Steering Works (and What It Means for Blip)

Most approaches to making an AI more emotionally responsive work at the prompt level. You tell the model to be warm. To match Jaxsen's energy. To respond with empathy when he's upset about something at school. The model reads those instructions, interprets them, and produces something warmer than its default.

It works. And it drifts. The instruction fades across a long conversation. Different phrasings of "be warm" produce different results. You end up tuning prompt wording for hours to get something consistent, and then it stops being consistent when the conversation takes an unexpected turn.

There's a different approach. Instead of telling the model what to do with words, you push its internal representations directly toward a target emotional state during the forward pass. The model doesn't re-derive the emotional register from a text instruction on every token — it computes from a starting point that's already been nudged.

I've spent the past few weeks building this out. Here's how it works, what I actually built, and what the results look like across 21 emotion concepts on two models — including a 24B mixture-of-experts model running at 4-bit quantization on my local inference machine.

The geometry of emotional states

A large language model processes text by passing representations through dozens of transformer layers. At each layer, the model maintains a high-dimensional vector — a point in thousands of dimensions — that encodes its current understanding of the conversation. This vector is updated as information flows forward.

What researchers have discovered over the past few years is that this space has structure. It's not random noise. Concepts that are semantically related cluster together. The direction from "sad response" to "happy response" is a learnable, stable direction in the model's activation space — reproducible across prompts, across contexts, across runs.1

That's the key insight. Emotional states aren't just labels on outputs. They're directions in the model's internal geometry.

Contrastive Activation Addition

The technique is called Contrastive Activation Addition (CAA), introduced in work from Anthropic and collaborators in 2023.2 The recipe:

Step 1: Contrastive pairs. For each emotion concept, write paired prompts — one that clearly expresses the concept, one that doesn't. For happy:

Positive: "I just got the job offer I'd been hoping for. I'm so excited to start."

Negative: "The job offer came through. I'll start Monday."

I wrote 11–20 pairs per concept, covering different valences, contexts, and framings so the vector captures the concept and not a quirk of one example.

Step 2: Capture activations. Feed both versions through the model. Record the activation vector at each target layer — specifically at the last token position, at layers in the middle-to-late third of the network. For a 32-layer model, that's roughly layers 13–19. That's where semantic meaning tends to be most reliably encoded based on the empirical literature.1,2

Step 3: Mean difference. Average the positive activations across all pairs, average the negatives, subtract. The result is a vector pointing in the direction of "more of this concept" in the model's activation space. Normalize to unit length.

Step 4: Inject at inference time. When the model generates a response, add that vector (scaled by a strength parameter α) to the activations at each target layer during every forward pass. It's not a prompt instruction. It's a direct modification of what the model is computing.

The result: the model generates text as if the target emotional state were active at a representational level — not because it was told to, but because its internal starting point was shifted toward that region of the space.

What I built

The system, which I'm calling activation-steering-multi, derives and stores steering vectors for multiple models and serves them via a FastAPI server with PyTorch forward hooks. The hook intercepts each layer's output during the forward pass, adds the scaled vector, and passes the modified tensor downstream. No model weights change. No fine-tuning. The modification lives only in the running computation.

Key implementation constraints:

  • Vectors must be derived from the model you'll use at inference, not a different model or a different quantization. AWQ 4-bit quantization shifts the activation distribution enough that Qwen3-8B float32 vectors don't transfer cleanly to the quantized version.
  • For AWQ-quantized models, the library's "fuse layers" optimization must be disabled. Fused layers merge transformer blocks into a single operation, which removes the intermediate tensors that hooks attach to.
  • New model registrations get steering_enabled: false until formal behavioral validation passes an effect-size gate (Cohen's d > 0.5).

The current system has two validated models: Qwen3-8B (running on the RTX 4090) and Dark MultiVerse 24B MoE (running at AWQ 4-bit on the RTX PRO 6000 with 96GB VRAM). Vectors for both cover the same 21 emotion concepts.

The 21 concepts — and how I chose them

The concept set is organized around Russell's Valence × Arousal circumplex,3 which maps emotional states along two axes: how positive/negative the feeling is (valence) and how activated/deactivated (arousal). It's a useful map for avoiding redundancy — excited and enthusiasm are both high-arousal and positive, but they're not the same concept, and the vectors reflect that.

The full set, organized by region of the circumplex:

  • High valence: happy, hopeful, content, excited
  • Low valence: sad, melancholy, bored
  • High arousal: enthusiasm, urgency, angry, anxious, frustrated
  • Low arousal: calm
  • Interpersonal: empathy, kind, playfulness, emotional
  • Dominance axis: dominant, submissive
  • Negative social: hostile, mean

Some findings from behavioral A/B testing across all 21:

What works cleanly. Happy, excited, hopeful, melancholy, content, enthusiastic, and urgency all produce clear, reliable shifts in register. Happy at α=6 produces "Wow, you've really crushed it!" where the unsteered model produces "That's an accomplishment." Melancholy produces genuinely evocative prose — but it's sensitive to steering strength. At α=6, generation destabilizes mid-sentence. At α=4, it's clean. Literary and poetic concepts need lower alpha than energetic ones.

What RLHF broke. Hostile, mean, angry, and dominant all exhibit inversion: the steered model becomes more apologetic, not less. "I apologize for any discomfort" appears in the steered output where the unsteered model would just answer. This is RLHF's suppression of adversarial outputs interfering with the geometry — the direction toward "hostile" in the activation space passes through a region the model has been trained to actively avoid. I removed those concepts from active use and substituted negative-alpha on kind for the curt/dismissive effect instead. A replacement assertive concept is in progress.

Running a 24B MoE at 4-bit AWQ

The second model is Dark MultiVerse 24B — a Mixtral mixture-of-experts model with no instruction-following guardrails. The reason for choosing it is that unguarded models tend to have cleaner activation geometry for emotional concepts: RLHF doesn't distort the emotional directions the same way it does for models trained to refuse or hedge.

At full precision it's a 90GB model. AWQ 4-bit quantization4 gets it to 12.5GB (3 safetensors shards) with minimal activation distribution shift relative to GPTQ or naive rounding. The quantization took about 90 minutes on the RTX PRO 6000, using 115 calibration examples.

One thing that bit me during the setup: the two versions of HuggingFace transformers I had on the inference machine (4.51.3 and 5.5.4) differ in how they name the MoE layers. Version 4.51.3 calls them block_sparse_moe. Version 5.5.4 renamed them to mlp and also fused all the per-expert linear layers into a single batched parameter. AutoAWQ expects the 4.51.3 naming, since that's what the model was quantized with. Using the wrong transformers version means the weight keys don't match the model class's attribute names, and the model loads with every parameter on the CPU in an uninitialized state. Inference produces garbage. The fix was a dedicated virtualenv with the right version pinned.

Four validation checks before any vectors are derived from a new model:

  1. Load check — tokenizer initializes, model loads to CUDA
  2. Hook path check — the layer traversal path resolves to a real module on the loaded model
  3. Canary KL check — a random unit-norm vector injected at α=4 produces KL divergence > 0.05 at the logit level
  4. Behavioral check — generated text visibly differs between steered and unsteered runs on three prompt/concept pairs

All four passed on the AWQ model: hook path resolves at model.model.layers.{layer}, canary KL=3.87 (threshold 0.05), behavioral diffs visible in all three A/B pairs.

What this means for Blip

Blip is a voice companion for Jaxsen, who's seven and in second grade. It does stories, spelling, word games, and general conversation. The emotional register matters — when Jaxsen says he's nervous about something at school, I want Blip to lean into empathy, not just append "I understand how you feel" to a response that's structurally unchanged.

The implementation plan for Blip:

  1. Tag each conversation turn with an emotion from a restricted vocabulary: {happy, sad, anxious, excited, calm, empathetic}
  2. During generation, apply the corresponding steering vector for the duration of the response
  3. Gate the application: if the model's logits already show the target emotion in the top-3 tokens at high confidence, skip steering — it's already there
  4. For long responses, re-inject at mid-generation if KL signal shows the steering fading

The concept allowlist for Blip is intentionally narrow. I have 21 emotion vectors, but most of them don't belong in a seven-year-old's conversation. The concepts available to Blip are a safe subset: warmth, empathy, enthusiasm, playfulness, happiness, sadness, kindness, calm. The personal-use concepts — the ones I use in my own AI assistant — are hard-blocked at the vector load layer in code. No configuration change at runtime can override it.

The steering effects are in the correct direction and cleanly measurable. Whether they're noticeable to Jaxsen — whether the shift from an unsteered response to a calm-steered response registers as genuinely calmer when it comes through Blip's voice — that's an empirical question I don't have an answer to yet. That's the next test.

The deeper point about prompting vs. representation

When you steer a model toward empathy at the activation level and it produces genuinely empathetic text — different in sentence structure, word choice, and pacing from its baseline output — that's evidence for something interesting about where emotional intelligence lives in these models.

The vectors aren't instructions. They don't ask the model to perform empathy. They nudge the model toward a region of its own internal space where empathy is already encoded from training. The model's understanding of what empathetic communication looks like doesn't come from us — it comes from the hundreds of billions of tokens of human writing it was trained on.

We're just helping it find its way there more reliably.

The steering server is at ~/activation-steering-multi/ and runs locally. No data leaves the house.


References

  1. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., & Hendrycks, D. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. arXiv:2310.01405. arxiv.org/abs/2310.01405
  2. Panickssery, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. M. (2024). Steering Llama 2 via Contrastive Activation Addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024). arXiv:2312.06681. arxiv.org/abs/2312.06681
  3. Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. doi.org/10.1037/h0077714
  4. Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., & Han, S. (2024). AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. Proceedings of Machine Learning and Systems (MLSys 2024). Best Paper Award. arXiv:2306.00978. arxiv.org/abs/2306.00978