Building & Tinkering
Maker projects, local AI, custom software, and building things — alongside clinical practice.
-
Teaching an AI to Feel Things: How Activation Steering Works (and What It Means for Blip)
Most approaches to making AI more emotionally responsive rely on prompt instructions. Activation steering works differently — injecting a direction into the model's internal representations during the forward pass. Here's the method, 21 emotion concepts, and what happened when RLHF fought back.
-
The Smaller Model Won, and the 70B Ran at 1.6 Tokens Per Second
I A/B/C tested three local models as the offline fallback for my voice assistant. GLM-4.7-Flash beat both Qwen3 32B and Llama 3.3 70B. Also Llama ran six times slower than it should have, and I think I know why.
-
I Had a Council of AIs Grade Itself. Here's What It Said.
Built a 5-stage voting pipeline where multiple AI models extract atomic claims from their own deliberation and vote on each one. Shipped it in a day. Then asked it to evaluate its own design. It was sharper than I was.
-
I Benchmarked 78 AI Models and Almost Picked the Wrong Winner
A four-phase benchmark of 78 local and cloud AI models to choose a production default. The highest-scoring model turned out to be unusable. Here's what happened and what I picked instead.
-
My Test Suite Was Lying to Me
When I ran Blip's text emulator against live Claude API for the first time, most of it failed. The cause: the Claude code path was returning placeholder strings and I hadn't noticed. 18/18 pipeline utterances passed including a new parent (Barry) voice. 5/9 fully passing, 0 failures after fixes.
-
Testing Blip Without a Kid in the Room
I built a synthetic test harness that generates child voice audio, pipes it through every stage of Blip's pipeline, and measures what breaks. 12/12 utterances passed. One of the findings is about Whisper and why it's bad at speech therapy.
-
I Trained Five Versions of Blip's Brain. TriviaQA Won, and I Still Didn't Use It.
Five fine-tuned blip-edu variants, one 28-test benchmark, and a surprising winner I'm not actually shipping — because winning a head-to-head tournament and being better for two specific kids are not the same thing.
-
Blip Is Getting Faster — Latency Fixes, a Local Model Win, and What Comes Next
The SER5 went from 8 seconds to 2.5 seconds end-to-end. blip-edu beat Claude on spelling drills. And I'm building a cloud brain so Blip can run on an iPhone. Three changes from the last two weeks.
-
The 31B Merge Finally Worked. The Qwen3 Fine-Tune Is Broken.
blip-edu-gemma4 scored 921/1150 on 28 tests — 9th out of 30 models, best blip-edu variant built so far. blip-edu-qwen3 timed out on 20 of the same tests and produced nonsense on the other 8. GLM-4.5 Air debuted at 8th. Here's what actually happened.
-
Deploying Blip on a $300 Mini PC Is Humbling
The Beelink SER5 MAX is the machine that actually goes in the kids' room. Getting it to work like the dev workstation — no CUDA, different audio hardware, everything GPU-intensive running over a LAN — has been an ongoing exercise in assumptions I didn't know I was making.
-
Fine-Tuning the Two Best Local Models I Own, and One of Them Won't Export
Gemma 4 31B scored 43.9 without any fine-tuning. So I trained a LoRA on it. The model trained fine — then every export path failed. Unsloth wraps Gemma 4's attention layers in a custom class that breaks the merge. Meanwhile Qwen3-32B is running clean on the inference box, and E4B is going on the 4090.
-
Gemma 4 Crashed the Benchmark and a Bug I Didn't Know I Had
Four new local models, a GLM-Z1 content bug that was scoring thinking traces instead of responses, and two blip-edu fine-tunes that generate 14,000-character walls of text. Gemma 4 31B is now the best local model in the benchmark — sitting between Claude Opus and Claude Sonnet.
-
When Two Kids Talk at Once, Blip Hears Gibberish
Blip's speech recognition works great with one kid. Add a sibling and a parent yelling from the kitchen, and Whisper blends them into word salad. I'm testing three fixes — with audio samples you can listen to.
-
26 Models, Half a Crash, and One Clear Winner
I tried to benchmark 26 models at once. The infrastructure buckled — 15 judge calls failed, several models emitted zero tokens. But GLM-5.1 walked in and won 4 tests, and the CHILDES fine-tunes scored their first wins.
-
My Kid Said "Bloop" and Two AI Models Lost Their Minds
I ran 201 recorded kid-speech turns through Whisper small.en and Whisper large-v3-turbo. The bigger model hallucinated Icelandic. What I learned about kid speech, multilingual language detection bugs, and why bigger isn't always better.
-
I Tried to Use the TurboQuant Paper on My GPU. Here's What Actually Ran.
Google Research published a beautiful 3-bit KV compression technique. No code exists yet. While I waited, I set up NVFP4 and speculative decoding on the inference box — the production equivalents that actually ship on a Blackwell GPU today.
-
Five Datasets, One Question: Does Training Data Source Matter for blip-edu?
I trained five variants of blip-edu, each blended 50/50 with a different open dataset — TinyStories, SmolTalk, UltraChat, TriviaQA, and a CHILDES proxy. Phase A of a two-phase A/B study on whether synthetic-only training data is leaving wins on the table.
-
Blip Build Log #2: Adding Speech Practice — and Why It's Harder Than It Sounds
I'm building a speech articulation practice module into Blip. As a clinician, I know what makes practice tools fail kids — so I'm trying to build one that doesn't.
-
From Flat Piper to Voice Cloning: Blip's Voice Overhaul
What happened when I replaced Blip's synthesized voice with F5-TTS voice cloning, iterated through several reference voices, added a continuous filler loop, and learned that real-time audio UX is far harder than it looks when the users are seven years old.
-
Teaching Blip to Remember Yesterday
My 8-year-old asked "what happens?" with zero context and Blip knew exactly what he meant. Here's how memory continuity works, what broke when a 7B model hallucinated a character name, and what I learned benchmarking 6 LLMs on memo quality.
-
After the Fix: Qwen-Coder vLLM BF16 vs Ollama Q4_K_M, Side by Side
Shipped a small runner.py fix that unblocked the vLLM endpoint and re-ran the 17-model benchmark. Got the cleanest A/B yet: same model, two serving stacks, dual-judged. vLLM BF16 wins on every dimension and posts the best voice_quality average rank in the whole benchmark.
-
Two New Local 32Bs Walked Into the Benchmark — Qwen3-32B and GLM-Z1
Added Qwen3-32B and GLM-Z1-32B to the Blip benchmark and turned on three inference calibration improvements at the same time. Both new locals debuted at 3 wins — Tier-1-adjacent. The blip-edu family jumped from 3 wins to 7 on pure calibration, no retraining. Also found a vLLM bug the hard way.
-
blip-edu v2: What I Learned Trying to Make My Kid-Tutor Model Better
14,000 examples, two new training runs, six GPU hours, and one falsified hypothesis later, here's what v2 and a Qwen-Coder A/B variant actually bought me. The short version: targeted dataset changes work in both directions, including a few that went the wrong way.
-
I Trained My Own Kid-Tutor LLM. Here's How It Did Against the Frontier Models
8,500 synthetic examples, 58 minutes of LoRA training, and a 4.7 GB GGUF later, blip-edu went into the same benchmark as Claude Opus and DeepSeek-V3. The result: 2 wins, the fastest model in the suite, and one important lesson about what fine-tuning can and can't teach.
-
Getting a Computer to Understand My 7-Year-Old
Kids' speech is harder to transcribe than adults'. Here's what I learned trying to get Whisper to reliably understand Adalind.
-
Blip Build Log #1: Getting the Audio Pipeline Working
What it took to wire together wake word detection, local speech-to-text, and text-to-speech — and get the first working loop.
-
I Ran the Same Benchmark Three Times and Got Three Different Answers
A third pass at the LLM benchmark was supposed to settle the qwen-coder question. Instead it revealed the noise floor: ±2-3 wins per model on a 28-test suite. Here's what survives three runs and what doesn't.
-
Two Judges Are Better Than One: What the Dual-Judge Benchmark Found
Adding a second non-Anthropic judge changed the results in ways that matter. GLM-5 lost five wins when the affinity came out in the wash, the Opus-distilled models are still hopeless, and the qwen-coder "win" was partially a calibration accident.
-
The Corrected Benchmark: 9 Models, 5 Suites, Calibrated Settings
After fixing three per-model calibration errors, I re-ran the Blip benchmark with 9 models across 5 task suites. Here's what changed, what didn't, and what the results mean for Blip's routing layer.
-
The Settings That Change Everything: Per-Model Calibration Before the Re-Run
The first benchmark ran all 11 models with identical settings. That was wrong for at least two of them. Here's the exact diff and why R1 and Hermes3 results can't be trusted yet.
-
308 Outputs, One Judge: Inside the Blip LLM Benchmark
The full benchmark data: real prompts, how Claude Opus scored 308 outputs, the bias-reduction steps, and the complete results table across 11 models and 9 task categories.
-
What I Learned Benchmarking 6 Local LLMs — and Why Most Model Comparisons Are Wrong
I ran a structured benchmark across 6 local LLMs for the Blip project. The results were only useful after I fixed three fundamental methodology errors that would have invalidated the whole thing.
-
I'm Building My Kids an AI Learning Console. Here's Why.
My kids ask Alexa questions all day. The answers are terrible. So I built something better — a voice-first AI learning console called Blip.
-
Welcome to Building & Tinkering
What this category is about and why I build things outside the clinic.