Skip to content

← Projects

Local AI Inference

A dedicated inference machine on the home network. 96GB VRAM, connected over a 10Gb direct link, running Ollama with automatic profile switching. Used for Blip, LoRA training, and anything I don't want going to a cloud API.

Status: Live · Burn-in passed · Profile switching working · Benchmark harness built

Why a dedicated inference box

My workstation has an RTX 4090 with 24GB VRAM — plenty for most tasks, but not for running a 32B+ parameter model while also running ComfyUI, Claude Code, and a browser. The inference box offloads all model serving to a separate machine so neither job competes for VRAM.

The other reason is privacy. My kids' voices run through Blip. LoRA training involves personal images. I'd rather those workloads stay on hardware I control than route through a cloud provider's inference endpoint.

Hardware

Note: the RTX PRO 6000 Blackwell requires the open kernel module variant of the NVIDIA driver — the proprietary driver returns "RmInitAdapter failed" and never surfaces the GPU. The open driver works correctly.

Software stack

Profile switching

Different work needs different models loaded. A coding session wants a 32B code-tuned model. A captioning session wants a VLM. Loading all of them simultaneously at 96GB isn't the problem — getting them swapped cleanly between tasks is.

The profile manager handles this. When I open a Revive EHR session in Claude Code, the SessionStart hook calls the manager's API, which evicts the current model set and loads the coding profile. It's automatic and adds about 140ms to session start (sticky cache on subsequent loads).

Profile Primary model Used for
coding Qwen2.5-Coder 32B Revive EHR, llm-server, Claude Code sessions
vision Qwen2.5-VL 32B + Moondream LoRA captioning, ComfyUI, photo work
reasoning DeepSeek-R1 32B Architecture decisions, debugging, hard problems
development Qwen3.5-abliterated 35B-A3B General chat, prose, uncensored generations

Hardware monitoring

The inference box publishes GPU temperature, VRAM usage, power draw, CPU temperature, and service health to AWS CloudWatch every 30 seconds. SNS sends email alerts at configurable thresholds. Local Python handles emergency shutdown (GPU at 95°C, CPU at 100°C) — that path can't wait for an AWS round-trip.

I ran a 90-minute burn-in test after deployment to validate the thermal envelope:

The 80°C warn / 90°C critical / 95°C shutdown thresholds are well-calibrated for the current cooling setup. The 85°C peak is thin margin — worth watching, but not worth tightening thresholds and generating false alarms during normal training runs.

LLM benchmarking

I built a benchmarking harness to objectively compare local models against cloud models on the actual tasks Blip needs to handle. The harness sends identical prompts to all models, captures raw outputs, then uses Claude Opus as an anonymized judge to rank responses.

After benchmarking 11 models across 28 Blip-specific tasks, the results were counterintuitive enough that I wrote about them:

What I Learned Benchmarking 6 Local LLMs — and Why Most Model Comparisons Are Wrong →

Short version: the dedicated code model (Qwen2.5-Coder) turned out to be the best local model for Blip's conversational tasks — better than the abliterated chat model I'd originally planned to use. Two 27B models distilled from Opus reasoning traces failed completely — they emitted 2,000+ tokens per response when Blip needs 40. The benchmark was worth running.

What's next