Local AI Inference
A dedicated inference machine on the home network. 96GB VRAM, connected over a 10Gb direct link, running Ollama with automatic profile switching. Used for Blip, LoRA training, and anything I don't want going to a cloud API.
Status: Live · Burn-in passed · Profile switching working · Benchmark harness built
Why a dedicated inference box
My workstation has an RTX 4090 with 24GB VRAM — plenty for most tasks, but not for running a 32B+ parameter model while also running ComfyUI, Claude Code, and a browser. The inference box offloads all model serving to a separate machine so neither job competes for VRAM.
The other reason is privacy. My kids' voices run through Blip. LoRA training involves personal images. I'd rather those workloads stay on hardware I control than route through a cloud provider's inference endpoint.
Hardware
- GPU: PNY RTX PRO 6000 Blackwell Max-Q — 96GB GDDR7 ECC
- OS: Ubuntu 22.04.5 LTS · CUDA 12.8 · Driver: nvidia-driver-570-open
- Network: 10GbE direct link to workstation (10.10.10.1 ↔ 10.10.10.2), 0.5ms ping
Note: the RTX PRO 6000 Blackwell requires the open kernel module variant of the NVIDIA driver — the proprietary driver returns "RmInitAdapter failed" and never surfaces the GPU. The open driver works correctly.
Software stack
- Ollama — systemd service, bound on 0.0.0.0:11434, flash attention on, 10-min keep-alive
- Profile manager — custom FastAPI service on :9000, 8 endpoints, handles model loading/unloading by profile
- SessionStart hook — on my workstation, detects the active project directory and automatically switches the inference profile to match
Profile switching
Different work needs different models loaded. A coding session wants a 32B code-tuned model. A captioning session wants a VLM. Loading all of them simultaneously at 96GB isn't the problem — getting them swapped cleanly between tasks is.
The profile manager handles this. When I open a Revive EHR session in Claude Code, the SessionStart hook calls the manager's API, which evicts the current model set and loads the coding profile. It's automatic and adds about 140ms to session start (sticky cache on subsequent loads).
| Profile | Primary model | Used for |
|---|---|---|
coding |
Qwen2.5-Coder 32B | Revive EHR, llm-server, Claude Code sessions |
vision |
Qwen2.5-VL 32B + Moondream | LoRA captioning, ComfyUI, photo work |
reasoning |
DeepSeek-R1 32B | Architecture decisions, debugging, hard problems |
development |
Qwen3.5-abliterated 35B-A3B | General chat, prose, uncensored generations |
Hardware monitoring
The inference box publishes GPU temperature, VRAM usage, power draw, CPU temperature, and service health to AWS CloudWatch every 30 seconds. SNS sends email alerts at configurable thresholds. Local Python handles emergency shutdown (GPU at 95°C, CPU at 100°C) — that path can't wait for an AWS round-trip.
I ran a 90-minute burn-in test after deployment to validate the thermal envelope:
- GPU peak: 85°C under combined CPU + GPU load
- Zero thermal throttle events across 244 monitor samples
- Sustained 50.1 TFLOPS FP32 for the full 60-minute combined burn
- 5 SNS alerts fired correctly (3 warn, 2 recovery), validating the alert pipeline end-to-end
The 80°C warn / 90°C critical / 95°C shutdown thresholds are well-calibrated for the current cooling setup. The 85°C peak is thin margin — worth watching, but not worth tightening thresholds and generating false alarms during normal training runs.
LLM benchmarking
I built a benchmarking harness to objectively compare local models against cloud models on the actual tasks Blip needs to handle. The harness sends identical prompts to all models, captures raw outputs, then uses Claude Opus as an anonymized judge to rank responses.
After benchmarking 11 models across 28 Blip-specific tasks, the results were counterintuitive enough that I wrote about them:
What I Learned Benchmarking 6 Local LLMs — and Why Most Model Comparisons Are Wrong →
Short version: the dedicated code model (Qwen2.5-Coder) turned out to be the best local model for Blip's conversational tasks — better than the abliterated chat model I'd originally planned to use. Two 27B models distilled from Opus reasoning traces failed completely — they emitted 2,000+ tokens per response when Blip needs 40. The benchmark was worth running.
What's next
- revive-clinical model — a fine-tuned clinical reasoning model for the Revive EHR. Would run as a runtime profile applied by the EHR app, not the developer SessionStart hook. The 96GB card can hold a 27B BF16 model without quantization.
- vLLM and llama.cpp — deferred because Ollama handles current workloads. Would be needed if I want PagedAttention for high-concurrency or llama.cpp for GGUF models without Ollama's quantization overhead.
- GDDR7 memory temperature monitoring — the burn-in used only ~1.6GB VRAM (the matrix triplet), so memory thermals weren't stressed. The first time I see thermal throttle during real LoRA training without the die temperature crossing 85°C, GDDR7 junction temperature is the prime suspect. Worth adding to the CloudWatch metrics.