Skip to content

Fine-Tuning the Two Best Local Models I Own, and One of Them Won't Export

Last benchmark run, Gemma 4 31B scored 43.9 out of 50 without any fine-tuning. Pulled it from Ollama, ran it through the same 28 kid-tutor tests I use for everything, and it ended up between Claude Opus and Claude Sonnet. The obvious question: if the base model sits there without any task-specific training and already scores that high, what happens if I train a LoRA on it?

The training ran fine. Six hours on the RTX PRO 6000, 14,000 examples, rank-16 LoRA, three epochs. Loss curves looked good. Checkpoints saved correctly. Step 1,750 was the last clean save — two full epochs done, one to go.

Then I tried to export it and found out Gemma 4 is apparently not done yet, at least not from Unsloth's side.

The export bug

The normal path: save_pretrained_merged(save_method="merged_4bit_forced"). This dequantizes the bnb-4bit weights, applies the LoRA delta, and writes a merged model in 16-bit. For Gemma 4 31B it produced an empty directory. No error. No warning. Just nothing.

Second try: save_pretrained_gguf(), which converts directly to GGUF format without the intermediate merge step. That one fails with EOF when reading a line — Unsloth shells out to llama.cpp for quantization, and the subprocess dies before it finishes reading stdin.

At that point I went looking at why the merge produces nothing. The answer is Gemma4ClippableLinear. Unsloth patches Gemma 4's attention projection layers with a custom subclass — it clips activations for VRAM efficiency during training — and that wrapper breaks the dequantization path. When the merger tries to extract the underlying weights and apply the LoRA, it's looking at a wrapped layer that doesn't behave like a standard Linear. Result: empty output, no crash.

I tried going around Unsloth entirely, loading the adapter with plain PEFT and merging directly. That fails for the same reason — PEFT's injection mechanism can't handle the wrapped layers either. The LoRA was trained on top of ClippableLinear instances. Standard PEFT doesn't know what those are.

There is a fourth path: download the BF16 base model from HuggingFace, patch the adapter config to point at it, and merge using standard Linear layers that PEFT can actually work with. I got this working on paper. But Google gated the model. As of a few hours ago, though, they've opened it — no token required anymore. The BF16 base is downloading right now, about 62 GB across two shards. If the merge works, checkpoint-1750 (two epochs of training) might be usable after all.

The pivot

Google also dropped new Gemma 4 models today — E4B and 26B-A4B MoE. E4B is about 8 billion parameters total, efficient architecture. Unsloth already has a bnb-4bit version up, and — this is the thing that matters — other people have already successfully exported fine-tuned GGUFs from it. The community uploads exist. The pipeline works for E4B where it doesn't for 31B.

My guess is that E4B doesn't use the same ClippableLinear patching, or that Unsloth has a clean export path for smaller models that hasn't been fixed for 31B yet. I haven't confirmed it. But if someone already fine-tuned E4B and posted a GGUF, the export isn't broken.

So E4B is training on the local 4090 now. The workstation has been sitting idle while the inference box handled everything — seemed like a waste, and E4B at 8B fits comfortably in 24 GB of VRAM. Three epochs, same dataset, same hyperparameters. Should finish well before the Qwen3 run on the inference box.

Qwen3-32B fine-tune

The other model I'm training: Qwen3-32B, the one that scored 47.8 on trivia and 46.8 on math last run. Both category-leading scores. The hypothesis is simpler than the Gemma 4 one — Qwen3 already has the strongest factual recall and math performance in my local fleet, and if fine-tuning keeps that while adding the blip-edu behavioral characteristics (concise, age-appropriate, emotionally aware), it could clear 50/50 on categories it already nearly maxes out.

Qwen3 doesn't have the Unsloth export issue. It's a standard architecture with standard linear layers. The post-training pipeline — merge, convert to F16 GGUF, quantize to Q4_K_M, register in Ollama — is the same one I've run six or seven times now. I'm not worried about that part.

The run is at about step 40 out of 2,625. Roughly 5.7 seconds per step on the PRO 6000, which puts it finishing around 4 AM. A watcher script will handle the merge and registration automatically. I've been burned before by processes dying at 3 AM with nothing catching the output, so the watcher logs everything to a shared file and touches a signal file when it's done.

Why both at once

The short answer: two different hypotheses, and I don't want to wait a week between them to see which one holds.

The longer answer: Gemma 4 31B leads multi-turn and voice quality. Qwen3-32B leads trivia and math. Those are different task categories that Blip routes to different models anyway. If both fine-tunes work, I don't have to pick one — Qwen3 takes factual tasks, Gemma4 (probably E4B if 31B stays stuck) takes conversational ones. If neither works, I've learned something about whether fine-tuning high-baseline models at r=16 even moves the needle. If one works and one doesn't, I at least have half a win.

Either way it narrows the question faster than running them sequentially.

What's also joining the benchmark

GLM-4.5 Air. ZhipuAI's 106-billion-parameter MoE model with 12 billion active per token. It's already quantized to Q4_K_M and downloaded — 47 GB across two shards on the inference box's data drive. The problem is Ollama doesn't support sharded GGUFs, which is a known open issue. The fix is llama-cpp-python, which handles sharded files natively. I've got a server script that loads both shards on port 11435 and exposes an OpenAI-compatible endpoint. Once Qwen3 finishes training and frees up the 96 GB, that server starts automatically.

What I don't know yet is whether 12B active parameters at 106B total beats 32B dense on the blip-edu task mix. GLM-4.5 Air is a hybrid reasoning model — it emits thinking traces — and my runner strips them before scoring. The relevant question is whether the remaining response, after stripping, is better-calibrated for a seven-year-old than Gemma 4 31B at 3.9 seconds per call. I'll find out tomorrow.

Benchmark harness and training scripts at github.com/drbarry-blip. Training logs and raw results available on request.