Fine-tuning DeepSeek-V3 distilled models for code review - Axolotl vs Unsloth vs TorchTune?

alex_ml · March 20, 2026, 9:14pm

Has anyone tried fine-tuning DeepSeek-V3 distilled models (the 32B or 70B variants) for domain-specific tasks? I’m trying to figure out the right tooling for our workflow and hitting some practical questions.

We have a product that does automated code review for security vulnerabilities, and we’ve been using Claude via API for the heavy lifting. Works great, but costs are climbing fast as we scale. The idea is to fine-tune an open-source model on our labeled dataset of ~50k code review examples so we can self-host inference.

I’ve been looking at three fine-tuning frameworks and I’m not sure which fits our situation:

Axolotl seems like the production-grade option. It has quantization-aware training (QAT), sequence parallelism for multi-GPU setups, and supposedly stable GRPO reasoning training. We have a 4xA100 cluster we could dedicate to this.
Unsloth just launched Unsloth Studio (March 17) with a no-code UI, and they claim 2x faster training with 70% less VRAM. That’s compelling for iteration speed, but I’m not sure it scales to our dataset size or multi-GPU.
TorchTune keeps coming up for enterprise-scale stuff but the docs are thinner and I haven’t found as many real-world reports.

Some specific things I’m trying to figure out:

DeepSeek 32B vs 70B for code tasks: The 32B distilled variant supposedly retains most of the reasoning quality of the full 671B model on code benchmarks. Has anyone actually validated this on real code review workloads? The synthetic benchmark scores don’t always translate.
QAT vs post-training quantization: Axolotl’s QAT support sounds great in theory, train in reduced precision and get better quality than quantizing after the fact. But is the quality difference actually meaningful for code tasks, or is standard GPTQ/AWQ after full-precision fine-tuning close enough?
LoRA rank selection: For code-heavy fine-tuning, are people going higher rank (64-128) than typical? My intuition says code understanding might need more adapter capacity than, say, tone/style adaptation, but I haven’t seen benchmarks backing that up.
Eval during training: What metrics are you using beyond loss curves? We’re thinking about running a held-out set of real PRs through the model at checkpoints and scoring against our human reviewer labels, but that’s slow. Any tricks for faster proxy metrics that correlate with actual code review quality?

Our current plan is to start with Axolotl + LoRA on the 32B distilled model, validate quality, then decide if we need to scale to the 70B. But I’d love to hear from anyone who’s been through this process recently.

Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

mike_backend · March 21, 2026, 5:11pm

We went through almost this exact exercise two months ago, fine-tuning a DeepSeek distilled model for automated PR feedback on our internal codebase. Different domain than security vulns, but a lot of the practical learnings overlap. We ended up on Axolotl and I think that’s the right call for your setup.

Axolotl vs Unsloth for your situation

With a 4xA100 cluster, Axolotl is the clear winner. Unsloth’s speed advantages are mostly on single-GPU setups. Their February 2026 update added MoE training support with some impressive numbers (12x faster, 35% less VRAM), but that’s for the full MoE architecture. For the distilled 32B/70B models (which are dense transformers, not MoE), the multi-GPU story matters more than single-GPU optimization.

Axolotl’s sequence parallelism across your 4 A100s will let you handle longer code contexts without chopping them up. For code review, context length is everything. A function that looks fine in isolation might be a security issue when you see the calling code. We trained with 8192 token sequences and wished we’d gone longer.

Unsloth Studio’s no-code UI is nice for quick experiments, but for a production training pipeline you’ll want the config-file approach anyway. You need reproducibility, version control on your training configs, and CI integration for automated retraining.

TorchTune is solid but it’s really aimed at teams building custom training loops from scratch. If you just want to fine-tune with best practices and move on to serving, Axolotl gives you that with less custom code.

32B vs 70B for code

We tested both. The 32B distilled model is genuinely impressive for code understanding, it retained most of the reasoning capability from the full 671B model on our benchmarks. But here’s the thing: for security-specific code review, the 70B was meaningfully better at catching subtle issues like TOCTOU races and injection vectors that require multi-step reasoning. The 32B would sometimes flag the obvious stuff (SQL injection with string concatenation) but miss the nuanced cases.

My suggestion: start with 32B for faster iteration on your training pipeline. Get your data formatting, eval harness, and serving infrastructure working. Then do a quality comparison run on 70B before you commit to production. The infra investment to serve 70B is significantly higher, so you want data showing it’s worth it for your specific use case.

LoRA rank for code tasks

Your intuition is right. We started at rank 32 and saw a noticeable quality jump going to 64. Beyond 64 the gains were marginal for our task. Here’s the thing though, code review is really two skills: understanding code semantics and applying domain-specific review criteria. The base model already has strong code understanding, so your adapter mostly needs to learn your review patterns and severity calibration.

We settled on rank 64 with alpha 128 (2x ratio). Target modules: q_proj, k_proj, v_proj, o_proj, and gate_proj. Including the gate projection helped with the model’s ability to “route” attention to security-relevant code patterns.

Eval during training: practical approach

Loss curves are basically useless for this. They’ll go down and tell you nothing about whether the model is catching real bugs.

What worked for us:

Classification accuracy on a curated set. Take 500 code snippets where you know the ground truth (200 with real vulnerabilities, 300 clean). Run the model at each checkpoint and measure precision/recall. This takes ~10 minutes on a single GPU and correlates well with real-world performance.
Structured output compliance. If you’re expecting the model to output in a specific format (severity level, affected lines, fix suggestion), measure how often the checkpoint produces parseable output. We saw format compliance drop during training sometimes before recovering, which was a useful early warning.
Diff-level agreement with human reviewers. Take 50 real PRs where you have human review comments. Score how often the model flags the same lines the human did. This is your most expensive eval but also the most meaningful. Run it every 200 steps, not every step.

Skip BLEU/ROUGE type metrics entirely. They’re meaningless for this task.

QAT vs post-training quantization

For code tasks specifically, we found post-training GPTQ (4-bit) on a full-precision fine-tuned model performed within 2-3% of the unquantized model on our eval set. QAT is theoretically better but adds complexity to your training pipeline and slows down iteration. I’d only bother with QAT if you’re pushing to very aggressive quantization (2-bit) or if that 2-3% quality gap matters for your use case.

Start simple: fine-tune in full precision (or bf16), quantize with GPTQ afterward, validate on your eval set. If the quality drop is acceptable, ship it. If not, then invest in QAT.