Has anyone tried fine-tuning DeepSeek-V3 distilled models (the 32B or 70B variants) for domain-specific tasks? I’m trying to figure out the right tooling for our workflow and hitting some practical questions.
We have a product that does automated code review for security vulnerabilities, and we’ve been using Claude via API for the heavy lifting. Works great, but costs are climbing fast as we scale. The idea is to fine-tune an open-source model on our labeled dataset of ~50k code review examples so we can self-host inference.
I’ve been looking at three fine-tuning frameworks and I’m not sure which fits our situation:
- Axolotl seems like the production-grade option. It has quantization-aware training (QAT), sequence parallelism for multi-GPU setups, and supposedly stable GRPO reasoning training. We have a 4xA100 cluster we could dedicate to this.
- Unsloth just launched Unsloth Studio (March 17) with a no-code UI, and they claim 2x faster training with 70% less VRAM. That’s compelling for iteration speed, but I’m not sure it scales to our dataset size or multi-GPU.
- TorchTune keeps coming up for enterprise-scale stuff but the docs are thinner and I haven’t found as many real-world reports.
Some specific things I’m trying to figure out:
-
DeepSeek 32B vs 70B for code tasks: The 32B distilled variant supposedly retains most of the reasoning quality of the full 671B model on code benchmarks. Has anyone actually validated this on real code review workloads? The synthetic benchmark scores don’t always translate.
-
QAT vs post-training quantization: Axolotl’s QAT support sounds great in theory, train in reduced precision and get better quality than quantizing after the fact. But is the quality difference actually meaningful for code tasks, or is standard GPTQ/AWQ after full-precision fine-tuning close enough?
-
LoRA rank selection: For code-heavy fine-tuning, are people going higher rank (64-128) than typical? My intuition says code understanding might need more adapter capacity than, say, tone/style adaptation, but I haven’t seen benchmarks backing that up.
-
Eval during training: What metrics are you using beyond loss curves? We’re thinking about running a held-out set of real PRs through the model at checkpoints and scoring against our human reviewer labels, but that’s slow. Any tricks for faster proxy metrics that correlate with actual code review quality?
Our current plan is to start with Axolotl + LoRA on the 32B distilled model, validate quality, then decide if we need to scale to the 70B. But I’d love to hear from anyone who’s been through this process recently.
Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!