Building synthetic data pipelines for distilling task-specific models - what's your stack?

We’re at a point where distilling a smaller model from a frontier LLM for a specific task is often better than fine-tuning the small model on human-labeled data directly. The pattern is pretty well established now: use GPT-5, Claude, or DeepSeek as a teacher to generate thousands of high-quality examples, then train a 1-8B parameter student model that runs fast and cheap.

But the engineering around this is still kind of messy. I’ve been cobbling together a pipeline and keep running into the same questions:

Data generation at scale

I’m using a frontier model to generate ~50k labeled examples for a classification task. Naive approach is just loop through prompts and collect outputs, but that gets expensive fast and you end up with a lot of near-duplicate examples. Has anyone built a good diversity-sampling layer on top of this? I’ve been experimenting with clustering the generated examples and re-prompting specifically for underrepresented clusters, but it feels hacky.

# Current approach (simplified)
for seed in seed_prompts:
    response = teacher_model.generate(
        prompt=f"Generate a realistic example of {task} for: {seed}",
        temperature=0.9
    )
    examples.append(parse_and_validate(response))

# Then filter near-duplicates with embedding similarity
filtered = deduplicate(examples, threshold=0.92)

Quality filtering

The teacher model isn’t perfect, so maybe 10-15% of generated examples have subtle errors. I’ve been using a second LLM call as a verifier, but that doubles cost. Some teams I’ve talked to use rule-based validators for structured outputs, which works for some tasks but not all.

Training the student

For the actual distillation step, are people mostly doing standard supervised fine-tuning on the synthetic data, or is anyone doing proper KD with logit matching? I’ve been using SFT with LoRA on Qwen 2.5 7B and getting decent results, but I wonder if I’m leaving accuracy on the table by not doing logit distillation.

Versioning and reproducibility

This is the one that’s really bugging me. When your training data is generated by an API that can change under you, how do you version your datasets? I’m caching every API response with the model version string, but it feels fragile.

Would love to hear what stacks people are using for this. Especially interested in whether anyone’s tried the new reusable synthetic datasets like Nemotron-Synth or IBM’s SYNTH instead of generating from scratch.


Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

From a DevOps/infra perspective, this is one of those problems where the pipeline orchestration matters way more than the individual scripts. I’ve built a few of these distillation pipelines and here’s what’s worked for us.

Orchestration layer

We ended up on Prefect 3 after trying both Airflow and plain bash scripts. The DAG looks roughly like this:

  1. Prompt generation (templatized, with diversity parameters)
  2. Parallel API calls to the teacher model (with rate limiting and retry logic)
  3. Quality filtering pass
  4. Deduplication via embedding similarity
  5. Dataset versioning and push to object storage
  6. Training job launch

Each step is a containerized task, so you can scale the API-calling step horizontally without touching anything else. We run steps 1-4 on regular compute and only spin up GPU instances for step 6.

Handling the API costs

The biggest win for us was building a caching layer with Redis. Before you send a prompt to the teacher model, hash it and check the cache. You’d be surprised how often slightly different pipeline runs end up generating near-identical prompts. We cut our API spend by roughly 30% just from this.

For the diversity problem you mentioned, we do something similar to your clustering approach but as a streaming filter. After every batch of ~1000 generated examples, we compute embeddings (we just use a small local embedding model for this), run HDBSCAN clustering, and then bias the next batch’s prompts toward underrepresented clusters. It’s not perfect but it keeps the distribution from collapsing.

# simplified version of our diversity-aware batch generator
import hdbscan
import numpy as np

def get_underrepresented_clusters(embeddings, labels, min_cluster_size=50):
    cluster_counts = np.bincount(labels[labels >= 0])
    median_count = np.median(cluster_counts)
    underrep = np.where(cluster_counts < median_count * 0.5)[0]
    # get example prompts from underrepresented clusters
    return [embeddings[labels == c][0] for c in underrep]

Versioning and reproducibility

DVC has been solid for us here. Every pipeline run produces a versioned dataset artifact that includes the teacher model version, the prompt templates used, filtering thresholds, and the final training data. If a student model starts performing worse, you can diff the datasets between versions and usually spot the issue pretty quick.

Training infra

For the actual fine-tuning step, we use Modal for on-demand GPU access. You define your training function as a Modal app, point it at the versioned dataset, and it spins up the GPU, runs training, pushes the checkpoint to S3, and shuts down. No idle GPU costs. The cold start is like 30-60 seconds which is totally fine for training jobs.

# modal training entrypoint (simplified)
import modal

app = modal.App("distillation-training")

@app.function(gpu="A100", timeout=3600)
def train_student(dataset_path: str, config: dict):
    # pull dataset from S3, run training, push checkpoint back
    ...

The whole pipeline from “I want to distill a new task” to “student model is deployed” takes us about 4-6 hours, most of which is the API calls to the teacher model. The infra setup took a while to get right but now spinning up a new distillation task is basically just writing new prompt templates.