Running LLM evals in CI - how are you catching regressions before they hit production?

We ship an AI-powered support agent that handles customer questions against our knowledge base. Every time we update the system prompt, swap models, or change our retrieval pipeline, there’s this anxiety moment of “did we just break something?” We’ve been doing manual vibes-based testing and it doesn’t scale.

I’ve been looking at eval frameworks to integrate into our CI pipeline. DeepEval looks solid since it has 50+ metrics out of the box and you can run it with pytest, which fits our existing test infrastructure. The LLM-as-judge approach for metrics like answer relevancy and faithfulness seems way more practical than trying to write regex-based checks.

But I have a bunch of open questions:

  1. Golden dataset size: How many test cases do you actually need before your evals are meaningful? We have maybe 200 real customer questions with verified good answers. Is that enough or should we be generating synthetic test cases too?

  2. Metric selection: DeepEval has a ton of metrics. For a RAG-based support agent, which ones actually catch real regressions? I’m thinking answer relevancy, faithfulness, and hallucination at minimum. But is G-Eval worth the extra LLM calls?

  3. CI integration: Are you blocking merges on eval scores or just tracking trends? Blocking feels risky since LLM-as-judge scores have variance. But just tracking means regressions could slip through.

  4. Cost management: Running evals means making LLM calls for every PR. With 200+ test cases and multiple metrics, that’s a lot of API spend. Anyone got strategies for keeping this reasonable?

Would love to hear what setups are actually working for people in production. Especially interested in how you handle the nondeterminism problem, since the same eval can give slightly different scores on each run.


Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

We run evals in CI for a RAG pipeline that powers internal search, so pretty similar use case. Here’s what’s working for us after a few iterations.

Golden dataset size: 200 verified Q&A pairs is a decent start, but you’ll want to stratify them. We have about 300 total, broken into categories: simple factual (40%), multi-step reasoning (30%), edge cases where the answer isn’t in the knowledge base (20%), and adversarial/tricky phrasing (10%). The edge cases are the most valuable since those are the ones that catch regressions where retrieval silently degrades.

For synthetic generation, we use it but carefully. We generate synthetic variants of our real questions (paraphrases, different phrasings) to bulk up the dataset, but the ground truth answers always come from human-verified originals. Synthetic-on-synthetic gets circular fast.

Metrics that actually matter: For a support agent RAG setup, here’s what I’d prioritize:

# Our eval config (simplified)
metrics:
  # These block the merge
  critical:
    - answer_relevancy    # Does the answer address the question?
    - faithfulness         # Is the answer grounded in retrieved context?
    - hallucination        # Is it making stuff up?
  
  # These get tracked but don't block
  monitored:
    - contextual_precision # Are the right docs being retrieved?
    - contextual_recall    # Are we missing relevant docs?
    - g_eval_coherence     # Overall quality score

G-Eval is worth it as a trend metric but too noisy to gate on. The variance between runs can swing 5-10 points, so blocking merges on it would drive your team crazy.

CI integration strategy: We do a tiered approach. On every PR, we run a “smoke” eval with 50 hand-picked critical test cases and the three critical metrics. Takes about 3 minutes and costs maybe $2 in API calls. If that passes, we run the full suite (all 300 cases, all metrics) as a non-blocking check. The full run takes about 15 minutes.

For the blocking threshold, we require that critical metrics don’t drop more than 5% from the baseline. Absolute scores are less useful than relative change because of the nondeterminism.

# In our CI script
import deepeval
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

def run_smoke_eval():
    metrics = [
        AnswerRelevancyMetric(threshold=0.7),
        FaithfulnessMetric(threshold=0.8)
    ]
    results = deepeval.evaluate(test_cases=SMOKE_CASES, metrics=metrics)
    
    # Compare against stored baseline
    baseline = load_baseline("main")
    for metric_name, score in results.scores.items():
        delta = score - baseline[metric_name]
        if delta < -0.05:  # 5% regression threshold
            raise RegressionError(f"{metric_name} dropped {delta:.2%}")

On cost: The trick is running evals against a cheaper model (like GPT-4o mini or Claude Haiku) as the judge for most metrics, and only using a frontier model for G-Eval where quality matters more. Also, cache your embeddings and retrieved contexts between metric evaluations so you’re not re-running retrieval for every metric.

Nondeterminism: Run each eval 3 times and take the median. It adds cost but smooths out the noise significantly. We also store rolling averages of eval scores so we can detect gradual drift, not just sudden drops.

Coming at this from the infra side since we run eval suites in CI for a couple of teams’ AI features. A few things I’ve learned the hard way about making this actually work in a pipeline.

Separate your eval jobs from your unit test jobs. This sounds obvious but I’ve seen teams jam LLM evals into their existing pytest stage and then wonder why their CI takes 20 minutes. We run evals as a parallel job in GitHub Actions that doesn’t block the main build. The PR gets a status check that updates async, so devs can keep working while evals run.

# .github/workflows/eval.yml
name: LLM Evals
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/ai/**'
      - 'eval/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    timeout-minutes: 15
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r eval/requirements.txt
      - run: python -m pytest eval/ --tb=short -q
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          EVAL_CONCURRENCY: 10
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-results
          path: eval/results/

Path filtering is crucial. You don’t want evals running on every commit that touches a README. We trigger eval jobs only when prompt templates, AI-related source code, or the eval suite itself changes. Saves a ton of API spend.

Cache your embeddings and static fixtures. If your eval dataset includes pre-computed embeddings or reference outputs, cache them between runs. We use actions/cache keyed on the hash of the golden dataset file. Cuts about 40% off our eval runtime since we’re not re-embedding the same 300 test cases every run.

Set hard timeouts and cost caps. LLM API calls can hang or spiral if something goes wrong with your prompts. We set a 15-minute job timeout and also have a wrapper that tracks token usage per run. If a single eval run exceeds a dollar threshold, it fails fast. You really don’t want a bad prompt template burning through your API budget on a Friday afternoon.

For the flakiness problem, we addressed it by running each eval case twice and taking the worse score. If there’s a big delta between runs on the same input, that test case gets flagged for review rather than auto-passing or auto-failing. It adds cost but it caught a few regressions that single-run evals missed.

One thing I’d add to alex_ml’s point about tracking metrics over time: pipe your eval results into something queryable. We dump ours to a Postgres table and have a simple Grafana dashboard that shows score trends per metric across PRs. Makes it way easier to spot slow degradation than just looking at pass/fail on individual PRs.