We ship an AI-powered support agent that handles customer questions against our knowledge base. Every time we update the system prompt, swap models, or change our retrieval pipeline, there’s this anxiety moment of “did we just break something?” We’ve been doing manual vibes-based testing and it doesn’t scale.
I’ve been looking at eval frameworks to integrate into our CI pipeline. DeepEval looks solid since it has 50+ metrics out of the box and you can run it with pytest, which fits our existing test infrastructure. The LLM-as-judge approach for metrics like answer relevancy and faithfulness seems way more practical than trying to write regex-based checks.
But I have a bunch of open questions:
-
Golden dataset size: How many test cases do you actually need before your evals are meaningful? We have maybe 200 real customer questions with verified good answers. Is that enough or should we be generating synthetic test cases too?
-
Metric selection: DeepEval has a ton of metrics. For a RAG-based support agent, which ones actually catch real regressions? I’m thinking answer relevancy, faithfulness, and hallucination at minimum. But is G-Eval worth the extra LLM calls?
-
CI integration: Are you blocking merges on eval scores or just tracking trends? Blocking feels risky since LLM-as-judge scores have variance. But just tracking means regressions could slip through.
-
Cost management: Running evals means making LLM calls for every PR. With 200+ test cases and multiple metrics, that’s a lot of API spend. Anyone got strategies for keeping this reasonable?
Would love to hear what setups are actually working for people in production. Especially interested in how you handle the nondeterminism problem, since the same eval can give slightly different scores on each run.
Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!