We’re building a customer support agent that handles multi-step workflows: reading tickets, pulling order data, checking inventory, drafting responses, and sometimes issuing refunds. It’s working surprisingly well in dev, but I’m terrified of deploying it without proper observability and eval coverage. The failure modes for agents are way harder to catch than a simple API call.
I’ve been evaluating three observability/eval platforms and I can’t decide which fits our stack best:
Langfuse (open-source, self-hosted)
- We like that it’s MIT-licensed and we can self-host it. Our compliance team gets nervous about sending customer interaction traces to third-party SaaS.
- Multi-turn trace support looks solid, and the LLM-as-judge evaluation feature means we could automate scoring without building our own eval harness.
- But I’m worried about operational overhead. Running another stateful service (Postgres + ClickHouse) in our infra adds maintenance burden. Anyone running Langfuse in production at scale? How’s the storage growth and query performance?
Braintrust
- The “Loop” feature is compelling. You describe what a good response looks like in plain English and it generates custom scoring functions. That could dramatically speed up our eval iteration cycle.
- The trace-to-test-case pipeline is exactly what we need. When something goes wrong in prod, we want to automatically turn that trace into a regression test.
- But it’s SaaS-only and we’d be sending customer interaction data to their servers. Has anyone navigated the compliance angle here?
Arize Phoenix (open-source)
- The clustering and drift detection features seem useful for catching when agent behavior shifts subtly over time, like if a model update causes it to become more aggressive about issuing refunds.
- Hierarchical tracing for multi-step agent workflows sounds like the right abstraction for our use case.
- But I’ve heard mixed things about the learning curve and docs quality.
Beyond the platform choice, I have some architectural questions:
-
What do you actually trace? For a multi-step agent, do you instrument every LLM call, every tool invocation, every decision branch? Or is that too noisy, and you focus on key checkpoints? We’re using LangGraph for orchestration, so there are natural node boundaries, but even a simple refund flow has 8-10 steps.
-
Trajectory vs outcome evals: Anthropic’s recent engineering blog on agent evals argues you should measure both the intermediate steps (did the agent look up the right order?) and the final outcome (did the customer get the right answer?). In practice, how do you define “right” for intermediate steps without hand-labeling thousands of traces?
-
Regression testing cadence: When you update your prompts or swap models, how many eval cases do you need to feel confident? We have about 200 hand-labeled golden examples. Is that enough, or should we be investing in building a bigger test set?
-
Alerting on quality drops: Has anyone set up real-time alerting based on eval scores? Like, if the rolling average of “response helpfulness” (scored by an LLM judge) drops below a threshold, page the on-call? That feels like the right end state, but I’m not sure how reliable LLM-as-judge scores are for this.
We’re on LangGraph + Claude API, deployed on AWS ECS. Would love to hear how others have set up their agent observability stack.
Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!