Agent observability and eval in production - Langfuse vs Braintrust vs Arize Phoenix?

We’re building a customer support agent that handles multi-step workflows: reading tickets, pulling order data, checking inventory, drafting responses, and sometimes issuing refunds. It’s working surprisingly well in dev, but I’m terrified of deploying it without proper observability and eval coverage. The failure modes for agents are way harder to catch than a simple API call.

I’ve been evaluating three observability/eval platforms and I can’t decide which fits our stack best:

Langfuse (open-source, self-hosted)

  • We like that it’s MIT-licensed and we can self-host it. Our compliance team gets nervous about sending customer interaction traces to third-party SaaS.
  • Multi-turn trace support looks solid, and the LLM-as-judge evaluation feature means we could automate scoring without building our own eval harness.
  • But I’m worried about operational overhead. Running another stateful service (Postgres + ClickHouse) in our infra adds maintenance burden. Anyone running Langfuse in production at scale? How’s the storage growth and query performance?

Braintrust

  • The “Loop” feature is compelling. You describe what a good response looks like in plain English and it generates custom scoring functions. That could dramatically speed up our eval iteration cycle.
  • The trace-to-test-case pipeline is exactly what we need. When something goes wrong in prod, we want to automatically turn that trace into a regression test.
  • But it’s SaaS-only and we’d be sending customer interaction data to their servers. Has anyone navigated the compliance angle here?

Arize Phoenix (open-source)

  • The clustering and drift detection features seem useful for catching when agent behavior shifts subtly over time, like if a model update causes it to become more aggressive about issuing refunds.
  • Hierarchical tracing for multi-step agent workflows sounds like the right abstraction for our use case.
  • But I’ve heard mixed things about the learning curve and docs quality.

Beyond the platform choice, I have some architectural questions:

  1. What do you actually trace? For a multi-step agent, do you instrument every LLM call, every tool invocation, every decision branch? Or is that too noisy, and you focus on key checkpoints? We’re using LangGraph for orchestration, so there are natural node boundaries, but even a simple refund flow has 8-10 steps.

  2. Trajectory vs outcome evals: Anthropic’s recent engineering blog on agent evals argues you should measure both the intermediate steps (did the agent look up the right order?) and the final outcome (did the customer get the right answer?). In practice, how do you define “right” for intermediate steps without hand-labeling thousands of traces?

  3. Regression testing cadence: When you update your prompts or swap models, how many eval cases do you need to feel confident? We have about 200 hand-labeled golden examples. Is that enough, or should we be investing in building a bigger test set?

  4. Alerting on quality drops: Has anyone set up real-time alerting based on eval scores? Like, if the rolling average of “response helpfulness” (scored by an LLM judge) drops below a threshold, page the on-call? That feels like the right end state, but I’m not sure how reliable LLM-as-judge scores are for this.

We’re on LangGraph + Claude API, deployed on AWS ECS. Would love to hear how others have set up their agent observability stack.


Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

I’ve been running Langfuse in production for about four months now for a similar multi-step agent setup (ours does document processing rather than support tickets, but same idea), so I can share some operational reality.

On the self-hosting question

Langfuse’s stack is Postgres + ClickHouse + Redis + S3-compatible storage. If you’re already on AWS ECS, the ClickHouse piece is the one that’ll bite you. It’s not hard to run, but it’s another stateful service that needs monitoring, backups, and capacity planning. Our trace volume is around 50k/day and ClickHouse storage is growing at roughly 2GB/week with default retention. Query performance is solid for recent data, but we had to tune the max_memory_usage settings once we crossed ~100GB of stored traces. Worth it for compliance, but don’t underestimate the maintenance.

If compliance isn’t a hard blocker, Braintrust’s managed approach saves you a ton of ops headaches. Their free tier (1M spans/month) is generous enough to validate before committing.

What we trace

We started by instrumenting everything and it was way too noisy. What actually works for us:

  • Span per LangGraph node: each node gets a trace span with input/output captured. This gives you the “trajectory” view without drowning in individual LLM call details.
  • LLM calls only get sampled at ~20% unless the node produces an error or low-confidence output, then we capture the full call including token counts.
  • Tool invocations always get traced with the full request/response, because that’s where the real debugging value is. When our agent pulls the wrong order, we need to see exactly what query it constructed.

This dropped our trace storage by about 60% compared to “trace everything” while keeping the debugging value.

Trajectory vs outcome evals

Your instinct about needing both is right. For intermediate step evaluation without hand-labeling thousands of examples, here’s what’s worked for us:

  1. Tool call correctness is easy to automate. Did the agent call the right tool with valid parameters? You can write deterministic assertions for this. “Given a refund request for order #X, did the agent call lookup_order with order_id=X?” No LLM judge needed.
  2. Decision branch evaluation uses a smaller LLM as judge. We use Claude Haiku to score whether the agent’s reasoning at each decision point was sound, given the context it had. Cheap enough to run on every trace.
  3. Outcome eval uses your golden set. 200 examples is a decent start for detecting large regressions, but you’ll want to stratify them. If refund flows are 10% of volume but 80% of risk, make sure 40-50 of those examples are refund scenarios.

Alerting on quality drops

We actually have this running. The setup:

# Pseudo-pipeline
Trace comes in -> Langfuse webhook fires -> Lambda runs LLM-as-judge eval
-> Score gets written back to Langfuse -> CloudWatch metric from score
-> Alarm on rolling 1hr avg dropping below threshold -> PagerDuty

The tricky part is calibrating thresholds. LLM-as-judge scores have variance, so you need a decent window (we use 1 hour, ~200 traces) to smooth out noise. We also found that monitoring score distribution shift (via a simple KL divergence check) catches subtle degradation better than a hard threshold.

One gotcha: when you swap models or update prompts, your LLM judge scores will shift even if quality is the same, because the judge’s expectations were calibrated on the old style. We re-baseline our judge scoring after every major prompt change.

My recommendation

Given you’re on AWS ECS and have compliance concerns, I’d start with Langfuse self-hosted. Deploy it on ECS with ClickHouse on a dedicated instance (not Fargate, ClickHouse wants consistent disk I/O). Use their Docker Compose setup as a starting point but break it into separate ECS services. Set up the webhook-based eval pipeline early rather than bolting it on later.