We’re running about a dozen Node.js microservices (mix of Express and Fastify) and our current observability setup is kind of a mess. We’ve got Datadog APM on some services, custom Winston logging on others, and Prometheus metrics sprinkled in where someone felt like adding them. It works, but tracing a request across services is painful.
I want to standardize everything on OpenTelemetry so we have one consistent approach. But I have a bunch of questions before diving in:
Auto-instrumentation vs manual spans
The auto-instrumentation for Node.js looks like it picks up HTTP calls, database queries, etc. automatically. How reliable is this in practice? We use Prisma, Redis (via ioredis), and make a lot of inter-service calls over gRPC. Do those all get traced automatically or do I need custom spans for some of them?
The collector question
Should we run the OTel Collector as a sidecar per service, or as a standalone gateway? We’re on Kubernetes, so both are doable. I’ve seen arguments for both. The sidecar approach seems simpler but uses more resources. The gateway approach is more efficient but adds a single point of failure.
Exporter backend
We’re debating between sticking with Datadog (which now accepts OTLP) vs switching to something like Grafana Cloud or SigNoz. Anyone have experience with the Datadog OTLP ingest? Is it full-featured or are there gaps compared to their native agent?
Cost and cardinality
This is the one I’m most nervous about. We had a cardinality explosion with Prometheus once that took down our monitoring. What’s the best way to keep metric cardinality under control with OTel? Are there good strategies for sampling traces in a way that still catches errors?
Any war stories from doing this migration would be super helpful. Especially interested in what order you’d tackle services in, if there’s a good incremental approach.
Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!