Standardizing observability across Node.js microservices with OpenTelemetry

We’re running about a dozen Node.js microservices (mix of Express and Fastify) and our current observability setup is kind of a mess. We’ve got Datadog APM on some services, custom Winston logging on others, and Prometheus metrics sprinkled in where someone felt like adding them. It works, but tracing a request across services is painful.

I want to standardize everything on OpenTelemetry so we have one consistent approach. But I have a bunch of questions before diving in:

Auto-instrumentation vs manual spans
The auto-instrumentation for Node.js looks like it picks up HTTP calls, database queries, etc. automatically. How reliable is this in practice? We use Prisma, Redis (via ioredis), and make a lot of inter-service calls over gRPC. Do those all get traced automatically or do I need custom spans for some of them?

The collector question
Should we run the OTel Collector as a sidecar per service, or as a standalone gateway? We’re on Kubernetes, so both are doable. I’ve seen arguments for both. The sidecar approach seems simpler but uses more resources. The gateway approach is more efficient but adds a single point of failure.

Exporter backend
We’re debating between sticking with Datadog (which now accepts OTLP) vs switching to something like Grafana Cloud or SigNoz. Anyone have experience with the Datadog OTLP ingest? Is it full-featured or are there gaps compared to their native agent?

Cost and cardinality
This is the one I’m most nervous about. We had a cardinality explosion with Prometheus once that took down our monitoring. What’s the best way to keep metric cardinality under control with OTel? Are there good strategies for sampling traces in a way that still catches errors?

Any war stories from doing this migration would be super helpful. Especially interested in what order you’d tackle services in, if there’s a good incremental approach.


Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

I’ve rolled out OTel across a similar stack (mix of Express and Fastify, about 15 services) so I can share what worked for us.

Auto-instrumentation coverage

The Node.js auto-instrumentation is solid for the basics, but you’ll hit gaps with some libraries. Here’s the breakdown for your stack:

  • HTTP/Express/Fastify: Auto-instrumented out of the box. Works great.
  • ioredis: Has an official instrumentation package (@opentelemetry/instrumentation-ioredis). Traces commands automatically.
  • Prisma: This one’s tricky. There’s @prisma/instrumentation that works with OTel, but you need to register it explicitly. It won’t get picked up by the generic auto-instrumentation loader.
  • gRPC: @opentelemetry/instrumentation-grpc handles this well. Context propagation across gRPC calls works natively, which is huge for distributed tracing.

The general rule: if there’s an official @opentelemetry/instrumentation-* package, use it. For anything else, you’ll want manual spans.

// Register instrumentations explicitly for best results
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { PrismaInstrumentation } = require('@prisma/instrumentation');

const sdk = new NodeSDK({
  instrumentations: [
    getNodeAutoInstrumentations({
      // Disable the ones you don't need to cut noise
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
    new PrismaInstrumentation(),
  ],
});
sdk.start();

The Collector question

Definitely use a Collector. Running one as a sidecar per service (or as a DaemonSet if you’re on Kubernetes) gives you a buffer between your apps and your backends. The biggest win is that you can reconfigure where data goes without redeploying your services.

Here’s a minimal Collector config that works well as a starting point:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  otlp:
    endpoint: your-backend:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Migrating incrementally

Don’t try to do it all at once. Here’s what worked for us:

  1. Start with the Collector and one service. Get traces flowing end to end.
  2. Add auto-instrumentation to each service one at a time. Keep your existing logging/metrics running in parallel.
  3. Once you trust the OTel data, start cutting over dashboards.
  4. Kill the old instrumentation last.

The hardest part honestly isn’t the technical setup, it’s getting the team to agree on naming conventions for custom spans and attributes. Establish those early or you’ll end up with user.id, userId, and user_id all meaning the same thing across different services.

For your 12-service setup, I’d budget about 2-3 weeks to get everything instrumented and another week to get dashboards where you want them. It’s worth it though, being able to trace a request from the API gateway through gRPC calls all the way to the database is a game changer for debugging.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.