How are you handling FinOps for GPU workloads in Kubernetes?

We recently started running inference workloads on GPU nodes in our EKS cluster, and the cost visibility is… not great. The standard Kubecost setup gives us per-namespace and per-pod breakdowns for CPU and memory, but GPU utilization and cost attribution feels like a blind spot.

A few things we’re struggling with:

  • Idle GPU cost is massive. We have nodes with A10G GPUs that sit at 20% utilization half the day because our batch jobs only run during certain windows. Spot instances help but availability is inconsistent.
  • Shared GPU scheduling with time-slicing (the NVIDIA device plugin supports this now) makes it harder to attribute cost per workload. If three pods share a GPU, who owns the cost?
  • Chargeback to teams is a mess. Our platform team eats the GPU bill right now, but product teams have no incentive to optimize their model serving configs.

We’ve looked at OpenCost with the GPU exporter plugin and also CloudHealth, but curious what others are actually using in production. Are any of you doing proper GPU cost allocation across teams? Did you build something custom with Prometheus metrics, or is there a tool that handles this well out of the box?

Also interested in whether anyone has set up auto-shutdown policies for GPU nodes during off-peak hours. We tried Karpenter consolidation policies but it gets tricky when you have pods with long graceful shutdown periods.


Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

We ran into almost the exact same problems when we started running inference on A10Gs in GKE. Here’s what ended up working for us after a lot of trial and error.

GPU utilization visibility: Kubecost alone won’t cut it for GPU cost attribution. We layered in DCGM Exporter (NVIDIA’s official Prometheus exporter) to get per-pod GPU utilization, memory usage, and power draw. That feeds into Grafana dashboards where we can actually see which pods are burning GPU cycles and which are just sitting there.

Tackling idle cost: The biggest win for us was moving to a KEDA-based autoscaler that watches a custom metric from our job queue. Instead of keeping GPU nodes warm all day, we scale from zero when batch jobs land in the queue and scale back down when the queue drains. Combined with a mix of on-demand (for baseline) and spot (for burst), we cut our GPU spend by about 40%.

For the time-slicing attribution problem, we ended up using NVIDIA MPS (Multi-Process Service) instead of basic time-slicing. MPS gives you better isolation and, more importantly, per-process GPU utilization metrics. We tag each pod with a team label and use a PromQL query to calculate per-team GPU-hours:

sum by (team) (
  rate(DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod=~".+"}[1h])
) * on(pod) group_left(team)
  kube_pod_labels{label_team=~".+"}

That gives you a pretty accurate per-team GPU utilization number you can multiply by your hourly node cost.

Chargeback model: We built a simple internal tool that pulls from the Prometheus API nightly, calculates per-team GPU-hours, and generates a weekly cost report. Each team sees their share of the GPU bill. It’s not perfect, but once teams could see how much their idle models were costing, they suddenly got very motivated to optimize batch scheduling and right-size their resource requests.

One thing I’d also recommend: look into Karpenter if you’re on EKS. Its consolidation feature is really good at bin-packing GPU workloads and terminating underutilized nodes. Way better than Cluster Autoscaler for heterogeneous GPU node pools.