Self-hosting LLMs with vLLM in production - how to handle memory and scaling?

I’ve been looking into self-hosting open-weight LLMs (Qwen3, Gemma variants, etc.) for an internal tool at work. We want to avoid sending data to third-party APIs for privacy reasons.

vLLM seems like the go-to inference server right now, and I’ve gotten a basic setup running locally. But I’m hitting some walls when thinking about production:

  1. VRAM planning - I know the rough formula is params × bytes_per_param + KV_cache + runtime_overhead, but in practice how do you estimate KV cache needs for variable-length requests? We’d be serving maybe 50 concurrent users with prompts ranging from 500 to 8k tokens.

  2. PagedAttention tuning - vLLM’s PagedAttention is supposed to cut KV cache waste from ~60-80% down to under 4%, but are there config knobs I should be tweaking beyond the defaults? Things like gpu-memory-utilization, block sizes, etc.

  3. Scaling strategy - At what point does it make sense to move from a single-node vLLM setup to something like Ray Serve for multi-node? We have two A100 80GB cards right now and might add more.

  4. Quantization tradeoffs - For a coding assistant use case, is AWQ or GPTQ noticeably worse than FP16 in practice? Trying to figure out if we can fit a 70B model quantized vs running a 7B-13B model at full precision.

Anyone running vLLM in prod at a similar scale? Would love to hear what worked and what didn’t.


Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

I’ve been running vLLM in production for about 6 months now serving internal tools, so I can share what we’ve learned the hard way.

VRAM planning

The formula you mentioned is right in theory, but in practice you want to be more conservative. For 50 concurrent users with prompts up to 8k tokens, here’s how I’d think about it:

For a 7B parameter model at FP16, you’re looking at roughly 14GB just for weights. KV cache is where it gets tricky. With 50 concurrent requests at up to 8k context, worst case you could need another 10-15GB of KV cache. So a single A100 80GB works, but you won’t have a ton of headroom.

The practical move is to just set --gpu-memory-utilization to 0.90 (default is 0.90 anyway) and let vLLM manage the KV cache pool. Then monitor vllm:gpu_cache_usage_perc via the metrics endpoint. If you’re consistently above 0.85, you need to either add GPUs or reduce --max-model-len.

# Start with explicit limits and monitoring
python -m vllm.entrypoints.openai.api_server \
  --model your-model \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 64 \
  --enable-prefix-caching

PagedAttention tuning

Honestly, the defaults are pretty good for most workloads. The main knob worth touching is --max-num-seqs, which controls how many requests can be processed concurrently. For 50 users, setting this to 48-64 is a reasonable starting point. Going higher means more KV cache pressure; going lower means requests queue up.

The big win that a lot of people miss is --enable-prefix-caching. If your users are sending similar system prompts or your internal tool has common prefixes, this can cut KV cache usage significantly because it reuses cache blocks for matching prompt prefixes.

Scaling across GPUs

For horizontal scaling, run multiple vLLM instances behind a load balancer. Each instance gets its own GPU(s). We use a simple nginx round-robin setup, but you could also use something like Ray Serve if you want more sophisticated routing.

For tensor parallelism (splitting one model across GPUs), set --tensor-parallel-size. This is mainly useful for models that don’t fit on a single GPU. For 7B-13B models, a single GPU is usually fine.

# docker-compose for multi-instance setup
services:
  vllm-1:
    image: vllm/vllm-openai:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0']
              capabilities: [gpu]
    command: ["--model", "your-model", "--max-model-len", "8192"]
    ports:
      - "8001:8000"
  vllm-2:
    image: vllm/vllm-openai:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['1']
              capabilities: [gpu]
    command: ["--model", "your-model", "--max-model-len", "8192"]
    ports:
      - "8002:8000"

Monitoring

vLLM exposes a Prometheus metrics endpoint at /metrics. The key ones to watch:

  • vllm:gpu_cache_usage_perc - KV cache utilization
  • vllm:num_requests_running - active requests
  • vllm:num_requests_waiting - queued requests (if this grows, you need more capacity)
  • vllm:avg_generation_throughput_toks_per_s - throughput

Set alerts on queue depth and cache usage. Those two metrics will tell you when you need to scale before your users start complaining about latency.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.