WebLLM + WebGPU as a fallback for AI features in React - anyone shipping this in production?

I’ve been building a React app with an AI chat assistant that calls Claude via our backend API. Works great, but we’re getting feature requests for offline support and a few enterprise customers are asking about on-premise options where their data never leaves the browser.

WebLLM (the MLC-AI project) looks really promising. Their benchmarks show Llama 3.1 8B quantized hitting ~41 tok/s on an M3 Max through WebGPU, and Phi 3.5 mini at 71 tok/s. That’s actually usable for a chat experience. And since it exposes an OpenAI-compatible API, theoretically I could swap my API client to point at the local engine with minimal code changes.

But I have a bunch of practical questions before I invest a sprint into this:

  1. Model download UX: These quantized models are still 2-4GB. How do you handle the initial download in a web app without it feeling terrible? I’m thinking Service Worker + Cache API for persistence, but the first-run experience of “please wait while we download 3GB” seems rough. Has anyone built a good progressive loading pattern for this?

  2. WebGPU availability: Coverage is supposedly ~90% on desktop but only ~70-75% on mobile. Do you just feature-flag the whole thing and fall back to the API? Or is there a reasonable WASM-only fallback path that’s still fast enough?

  3. Memory pressure: If someone has 8GB of RAM and a bunch of tabs open, loading a 4-bit quantized 8B model is going to cause problems. Is there a reliable way to detect available memory before attempting to load the model? navigator.deviceMemory gives you a rough bucket but it’s not precise.

  4. Quality gap: The obvious tradeoff is that a local 8B model isn’t going to match Claude or GPT-5 for complex reasoning. How are you managing user expectations? Do you show a “running locally” indicator and adjust the UI to set different expectations?

  5. Hybrid routing: Ideally I’d use the local model for simple stuff (reformatting text, quick summaries, autocomplete) and route complex queries to the API. Has anyone built a client-side router that decides where to send each request?

For context, our app is Next.js 15, and the AI features are in a chat panel component. I’d want the WebGPU inference to run in a Web Worker so it doesn’t block the main thread.

Would love to hear from anyone who’s actually shipped this pattern, not just prototyped it. What broke that you didn’t expect?


Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

I shipped a WebLLM fallback in a production React app last quarter, so I can share what actually happened vs. what the benchmarks promise.

Model download UX: chunked loading with progress

The “please wait 3GB” problem is real, but solvable. We show a first-run onboarding modal that explains why you’re downloading a model (“enable offline AI”), shows a chunked progress bar, and lets the user continue using the API-backed version while the download happens in the background. The model gets cached in the Cache API via a Service Worker, so subsequent visits load from disk in about 2-3 seconds.

// Simplified model loader with progress
const engine = new webllm.MLCEngine();
engine.setInitProgressCallback((progress) => {
  // progress.text has human-readable status
  // progress.progress is 0-1 float
  updateLoadingUI(progress.progress, progress.text);
});

// Load in a Web Worker so the UI stays responsive
await engine.reload("Phi-3.5-mini-instruct-q4f16_1-MLC");

We went with Phi-3.5 mini (q4f16, ~2.2GB) instead of Llama 8B. The smaller download is a better tradeoff for our use case, and inference speed is noticeably snappier on mid-range hardware.

WebGPU detection: feature flag everything

Don’t try to make it work everywhere. We use a capability check on app init:

async function checkWebGPUSupport(): Promise<boolean> {
  if (!navigator.gpu) return false;
  try {
    const adapter = await navigator.gpu.requestAdapter();
    if (!adapter) return false;
    const info = await adapter.requestAdapterInfo();
    // Reject known-bad GPUs (some Intel integrated chips crash)
    const blocklist = ['Intel HD Graphics 4000', 'Intel HD Graphics 3000'];
    return !blocklist.some(gpu => info.description?.includes(gpu));
  } catch {
    return false;
  }
}

If WebGPU isn’t available, we don’t fall back to WASM. Honestly, WASM inference on an 8B model is too slow to be a good user experience. We just use the API and the user never sees the local option. About 85% of our desktop users get WebGPU; mobile is lower but most of our mobile traffic uses the API anyway.

Memory pressure: just ask the GPU

navigator.deviceMemory is useless for this, you’re right. Instead, check the GPU adapter’s limits:

const adapter = await navigator.gpu.requestAdapter();
const memoryInfo = adapter?.limits?.maxBufferSize;
// Phi-3.5 mini q4f16 needs roughly 2.5GB VRAM
if (memoryInfo && memoryInfo < 2.5 * 1024 * 1024 * 1024) {
  // Don't even try, fall back to API
}

We also wrap the model load in a try-catch. If WebGPU runs out of memory mid-load, the engine throws, and we gracefully degrade to the API with a toast notification.

Hybrid routing: simpler than you’d think

We built a basic client-side router that checks prompt complexity. It’s not ML-based, just heuristics:

function shouldRouteLocally(message: string): boolean {
  const wordCount = message.split(/\s+/).length;
  const complexIndicators = ['analyze', 'compare', 'explain why',
    'write a', 'debug this', 'review this code'];
  const isComplex = complexIndicators.some(i =>
    message.toLowerCase().includes(i)) || wordCount > 100;
  return !isComplex;
}

Simple stuff like “reformat this list” or “summarize in one sentence” goes local. Anything that needs deeper reasoning goes to the API. Users see a small chip in the chat bubble: “Local” or “Cloud”. We were worried about the quality gap being jarring, but users actually appreciate the instant responses from the local model for simple tasks.

What broke that we didn’t expect:

  1. Safari WebGPU has subtle differences from Chrome’s. Test on both early.
  2. Some corporate VPN clients block Service Worker caching of large files. Had to add a fallback that stores chunks in IndexedDB.
  3. The model eats ~3GB of VRAM while active. If the user switches to a game or heavy app and comes back, Chrome might have evicted the GPU context. You need to handle re-initialization gracefully.

Overall, it’s been worth the sprint. About 30% of our active users now have local inference enabled, and our API costs dropped noticeably for the simple query tier.