I’ve been building a React app with an AI chat assistant that calls Claude via our backend API. Works great, but we’re getting feature requests for offline support and a few enterprise customers are asking about on-premise options where their data never leaves the browser.
WebLLM (the MLC-AI project) looks really promising. Their benchmarks show Llama 3.1 8B quantized hitting ~41 tok/s on an M3 Max through WebGPU, and Phi 3.5 mini at 71 tok/s. That’s actually usable for a chat experience. And since it exposes an OpenAI-compatible API, theoretically I could swap my API client to point at the local engine with minimal code changes.
But I have a bunch of practical questions before I invest a sprint into this:
-
Model download UX: These quantized models are still 2-4GB. How do you handle the initial download in a web app without it feeling terrible? I’m thinking Service Worker + Cache API for persistence, but the first-run experience of “please wait while we download 3GB” seems rough. Has anyone built a good progressive loading pattern for this?
-
WebGPU availability: Coverage is supposedly ~90% on desktop but only ~70-75% on mobile. Do you just feature-flag the whole thing and fall back to the API? Or is there a reasonable WASM-only fallback path that’s still fast enough?
-
Memory pressure: If someone has 8GB of RAM and a bunch of tabs open, loading a 4-bit quantized 8B model is going to cause problems. Is there a reliable way to detect available memory before attempting to load the model?
navigator.deviceMemorygives you a rough bucket but it’s not precise. -
Quality gap: The obvious tradeoff is that a local 8B model isn’t going to match Claude or GPT-5 for complex reasoning. How are you managing user expectations? Do you show a “running locally” indicator and adjust the UI to set different expectations?
-
Hybrid routing: Ideally I’d use the local model for simple stuff (reformatting text, quick summaries, autocomplete) and route complex queries to the API. Has anyone built a client-side router that decides where to send each request?
For context, our app is Next.js 15, and the AI features are in a chat panel component. I’d want the WebGPU inference to run in a Web Worker so it doesn’t block the main thread.
Would love to hear from anyone who’s actually shipped this pattern, not just prototyped it. What broke that you didn’t expect?
Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!