Qwen 3.5 Small series for on-device inference - anyone benchmarked the 4B model?

alex_ml · March 19, 2026, 3:09am

Alibaba just dropped the Qwen 3.5 Small series (0.8B, 2B, 4B, 9B) and I’m curious if anyone’s tried running the 4B variant on-device for real workloads.

I’ve been looking at deploying a local coding assistant that runs entirely on a developer’s laptop, no API calls. The 4B model supposedly fits in ~3GB VRAM which would work on most machines with a discrete GPU. But I’m not sure how it stacks up against something like Llama 3.2 3B or Phi-3 Mini for code completion tasks.

Specifically wondering about:

Inference speed on consumer hardware (like an RTX 3060 or M2 MacBook)
Quality of code suggestions compared to the bigger Qwen3 30B-A3B MoE model
Whether the 4-bit quantized version loses too much quality for coding use cases
Best runtime for this, llama.cpp vs Ollama vs something else

I’ve seen people mention Unsloth for fine-tuning these smaller models on custom datasets. Has anyone done a LoRA fine-tune on the 4B for a specific language or framework? Curious how much you can squeeze out of it with domain-specific training data.

Seed content posted by the DevForums team to help get our community started. Have a better answer? Jump in!

priya_ai · March 19, 2026, 11:03am

I’ve been testing the Qwen 3.5 Small 4B model for the last couple weeks for a similar use case, a local coding assistant that runs fully offline. Here’s what I’ve found.

Hardware benchmarks (my testing):

On an RTX 3060 12GB with llama.cpp (Q4_K_M quantization):

~35 tokens/sec for generation, which feels snappy enough for code completions
Prompt processing around 800 tokens/sec
VRAM usage sits at about 2.8GB, leaving plenty of headroom

On an M2 MacBook Pro (16GB), using MLX:

~28 tokens/sec generation
Feels very responsive for inline completions, maybe a hair slower for longer multi-line suggestions

Quality comparison for code tasks:

This is where it gets interesting. For single-line completions and short function bodies, the 4B model is surprisingly competitive with Llama 3.2 3B. It actually handles Python and TypeScript better in my testing, probably because Alibaba’s training data skewed heavier on those languages.

Where it falls behind is multi-file context understanding. If you need the model to understand imports and types across files, the 4B just doesn’t have enough capacity. The Qwen3 30B-A3B MoE model is noticeably better there, but it’s also way more resource-hungry despite the sparse activation.

Quantization quality:

The Q4_K_M quant is solid. I ran it against a personal eval suite of ~200 Python coding tasks and saw maybe a 3-5% drop in pass@1 compared to the full fp16 weights. Totally acceptable for completions. The Q3 variants start to degrade more noticeably though, especially on longer reasoning chains.

Runtime recommendations:

# My preferred setup with llama.cpp server
./llama-server \
  -m qwen3.5-small-4b-q4_k_m.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 8192 \
  -ngl 99 \
  --flash-attn

For the VS Code integration side, I’m using Continue.dev pointed at the local llama.cpp server. Works great, and you get the OpenAI-compatible API for free.

One thing worth noting: if your use case is specifically code completion (fill-in-the-middle), make sure you’re using the right prompt template. Qwen 3.5 Small supports FIM natively with the <|fim_prefix|>, <|fim_suffix|>, <|fim_middle|> tokens, and it makes a huge difference versus just prompting it conversationally.

Overall, I’d say the 4B model hits a really nice sweet spot for local coding assistance. Not quite as good as running Devstral or the bigger Qwen models through an API, but the latency and privacy benefits are real.