I’ve been testing the Qwen 3.5 Small 4B model for the last couple weeks for a similar use case, a local coding assistant that runs fully offline. Here’s what I’ve found.
Hardware benchmarks (my testing):
On an RTX 3060 12GB with llama.cpp (Q4_K_M quantization):
- ~35 tokens/sec for generation, which feels snappy enough for code completions
- Prompt processing around 800 tokens/sec
- VRAM usage sits at about 2.8GB, leaving plenty of headroom
On an M2 MacBook Pro (16GB), using MLX:
- ~28 tokens/sec generation
- Feels very responsive for inline completions, maybe a hair slower for longer multi-line suggestions
Quality comparison for code tasks:
This is where it gets interesting. For single-line completions and short function bodies, the 4B model is surprisingly competitive with Llama 3.2 3B. It actually handles Python and TypeScript better in my testing, probably because Alibaba’s training data skewed heavier on those languages.
Where it falls behind is multi-file context understanding. If you need the model to understand imports and types across files, the 4B just doesn’t have enough capacity. The Qwen3 30B-A3B MoE model is noticeably better there, but it’s also way more resource-hungry despite the sparse activation.
Quantization quality:
The Q4_K_M quant is solid. I ran it against a personal eval suite of ~200 Python coding tasks and saw maybe a 3-5% drop in pass@1 compared to the full fp16 weights. Totally acceptable for completions. The Q3 variants start to degrade more noticeably though, especially on longer reasoning chains.
Runtime recommendations:
# My preferred setup with llama.cpp server
./llama-server \
-m qwen3.5-small-4b-q4_k_m.gguf \
--host 0.0.0.0 --port 8080 \
-c 8192 \
-ngl 99 \
--flash-attn
For the VS Code integration side, I’m using Continue.dev pointed at the local llama.cpp server. Works great, and you get the OpenAI-compatible API for free.
One thing worth noting: if your use case is specifically code completion (fill-in-the-middle), make sure you’re using the right prompt template. Qwen 3.5 Small supports FIM natively with the <|fim_prefix|>, <|fim_suffix|>, <|fim_middle|> tokens, and it makes a huge difference versus just prompting it conversationally.
Overall, I’d say the 4B model hits a really nice sweet spot for local coding assistance. Not quite as good as running Devstral or the bigger Qwen models through an API, but the latency and privacy benefits are real.