Qwen3.5-35B-A3B: 35 Billion Parameters, 3 Billion Active
Alibaba’s Qwen team released Qwen3.5-35B-A3B, a mixture-of-experts language model containing 35 billion total parameters but routing each token through only 3 billion active parameters. The architecture uses 256 experts with 8 routed plus 1 shared expert activated per token, each with an intermediate dimension of 512. Native context length sits at 262,144 tokens, extensible to roughly one million.
The model pairs sparse MoE layers with Gated DeltaNet linear attention blocks in a repeating pattern: every four sublayers, three use linear attention and one uses standard gated attention with grouped-query heads (16 query, 2 KV). This hybrid layout reduces the memory and compute overhead associated with full quadratic attention while retaining its benefits at regular intervals.
Architecture Details
| Component | Specification |
|---|---|
| Total Parameters | 35B |
| Active Parameters | ~3B |
| Hidden Dimension | 2048 |
| Layers | 40 |
| Expert Count | 256 (8 routed + 1 shared active) |
| Expert FFN Dim | 512 |
| Attention (Gated) | 16 Q heads / 2 KV heads, dim 256 |
| Linear Attention (DeltaNet) | 32 V heads / 16 QK heads, dim 128 |
| Vocab Size | 248,320 |
| Context Window | 262K native, ~1M extended |
| Training | Multi-token prediction (MTP) |
The layer layout follows a 10 Ă (3 Ă DeltaNet-MoE â 1 Ă Attention-MoE) pattern, meaning 30 of 40 sublayers use linear attention. This is a significant departure from standard transformer stacks and directly targets inference throughput on memory-constrained hardware.
Benchmark Performance
Despite activating under 10% of its total parameters, the model posts numbers that track surprisingly close to far larger configurations. Selected results against relevant comparisons:
| Benchmark | Qwen3.5-35B-A3B | Qwen3.5-27B (dense) | Qwen3-235B-A22B | GPT-5-mini |
|---|---|---|---|---|
| MMLU-Pro | 85.3 | 86.1 | 84.4 | 83.7 |
| GPQA Diamond | 84.2 | 85.5 | 81.1 | 82.8 |
| SWE-bench Verified | 69.2 | 72.4 | – | 72.0 |
| LiveCodeBench v6 | 74.6 | 80.7 | 75.1 | 80.5 |
| IFEval | 91.9 | 95.0 | 87.8 | 93.9 |
| HLE w/ CoT | 22.4 | 24.3 | 18.2 | 19.4 |
| TAU2-Bench | 81.2 | 79.0 | 58.5 | 69.8 |
| MMMLU (multilingual) | 85.2 | 85.9 | 83.4 | 86.2 |
On TAU2-Bench (agent tasks), the 3B-active model actually outscores both its dense 27B sibling and GPT-5-mini. Coding benchmarks show the expected gap â roughly 3-6 points behind the dense 27B variant on SWE-bench and LiveCodeBench â but the model still posts 69.2% on SWE-bench Verified, which would have been state-of-the-art a year ago.
Vision Capabilities
Unlike previous small MoE releases, this model ships with an integrated vision encoder. Selected multimodal results:
| Benchmark | Qwen3.5-35B-A3B | Claude Sonnet 4.5 | GPT-5-mini |
|---|---|---|---|
| MMMU | 81.4 | 79.6 | 79.0 |
| MMMU-Pro | 75.1 | 68.4 | 67.3 |
| MathVision | 83.9 | 71.1 | 71.9 |
| OCRBench | 91.0 | 76.6 | 82.1 |
| OmniDocBench 1.5 | 89.3 | 85.8 | 77.0 |
The vision numbers are notable â outperforming both Claude Sonnet 4.5 and GPT-5-mini on document understanding and mathematical vision tasks while activating a fraction of the compute.
Practical Implications
The 3B active parameter count puts this model’s inference memory footprint in the range of models like Phi-3-mini or Llama-3.2-3B, while delivering performance that competes with 22-27B active-parameter models. For teams running local inference on consumer GPUs (RTX 4070-class and above), this is a meaningful density improvement.
The model is compatible with Hugging Face Transformers, vLLM, SGLang, and KTransformers. Alibaba also offers a hosted version under the name “Qwen3.5-Flash” through their Model Studio API, which includes a 1M default context window and built-in tool use.
Language coverage spans 201 languages and dialects, which matters for deployment scenarios outside the English-centric defaults most models optimize for.
The key question for infrastructure teams: does the MoE expert-loading overhead on consumer hardware negate the active-parameter savings? Real-world throughput benchmarks on specific GPU configurations will determine whether this model delivers on the efficiency promise or whether the 256-expert routing adds enough latency to close the gap with dense alternatives.
References
- Qwen3.5-35B-A3B Model Card â Hugging Face
- Qwen3.5 Blog Post â Qwen Official Blog
- Alibaba Cloud Model Studio â Hosted API
---
Configuration details reflect a production environment at time of writing. Implementation specifics vary based on tooling versions, platform updates, and organizational requirements. Validate approaches against current documentation before deployment.