Qwen3.5-35B-A3B: 35 Billion Parameters, 3 Billion Active

March 12, 2026 qwen, mixture-of-experts, local-inference, moe, language-models

Alibaba’s Qwen team released Qwen3.5-35B-A3B, a mixture-of-experts language model containing 35 billion total parameters but routing each token through only 3 billion active parameters. The architecture uses 256 experts with 8 routed plus 1 shared expert activated per token, each with an intermediate dimension of 512. Native context length sits at 262,144 tokens, extensible to roughly one million.

The model pairs sparse MoE layers with Gated DeltaNet linear attention blocks in a repeating pattern: every four sublayers, three use linear attention and one uses standard gated attention with grouped-query heads (16 query, 2 KV). This hybrid layout reduces the memory and compute overhead associated with full quadratic attention while retaining its benefits at regular intervals.

Architecture Details

Component	Specification
Total Parameters	35B
Active Parameters	~3B
Hidden Dimension	2048
Layers	40
Expert Count	256 (8 routed + 1 shared active)
Expert FFN Dim	512
Attention (Gated)	16 Q heads / 2 KV heads, dim 256
Linear Attention (DeltaNet)	32 V heads / 16 QK heads, dim 128
Vocab Size	248,320
Context Window	262K native, ~1M extended
Training	Multi-token prediction (MTP)

The layer layout follows a 10 × (3 × DeltaNet-MoE → 1 × Attention-MoE) pattern, meaning 30 of 40 sublayers use linear attention. This is a significant departure from standard transformer stacks and directly targets inference throughput on memory-constrained hardware.

Benchmark Performance

Despite activating under 10% of its total parameters, the model posts numbers that track surprisingly close to far larger configurations. Selected results against relevant comparisons:

Benchmark	Qwen3.5-35B-A3B	Qwen3.5-27B (dense)	Qwen3-235B-A22B	GPT-5-mini
MMLU-Pro	85.3	86.1	84.4	83.7
GPQA Diamond	84.2	85.5	81.1	82.8
SWE-bench Verified	69.2	72.4	–	72.0
LiveCodeBench v6	74.6	80.7	75.1	80.5
IFEval	91.9	95.0	87.8	93.9
HLE w/ CoT	22.4	24.3	18.2	19.4
TAU2-Bench	81.2	79.0	58.5	69.8
MMMLU (multilingual)	85.2	85.9	83.4	86.2

On TAU2-Bench (agent tasks), the 3B-active model actually outscores both its dense 27B sibling and GPT-5-mini. Coding benchmarks show the expected gap — roughly 3-6 points behind the dense 27B variant on SWE-bench and LiveCodeBench — but the model still posts 69.2% on SWE-bench Verified, which would have been state-of-the-art a year ago.

Vision Capabilities

Unlike previous small MoE releases, this model ships with an integrated vision encoder. Selected multimodal results:

Benchmark	Qwen3.5-35B-A3B	Claude Sonnet 4.5	GPT-5-mini
MMMU	81.4	79.6	79.0
MMMU-Pro	75.1	68.4	67.3
MathVision	83.9	71.1	71.9
OCRBench	91.0	76.6	82.1
OmniDocBench 1.5	89.3	85.8	77.0

The vision numbers are notable — outperforming both Claude Sonnet 4.5 and GPT-5-mini on document understanding and mathematical vision tasks while activating a fraction of the compute.

Practical Implications

The 3B active parameter count puts this model’s inference memory footprint in the range of models like Phi-3-mini or Llama-3.2-3B, while delivering performance that competes with 22-27B active-parameter models. For teams running local inference on consumer GPUs (RTX 4070-class and above), this is a meaningful density improvement.

The model is compatible with Hugging Face Transformers, vLLM, SGLang, and KTransformers. Alibaba also offers a hosted version under the name “Qwen3.5-Flash” through their Model Studio API, which includes a 1M default context window and built-in tool use.

Language coverage spans 201 languages and dialects, which matters for deployment scenarios outside the English-centric defaults most models optimize for.

The key question for infrastructure teams: does the MoE expert-loading overhead on consumer hardware negate the active-parameter savings? Real-world throughput benchmarks on specific GPU configurations will determine whether this model delivers on the efficiency promise or whether the 256-expert routing adds enough latency to close the gap with dense alternatives.

References

Qwen3.5-35B-A3B Model Card — Hugging Face
Qwen3.5 Blog Post — Qwen Official Blog
Alibaba Cloud Model Studio — Hosted API

---

Configuration details reflect a production environment at time of writing. Implementation specifics vary based on tooling versions, platform updates, and organizational requirements. Validate approaches against current documentation before deployment.

← Back to Journal