Qwen

Each MoE layer holds many expert FFN sub-networks. A learned gating router picks the top-K experts per token so the model can specialize different experts for different knowledge — while keeping compute constant regardless of total size.

Active vs Total Parameters

Qwen 1.5-MoE2.7B / 14.3B

19%

14.3B total

Qwen 2-57B-A14B14B / 57.4B

24%

57.4B total

Qwen 3-30B-A3B3.3B / 30.5B

11%

30.5B total

Qwen 3-235B-A22B22B / 235B

235B total

Architecture Evolution

Mar 2024

Qwen 1.5-MoE

2.7B active of 14.3B

64 expertstop-44 shared

Shared + Routed experts

Jun 2024

Qwen 2-57B-A14B

14B active of 57.4B

64 expertstop-88 shared

Shared + Routed experts

Apr 2025

Qwen 3-30B-A3B

3.3B active of 30.5B

128 expertstop-8

Pure routed, global-batch balancing

Apr 2025

Qwen 3-235B-A22B

22B active of 235B

128 expertstop-8

Pure routed, global-batch balancing

Linear Attention & Gated DeltaNet

Qwen 3.5 introduces a hybrid architecture that replaces 75% of standard attention layers with Gated DeltaNet — a linear attention mechanism with error-correcting memory updates and adaptive gating.

Standard Softmax Attention

Full attention matrix grows quadratically with sequence length

Gated DeltaNet (Linear Attention)

Fixed-size state matrix — constant memory per token at decode time

The Delta Rule — Error-Correcting Memory

prediction = Sₜ₋₁ · kₜ

error = vₜ − prediction

Sₜ = Sₜ₋₁ + βₜ · error · kₜT

State memory

Learning rate

Gate

Adaptive decay

Qwen 3.5 Hybrid Layer Pattern

32 layers total: 8 repeating blocks of [3× Gated DeltaNet + 1× Softmax Attention]. Full attention layers provide global context; linear layers give O(1)-per-token decoding.

GDN

Attn

GDN

Attn

GDN

Attn

GDN

Attn

GDN

Attn

GDN

Attn

GDN

Attn

GDN

Attn

Gated DeltaNet (75%)

Softmax Attention (25%)

O(1)

per token

Constant Decode Memory

State matrix is d×d per head — memory doesn't grow with sequence length, unlike KV cache.

delta rule

Error-Correcting Updates

The delta rule corrects past associations instead of just accumulating, giving sharper memory.

1M+

tokens

1M+ Context

With YaRN extension, native 256K context scales past 1 million tokens efficiently.

Model Evolution

From the original Qwen to the latest Qwen3 — each generation brings architectural innovations, more data, and broader capabilities.

3.6

Qwen 3.6

Apr 2026

Latest generation — VL-capable dense & MoE models with Gated DeltaNet linear attention.

Qwen 3.6

Gated DeltaNet

Apr 2026

Dense & MoE with Gated DeltaNet, native vision-language, 262K context.

Gated DeltaNet (3:1)Native VL input262K→1M+ contextMTP support

Qwen3.6-27B27B

Qwen

The Qwen Family

Mixture of Experts

How It Works

Active vs Total Parameters

Architecture Evolution

Linear Attention & Gated DeltaNet

Standard Softmax Attention

Gated DeltaNet (Linear Attention)

The Delta Rule — Error-Correcting Memory

Qwen 3.5 Hybrid Layer Pattern

Constant Decode Memory

Error-Correcting Updates

1M+ Context

Model Evolution

Qwen 3.6

Qwen 3.6

Qwen 3.5

Qwen 3.5

Qwen 3

Qwen 3

Qwen 3-30B-A3B

Qwen 3-235B-A22B

Qwen 3-Coder

Qwen 3-Next

Qwen 3-Omni

Qwen 3-VL

Qwen 2.5

Qwen 2.5

Qwen 2.5-Coder

Qwen 2.5-Math

Qwen 2.5-VL

QwQ

Qwen 2.5-Omni

Qwen 2

Qwen 2

Qwen 2-57B-A14B

Qwen 2-VL

Qwen 2-Audio

Qwen 2-Math

Qwen 1.5

Qwen 1.5

Qwen 1.5-MoE-A2.7B

CodeQwen 1.5

Qwen

Qwen

Qwen-VL

Qwen-Audio