Qwen

Open Foundation Models for Language Intelligence

Dense & Mixture-of-Experts models for language, vision, audio, code, math, and reasoning — from 0.6B to 235B parameters.

7
Generations
25+
Model Families
119
Languages
36T
Training Tokens

The Qwen Family

A comprehensive family of open foundation models spanning language, vision, audio, code, math, and reasoning.

7
Generations
26
Model Families
75
Model Variants
7
MoE Models

Mixture of Experts

Activate only a fraction of parameters per token — achieving large-model quality at small-model cost.

Input TokenGating RouterExpert1Expert2Expert3Expert4Expert5Expert6Expert7Expert8Weighted SumOutput

How It Works

Each MoE layer holds many expert FFN sub-networks. A learned gating router picks the top-K experts per token so the model can specialize different experts for different knowledge — while keeping compute constant regardless of total size.

Active vs Total Parameters

Qwen 1.5-MoE2.7B / 14.3B
19%
14.3B total
Qwen 2-57B-A14B14B / 57.4B
24%
57.4B total
Qwen 3-30B-A3B3.3B / 30.5B
11%
30.5B total
Qwen 3-235B-A22B22B / 235B
9%
235B total

Architecture Evolution

Mar 2024
Qwen 1.5-MoE
2.7B active of 14.3B
64 expertstop-44 shared
Shared + Routed experts
Jun 2024
Qwen 2-57B-A14B
14B active of 57.4B
64 expertstop-88 shared
Shared + Routed experts
Apr 2025
Qwen 3-30B-A3B
3.3B active of 30.5B
128 expertstop-8
Pure routed, global-batch balancing
Apr 2025
Qwen 3-235B-A22B
22B active of 235B
128 expertstop-8
Pure routed, global-batch balancing

Linear Attention & Gated DeltaNet

Qwen 3.5 introduces a hybrid architecture that replaces 75% of standard attention layers with Gated DeltaNet — a linear attention mechanism with error-correcting memory updates and adaptive gating.

Standard Softmax Attention

QKVQKᵀn × nOutO(n² · d)

Full attention matrix grows quadratically with sequence length

Gated DeltaNet (Linear Attention)

QKVState Sd × d (fixed)deltaupdateGateOutputO(n · d²)

Fixed-size state matrix — constant memory per token at decode time

The Delta Rule — Error-Correcting Memory

prediction = Sₜ₋₁ · k
error = v prediction
S = Sₜ₋₁ + β · error · kT
S
State memory
β
Learning rate
Gate
Adaptive decay

Qwen 3.5 Hybrid Layer Pattern

32 layers total: 8 repeating blocks of [3× Gated DeltaNet + 1× Softmax Attention]. Full attention layers provide global context; linear layers give O(1)-per-token decoding.

GDN
GDN
GDN
Attn
GDN
GDN
GDN
Attn
GDN
GDN
GDN
Attn
GDN
GDN
GDN
Attn
GDN
GDN
GDN
Attn
GDN
GDN
GDN
Attn
GDN
GDN
GDN
Attn
GDN
GDN
GDN
Attn
Gated DeltaNet (75%)
Softmax Attention (25%)
O(1)
per token

Constant Decode Memory

State matrix is d×d per head — memory doesn't grow with sequence length, unlike KV cache.

Δ
delta rule

Error-Correcting Updates

The delta rule corrects past associations instead of just accumulating, giving sharper memory.

1M+
tokens

1M+ Context

With YaRN extension, native 256K context scales past 1 million tokens efficiently.

Model Evolution

From the original Qwen to the latest Qwen3 — each generation brings architectural innovations, more data, and broader capabilities.

3.6

Qwen 3.6

Apr 2026

Latest generation — VL-capable dense & MoE models with Gated DeltaNet linear attention.

Qwen 3.6

Gated DeltaNet
Apr 2026

Dense & MoE with Gated DeltaNet, native vision-language, 262K context.

Gated DeltaNet (3:1)Native VL input262K→1M+ contextMTP support
Qwen3.6-27B27B
Qwen3.6-35B-A3B35B3B active
3.5

Qwen 3.5

Feb – Mar 2026

Hybrid Gated DeltaNet architecture — 75% linear attention for O(1) decode memory, 201 languages.

Qwen 3.5

Gated DeltaNet
Feb 2026

Gated DeltaNet + Transformer hybrid, 256K native, 1M+ extended, 201 languages.

75% Gated DeltaNet256K→1M+ context201 languagesMulti-token prediction
Qwen3.5-0.8B0.8B
Qwen3.5-2B2B
Qwen3.5-4B4B
3

Qwen 3

Apr – Dec 2025

Hybrid thinking, 36T tokens, 119 languages, massive MoE scaling, and multimodal expansions.

Qwen 3

Apr 2025

Hybrid thinking modes, 36T tokens, 119 languages.

Hybrid thinking36T tokens119 languages6 dense + 2 MoE
Qwen3-0.6B0.6B
Qwen3-1.7B1.7B
Qwen3-4B4B

Qwen 3-30B-A3B

MoE
Apr 2025

128 pure routed experts, no shared experts, global-batch balancing.

128 experts, top-8No shared expertsGlobal-batch balancing
30.5B
Total
3.3B
Active
128
Experts
8
Activated
Qwen3-30B-A3B30.5B3.3B active
MMLU: 81.5HumanEval: 74.4🤗MS

Qwen 3-235B-A22B

MoE
Apr 2025

Largest open-weight MoE, competitive with top proprietary models.

Largest open MoE128 experts, top-8Hybrid thinking
235B
Total
22B
Active
128
Experts
8
Activated
Qwen3-235B-A22B235B22B active
MMLU: 87.8HumanEval: 82.9MATH: 85.7🤗MS

Qwen 3-Coder

MoE
Jul 2025

Agentic coding MoE trained on 7.5T tokens (70% code), 256K context.

7.5T tokens (70% code)256K contextRepo-scale reasoning
Qwen3-Coder-480B-A35B480B35B active
Qwen3-Coder-30B-A3B30B3B active

Qwen 3-Next

MoEGated DeltaNet
Sep 2025

First Qwen model with hybrid Gated DeltaNet linear attention.

Gated DeltaNet (3:1 ratio)Ultra-sparse MoEMulti-token prediction
Qwen3-Next-80B-A3B80B3B active

Qwen 3-Omni

MoE
Sep 2025

End-to-end omni-modal: text, images, audio, video, and streaming speech output.

All-modal I/OReal-time speech119 text + 19 speech languages
Qwen3-Omni-30B-A3B30B3B active

Qwen 3-VL

Sep 2025

Vision-language with native 256K interleaved context, dense + MoE variants.

256K interleaved contextDense + MoE variantsThinking mode
Qwen3-VL-2B2B
Qwen3-VL-8B8B
Qwen3-VL-32B32B
2.5

Qwen 2.5

Late 2024 – Early 2025

Flagship LLMs, coder/math specialists, VL, omni-modal, and reasoning models.

Qwen 2.5

Sep 2024

18T tokens, 7 sizes, structured output, 128K context.

18T tokens0.5B–72B29+ languages128K context
Qwen2.5-0.5B0.5B
MMLU: 47.5🤗MS
Qwen2.5-1.5B1.5B
MMLU: 60.9🤗MS
Qwen2.5-3B3B
MMLU: 65.6🤗MS

Qwen 2.5-Coder

Nov 2024

5.5T code tokens, 6 sizes, multi-language code specialist.

5.5T code tokens0.5B–32BCode gen & repair
Qwen2.5-Coder-1.5B1.5B
HumanEval: 43.9🤗MS
Qwen2.5-Coder-7B7B
HumanEval: 61.6🤗MS
Qwen2.5-Coder-32B32B
HumanEval: 65.9🤗MS

Qwen 2.5-Math

Sep 2024

Tool-integrated reasoning for competition-level math.

Tool-integrated reasoningCoT & TIR modesReward models
Qwen2.5-Math-1.5B1.5B
MATH: 55.2🤗MS
Qwen2.5-Math-7B7B
MATH: 75.5🤗MS
Qwen2.5-Math-72B72B
MATH: 83.1🤗MS

Qwen 2.5-VL

Jan 2025

Dynamic resolution v2, long video, and agentic vision.

Dynamic resolution v2Long videoVisual grounding
Qwen2.5-VL-3B3B
Qwen2.5-VL-7B7B
Qwen2.5-VL-32B32B

QwQ

Mar 2025

Reasoning model with extended thinking, o1-competitive.

Deep reasoningExtended thinkingStrong math & code
QwQ-32B32B
AIME24: 79.5LiveCodeBench: 63.4🤗MS

Qwen 2.5-Omni

Mar 2025

End-to-end text, image, audio, and video input/output.

Text+image+audio+video inAudio outputReal-time streaming
Qwen2.5-Omni-3B3B
Qwen2.5-Omni-7B7B
2

Qwen 2

Mid 2024

GQA architecture, 57B MoE variant, and expanded multimodal family.

Qwen 2

Jun 2024

GQA, SwiGLU, RoPE — 5 dense sizes plus a 57B MoE.

GQA architectureUp to 128K context29+ languages
Qwen2-0.5B0.5B
MMLU: 45.4🤗MS
Qwen2-1.5B1.5B
MMLU: 56.5🤗MS
Qwen2-7B7B
MMLU: 70.3HumanEval: 51.2🤗MS

Qwen 2-57B-A14B

MoE
Jun 2024

Scaled MoE with shared + routed experts.

64 routed + 8 sharedTop-8 routing~25% compute of 72B
57.4B
Total
14B
Active
64
Experts
8
Activated
Qwen2-57B-A14B57.4B14B active
MMLU: 75.5HumanEval: 53🤗MS

Qwen 2-VL

Aug 2024

Dynamic resolution, video understanding, and visual agents.

Dynamic resolutionVideo understandingMultilingual OCR
Qwen2-VL-2B2B
Qwen2-VL-7B7B
Qwen2-VL-72B72B

Qwen 2-Audio

Jul 2024

Instruction-following audio understanding and voice chat.

Voice interactionAudio analysisInstruction following
Qwen2-Audio-7B7B
Qwen2-Audio-7B-Instruct7B

Qwen 2-Math

Aug 2024

Math-specialized with chain-of-thought reasoning.

Chain-of-thoughtTool-integrated reasoningCompetition math
Qwen2-Math-1.5B1.5B
Qwen2-Math-7B7B
GSM8K: 89.9MATH: 58.1🤗MS
Qwen2-Math-72B72B
GSM8K: 93.2MATH: 69.9🤗MS
1.5

Qwen 1.5

Early 2024

8 size tiers, 29 languages, improved alignment, and the first MoE model.

Qwen 1.5

Feb 2024

Improved alignment and 29-language support across 8 sizes.

29 languages0.5B–110BRLHF alignment32K context
Qwen1.5-0.5B0.5B
MMLU: 39.2🤗MS
Qwen1.5-1.8B1.8B
MMLU: 46.8🤗MS
Qwen1.5-4B4B
MMLU: 56.1🤗MS

Qwen 1.5-MoE-A2.7B

MoE
Mar 2024

First Qwen MoE — 7B-level quality at 25% compute.

64 experts (4 shared + 60 routed)Top-4 routing~25% compute
14.3B
Total
2.7B
Active
64
Experts
4
Activated
Qwen1.5-MoE-A2.7B14.3B2.7B active
MMLU: 62.5HumanEval: 34.8🤗MS

CodeQwen 1.5

Apr 2024

Code specialist trained on 3T tokens across 92 languages.

92 programming languages3T code tokens128K context
CodeQwen1.5-7B7B
HumanEval: 51.8MBPP: 61.8🤗MS

Qwen

2023

The foundation — bilingual language, vision, and audio models.

Qwen

Aug 2023

Bilingual (Chinese & English) LLMs up to 72B.

BilingualUp to 72BCoding & math
Qwen-7B7B
MMLU: 58.2HumanEval: 29.9GSM8K: 51.7🤗MS
Qwen-14B14B
MMLU: 66.3HumanEval: 32.3GSM8K: 61.3🤗MS
Qwen-72B72B
MMLU: 77.4HumanEval: 35.4GSM8K: 78.9🤗MS

Qwen-VL

Aug 2023

First vision-language model with image + text understanding.

Image understandingVisual groundingMulti-image
Qwen-VL~9.6B
Qwen-VL-Chat~9.6B

Qwen-Audio

Nov 2023

Speech, music, and environmental sound understanding.

Audio understandingSpeech recognitionSound analysis
Qwen-Audio~8B
Qwen-Audio-Chat~8B