Qwen
Open Foundation Models for Language Intelligence
Dense & Mixture-of-Experts models for language, vision, audio, code, math, and reasoning — from 0.6B to 235B parameters.
The Qwen Family
A comprehensive family of open foundation models spanning language, vision, audio, code, math, and reasoning.
Mixture of Experts
Activate only a fraction of parameters per token — achieving large-model quality at small-model cost.
How It Works
Each MoE layer holds many expert FFN sub-networks. A learned gating router picks the top-K experts per token so the model can specialize different experts for different knowledge — while keeping compute constant regardless of total size.
Active vs Total Parameters
Architecture Evolution
Linear Attention & Gated DeltaNet
Qwen 3.5 introduces a hybrid architecture that replaces 75% of standard attention layers with Gated DeltaNet — a linear attention mechanism with error-correcting memory updates and adaptive gating.
Standard Softmax Attention
Full attention matrix grows quadratically with sequence length
Gated DeltaNet (Linear Attention)
Fixed-size state matrix — constant memory per token at decode time
The Delta Rule — Error-Correcting Memory
Qwen 3.5 Hybrid Layer Pattern
32 layers total: 8 repeating blocks of [3× Gated DeltaNet + 1× Softmax Attention]. Full attention layers provide global context; linear layers give O(1)-per-token decoding.
Constant Decode Memory
State matrix is d×d per head — memory doesn't grow with sequence length, unlike KV cache.
Error-Correcting Updates
The delta rule corrects past associations instead of just accumulating, giving sharper memory.
1M+ Context
With YaRN extension, native 256K context scales past 1 million tokens efficiently.
Model Evolution
From the original Qwen to the latest Qwen3 — each generation brings architectural innovations, more data, and broader capabilities.
Qwen 3.6
Apr 2026
Latest generation — VL-capable dense & MoE models with Gated DeltaNet linear attention.
Qwen 3.5
Feb – Mar 2026
Hybrid Gated DeltaNet architecture — 75% linear attention for O(1) decode memory, 201 languages.
Qwen 3
Apr – Dec 2025
Hybrid thinking, 36T tokens, 119 languages, massive MoE scaling, and multimodal expansions.
Hybrid thinking modes, 36T tokens, 119 languages.
128 pure routed experts, no shared experts, global-batch balancing.
Largest open-weight MoE, competitive with top proprietary models.
Agentic coding MoE trained on 7.5T tokens (70% code), 256K context.
Qwen 3-Next
MoEGated DeltaNetFirst Qwen model with hybrid Gated DeltaNet linear attention.
End-to-end omni-modal: text, images, audio, video, and streaming speech output.
Qwen 2.5
Late 2024 – Early 2025
Flagship LLMs, coder/math specialists, VL, omni-modal, and reasoning models.
18T tokens, 7 sizes, structured output, 128K context.
5.5T code tokens, 6 sizes, multi-language code specialist.
Tool-integrated reasoning for competition-level math.
Dynamic resolution v2, long video, and agentic vision.
Reasoning model with extended thinking, o1-competitive.
Qwen 2
Mid 2024
GQA architecture, 57B MoE variant, and expanded multimodal family.
GQA, SwiGLU, RoPE — 5 dense sizes plus a 57B MoE.
Scaled MoE with shared + routed experts.
Dynamic resolution, video understanding, and visual agents.
Instruction-following audio understanding and voice chat.
Qwen 1.5
Early 2024
8 size tiers, 29 languages, improved alignment, and the first MoE model.
Improved alignment and 29-language support across 8 sizes.
First Qwen MoE — 7B-level quality at 25% compute.
Qwen
2023
The foundation — bilingual language, vision, and audio models.
First vision-language model with image + text understanding.