Qwen3

发表于 2025-06-24 更新于 2025-06-25 分类于 Papers 阅读次数：本文字数： 2.1k 阅读时长 ≈ 8 分钟

Qwen3 Technical Report^[1]

作者是Qwen Team，论文引用[1]:Yang, An et al. “Qwen3 Technical Report.” (2025).

Time

2025.May

### Key Words

总结

Qwen3包含一系列的LLMs，Qwen3系列包含dense和MoE 架构，参数从0.6B到235B，**Qwen3中的一个关键创新是将了thinking mode(多步推理)和non-thinking mode(rapid, context-driven responses)集成到了一个框架中，同时，Qwen3引入了一个thinking budget 机制，使得用户可以在推理的时候灵活地分配computational resources，平衡延迟和性能。另外，通过利用旗舰model的知识，能够大幅度地降低计算资源。

Qwen3的预训练过程利用了包含接近36 trillion的tokens，大规模的数据集，精心组织确保linguistic和domain的多样性，为了有效地扩充training data，作者采用了多模态的方法，Qwen2.5-VL被微调，用来从大量的pdf中提取text，作者也用domain-specific models来产生合成数据: Qwen2.5-Math for mathematical content和Qwen2.5-Coder for code-related data，预训练过程是一个三阶段的策略：第一阶段，模型在大约30 trillion的tokens上训练，构建通用知识的基座;在第二阶段，进一步在knowledge-intensive data上训练，增强推理能力，例如science, technology, coding等；最后在第三阶段，model在long-context上进行训练，将它最大的context length从4096增加到32768 tokens。

为了很好地将人类偏好和下游应用与foundation models对齐，作者采用了一个多阶段的后训练方法，使得可以thinking(reasoning) and non-thinking modes，在前两个阶段，作者聚焦于通过思维链冷启动微调，以及聚焦于数学和coding tasks的强化学习，来开发strong reasoning能力，在最后的两个阶段，作者将带有reasoning和没有reasoning的data结合到一个统一的dataset中，用于fine-tuning，使得model能够有效地处理两种数据类型。然后用一个general-domain 强化学习来提高多个下游任务的性能。对于更小的model，作者用了strong-to-weak蒸馏，利用off-policy和on-policy，将knowledge从larger models进行迁移，提高它们的能力，从先进的teacher models中蒸馏显著地在性能和训练效率上超过了强化学习。
Qwen3系列包含6个dense model， Qwen3-0.6B, Qwen3-1.7B， Qwen3-4B， Qwen3-8B，Qwen3-14B, Qwen3-32B，还有两个MoE models，Qwen3-30B-A3B和Qwen3-235B-A22B(22B是激活的参数) Qwen3的dense models的架构和Qwen2.5类似，包括使用了Grouped Query Attention, SwiGLU， Rotary Positioal Embeddings和RMSNorm with pre-normalization，另外，去掉了Qwen2中使用的QKV-bias，将QK-Norm，引入到attention机制，确保Qwen3的稳定训练。

Qwen3 MoE models和Qwen3 dense models共享基础架构，Qwen3 MoE models有128个experts，每个token激活8个experts，不同于Qwen2.5-MoE，Qwen3-MoE的设计去掉了共享的experts，另外，采用了global-batch load balancing loss，来促进expert specialization。这个架构和训练创新在多个下游任务和model performance上产生了很大的提升。

Qwen3 利用了Qwen的tokenizer，执行byte-level的byte-pair编码。
Qwen3的后训练主要是两个核心的目标：1. thinking control:这涉及两个distinct modes的集成，称之为non-thinking和thinking mode，控制了depth of thinking；2. Strong-to-weak Distillation:, 旨在streamline和优化后训练过程for lightweight models，通过利用来自large-scale models的知识，降低了计算开销和构建smaller-scale models的努力。
- 长思维链了冷启动：构建了一个全面的数据集，数据集中的每个problem和verified reference answers或者code-based test cases是配对的，这个dataset作为长思维链冷启动阶段的foundation。这个dataset construction涉及一个严格的两阶段的过滤过程：query filtering和repsonse filtering。在query filtering阶段，用Qwen2.5-72B-Instruct来识别和去掉不容易verifiable的queries，包括有多个sub-questions或者需要通用的text generation的，另外，排除掉Qwen2.5-72B-Instruct不需要CoT reasoning就能正确回答的queries，阻止model肤浅的guessing，确保只包括要求deeper reasoning的复杂问题。另外，作者用Qwen2.5-72B-Instruct标注每个query的domain，来保持数据集中的balanced domain representation。
在保留一个validation query set之后，作者对每个remaining query，用QwQ-32B产生N个candidate responses，当QwQ-32B不能够产生正确的solutions的时候，人类标注员评估repsonses的精度。经过仔细地挑选和refine之后，得到的dataset用于reasoning patterns的初始冷启动。这个阶段的目标是将foundational reasoning patterns逐步插入到model中，不需要强调immediate reasoning performance，这个方法确保了model的potential 没有被限制，在后续的RL 阶段有更好的灵活性和improvement。
- Reasoning RL: RL阶段中用的query-verifier pairs必须满足下面四个标准：1. 在冷启动阶段没有使用; 2. 对于cold-start model是learnable的; 3. 是尽可能具有挑战的; 4. 覆盖了broad range of sub-domains. 最后用了总共3995 query-verifier pairs，采用了GRPO来更新模型的参数。
- Thinkin Model Fusion: Thinking Mode fusion阶段的目标是将non-thinking的能力集成到之前的thinking model中，这使得开发者能够管理和控制reasoning behaviors，同时降低部署separate models的cost和复杂度。为了实现这个，作者进行了连续的SFT on the Reasoning RL model，设计了一个chat template来融合两种modes，另外，作者发现，能够处理好两种mode的models，在不同的thinking budgets下表现consistently well。
- construction of SFT data: SFT dataset包含thinking 和non-thinking data，为了确保stage 2 model的性能不受额外的SFT的影响，stage 1 queries通过rejection sampling，产生thinking data，non-thinking data是通过精心的准备，覆盖多种tasks。
- Chat Template Design: 作者在user query中引入了 \(/think\) 和 \(/no_think\) flags
- Thinking Budget: model有能力处理中间的情况，就是基于不完整的thinking产生responses，具体地说，就是model的thinking 达到了user-defined的阈值。
General RL: 这个阶段旨在增强model的能力和稳定性，为了促进这个，作者构建了一个复杂的reward system，覆盖了超过20个distinct tasks，这些tasks主要是增强下面的核心能力：
- Instruction following: 这个能力确保了model能够准确地理解和遵循user的instructions，给出和user的期望相同的responses。
- Format Following: 除了显式地instructions，作者期望model和特定的formatting conventions保持一直。
- Abilities for Specialized Scenarios: 在更多的specialized scenarios中，作者设计了针对特定context的tasks，例如RAG，作者引入了reward signals来指导model，产生精确和上下文合适的responses。
为了提供上述任务的feedback，作者利用了三种不同类型的rewards:
- Rule-based Rewards: 这个rule-based reward广泛用于reasoning RL 阶段，对于通用的像instruction following的任务是有用的，设计好的rule-based rewards能够高精度地评估model的output，阻止reward hacking这样的问题。
- Model-based Reward with Reference Answer: 在这个方法中，作者为每个query提供了一个参考答案，基于这些参考答案，提示Qwen2.5-72B-Instruct来给model的response打分，这个方法不需要严格的format，就能灵活地处理多种任务，避免了false negatives
- Model-based Reward without Reference Answer: 利用人类偏好数据，作者训练了一个reward model，将scalar scores分配给model responses，这个方法，不依赖于参考答案，能够处理更多的queries，同时有效地增强了model的engagement。
Strong-to-Weak Distillation: 这个pipeline用来优化Lightweight models，包含5个dense models和一个MoE model，这个方法增强了model的performance，同时有效地提供robust的mode-switching 能力
- off-policy distillation: 初始阶段，作者结合了用 \(/think\) 和 \(no_think\) mode的teacher model的输出用于response distillation，这帮助student model能够有基本的reasoning skill，能够在不同的thinking mode之间切换。
- on-policy distillation: 在这个阶段，student model产生on-policy sequences用于fine-tuning，特别地，对prompts进行采样，student model在 \(think\) 或者 \(no_think\) model下产生responses，这个student model通过将logits和teacher model进行对齐，实现fine-tune，来最小化KL divergence。

\(Fig.1^{[1]}\) Post-training pipeline of Qwen3 series models

Qwen3 Technical Report[1]

Time

### Key Words

总结

Qwen3 Technical Report^[1]