[Paper Review] EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes

Posted by Euisuk's Dev Log on August 29, 2025

[Paper Review] EXAONE 4.0: Unified Large Language Models Integrating

Non-reasoning and Reasoning Modes

https://arxiv.org/pdf/2507.11407

Introduction

LLM ์ƒํƒœ๊ณ„์—์„œ ๊ฐ€์žฅ ๋šœ๋ ทํ•œ ํŠธ๋ Œ๋“œ ์ค‘ ํ•˜๋‚˜๋Š” โ€œ๋น ๋ฅธ ์‘๋‹ตโ€๊ณผ โ€œ๊นŠ์€ ์ถ”๋ก โ€์„ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ์ œ๊ณตํ•˜๋Š” Hybrid ๋ชจ๋ธ์˜ ๋ถ€์ƒ์ž…๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ โ€œ๋น ๋ฅธ ์‘๋‹ตโ€์ด๋ž€ ์ผ๋ฐ˜์ ์ธ ๋Œ€ํ™”๋‚˜ ์š”์•ฝ์ฒ˜๋Ÿผ ์ฆ‰๊ฐ์ ์ธ ๋‹ต๋ณ€์ด ํ•„์š”ํ•œ ์ƒํ™ฉ์„, โ€œ๊นŠ์€ ์ถ”๋ก โ€์ด๋ž€ ๋ณต์žกํ•œ ์ˆ˜ํ•™ ๋ฌธ์ œ๋‚˜ ์ฝ”๋”ฉ ๊ณผ์ œ์ฒ˜๋Ÿผ ๋‹จ๊ณ„์  ์‚ฌ๊ณ (Chain-of-Thought)๊ฐ€ ํ•„์š”ํ•œ ์ƒํ™ฉ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์—๋Š” ์ด ๋‘ ๊ฐ€์ง€ ๋Šฅ๋ ฅ์„ ๊ฐ๊ฐ ๋ณ„๋„์˜ ๋ชจ๋ธ๋กœ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด OpenAI์˜ ๊ฒฝ์šฐ ๋ฒ”์šฉ GPT-4o์™€ ์ถ”๋ก  ํŠนํ™” o1/o3๋ฅผ, DeepSeek์€ DeepSeek V3(๋ฒ”์šฉ)์™€ DeepSeek R1(์ถ”๋ก )์„ ๋ณ„๋„๋กœ ์šด์˜ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž ์ž…์žฅ์—์„œ๋Š” ์šฉ๋„์— ๋”ฐ๋ผ ๋ชจ๋ธ์„ ์ „ํ™˜ํ•ด์•ผ ํ•˜๊ณ , ์„œ๋น„์Šค ์ œ๊ณต์ž ์ž…์žฅ์—์„œ๋Š” ๋‘ ๋ชจ๋ธ์„ ๋™์‹œ์— ๋ฐฐํฌํ•˜๊ณ  ๊ด€๋ฆฌํ•ด์•ผ ํ•˜๋Š” ๋ถ€๋‹ด์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์ตœ๊ทผ Qwen 3, DeepSeek ๋“ฑ ์ฃผ์š” ๋ชจ๋ธ๋“ค์ด ๋‘ ๋ชจ๋“œ๋ฅผ ํ•˜๋‚˜๋กœ ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ˆ˜๋ ดํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, LG AI Research์˜ EXAONE 4.0๋„ ์ด ํ๋ฆ„์˜ ํ•œ๊ฐ€์šด๋ฐ์— ์žˆ์Šต๋‹ˆ๋‹ค.

LG AI Research๋Š” EXAONE์ด๋ผ๋Š” ์ž์ฒด Foundation Model ์‹œ๋ฆฌ์ฆˆ๋ฅผ ๊ฐœ๋ฐœํ•ด ์™”์Šต๋‹ˆ๋‹ค. ์ง์ „ ๋ฒ„์ „์ธ EXAONE 3.5๋Š” ๋‹ค์–‘ํ•œ ์‚ฌ์šฉ์ž ์ง€์‹œ๋ฅผ ์ •ํ™•ํžˆ ๋”ฐ๋ฅด๋Š” ๋ฒ”์šฉ Instruction Following์—, EXAONE Deep์€ ์ˆ˜ํ•™๊ณผ ์ฝ”๋”ฉ ์˜์—ญ์˜ ์ถ”๋ก  ์„ฑ๋Šฅ์— ๊ฐ๊ฐ ํŠนํ™”๋˜์–ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. EXAONE 4.0์€ ์ด ๋‘ ๋ชจ๋ธ์˜ ๊ฐ•์ ์„ NON-REASONING ๋ชจ๋“œ์™€ REASONING ๋ชจ๋“œ๋ผ๋Š” ํ˜•ํƒœ๋กœ ๋‹จ์ผ ๋ชจ๋ธ์— ํ†ตํ•ฉํ•œ ๊ฒƒ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž๊ฐ€ ๋น ๋ฅธ ๋‹ต๋ณ€์ด ํ•„์š”ํ•˜๋ฉด NON-REASONING ๋ชจ๋“œ๋ฅผ, ๋ณต์žกํ•œ ๋ฌธ์ œ ํ’€์ด๊ฐ€ ํ•„์š”ํ•˜๋ฉด REASONING ๋ชจ๋“œ๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์— Agentic AI ์‹œ๋Œ€๋ฅผ ๊ฒจ๋ƒฅํ•œ Tool Use ๊ธฐ๋Šฅ์ด ์ƒˆ๋กœ ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Tool Use๋ž€ ๋ชจ๋ธ์ด ์™ธ๋ถ€ ๋„๊ตฌ(API ํ˜ธ์ถœ, ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ์กฐํšŒ, ์›น ๊ฒ€์ƒ‰ ๋“ฑ)๋ฅผ ์ž์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ์‚ฌ์šฉ์ž์˜ ์š”์ฒญ์„ ํ•ด๊ฒฐํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋œปํ•ฉ๋‹ˆ๋‹ค. ์ตœ๊ทผ LLM์ด ๋‹จ์ˆœํ•œ ํ…์ŠคํŠธ ์ƒ์„ฑ์„ ๋„˜์–ด ์‹ค์ œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” โ€œAgentโ€๋กœ ๋ฐœ์ „ํ•˜๋ฉด์„œ, Tool Use๋Š” ํ•„์ˆ˜์ ์ธ ๊ธฐ๋Šฅ์œผ๋กœ ์ž๋ฆฌ์žก๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์™ธ์—๋„ Spanish ์–ธ์–ด ์ง€์› ์ถ”๊ฐ€(๊ธฐ์กด ์˜์–ด/ํ•œ๊ตญ์–ด์—์„œ ํ™•์žฅ), 14T ํ† ํฐ์œผ๋กœ ๋Œ€ํญ ํ™•๋Œ€๋œ Pretraining, 128K ํ† ํฐ๊นŒ์ง€์˜ Context Length ํ™•์žฅ์ด ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ์€ ๊ณ ์„ฑ๋Šฅ 32B(320์–ต ํŒŒ๋ผ๋ฏธํ„ฐ)์™€ On-device์šฉ 1.2B(12์–ต ํŒŒ๋ผ๋ฏธํ„ฐ) ๋‘ ๊ฐ€์ง€ ํฌ๊ธฐ๋กœ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค. 32B๋Š” ์„œ๋ฒ„ ํ™˜๊ฒฝ์—์„œ์˜ ๊ณ ์„ฑ๋Šฅ ์ถ”๋ก ์„, 1.2B๋Š” ์Šค๋งˆํŠธํฐ์ด๋‚˜ Edge Device ๊ฐ™์€ ์ œํ•œ๋œ ํ™˜๊ฒฝ์—์„œ์˜ ๋กœ์ปฌ ์‹คํ–‰์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ตฌ ๋ชฉ์ ์œผ๋กœ Hugging Face์—์„œ ๊ณต๊ฐœ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๊ธ€์—์„œ๋Š” EXAONE 4.0 Technical Report์˜ ํ๋ฆ„์„ ๋”ฐ๋ผ, ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„ ๊ฒฐ์ •๋ถ€ํ„ฐ Post-training ํŒŒ์ดํ”„๋ผ์ธ, ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ๊นŒ์ง€ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

Model Configurations: ์•„ํ‚คํ…์ฒ˜์˜ ํ•ต์‹ฌ ๋ณ€๊ฒฝ์ 

Hybrid Attention โ€” Global๊ณผ Sliding Window์˜ ๊ฒฐํ•ฉ

EXAONE 4.0 32B์˜ ๊ฐ€์žฅ ๋ˆˆ์— ๋„๋Š” ์•„ํ‚คํ…์ฒ˜ ๋ณ€๊ฒฝ์€ Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์ž…๋‹ˆ๋‹ค.

๋จผ์ € ๋ฐฐ๊ฒฝ์„ ์งš์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. Transformer ๋ชจ๋ธ์˜ Self-Attention์€ ์‹œํ€€์Šค ๋‚ด ๋ชจ๋“  ํ† ํฐ ์Œ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด 1,000๊ฐœ์˜ ํ† ํฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์ž…๋ ฅ์ด ์žˆ๋‹ค๋ฉด, ๊ฐ ํ† ํฐ์ด ๋‚˜๋จธ์ง€ 999๊ฐœ์˜ ํ† ํฐ๊ณผ ์–ด๋–ค ๊ด€๋ จ์ด ์žˆ๋Š”์ง€๋ฅผ ๋ชจ๋‘ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ Global Attention์ด๋ผ ํ•˜๋ฉฐ, ๋ฌธ๋งฅ ์ดํ•ด์—๋Š” ๊ฐ•๋ ฅํ•˜์ง€๋งŒ ์‹œํ€€์Šค ๊ธธ์ด nnn์— ๋Œ€ํ•ด O(n2)O(n^2)O(n2)์˜ ์—ฐ์‚ฐ ๋น„์šฉ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ 2๋ฐฐ๊ฐ€ ๋˜๋ฉด ์—ฐ์‚ฐ๋Ÿ‰์€ 4๋ฐฐ๋กœ ๋Š˜์–ด๋‚˜๋Š” ์…ˆ์ž…๋‹ˆ๋‹ค. 128K ํ† ํฐ(์•ฝ 10๋งŒ ์ž ์ด์ƒ์˜ ํ…์ŠคํŠธ)๊นŒ์ง€ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•˜๋Š” EXAONE 4.0์—์„œ๋Š” ์ด ๋น„์šฉ์ด ํ˜„์‹ค์ ์ธ ๋ณ‘๋ชฉ์ด ๋ฉ๋‹ˆ๋‹ค.

์ด์— ๋Œ€ํ•œ ํšจ์œจ์  ๋Œ€์•ˆ์ด Sliding Window Attention(Local Attention)์ž…๋‹ˆ๋‹ค. ์ „์ฒด ์‹œํ€€์Šค๊ฐ€ ์•„๋‹Œ ๊ฐ ํ† ํฐ ์ฃผ๋ณ€์˜ ๊ณ ์ •๋œ ๋ฒ”์œ„(Window) ๋‚ด์—์„œ๋งŒ Attention์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. Window Size๊ฐ€ 4K๋ผ๋ฉด, ๊ฐ ํ† ํฐ์€ ์ž์‹ ์˜ ์•ž๋’ค 4K ํ† ํฐ ๋ฒ”์œ„ ๋‚ด์—์„œ๋งŒ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ์‚ฐ ๋น„์šฉ์ด O(nร—w)O(n \times w)O(nร—w) (w๋Š” Window Size)๋กœ ์ค„์–ด๋“ค์–ด ํšจ์œจ์ ์ด์ง€๋งŒ, Window ๋ฐ–์˜ ๋จผ ํ† ํฐ๊ณผ์˜ ๊ด€๊ณ„๋Š” ์ง์ ‘ ํŒŒ์•…ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

EXAONE 3.5๋Š” ๋ชจ๋“  ๋ ˆ์ด์–ด์—์„œ Global Attention์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. EXAONE 4.0์€ ์ด๋ฅผ ๋ณ€๊ฒฝํ•˜์—ฌ Sliding Window Attention(Local)๊ณผ Global Attention์„ 3:1 ๋น„์œจ๋กœ ํ˜ผํ•ฉํ•ฉ๋‹ˆ๋‹ค. 64๊ฐœ ๋ ˆ์ด์–ด ์ค‘ 48๊ฐœ๋Š” Window Size 4K์˜ Sliding Window Attention์„, 16๊ฐœ๋Š” ์ „์ฒด ์‹œํ€€์Šค์— ๋Œ€ํ•œ Global Attention์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ์„ค๊ณ„์˜ ์ง๊ด€์€ ๋ช…ํ™•ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ๋ ˆ์ด์–ด์—์„œ๋Š” ์ฃผ๋ณ€ ํ† ํฐ ๊ฐ„์˜ ๊ด€๊ณ„(Local Context)๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ถฉ๋ถ„ํ•˜๊ณ , ์ฃผ๊ธฐ์ ์œผ๋กœ ๋ฐฐ์น˜๋œ Global Attention ๋ ˆ์ด์–ด๊ฐ€ ์‹œํ€€์Šค ์ „์ฒด์˜ ๋งฅ๋ฝ์„ ํ†ตํ•ฉํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ๋น„์œ ํ•˜์ž๋ฉด, ๊ธด ์†Œ์„ค์„ ์ฝ์„ ๋•Œ ๋Œ€๋ถ€๋ถ„์˜ ์‹œ๊ฐ„์€ ํ˜„์žฌ ๋‹จ๋ฝ์˜ ๋งฅ๋ฝ์— ์ง‘์ค‘ํ•˜๋˜, ๊ฐ€๋” ์ „์ฒด ์ค„๊ฑฐ๋ฆฌ๋ฅผ ์ƒ๊ธฐํ•˜๋ฉฐ ํฐ ๊ทธ๋ฆผ์„ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. Gemma 2/3, Llama 4 ๋“ฑ ์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ๋„ ์†Œ์ˆ˜ ๋ ˆ์ด์–ด์—๋งŒ Global Attention์„ ์ ์šฉํ•ด๋„ Long-context ์„ฑ๋Šฅ์ด ์œ ์ง€๋œ๋‹ค๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋ณด๊ณ ๋˜์–ด ์žˆ์œผ๋ฉฐ, Mamba ๊ฐ™์€ Heterogeneous ๊ตฌ์กฐ์—์„œ๋„ ์ฃผ๊ธฐ์ ์ธ ์†Œ๋Ÿ‰์˜ Global Attention์ด ์ „์—ญ ๋งฅ๋ฝ ์ดํ•ด์— ๋„์›€์ด ๋œ๋‹ค๋Š” ์ ์ด ํ™•์ธ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ ์ฃผ๋ชฉํ•  ์„ค๊ณ„ ๊ฒฐ์ •์ด ๋‘ ๊ฐ€์ง€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฒซ์งธ, Global Attention ๋ ˆ์ด์–ด์—์„œ RoPE(Rotary Position Embedding)๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. RoPE๋Š” ํ† ํฐ์˜ ์ƒ๋Œ€์  ์œ„์น˜ ์ •๋ณด๋ฅผ Attention ๊ณ„์‚ฐ์— ์ฃผ์ž…ํ•˜๋Š” ๊ธฐ๋ฒ•์œผ๋กœ, ๋Œ€๋ถ€๋ถ„์˜ ์ตœ์‹  LLM์—์„œ ํ‘œ์ค€์ ์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ EXAONE 4.0์€ Global Attention ๋ ˆ์ด์–ด์—์„œ ์ด๋ฅผ ์˜๋„์ ์œผ๋กœ ์ œ๊ฑฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์œ ๋Š” RoPE๊ฐ€ ์ ์šฉ๋˜๋ฉด ๋ชจ๋ธ์ด ํ† ํฐ ๊ฐ„ ๊ฑฐ๋ฆฌ์— ๋”ฐ๋ฅธ Bias(๊ฐ€๊นŒ์šด ํ† ํฐ์— ๋” ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๊ฒฝํ–ฅ)๋ฅผ ๊ฐ–๊ฒŒ ๋˜๋Š”๋ฐ, Global Attention์˜ ์—ญํ• ์€ ์‹œํ€€์Šค ์ „์ฒด๋ฅผ ๊ท ๋“ฑํ•˜๊ฒŒ ์กฐ๋งํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ, ์œ„์น˜ ๊ธฐ๋ฐ˜ Bias ์—†์ด ์ง„์ •ํ•œ ์ „์—ญ์  ์‹œ์•ผ๋ฅผ ์œ ์ง€ํ•˜๋„๋ก ์„ค๊ณ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. Local Attention ๋ ˆ์ด์–ด์—์„œ๋Š” RoPE๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ทผ์ ‘ ํ† ํฐ ๊ฐ„์˜ ์œ„์น˜ ๊ด€๊ณ„๋ฅผ ์ •ํ™•ํžˆ ํฌ์ฐฉํ•ฉ๋‹ˆ๋‹ค.

๋‘˜์งธ, Local Attention์œผ๋กœ Chunked Attention ๋Œ€์‹  Sliding Window Attention์„ ์ฑ„ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. Chunked Attention์€ ์‹œํ€€์Šค๋ฅผ ๊ณ ์ • ํฌ๊ธฐ์˜ Chunk๋กœ ๋‚˜๋ˆ„์–ด Chunk ๋‚ด์—์„œ๋งŒ Attention์„ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ํšจ์œจ์ ์ด์ง€๋งŒ Chunk ๊ฒฝ๊ณ„์—์„œ ์ •๋ณด๊ฐ€ ๋‹จ์ ˆ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด Sliding Window Attention์€ ์œˆ๋„์šฐ๊ฐ€ ํ† ํฐ๋งˆ๋‹ค ํ•œ ์นธ์”ฉ ์ด๋™ํ•˜๋ฏ€๋กœ ๊ฒฝ๊ณ„ ๋‹จ์ ˆ ๋ฌธ์ œ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. EXAONE 4.0์€ Sliding Window์˜ ์ด๋ก ์  ์•ˆ์ •์„ฑ๊ณผ ์˜คํ”ˆ์†Œ์Šค ํ”„๋ ˆ์ž„์›Œํฌ(vLLM, TGI ๋“ฑ)์—์„œ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์ง€์›์ด๋ผ๋Š” ์‹ค์šฉ์  ์ด์œ ๋ฅผ ๋“ค์–ด ์ด๋ฅผ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. Short-context ์„ฑ๋Šฅ์— ์•…์˜ํ–ฅ์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด Window Size๋ฅผ 4K๋กœ ์„ค์ •ํ•œ ๊ฒƒ๋„ ์‹ค์šฉ์  ํŒ๋‹จ์ž…๋‹ˆ๋‹ค.

ํ•œํŽธ, 1.2B ๋ชจ๋ธ์€ Hybrid๊ฐ€ ์•„๋‹Œ ์ „์ฒด Global Attention์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ ˆ์ด์–ด ์ˆ˜๊ฐ€ 30๊ฐœ๋กœ ์ƒ๋Œ€์ ์œผ๋กœ ์ ์–ด Hybrid์˜ ํšจ์œจ์„ฑ ์ด์ ์ด ์ œํ•œ์ ์ด๋ฉฐ, ์†Œ๊ทœ๋ชจ ๋ชจ๋ธ์—์„œ๋Š” ์ด๋ฏธ ์ œํ•œ๋œ ๋ชจ๋ธ ์šฉ๋Ÿ‰์„ ์ „์—ญ์  ๋งฅ๋ฝ ํŒŒ์•…์— ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ์ค‘์š”ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.

QK-Reorder-LN โ€” Layer Normalization์˜ ์žฌ๋ฐฐ์น˜

๋‘ ๋ฒˆ์งธ ์ฃผ์š” ์•„ํ‚คํ…์ฒ˜ ๋ณ€๊ฒฝ์€ LayerNorm์˜ ์œ„์น˜์ž…๋‹ˆ๋‹ค.

์ด ๋ณ€๊ฒฝ์„ ์ดํ•ดํ•˜๋ ค๋ฉด Transformer์˜ Layer Normalization ๋ฐฐ์น˜์— ๋Œ€ํ•œ ๋ฐฐ๊ฒฝ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Layer Normalization(LayerNorm)์€ ๊ฐ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ์„ ์ •๊ทœํ™”ํ•˜์—ฌ ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ์›๋ž˜ Transformer(โ€œAttention Is All You Needโ€) ๋…ผ๋ฌธ์—์„œ๋Š” ๋ ˆ์ด์–ด ์ถœ๋ ฅ ํ›„์— Normalization์„ ์ ์šฉํ•˜๋Š” Post-LN ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ์ดํ›„ Pre-LN ๊ตฌ์กฐ(๋ ˆ์ด์–ด ์ž…๋ ฅ์— Normalization ์ ์šฉ)๊ฐ€ ๋“ฑ์žฅํ•˜์—ฌ, ํ•™์Šต ์•ˆ์ •์„ฑ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ–ˆ๊ณ  ๋Œ€๋ถ€๋ถ„์˜ ์ตœ์‹  LLM์—์„œ ์‚ฌ์‹ค์ƒ ํ‘œ์ค€์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์ตœ๊ทผ โ€œThe Curse of Depth in Large Language Modelsโ€ ์—ฐ๊ตฌ์—์„œ Pre-LN์˜ ๊ตฌ์กฐ์  ๋ฌธ์ œ๊ฐ€ ์ง€์ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ ๊นŠ์ด๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ์ถœ๋ ฅ์˜ ๋ถ„์‚ฐ(Variance)์ด ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ปค์ง€๋ฉฐ, ์ด๋กœ ์ธํ•ด ๊นŠ์€ ๋ ˆ์ด์–ด๋“ค์ด ๋ชจ๋ธ ์„ฑ๋Šฅ์— ์‹ค์งˆ์ ์œผ๋กœ ๊ธฐ์—ฌํ•˜์ง€ ๋ชปํ•˜๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ง๊ด€์ ์œผ๋กœ ์„ค๋ช…ํ•˜๋ฉด, 64๊ฐœ ๋ ˆ์ด์–ด๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์—์„œ ์•ž์ชฝ ๋ ˆ์ด์–ด๋“ค์˜ ์ถœ๋ ฅ์€ ์ ์ ˆํ•œ ํฌ๊ธฐ๋ฅผ ์œ ์ง€ํ•˜์ง€๋งŒ ๋’ค์ชฝ ๋ ˆ์ด์–ด๋กœ ๊ฐˆ์ˆ˜๋ก ์ถœ๋ ฅ ๊ฐ’์˜ ๋ฒ”์œ„๊ฐ€ ๊ทน๋‹จ์ ์œผ๋กœ ์ปค์ ธ์„œ, ๋งˆ์ง€๋ง‰ ๋ช‡์‹ญ ๊ฐœ์˜ ๋ ˆ์ด์–ด๋Š” ์‚ฌ์‹ค์ƒ โ€œ์ฃฝ์€ ๋ ˆ์ด์–ดโ€ โ€” ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์กด์žฌํ•˜์ง€๋งŒ ์˜ˆ์ธก์— ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ๋ฅผ ํ•˜์ง€ ๋ชปํ•˜๋Š” ์ƒํƒœ โ€” ๊ฐ€ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

EXAONE 4.0์€ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด QK-Reorder-LN์„ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” Attention ๋ธ”๋ก ๋‚ด์—์„œ Query์™€ Key ๋ฒกํ„ฐ์— ๊ฐ๊ฐ RMSNorm์„ ์ ์šฉํ•œ ํ›„ Attention ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , Attention Output์— ํ•œ ๋ฒˆ ๋” RMSNorm์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. RMSNorm(Root Mean Square Normalization)์€ LayerNorm์˜ ๊ฒฝ๋Ÿ‰ํ™” ๋ฒ„์ „์œผ๋กœ, ํ‰๊ท ์„ ๋นผ๋Š” ์—ฐ์‚ฐ์„ ์ƒ๋žตํ•˜๊ณ  RMS ๊ฐ’์œผ๋กœ๋งŒ ์ •๊ทœํ™”ํ•˜์—ฌ ๊ณ„์‚ฐ ํšจ์œจ์ด ๋†’์Šต๋‹ˆ๋‹ค. EXAONE 3.0๋ถ€ํ„ฐ ์‚ฌ์šฉํ•ด์˜จ RMSNorm ์ž์ฒด๋Š” ์œ ์ง€ํ•˜๋˜, ์ ์šฉ ์œ„์น˜๋ฅผ ์žฌ๋ฐฐ์น˜ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ๋ฐฉ์‹์€ ๊ธฐ์กด Pre-LN ๋Œ€๋น„ ์—ฐ์‚ฐ๋Ÿ‰์ด ์ฆ๊ฐ€ํ•˜๋Š” Trade-off๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค โ€” Attention ๋ธ”๋ก ๋‚ด์— ์ถ”๊ฐ€์ ์ธ Normalization ์—ฐ์‚ฐ์ด ๋“ค์–ด๊ฐ€๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ OLMoE, OLMo 2 ๋“ฑ์˜ ์—ฐ๊ตฌ์—์„œ QK-Normalization๊ณผ ์œ ์‚ฌํ•œ ์ ‘๊ทผ์ด Downstream Task ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด๊ณ ๋˜์—ˆ์œผ๋ฉฐ, EXAONE ํŒ€๋„ ๋‹จ์ˆœํžˆ ๋ถ„์‚ฐ์„ ์Šค์ผ€์ผ๋งํ•˜๋Š” ๋ฐฉ์‹๋ณด๋‹ค QK-Reorder-LN์ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š” ์‹คํ—˜์  ๊ฒฐ๊ณผ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ์ŠคํŽ™ ์š”์•ฝ

ํ•ญ๋ชฉ 32B 1.2B
dmodeld_\text{model}dmodelโ€‹ (Hidden Dimension) 5,120 2,048
Layers 64 30
Normalization QK-Reorder-LN QK-Reorder-LN
Non-linearity SwiGLU SwiGLU
FFN Dimension 27,392 4,096
Attention Type Hybrid (Local:Global = 3:1) Global
Head Type / Heads / KV Heads GQA / 40 / 8 GQA / 32 / 8
Head Size 128 64
Max Seq Length 131,072 (128K) 65,536 (64K)
RoPE ฮธ\thetaฮธ 1,000,000 1,000,000
Tokenizer / Vocab BBPE / 102,400 BBPE / 102,400
Tied Embedding False True

๋‘ ๋ชจ๋ธ ๋ชจ๋‘ GQA(Grouped Query Attention)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. GQA๋Š” Multi-Head Attention์—์„œ Key์™€ Value Head์˜ ์ˆ˜๋ฅผ Query Head๋ณด๋‹ค ์ ๊ฒŒ ์„ค์ •ํ•˜์—ฌ, KV Cache ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•˜๋ฉด์„œ ์„ฑ๋Šฅ์€ ์œ ์ง€ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. 32B ๋ชจ๋ธ์€ 40๊ฐœ์˜ Query Head์— 8๊ฐœ์˜ KV Head๋ฅผ, 1.2B ๋ชจ๋ธ์€ 32๊ฐœ์˜ Query Head์— 8๊ฐœ์˜ KV Head๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Tokenizer๋Š” BBPE(Byte-level Byte Pair Encoding)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, 102,400๊ฐœ์˜ Vocabulary๋ฅผ ํ•œ๊ตญ์–ด์™€ ์˜์–ด ํ† ํฐ์ด ๊ฑฐ์˜ ๋™์ผํ•œ ๋น„์œจ๋กœ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. 1.2B ๋ชจ๋ธ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด Tied Word Embedding(์ž…๋ ฅ Embedding๊ณผ ์ถœ๋ ฅ Projection์ด ๊ฐ™์€ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณต์œ ํ•˜๋Š” ๋ฐฉ์‹)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Pre-training: ๋ฐ์ดํ„ฐ ๊ทœ๋ชจ์™€ ํ’ˆ์งˆ์˜ ๋™์‹œ ๊ฐ•ํ™”

EXAONE 4.0 32B ๋ชจ๋ธ์€ 14T(14์กฐ) ํ† ํฐ์œผ๋กœ Pretraining๋˜์—ˆ์œผ๋ฉฐ, ์ด๋Š” EXAONE 3.5์˜ 6.5T ๋Œ€๋น„ ์•ฝ 2๋ฐฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. 1.2B ๋ชจ๋ธ๋„ 12T ํ† ํฐ์œผ๋กœ ํ•™์Šต๋˜์–ด ๋ชจ๋ธ ํฌ๊ธฐ ๋Œ€๋น„ ์ƒ๋‹นํžˆ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์†Œํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ํˆฌ์ž…๋œ ์—ฐ์‚ฐ๋Ÿ‰(FLOPs)์€ 32B ๋ชจ๋ธ์ด 2.69ร—10242.69 \times 10^{24}2.69ร—1024, 1.2B ๋ชจ๋ธ์ด 8.65ร—10228.65 \times 10^{22}8.65ร—1022์ž…๋‹ˆ๋‹ค. ์ฐธ๊ณ ๋กœ Pretraining์€ ๋ชจ๋ธ์ด ๋Œ€๊ทœ๋ชจ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์–ธ์–ด์˜ ํŒจํ„ด, ์‚ฌ์‹ค ์ง€์‹, ์ถ”๋ก  ๋Šฅ๋ ฅ ๋“ฑ์„ ํ•™์Šตํ•˜๋Š” ์ดˆ๊ธฐ ํ•™์Šต ๋‹จ๊ณ„๋กœ, ์ดํ›„์˜ Fine-tuning๊ณผ ๊ตฌ๋ถ„๋ฉ๋‹ˆ๋‹ค.

์ด ๋ฐ์ดํ„ฐ ์ฆ๊ฐ€๋Š” ๋‹จ์ˆœํžˆ ์–‘์  ํ™•๋Œ€์— ๊ทธ์น˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. World Knowledge ๊ฐ•ํ™”๋ฅผ ๋ช…ํ™•ํ•œ ๋ชฉํ‘œ๋กœ ์„ค์ •ํ•˜๊ณ , STEM(Science, Technology, Engineering, Mathematics) ๋ถ„์•ผ ๋“ฑ ์ „๋ฌธ ๋„๋ฉ”์ธ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํŠน๋ณ„ํžˆ ํ๋ ˆ์ด์…˜ํ–ˆ์Šต๋‹ˆ๋‹ค. โ€œํ๋ ˆ์ด์…˜โ€์ด๋ž€ ๋‹จ์ˆœํžˆ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์œผ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ํ’ˆ์งˆ ๊ธฐ์ค€์— ๋”ฐ๋ผ ์„ ๋ณ„ํ•˜๊ณ  ์ •์ œํ•˜๋Š” ๊ณผ์ •์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ MMLU-Redux ๋“ฑ ์ง€์‹ ๊ธฐ๋ฐ˜ ๋ฒค์น˜๋งˆํฌ์—์„œ ๋šœ๋ ทํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๊ด€์ฐฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ ์ตœ๊ทผ ์—ฐ๊ตฌ(โ€œFour Habits of Highly Effective STaRsโ€)์—์„œ ์ถ”๋ก  ์„ฑ๋Šฅ์ด Pretraining ๊ณผ์ •์—์„œ ํ•™์Šต๋œ Cognitive Behavior์— ํฌ๊ฒŒ ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋ณด๊ณ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Cognitive Behavior๋ž€ ๋ชจ๋ธ์ด ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•  ๋•Œ ๋ณด์ด๋Š” ์‚ฌ๊ณ  ํŒจํ„ด โ€” ์˜ˆ๋ฅผ ๋“ค์–ด ๋ฌธ์ œ๋ฅผ ๋‹จ๊ณ„์ ์œผ๋กœ ๋ถ„ํ•ดํ•˜๊ฑฐ๋‚˜, ๊ฐ€์ •์„ ๊ฒ€์ฆํ•˜๊ฑฐ๋‚˜, ๋Œ€์•ˆ์„ ํƒ์ƒ‰ํ•˜๋Š” ๋“ฑ์˜ ํ–‰๋™ โ€” ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํŒจํ„ด์€ ์ฃผ๋กœ Pretraining ๋ฐ์ดํ„ฐ์— ํฌํ•จ๋œ ๋ฌธ์„œ(๊ต๊ณผ์„œ, ํ•™์ˆ  ๋…ผ๋ฌธ, ๋…ผ๋ฆฌ์  ํ† ๋ก  ๋“ฑ)๋กœ๋ถ€ํ„ฐ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. EXAONE 4.0์€ ์ด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ Pretraining ๋‹จ๊ณ„์—์„œ๋ถ€ํ„ฐ ์—„๊ฒฉํ•œ Data Curation์„ ์ˆ˜ํ–‰ํ•˜์—ฌ, ๋‹จ์ˆœํžˆ ์ง€์‹๋ฟ ์•„๋‹ˆ๋ผ Post-training์—์„œ์˜ ์ถ”๋ก  ์„ฑ๋Šฅ๊นŒ์ง€ ๊ณ ๋ คํ•œ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์„ ํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Context Length Extension: 4K์—์„œ 128K๊นŒ์ง€

EXAONE 4.0์€ ์ตœ๋Œ€ 128K ํ† ํฐ์˜ Context Length๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. Context Length๋ž€ ๋ชจ๋ธ์ด ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์ž…๋ ฅ ํ…์ŠคํŠธ์˜ ์ตœ๋Œ€ ๊ธธ์ด๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. 128K ํ† ํฐ์€ ๋Œ€๋žต ์˜๋ฌธ ๊ธฐ์ค€ ์•ฝ 300ํŽ˜์ด์ง€ ๋ถ„๋Ÿ‰์˜ ํ…์ŠคํŠธ์— ํ•ด๋‹นํ•˜๋ฉฐ, ๊ธด ๋ฌธ์„œ ์š”์•ฝ, ๋Œ€๋Ÿ‰์˜ ์ฝ”๋“œ ๋ถ„์„, ์—ฌ๋Ÿฌ ๋ฌธ์„œ๋ฅผ ๋™์‹œ์— ์ฐธ์กฐํ•˜๋Š” QA ๋“ฑ์—์„œ ํ•ต์‹ฌ์ ์ธ ๋Šฅ๋ ฅ์ž…๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์ฒ˜์Œ๋ถ€ํ„ฐ 128K ํ† ํฐ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค. ๊ธด ์‹œํ€€์Šค๋Š” ์—ฐ์‚ฐ ๋น„์šฉ์ด ๋†’๊ณ , ๋Œ€๋ถ€๋ถ„์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 128K๋ณด๋‹ค ํ›จ์”ฌ ์งง๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ EXAONE 4.0์€ 2๋‹จ๊ณ„ ์ ์ง„์  ํ™•์žฅ ๊ณผ์ •์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.

๋จผ์ € 4K Context๋กœ Pretrain๋œ ๋ชจ๋ธ์„ 32K๋กœ ํ™•์žฅํ•˜๊ณ , ์ดํ›„ ๋‹ค์‹œ 128K๋กœ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„์—์„œ NIAH(Needle In A Haystack) ํ…Œ์ŠคํŠธ๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค. NIAH ํ…Œ์ŠคํŠธ๋Š” ๊ธด ํ…์ŠคํŠธ(Haystack) ์•ˆ์— ํŠน์ • ์ •๋ณด(Needle)๋ฅผ ์ˆจ๊ธฐ๊ณ , ๋ชจ๋ธ์ด ์ด๋ฅผ ์ •ํ™•ํžˆ ์ฐพ์•„๋‚ด๋Š”์ง€ ํ™•์ธํ•˜๋Š” ํ‘œ์ค€์ ์ธ Long-context ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  ์œ„์น˜(์‹œ์ž‘, ์ค‘๊ฐ„, ๋)์™€ ๋ชจ๋“  ๊ธธ์ด์—์„œ โ€œgreen lightโ€(์ •๋ณด๋ฅผ ์ •ํ™•ํžˆ ์ฐพ์•„๋ƒ„)์ด ํ™•์ธ๋  ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ตœ์ ํ™”๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

Short-context ์˜์—ญ์—์„œ์˜ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์‹ ์ค‘ํ•œ ๋ฐ์ดํ„ฐ ์„ ์ • ๋ฐฉ๋ฒ•๋ก ๊ณผ Progressive Training Recipe๋ฅผ ์ ์šฉํ•œ ์ ์ด ์ฃผ๋ชฉํ•  ๋งŒํ•ฉ๋‹ˆ๋‹ค. Long-context Fine-tuning ๊ณผ์ •์—์„œ ๊ธฐ์กด์˜ Short-context ๋Šฅ๋ ฅ์ด ํ›ผ์†๋˜๋Š” ํ˜„์ƒ์€ โ€œCatastrophic Forgettingโ€์˜ ์ผ์ข…์œผ๋กœ ํ”ํžˆ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ์ด๋ฉฐ, EXAONE 4.0์€ ์ด๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

1.2B ๋ชจ๋ธ์€ 64K ํ† ํฐ๊นŒ์ง€ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค. 1B ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒ”์œ„์˜ ๋ชจ๋ธ ๋Œ€๋ถ€๋ถ„์ด 32K๋ฅผ ์ตœ๋Œ€ ์ง€์›ํ•˜๋Š” ๊ฒƒ์„ ๊ฐ์•ˆํ•˜๋ฉด, ์ด๋Š” ๋™๊ธ‰ ๋Œ€๋น„ ์•ฝ 2๋ฐฐ์˜ Context Length์ž…๋‹ˆ๋‹ค.

Post-training: 5๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ

Pretraining์ด โ€œ์›์‹œ์ ์ธ ์–ธ์–ด ๋Šฅ๋ ฅโ€์„ ํ•™์Šตํ•˜๋Š” ๋‹จ๊ณ„๋ผ๋ฉด, Post-training์€ ์ด ๋Šฅ๋ ฅ์„ ์‚ฌ์šฉ์ž์˜ ์ง€์‹œ๋ฅผ ๋”ฐ๋ฅด๊ณ , ์ •ํ™•ํ•˜๊ฒŒ ์ถ”๋ก ํ•˜๋ฉฐ, ์ธ๊ฐ„์˜ ์„ ํ˜ธ์— ๋งž๊ฒŒ ์‘๋‹ตํ•˜๋„๋ก ์ •์ œํ•˜๋Š” ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. EXAONE 4.0์˜ Post-training์€ 5๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋œ ์ •๊ตํ•œ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค. ํฌ๊ฒŒ SFT(Supervised Fine-Tuning) โ†’ RL(Reinforcement Learning) โ†’ Preference Learning์˜ 3๊ฐœ ์ถ•์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค.

Large-Scale Supervised Fine-Tuning

SFT๋Š” ์ธ๊ฐ„์ด ์ž‘์„ฑํ•œ ๊ณ ํ’ˆ์งˆ ์ž…๋ ฅ-์ถœ๋ ฅ ์Œ์„ ํ†ตํ•ด ๋ชจ๋ธ์ด โ€œ๋ฐ”๋žŒ์งํ•œ ์‘๋‹ต ํ˜•ํƒœโ€๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. EXAONE 4.0์˜ SFT์—์„œ๋Š” 5๊ฐœ ๋„๋ฉ”์ธ์— ๊ฑธ์ณ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

World Knowledge ๋„๋ฉ”์ธ์—์„œ๋Š” ์›น ์†Œ์Šค์—์„œ ์ˆ˜์ง‘ํ•œ ๋ฌธ์ œ๋ฅผ ๊ต์œก์  ๊ฐ€์น˜ ๊ธฐ์ค€์œผ๋กœ ํ•„ํ„ฐ๋งํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ์‚ฌ์‹ค ์•”๊ธฐ๊ฐ€ ์•„๋‹ˆ๋ผ, ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์™€ ๋‚œ์ด๋„์— ๊ฑธ์นœ ์ง€์‹์˜ Distillation(์ฆ๋ฅ˜)์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ „๋ฌธ์ ์ด๊ณ  ๊ณ ๋‚œ์ด๋„์˜ ๋ฐ์ดํ„ฐ๋Š” ํŠน๋ณ„ํžˆ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ REASONING ๋ชจ๋“œ ํ•™์Šต์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

Math, Code, Logic ๋„๋ฉ”์ธ์—์„œ๋Š” ์ •ํ™•ํ•œ Ground Truth ํ™•๋ณด๊ฐ€ ์–ด๋ ต๋‹ค๋Š” ๊ทผ๋ณธ์  ์ œ์•ฝ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜ํ•™ ๋ฌธ์ œ๋Š” ๋‹ต์ด ๋ช…ํ™•ํ•˜์ง€๋งŒ ๊ณ ํ’ˆ์งˆ ๋ฌธ์ œ ์ž์ฒด๋ฅผ ๋Œ€๋Ÿ‰์œผ๋กœ ๋งŒ๋“ค๊ธฐ๊ฐ€ ์–ด๋ ต๊ณ , ์ฝ”๋“œ ๋ฌธ์ œ๋Š” ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค๋ฅผ ํ†ตํ•œ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ฒ€์ฆ ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ฌธ์ œ๋ฅผ ์–ต์ง€๋กœ ๋งŒ๋“œ๋Š” ๋Œ€์‹ , ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ๋‹ต์ด ์žˆ๋Š” ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋‹ค์–‘ํ•œ ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ์ „๋žต์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ํฅ๋ฏธ๋กœ์šด ์‹คํ—˜์  ๋ฐœ๊ฒฌ์€, ํ•˜๋‚˜์˜ ๋ฌธ์ œ์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๋‹ค๋ฅธ ํ’€์ด๋ฒ•(์‘๋‹ต)์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ๊ณ ์œ ํ•œ ๋ฌธ์ œ์˜ ์ˆ˜๋‚˜ ๋‹ค์–‘์„ฑ์„ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ๊ณผ ๋™๋“ฑํ•œ ํšจ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋ฐ์ดํ„ฐ ๊ตฌ์ถ• ๋น„์šฉ์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ์‹ค์šฉ์  ์ธ์‚ฌ์ดํŠธ์ž…๋‹ˆ๋‹ค. REASONING ๋ชจ๋“œ์—์„œ Math/Code ์‘๋‹ต์€ ๊ธด ์‚ฌ๊ณ  ๊ณผ์ •์„ ํฌํ•จํ•˜์—ฌ ๊ธธ์–ด์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์–ด, Degeneration(๋ฐ˜๋ณต์ ์ด๊ฑฐ๋‚˜ ๋ฌด์˜๋ฏธํ•œ ํ…์ŠคํŠธ ์ƒ์„ฑ)๊ณผ ์–ธ์–ด ๋ถˆ์ผ์น˜(ํ•œ๊ตญ์–ด ์งˆ๋ฌธ์— ์˜์–ด๋กœ ๋‹ตํ•˜๋Š” ๋“ฑ)์˜ ์œ„ํ—˜์ด ๋†’์•„์ง€๋ฏ€๋กœ, ์‹ ์ค‘ํ•œ ํ•„ํ„ฐ๋ง์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. Code ๋„๋ฉ”์ธ์—์„œ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ Problem-solving์„ ๋„˜์–ด Full Stack Development์— ์ดˆ์ ์„ ๋งž์ถ˜ Software Engineering ๋ฐ์ดํ„ฐ์…‹๋„ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

Long Context ๋„๋ฉ”์ธ์—์„œ๋Š” Context ๊ธธ์ด์™€ ํ•ต์‹ฌ ์ •๋ณด์˜ ์œ„์น˜๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๋ณ€ํ™”์‹œ์ผœ, ๋ถ„์‚ฐ๋œ ์ •๋ณด๋ฅผ ์‹๋ณ„ํ•˜๊ณ  ์ถ”๋ก ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ•ต์‹ฌ ์ •๋ณด๊ฐ€ ๊ธด ๋ฌธ์„œ์˜ ์ฒ˜์Œ, ์ค‘๊ฐ„, ๋์— ๊ฐ๊ฐ ์œ„์น˜ํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ๋ชจ๋‘ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด์˜ ๊ฒฝ์šฐ ๋ฒ•๋ฅ , ํ–‰์ •, ๊ธฐ์ˆ  ๋ฌธ์„œ ๋“ฑ์„ ์ •์ œํ•˜์—ฌ ๋‹ค์–‘ํ•œ Long-context ์ž…๋ ฅ ํ˜•์‹์— ๋งž๊ฒŒ ์žฌ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Agentic Tool Use ๋„๋ฉ”์ธ์—์„œ๋Š” ๋‹จ์ˆœํ•œ Single Tool Call(ํ•˜๋‚˜์˜ API๋ฅผ ํ•œ ๋ฒˆ ํ˜ธ์ถœํ•˜๋Š” ๊ฒƒ)์ด ์•„๋‹ˆ๋ผ, ๋ณด๋‹ค ํ˜„์‹ค์ ์ธ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ์‚ฌ์šฉ์ž์™€์˜ ๋Œ€ํ™”๋ฅผ ํ†ตํ•ด ์š”๊ตฌ์‚ฌํ•ญ์„ ๊ตฌ์ฒดํ™”ํ•˜๊ณ , ์—ฌ๋Ÿฌ ๋„๊ตฌ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ํ˜ธ์ถœํ•˜๋ฉฐ, ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ํ–‰๋™์„ ๊ฒฐ์ •ํ•˜๊ณ , ์‹คํ–‰ ์˜ค๋ฅ˜ ์‹œ ๋Œ€์•ˆ์„ ๋ชจ์ƒ‰ํ•˜๋Š” โ€” ์ด๋Ÿฐ ๋ณต์žกํ•œ Long-horizon Tool-calling ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. Multi-step(์—ฌ๋Ÿฌ ๋‹จ๊ณ„์˜ ๋„๊ตฌ ํ˜ธ์ถœ), Multi-turn(์—ฌ๋Ÿฌ ์ฐจ๋ก€์˜ ๋Œ€ํ™” ์™•๋ณต) ํ˜•์‹์œผ๋กœ ์กฐ์งํ™”ํ•˜์—ฌ Agentic Tool Use์˜ ํ•™์Šต์„ ํšจ๊ณผ์ ์œผ๋กœ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

Multilinguality ๋„๋ฉ”์ธ์—์„œ๋Š” ํ•œ๊ตญ์–ด์™€ Spanish ๋ชจ๋‘์— ๋Œ€ํ•ด ๋ฌธํ™”/์—ญ์‚ฌ์  ์ง€์‹๊ณผ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋Œ€ํ™” ๋Šฅ๋ ฅ์„ ๋ชฉํ‘œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด ์˜์–ด ์ƒ˜ํ”Œ์˜ ๋ฒˆ์—ญ์„ ์ฟผ๋ฆฌ๋กœ ํ™œ์šฉํ•˜๋Š” ํ•œํŽธ, ๊ฐ ์–ธ์–ด ๊ณ ์œ ์˜ ์ƒˆ๋กœ์šด Instruction๋„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ํ•œ๊ตญ์–ด๋Š” ํŠนํžˆ ๊ต์œก๊ณผ ์‚ฐ์—… ์ „๋ฌธ๊ฐ€ ๊ด€๋ จ ์ฃผ์ œ๋ฅผ ํ๋ ˆ์ด์…˜ํ•˜์—ฌ ๋„๋ฉ”์ธ ํŠนํ™” ์ฟผ๋ฆฌ ๋Œ€์‘ ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ•ฉ๋‹ˆ๋‹ค.

Unified Mode Training. NON-REASONING๊ณผ REASONING ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ๊ฐ€ ์•„๋‹ˆ๋ผ ํ•จ๊ป˜ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ ์„ค๊ณ„ ๊ฒฐ์ •์ž…๋‹ˆ๋‹ค. ์ˆœ์ฐจ์  ํ•™์Šต(๋จผ์ € NON-REASONING, ๊ทธ ๋‹ค์Œ REASONING)์€ ๋‚˜์ค‘์— ํ•™์Šตํ•œ ๋ชจ๋“œ๊ฐ€ ์ด์ „ ๋ชจ๋“œ๋ฅผ ๋ฎ์–ด์“ฐ๋Š” Catastrophic Forgetting ์œ„ํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋™์‹œ ํ•™์Šต์€ ์ด ์œ„ํ—˜์„ ์ค„์ด์ง€๋งŒ, ๋‘ ๋ชจ๋“œ ๊ฐ„ ๋ฐ์ดํ„ฐ ๋น„์œจ ์„ค์ •์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. Ablation Study๋ฅผ ํ†ตํ•ด REASONING ๋Œ€ NON-REASONING ๋ฐ์ดํ„ฐ์˜ ํ† ํฐ ๋น„์œจ์„ 1.5:1๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. REASONING ๋น„์œจ์ด ๋„ˆ๋ฌด ๋†’์œผ๋ฉด NON-REASONING ๋ชจ๋“œ์—์„œ๋„ ๋ชจ๋ธ์ด ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๊ธด ์‚ฌ๊ณ  ๊ณผ์ •์„ ์ƒ์„ฑํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ ๋„ˆ๋ฌด ๋‚ฎ์œผ๋ฉด REASONING ๋ชจ๋“œ์˜ ์ถ”๋ก  ํ’ˆ์งˆ์ด ์ €ํ•˜๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Unified Mode ํ•™์Šต ํ›„์—๋Š” ๋„๋ฉ”์ธ ๋ถˆ๊ท ํ˜•์„ ํ•ด์†Œํ•˜๊ธฐ ์œ„ํ•ด, Code์™€ Tool Use ๋„๋ฉ”์ธ์˜ ๊ณ ํ’ˆ์งˆ REASONING ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜๋Š” 2์ฐจ SFT๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ ์ด ๋„๋ฉ”์ธ์˜ ๋น„์ค‘์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•ด๋‹น ์˜์—ญ์˜ ์„ฑ๋Šฅ์„ ๋ณด๊ฐ•ํ•˜๋ ค๋Š” ์˜๋„์ž…๋‹ˆ๋‹ค.

Reasoning Reinforcement Learning โ€” AGAPO

SFT๊ฐ€ โ€œ์ข‹์€ ์‘๋‹ต์˜ ํŒจํ„ด์„ ๋ชจ๋ฐฉโ€ํ•˜๋Š” ๊ฒƒ์ด๋ผ๋ฉด, RL(Reinforcement Learning)์€ โ€œ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ํ†ตํ•ด ์Šค์Šค๋กœ ๋” ๋‚˜์€ ์ „๋žต์„ ๋ฐœ๊ฒฌโ€ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ๋ฌธ์ œ์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๊ณ , ์ •๋‹ต/์˜ค๋‹ต ์—ฌ๋ถ€์— ๋”ฐ๋ฅธ ๋ณด์ƒ(Reward)์„ ๋ฐ›์•„ ์ •๋‹ต์„ ๋‚ผ ํ™•๋ฅ ์„ ๋†’์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

EXAONE 4.0์€ SFT ์ดํ›„ Online RL์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ๊ธฐ์กด GRPO(Group Relative Policy Optimization)์˜ ํ•œ๊ณ„๋ฅผ ํฌ๊ด„์ ์œผ๋กœ ๊ฐœ์„ ํ•œ ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜ AGAPO(Asymmetric Sampling and Global Advantage Policy Optimization)๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

๋จผ์ € GRPO์— ๋Œ€ํ•ด ๊ฐ„๋‹จํžˆ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. GRPO๋Š” DeepSeek์—์„œ ์ œ์•ˆํ•œ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ๊ฐ ๋ฌธ์ œ์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์‘๋‹ต(Group)์„ ์ƒ์„ฑํ•œ ๋’ค, ๊ทธ๋ฃน ๋‚ด์—์„œ์˜ ์ƒ๋Œ€์  ์„ฑ๋Šฅ ์ฐจ์ด(Advantage)๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ •์ฑ…์„ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค. PPO(Proximal Policy Optimization)์™€ ๋‹ฌ๋ฆฌ ๋ณ„๋„์˜ Critic(Value) ๋ชจ๋ธ์ด ํ•„์š” ์—†์–ด ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ด ์ข‹๊ณ , Verifiable Reward(์ˆ˜ํ•™ ์ •๋‹ต ์—ฌ๋ถ€, ์ฝ”๋“œ ํ…Œ์ŠคํŠธ ํ†ต๊ณผ ์—ฌ๋ถ€ ๋“ฑ ๊ฐ๊ด€์ ์œผ๋กœ ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ๋ณด์ƒ)์™€ ๊ฒฐํ•ฉํ•˜๋ฉด ๋งค์šฐ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ช‡ ๊ฐ€์ง€ ๊ตฌ์กฐ์  ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋ฉฐ, AGAPO๋Š” ์ด๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ์ˆ˜ํ•™, ์ฝ”๋“œ, ๊ณผํ•™, Instruction Following์˜ 4๊ฐœ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ํšจ์œจ์ ์ธ ํ•™์Šต์„ ์œ„ํ•ด SFT ๋ชจ๋ธ์—์„œ 8๊ฐœ์˜ ์‘๋‹ต์„ ์ƒ์„ฑํ•˜์—ฌ, 8๊ฐœ ๋ชจ๋‘ ์ •๋‹ต์ธ ์ƒ˜ํ”Œ(๋ชจ๋ธ์—๊ฒŒ ์ด๋ฏธ ์‰ฌ์šด ๋ฌธ์ œ)์€ ์‚ฌ์ „ ํ•„ํ„ฐ๋ง์œผ๋กœ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. ์‰ฌ์šด ๋ฌธ์ œ์—์„œ๋Š” ๋ชจ๋ธ์ด ์ƒˆ๋กœ์šด ๊ฒƒ์„ ๋ฐฐ์šธ ์—ฌ์ง€๊ฐ€ ์ ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋ณด์ƒ ํ•จ์ˆ˜๋Š” ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ๋งž์ถค ์„ค๊ณ„๋ฉ๋‹ˆ๋‹ค. ์ˆ˜ํ•™์€ Rule-based Verifier(์ •๋‹ต๊ณผ์˜ ์ผ์น˜ ์—ฌ๋ถ€), ์ฝ”๋“œ๋Š” Test Case ํ†ต๊ณผ ์—ฌ๋ถ€, ๊ณผํ•™์€ Rule-based Verifier ์‹คํŒจ ์‹œ LLM Judge๊ฐ€ 2์ฐจ ๊ฒ€์ฆ(๋” ์œ ์—ฐํ•œ ํŒ๋‹จ), Instruction Following์€ ๋ชจ๋“  ์ œ์•ฝ ์กฐ๊ฑด ์ถฉ์กฑ ์‹œ 1, ์•„๋‹ˆ๋ฉด 0์„ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.

AGAPO์˜ ํ•ต์‹ฌ ์„ค๊ณ„ ์š”์†Œ๋Š” ๋„ค ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค.

์ฒซ์งธ, Clipped Objective ์ œ๊ฑฐ. PPO๋Š” ํ•™์Šต ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด Policy Update์˜ ํฌ๊ธฐ๋ฅผ ์ œํ•œํ•˜๋Š” โ€œClippingโ€์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์ƒˆ ์ •์ฑ…๊ณผ ๊ธฐ์กด ์ •์ฑ…์˜ ํ™•๋ฅ  ๋น„์œจ(Ratio)์ด ์ผ์ • ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋ฉด Gradient๋ฅผ ์ฐจ๋‹จํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์•ˆ์ •์„ฑ์—๋Š” ์œ ๋ฆฌํ•˜์ง€๋งŒ, ๋‚ฎ์€ ํ™•๋ฅ ์˜ ํ† ํฐ โ€” ์ฆ‰ ๋ชจ๋ธ์ด ํ˜„์žฌ๋Š” ๊ฑฐ์˜ ์ƒ์„ฑํ•˜์ง€ ์•Š์ง€๋งŒ ์‹ค์€ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ํ† ํฐ โ€” ์˜ Gradient Update๊ฐ€ ์ฐจ๋‹จ๋˜๋Š” ๋ถ€์ž‘์šฉ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ† ํฐ์€ ์ข…์ข… ์ถ”๋ก  ๊ฒฝ๋กœ์˜ ๋ถ„๊ธฐ์ (fork) ์—ญํ• ์„ ํ•˜๋Š” Reflective Behavior(์˜ˆ: โ€œ์ž ๊น, ์ด ์ ‘๊ทผ์€ ํ‹€๋ ธ์œผ๋‹ˆ ๋‹ค์‹œ ์ƒ๊ฐํ•ด๋ณด์žโ€)์™€ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๋‹ค. AGAPO๋Š” Clipping์„ ์ œ๊ฑฐํ•˜๊ณ  ํ‘œ์ค€ Policy Gradient Loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์ด๋Ÿฌํ•œ ํƒ์ƒ‰์  ํ† ํฐ์ด ํ•™์Šต์— ์˜จ์ „ํžˆ ๊ธฐ์—ฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

๋‘˜์งธ, Asymmetric Sampling. ๊ธฐ์กด GRPO์—์„œ๋Š” ํ•œ ๋ฌธ์ œ์— ๋Œ€ํ•ด ์ƒ์„ฑ๋œ ๋ชจ๋“  ์‘๋‹ต์ด ์ •๋‹ต์ด๊ฑฐ๋‚˜ ๋ชจ๋‘ ์˜ค๋‹ต์ธ ๊ฒฝ์šฐ, ๊ทธ๋ฃน ๋‚ด ์ƒ๋Œ€์  ์ฐจ์ด๊ฐ€ ์—†์œผ๋ฏ€๋กœ Advantage๊ฐ€ 0์ด ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฐ ์ƒ˜ํ”Œ์€ ํ•™์Šต์— ๊ธฐ์—ฌํ•˜์ง€ ๋ชปํ•ด ํ๊ธฐ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ โ€œ๋ชจ๋“  ์‘๋‹ต์ด ์˜ค๋‹ตโ€์ธ ๊ฒฝ์šฐ์—๋„ ์œ ์šฉํ•œ ํ•™์Šต ์‹ ํ˜ธ๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค โ€” ๋ชจ๋ธ์ด ์ด๋Ÿฐ ์œ ํ˜•์˜ ๋ฌธ์ œ์—์„œ ํŠนํžˆ ์ทจ์•ฝํ•˜๋‹ค๋Š” ์ •๋ณด ์ž์ฒด๊ฐ€ ๊ฐ€์น˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. Negative Sample Reinforcement์˜ ํšจ๊ณผ์— ๋Œ€ํ•œ ์ตœ๊ทผ ์—ฐ๊ตฌ(โ€œThe Surprising Effectiveness of Negative Reinforcement in LLM Reasoningโ€)๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ, AGAPO๋Š” ๋ชจ๋“  ์‘๋‹ต์ด ์˜ค๋‹ต์ธ ์ƒ˜ํ”Œ์„ ๋ฒ„๋ฆฌ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋Œ€์‹  Advantage ๊ณ„์‚ฐ์„ ํ†ตํ•ด ์ž‘์€ ์Œ์˜ ๋ณด์ƒ์„ ํ• ๋‹นํ•˜์—ฌ, ๋ชจ๋ธ์ด ์ž˜๋ชป๋œ ์ถ”๋ก  ๊ฒฝ๋กœ๋ฅผ ์ ๊ทน์ ์œผ๋กœ ํšŒํ”ผํ•˜๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. โ€œ๋น„๋Œ€์นญ(Asymmetric)โ€์ด๋ผ๋Š” ์ด๋ฆ„์€ All-correct(ํ๊ธฐ)์™€ All-incorrect(์œ ์ง€)๋ฅผ ๋น„๋Œ€์นญ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ์„œ ์œ ๋ž˜ํ•ฉ๋‹ˆ๋‹ค.

์…‹์งธ, Group & Global Advantage. GRPO์˜ Advantage ๊ณ„์‚ฐ์€ ๊ฐ ๊ทธ๋ฃน(๊ฐ™์€ ๋ฌธ์ œ์— ๋Œ€ํ•œ ์‘๋‹ต๋“ค) ๋‚ด์—์„œ๋งŒ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๊ทธ๋ฃน ๋‚ด ์ƒ๋Œ€์  ์ฐจ์ด๋งŒ ๋ฐ˜์˜ํ•  ๋ฟ, ์ „์ฒด ๋ฐฐ์น˜์˜ ๋‚œ์ด๋„ ๋ถ„ํฌ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด All-incorrect ๊ทธ๋ฃน์— ์ ์ ˆํ•œ ํฌ๊ธฐ์˜ ์Œ์˜ ๋ณด์ƒ์„ ๋ถ€์—ฌํ•˜๋ ค๋ฉด, ๋ฐฐ์น˜ ์ „์ฒด์—์„œ ์ด ๊ทธ๋ฃน์ด ์–ผ๋งˆ๋‚˜ ๋‚˜์œ ์„ฑ๊ณผ์ธ์ง€๋ฅผ ์•Œ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. AGAPO๋Š” ์ด๋ฅผ ์œ„ํ•ด 2๋‹จ๊ณ„ Advantage ๊ณ„์‚ฐ์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ๋จผ์ € ๊ทธ๋ฃน ๋‚ด์—์„œ LOO(Leave-One-Out) ๋ฐฉ์‹์œผ๋กœ Advantage๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. LOO๋Š” ๊ฐ ์‘๋‹ต์˜ ๋ณด์ƒ์—์„œ ๋‚˜๋จธ์ง€ ์‘๋‹ต๋“ค์˜ ํ‰๊ท  ๋ณด์ƒ์„ ๋นผ๋Š” ๋ฐฉ์‹์œผ๋กœ, ํ•ด๋‹น ์‘๋‹ต์ด ๊ทธ๋ฃน ๋‚ด์—์„œ ์ƒ๋Œ€์ ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ์ข‹๊ฑฐ๋‚˜ ๋‚˜์œ์ง€๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๋‹ค์Œ, ์ „์ฒด ๋ฏธ๋‹ˆ๋ฐฐ์น˜์— ๊ฑธ์ณ ์ •๊ทœํ™”(ํ‰๊ท ์„ ๋นผ๊ณ  ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜๋ˆ”)ํ•˜์—ฌ ์ตœ์ข… Global Advantage๋ฅผ ์‚ฐ์ถœํ•ฉ๋‹ˆ๋‹ค.

Aloo,i=riโˆ’1Gโˆ’1โˆ‘jโ‰ irj,Aglobal,i=Aloo,iโˆ’mean{Aloo,k}kstd{Aloo,k}kA_{\text{loo},i} = r_i - \frac{1}{G-1}\sum_{j \neq i} r_j, \quad A_{\text{global},i} = \frac{A_{\text{loo},i} - \text{mean}{A_{\text{loo},k}}_k}{\text{std}{A_{\text{loo},k}}_k}Aloo,iโ€‹=riโ€‹โˆ’Gโˆ’11โ€‹โˆ‘j๎€ โ€‹=iโ€‹rjโ€‹,Aglobal,iโ€‹=std{Aloo,kโ€‹}kโ€‹Aloo,iโ€‹โˆ’mean{Aloo,kโ€‹}kโ€‹โ€‹

์—ฌ๊ธฐ์„œ rir_iriโ€‹๋Š” iii๋ฒˆ์งธ ์‘๋‹ต์˜ ๋ณด์ƒ, GGG๋Š” ๊ทธ๋ฃน ํฌ๊ธฐ, kkk๋Š” ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๋‚ด ๋ชจ๋“  ์‘๋‹ต์˜ ์ธ๋ฑ์Šค์ž…๋‹ˆ๋‹ค.

๋„ท์งธ, Sequence Level Cumulative KL. RL๋กœ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ•˜๋Š” ๊ณผ์ •์—์„œ, SFT ๋‹จ๊ณ„์—์„œ ํ•™์Šตํ•œ ๋‹ค๋ฅธ ๋Šฅ๋ ฅ(์ž์—ฐ์Šค๋Ÿฌ์šด ๋Œ€ํ™”, Instruction Following ๋“ฑ)์ด ์†์ƒ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด KL Divergence Penalty๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. KL Penalty๋Š” RL๋กœ ์—…๋ฐ์ดํŠธ๋˜๋Š” ํ˜„์žฌ ์ •์ฑ…(ฯ€ฮธ\pi_\thetaฯ€ฮธโ€‹)์ด SFT ์ดํ›„์˜ ์ฐธ์กฐ ์ •์ฑ…(ฯ€ref\pi_\text{ref}ฯ€refโ€‹)์—์„œ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€ ์•Š๋„๋ก ์ œ์•ฝํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. AGAPO๋Š” ํ† ํฐ ์ˆ˜์ค€์ด ์•„๋‹Œ Sequence ์ˆ˜์ค€์˜ Cumulative KL์„ ์ฑ„ํƒํ•˜์—ฌ, ๊ฐœ๋ณ„ ํ† ํฐ์˜ ๋ฏธ์„ธํ•œ ํ™•๋ฅ  ๋ณ€ํ™”๋ณด๋‹ค๋Š” ์ „์ฒด ์‘๋‹ต ์ˆ˜์ค€์—์„œ์˜ ๋ถ„ํฌ ๋ณ€ํ™”๋ฅผ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ตœ์ข… Objective๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

JAGAPO(ฮธ)=EqโˆผP(Q),โ€…โ€Š{oi}i=1Gโˆผฯ€ฮธ(Oโˆฃq)[1Gโˆ‘i=1G(Aglobal,ilogโกฯ€ฮธ(oiโˆฃq)โˆ’ฮฒDKL(ฯ€ฮธโˆฅฯ€ref))]J_{\text{AGAPO}}(\theta) = \mathbb{E}_{q \sim P(Q),\; {o_i}_{i=1}^G \sim \pi_\theta(\mathcal{O} q)} \left[ \frac{1}{G} \sum_{i=1}^{G} \left( A_{\text{global},i} \log \pi_\theta(o_i q) - \beta D_{\text{KL}}(\pi_\theta | \pi_{\text{ref}}) \right) \right]JAGAPOโ€‹(ฮธ)=EqโˆผP(Q),{oiโ€‹}i=1Gโ€‹โˆผฯ€ฮธโ€‹(Oโˆฃq)โ€‹[G1โ€‹โˆ‘i=1Gโ€‹(Aglobal,iโ€‹logฯ€ฮธโ€‹(oiโ€‹โˆฃq)โˆ’ฮฒDKLโ€‹(ฯ€ฮธโ€‹โˆฅฯ€refโ€‹))]
Aglobal,ilogโกฯ€ฮธ(oiโˆฃq)A_{\text{global},i} \log \pi_\theta(o_i q)Aglobal,iโ€‹logฯ€ฮธโ€‹(oiโ€‹โˆฃq) ๋ถ€๋ถ„์€ ์ข‹์€ ์‘๋‹ต์˜ ์ƒ์„ฑ ํ™•๋ฅ ์„ ๋†’์ด๊ณ  ๋‚˜์œ ์‘๋‹ต์˜ ํ™•๋ฅ ์„ ๋‚ฎ์ถ”๋Š” Policy Gradient์ด๋ฉฐ, ฮฒDKL\beta D_{\text{KL}}ฮฒDKLโ€‹ ๋ถ€๋ถ„์€ ์ •์ฑ…์ด ์ฐธ์กฐ ๋ชจ๋ธ์—์„œ ๋„ˆ๋ฌด ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋„๋ก ํ•˜๋Š” ์ •๊ทœํ™” ํ•ญ์ž…๋‹ˆ๋‹ค. ฮฒ\betaฮฒ๋Š” ๋‘ ํ•ญ ๊ฐ„์˜ ๊ท ํ˜•์„ ์กฐ์ ˆํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค.

AGAPO์˜ ๊ฐ ์ปดํฌ๋„ŒํŠธ๊ฐ€ ํ•ด๊ฒฐํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ธฐ๋ฒ• ํ•ด๊ฒฐํ•˜๋Š” ๋ฌธ์ œ ํ•ต์‹ฌ ์•„์ด๋””์–ด
Remove Clipped Objective PPO Clip์ด ํƒ์ƒ‰์  ํ† ํฐ์˜ Gradient ์ฐจ๋‹จ ํ‘œ์ค€ Policy Gradient Loss ์‚ฌ์šฉ
Asymmetric Sampling All-incorrect ์ƒ˜ํ”Œ ํ๊ธฐ๋กœ ์ธํ•œ ์ •๋ณด ์†์‹ค All-incorrect์— ์ž‘์€ ์Œ์˜ ๋ณด์ƒ, ํ๊ธฐํ•˜์ง€ ์•Š์Œ
Group & Global Advantage GRPO๊ฐ€ ๋ฐฐ์น˜ ์ „์ฒด ๋ถ„ํฌ ๋ฏธ๋ฐ˜์˜ LOO(๊ทธ๋ฃน ๋‚ด) โ†’ Global Normalization(๋ฐฐ์น˜ ์ „์ฒด)
Seq-Level Cumulative KL SFT ํ•™์Šต ๋Šฅ๋ ฅ ๋ณด์กด ์‹œํ€€์Šค ์ˆ˜์ค€ ๋ˆ„์  KL Penalty

Preference Learning โ€” 2๋‹จ๊ณ„ ์ธ๊ฐ„ ์ •๋ ฌ

RL ๋‹จ๊ณ„์—์„œ๋Š” Verifiable Reward โ€” ์ฆ‰ โ€œ์ •๋‹ต์ด๋ƒ ์˜ค๋‹ต์ด๋ƒโ€๋ผ๋Š” ๊ฐ๊ด€์  ๋ณด์ƒ โ€” ๋ฅผ ํ†ตํ•œ ์ •ํ™•๋„ ํ–ฅ์ƒ์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๊ฒƒ๋งŒ์œผ๋กœ๋Š” ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ธ๊ฐ„์ด ์„ ํ˜ธํ•˜๋Š” ์‘๋‹ต์˜ ์Šคํƒ€์ผ(๊ฐ„๊ฒฐํ•จ, ์ž์—ฐ์Šค๋Ÿฌ์›€, ์ •์ค‘ํ•จ ๋“ฑ)์„ ํ•™์Šตํ•˜์ง€ ์•Š์œผ๋ฉฐ, ์ถ”๋ก  Task์— ํŠนํ™”๋˜๋ฉด์„œ ๋‹ค๋ฅธ ์œ ํ˜•์˜ Task์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๊ด€์ฐฐ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€์ ์ธ Preference Learning์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

Preference Learning์€ ์ธ๊ฐ„์˜ ์„ ํ˜ธ๋ฅผ ์ง์ ‘ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. โ€œ์ด ์‘๋‹ต์ด ์ € ์‘๋‹ต๋ณด๋‹ค ๋‚ซ๋‹คโ€๋ผ๋Š” ํ˜•ํƒœ์˜ ๋น„๊ต ๋ฐ์ดํ„ฐ(Chosen/Rejected ์Œ)๋กœ๋ถ€ํ„ฐ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ DPO(Direct Preference Optimization)์ธ๋ฐ, EXAONE 4.0์€ DPO ๊ณ„์—ด์ด๋ฉด์„œ Reference Model์ด ๋ถˆํ•„์š”ํ•œ SimPER(Simple Preference Optimization)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Reference Model์ด ๋ถˆํ•„์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์€ ํ•™์Šต ์‹œ ์ถ”๊ฐ€ ๋ชจ๋ธ์„ ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ฆด ํ•„์š”๊ฐ€ ์—†์–ด ํšจ์œจ์ ์ด๋ผ๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• ๋ฐฉ์‹์ด ํŠน์ง•์ ์ž…๋‹ˆ๋‹ค. ์™ธ๋ถ€ ์ธ๊ฐ„ ํ‰๊ฐ€์ž๊ฐ€ ์ง์ ‘ ๋ผ๋ฒจ๋งํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, RL ์™„๋ฃŒ ํ›„์˜ ๋ชจ๋ธ ์ž์‹ ์ด ์ƒ์„ฑํ•œ On-policy ์‘๋‹ต์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด 4~16๊ฐœ์˜ ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๊ณ , Verifiable Reward, Preference Reward(LLM Judge๊ฐ€ ํ‰๊ฐ€ํ•˜๋Š” ์‘๋‹ต ํ’ˆ์งˆ), Language Consistency Reward(์งˆ๋ฌธ ์–ธ์–ด์™€ ์‘๋‹ต ์–ธ์–ด์˜ ์ผ์น˜๋„), Conciseness Reward(๋ถˆํ•„์š”ํ•œ ์žฅํ™ฉํ•จ ์—†์ด ํ•ต์‹ฌ์„ ์ „๋‹ฌํ•˜๋Š” ์ •๋„)๋ฅผ ์กฐํ•ฉํ•œ Hybrid Reward๋กœ Chosen๊ณผ Rejected๋ฅผ ์„ ์ •ํ•ฉ๋‹ˆ๋‹ค.

Stage 1์€ ํ† ํฐ ํšจ์œจ์„ฑ์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. REASONING ๋ชจ๋“œ์—์„œ ์ •ํ™•ํ•œ ๋‹ต๋ณ€์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ถˆํ•„์š”ํ•˜๊ฒŒ ๊ธด ์‚ฌ๊ณ  ๊ณผ์ •์„ ์ค„์ด๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค. Verifiable Reward์™€ Conciseness Reward๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ, ์ •๋‹ต ์ค‘ ๊ฐ€์žฅ ์งง์€ ์‘๋‹ต์„ Chosen์œผ๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ถ”๋ก  ๋น„์šฉ์„ ์ง์ ‘์ ์œผ๋กœ ์ค„์ด๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Stage 2๋Š” ์ธ๊ฐ„ ์ •๋ ฌ(Human Alignment)์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. Preference Reward์™€ Language Consistency Reward๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. REASONING ๋ชจ๋“œ ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ, ์ถ”๋ก  ๊ณผ์ •(Thinking) ๋ถ€๋ถ„์ด ์•„๋‹Œ ์ตœ์ข… ๋‹ต๋ณ€์— ๋Œ€ํ•ด์„œ๋งŒ Preference Labeling์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ์„ค๊ณ„ ๊ฒฐ์ •์ž…๋‹ˆ๋‹ค. ์‚ฌ๊ณ  ๊ณผ์ •์˜ ์Šคํƒ€์ผ๋ณด๋‹ค๋Š” ์ตœ์ข…์ ์œผ๋กœ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ์‹œ๋˜๋Š” ๋‹ต๋ณ€์˜ ํ’ˆ์งˆ๊ณผ ์„ ํ˜ธ๋„์— ์ง‘์ค‘ํ•˜๊ฒ ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. ํ•™์Šต ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด Stage 1 ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€๋ฅผ Stage 2์—์„œ ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Evaluation: ๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ ๋ถ„์„

ํ‰๊ฐ€ ์ฒด๊ณ„

EXAONE 4.0์€ 6๊ฐœ ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋ฒค์น˜๋งˆํฌ๋กœ ํ‰๊ฐ€๋ฉ๋‹ˆ๋‹ค.

World Knowledge: MMLU-REDUX(๋‹ค๋ถ„์•ผ ์ง€์‹ ํ‰๊ฐ€์˜ ๊ฐœ์„  ๋ฒ„์ „), MMLU-PRO(๋” ๋„์ „์ ์ธ ๋‹ค๋ถ„์•ผ ์ง€์‹), GPQA-DIAMOND(๋Œ€ํ•™์› ์ˆ˜์ค€์˜ ์ƒ๋ฌผํ•™/๋ฌผ๋ฆฌํ•™/ํ™”ํ•™ ๋ฌธ์ œ). Math/Coding: AIME 2025(๋ฏธ๊ตญ ์ˆ˜ํ•™ ์˜ฌ๋ฆผํ”ผ์•„๋“œ), HMMT FEB 2025(ํ•˜๋ฒ„๋“œ-MIT ์ˆ˜ํ•™ ๋Œ€ํšŒ), LIVECODEBENCH V5/V6(์‹ค์‹œ๊ฐ„ ์ฝ”๋”ฉ ๊ฒฝ์ง„). Instruction Following: IFEVAL(์ง€์‹œ์‚ฌํ•ญ ์ค€์ˆ˜ ํ‰๊ฐ€), MULTI-IF(๋‹ค๊ตญ์–ด/๋‹คํ„ด ์ง€์‹œ์‚ฌํ•ญ). Long Context: HELMET(์ข…ํ•ฉ Long-context ํ‰๊ฐ€), RULER(ํ•ฉ์„ฑ Long-context ํ…Œ์ŠคํŠธ), LONGBENCH(์ด์ค‘์–ธ์–ด Long-context ๋ฒค์น˜๋งˆํฌ). Agentic Tool Use: BFCL-V3(ํ•จ์ˆ˜ ํ˜ธ์ถœ ๋Šฅ๋ ฅ), TAU-BENCH(์‚ฌ์šฉ์ž-์—์ด์ „ํŠธ ๋„๊ตฌ ์‚ฌ์šฉ ์‹œ๋ฎฌ๋ ˆ์ด์…˜). Multilinguality: ํ•œ๊ตญ์–ด(KMMLU-PRO, KMMLU-REDUX, KSM)์™€ Spanish(MMMLU, MATH500, WMT24++) ํ‰๊ฐ€.

๋น„๊ต ๋Œ€์ƒ์€ Mid-size(Qwen 3 32B, Gemma 3 27B, Phi 4, Mistral Small ๋“ฑ)๋ถ€ํ„ฐ Frontier๊ธ‰(DeepSeek R1-0528 671B, Qwen 3 235B, Llama 4 Maverick 402B ๋“ฑ)๊นŒ์ง€ ํฌ๊ด„ํ•ฉ๋‹ˆ๋‹ค. REASONING ๋ชจ๋“œ์—์„œ๋Š” temperature 0.6, top-p 0.95๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ํŠนํžˆ AIME/HMMT์—์„œ๋Š” n=32n=32n=32๊ฐœ ์‘๋‹ต์„ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ํ‰๊ท  ์ •ํ™•๋„๋ฅผ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ถ”๋ก  ๋ฌธ์ œ์˜ ํ™•๋ฅ ์  ํŠน์„ฑ์„ ๊ฐ์•ˆํ•œ ํ‰๊ฐ€ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

REASONING ๋ชจ๋“œ โ€” Math/Coding์—์„œ์˜ ์••๋„์  ์„ฑ๊ณผ

32B REASONING ๋ชจ๋“œ์˜ ๊ฐ€์žฅ ๋‘๋“œ๋Ÿฌ์ง„ ๊ฒฐ๊ณผ๋Š” Math/Coding ์˜์—ญ์ž…๋‹ˆ๋‹ค.

๋ฒค์น˜๋งˆํฌ EXAONE 4.0 32B Qwen 3 32B Qwen 3 235B DeepSeek R1-0528
AIME 2025 85.3 72.9 81.5 87.5
HMMT FEB 2025 72.9 50.4 62.5 79.4
LIVECODEBENCH V6 66.7 60.1 58.9 70.3

32B ๋ชจ๋ธ์ด ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ์•ฝ 7๋ฐฐ์ธ Qwen 3 235B๋ฅผ ๋ชจ๋“  Math/Coding ๋ฒค์น˜๋งˆํฌ์—์„œ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋‹จ์ˆœํžˆ ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ ํ‚ค์šฐ๋Š” ๊ฒƒ๋ณด๋‹ค ํ•™์Šต ๋ฐฉ๋ฒ•๋ก (AGAPO, ์ฒด๊ณ„์  SFT ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ)์ด ์ถ”๋ก  ์„ฑ๋Šฅ์— ๋” ๊ฒฐ์ •์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. 671B์ธ DeepSeek R1-0528์—๋„ ๊ทผ์ ‘ํ•œ ์„ฑ๋Šฅ(AIME 85.3 vs 87.5)์„ ๋‹ฌ์„ฑํ•˜์—ฌ, ๋ชจ๋ธ ํฌ๊ธฐ ๋Œ€๋น„ ํšจ์œจ์ด ๋งค์šฐ ๋†’์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

REASONING ๋ชจ๋“œ โ€” World Knowledge์™€ Tool Use

World Knowledge์—์„œ๋Š” GPQA-DIAMOND 75.4๋กœ, Qwen 3 235B(71.1)๋ฅผ ๋Šฅ๊ฐ€ํ•˜๊ณ  DeepSeek R1-0528(81.0)์— ์ด์–ด ๋‘ ๋ฒˆ์งธ๋ฅผ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค. GPQA-DIAMOND๋Š” ๋Œ€ํ•™์› ์ˆ˜์ค€์˜ ์ „๋ฌธ ์ง€์‹์„ ์š”๊ตฌํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ๋กœ, ์ด ์„ฑ๊ณผ๋Š” STEM ๋ถ„์•ผ ๋ฐ์ดํ„ฐ ํ๋ ˆ์ด์…˜์˜ ํšจ๊ณผ๋ฅผ ์ง์ ‘์ ์œผ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. MMLU-REDUX 92.3๋„ 14T ํ† ํฐ Pretraining์˜ ํšจ๊ณผ๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.

Instruction Following์—์„œ๋Š” IFEVAL 83.7, MULTI-IF 73.5๋ฅผ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค. NON-REASONING๊ณผ REASONING ๋ชจ๋“œ๋ฅผ ํ†ตํ•ฉํ–ˆ์Œ์—๋„ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•œ๋‹ค๋Š” ์ ์—์„œ, Unified Mode Training์˜ 1.5:1 ๋น„์œจ ์„ค์ •์ด ํšจ๊ณผ์ ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ถ€ ๋ชจ๋ธ(Magistral Small)์€ IFEVAL 37.9๋กœ ํฌ๊ฒŒ ๋‚ฎ์€ ์ ์ˆ˜๋ฅผ ๋ณด์ด๋Š”๋ฐ, ์ด๋Š” Reasoning์— ํŠนํ™”๋œ ๋ชจ๋ธ์ด ์ผ๋ฐ˜์ ์ธ ์ง€์‹œ์‚ฌํ•ญ ์ค€์ˆ˜์—์„œ๋Š” ์ทจ์•ฝํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ฃผ๋Š” ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.

Tool Use์—์„œ๋Š” TAU-BENCH(Airline) 51.5๋กœ DeepSeek R1-0528(53.5)๊ณผ ์œ ์‚ฌํ•œ ์ˆ˜์ค€์„ ๋ณด์ด๋ฉฐ, TAU-BENCH(Retail) 62.8๋กœ ๋Œ€๋ถ€๋ถ„์˜ Baseline์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. TAU-BENCH๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜๋œ ์‚ฌ์šฉ์ž์™€ ๋Œ€ํ™”ํ•˜๋ฉด์„œ ํ•ญ๊ณต๊ถŒ ๋ณ€๊ฒฝ, ์ƒํ’ˆ ๋ฐ˜ํ’ˆ ๋“ฑ์˜ ์‹ค์ œ ์—…๋ฌด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ์ž…๋‹ˆ๋‹ค. Agentic Tool Use๊ฐ€ EXAONE 4.0์—์„œ ์ƒˆ๋กœ ๋„์ž…๋œ ๊ธฐ๋Šฅ์ž„์„ ๊ฐ์•ˆํ•˜๋ฉด ๊ณ ๋ฌด์ ์ธ ์ถœ๋ฐœ์ ์ž…๋‹ˆ๋‹ค.

NON-REASONING ๋ชจ๋“œ โ€” ์ „๋ฐฉ์œ„ ๊ฒฝ์Ÿ๋ ฅ

NON-REASONING ๋ชจ๋“œ์—์„œ๋„ EXAONE 4.0 32B๋Š” ๋™๊ธ‰ Mid-size ๋ชจ๋ธ ์ค‘ ์ „๋ฐ˜์ ์œผ๋กœ ์ตœ๊ณ  ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. MMLU-REDUX 89.8, MMLU-PRO 77.6์œผ๋กœ Phi 4(88.3/70.4), Mistral Small(85.9/69.1), Gemma 3 27B(85.0/67.5)๋ฅผ ํฌ๊ฒŒ ์•ž์„œ๋ฉฐ, Math/Coding์—์„œ๋„ AIME 2025 35.9, LIVECODEBENCH V6 43.1๋กœ ๋™๊ธ‰ ๋Œ€๋น„ ์••๋„์ ์ž…๋‹ˆ๋‹ค. NON-REASONING ๋ชจ๋“œ์ž„์—๋„ ์ˆ˜ํ•™/์ฝ”๋”ฉ์—์„œ ๋‹ค๋ฅธ ๋ชจ๋ธ์˜ Non-Reasoning ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ์ƒํšŒํ•˜๋Š” ๊ฒƒ์€, Unified Mode Training์ด REASONING์˜ ๋Šฅ๋ ฅ์„ NON-REASONING ๋ชจ๋“œ์—๋„ ์ผ์ • ๋ถ€๋ถ„ ์ „์ด์‹œํ‚ค๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

Long Context ํ‰๊ฐ€์—์„œ๋Š” RULER 88.2๋ฅผ ๊ธฐ๋กํ•˜์—ฌ Qwen 3 32B(85.6), Gemma 3 27B(66.0)๋ฅผ ์ƒํšŒํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ Llama 4 Maverick์ด RULER 128K์—์„œ 2.9๋กœ ์‚ฌ์‹ค์ƒ ์™„์ „ํžˆ ์‹คํŒจํ•˜๋Š” ๊ฒƒ๊ณผ ๋Œ€์กฐ์ ์œผ๋กœ, Hybrid Attention ๊ตฌ์กฐ์˜ ํšจ๊ณผ๊ฐ€ ๋ช…ํ™•ํžˆ ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค. HELMET์—์„œ๋Š” Recall ์นดํ…Œ๊ณ ๋ฆฌ(๊ธด ํ…์ŠคํŠธ์—์„œ ํŠน์ • ์ •๋ณด๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๋Šฅ๋ ฅ)์—์„œ 94.06์œผ๋กœ ๋ชจ๋“  ๋น„๊ต ๋ชจ๋ธ์„ ์••๋„ํ•˜์ง€๋งŒ, Summarization(25.64)์€ ์ƒ๋Œ€์  ์•ฝ์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. ๊ธด ํ…์ŠคํŠธ์—์„œ ์ •๋ณด๋ฅผ โ€œ์ฐพ๋Š”โ€ ๋Šฅ๋ ฅ๊ณผ โ€œ์š”์•ฝํ•˜๋Š”โ€ ๋Šฅ๋ ฅ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์Šคํ‚ฌ์ž„์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

ํ•œ๊ตญ์–ด ์„ฑ๋Šฅ๋„ ๋ˆˆ์— ๋•๋‹ˆ๋‹ค. KMMLU-PRO 60.0, KO-LONGBENCH 76.9๋กœ, Frontier๊ธ‰ ๋ชจ๋ธ์„ ์ œ์™ธํ•˜๋ฉด ๊ฐ€์žฅ ๋†’์€ ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค. KO-LONGBENCH๋Š” ํ•œ๊ตญ์–ด Long-context ์ดํ•ด๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ์ž์ฒด ๋ฒค์น˜๋งˆํฌ๋กœ, ๋ฒ•๋ฅ /ํ–‰์ •/๊ธฐ์ˆ  ๋ฌธ์„œ QA, ๋Œ€ํ™” ์ดํ•ด, ํ…Œ์ด๋ธ” QA ๋“ฑ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. Mistral Small(55.4)์„ 20% ์ด์ƒ ์•ž์„œ๋Š” ๊ฒƒ์€ ํ•œ๊ตญ์–ด Long-context ๋ฐ์ดํ„ฐ ํ๋ ˆ์ด์…˜์˜ ํšจ๊ณผ๋ฅผ ์ž˜ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

1.2B ๋ชจ๋ธ โ€” On-device Reasoning์˜ ๊ฐ€๋Šฅ์„ฑ

1.2B ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์€ 12์–ต์ด๋ผ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ๊ฐ์•ˆํ•˜๋ฉด ๋†€๋ผ์šด ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค. 12์–ต ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์Šค๋งˆํŠธํฐ์—์„œ๋„ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์˜ ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค.

REASONING ๋ชจ๋“œ์—์„œ AIME 2025 45.2, LIVECODEBENCH V6 45.3์„ ๋‹ฌ์„ฑํ•˜์—ฌ, ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์•ฝ 2.4๋ฐฐ์ธ SmolLM 3B(36.7, 29.1)๋ฅผ ํฌ๊ฒŒ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. GPQA-DIAMOND 52.0์œผ๋กœ Qwen 3 1.7B(40.1)๋ฅผ 10% ์ด์ƒ ์•ž์„œ๋ฉฐ, ํ•œ๊ตญ์–ด ์ˆ˜ํ•™(KSM) 60.6์œผ๋กœ ๋™๊ธ‰ ์ตœ๊ณ ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ EXAONE Deep 2.4B(2๋ฐฐ ํฌ๊ธฐ์˜ ์ถ”๋ก  ์ „์šฉ ๋ชจ๋ธ)์— ๋น„ํ•ด์„œ๋Š” AIME(45.2 vs 47.9)์—์„œ ์†Œํญ ๋’ค์ฒ˜์ง€๋Š”๋ฐ, Hybrid ๋ชจ๋ธ์ด ์ „์šฉ Reasoning ๋ชจ๋ธ ๋Œ€๋น„ ์•ฝ๊ฐ„์˜ ์„ฑ๋Šฅ Trade-off๊ฐ€ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

NON-REASONING ๋ชจ๋“œ์—์„œ๋„ ๋Œ€๋ถ€๋ถ„์˜ ๋ฒค์น˜๋งˆํฌ์—์„œ ๋™๊ธ‰ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ Long Context์—์„œ RULER 77.4, KO-LONGBENCH 69.8๋กœ, 64K ํ† ํฐ๊นŒ์ง€์˜ Long-context ์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ์„ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค. Qwen 3 0.6B์˜ KO-LONGBENCH 16.4์™€ ๋น„๊ตํ•˜๋ฉด ๊ทธ ์ฐจ์ด๊ฐ€ ๊ทน๋ช…ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค๋งŒ WMT24++(Spanish ๋ฒˆ์—ญ ํ’ˆ์งˆ)์—์„œ๋Š” 65.9๋กœ SmolLM 3B(84.0)์— ํฌ๊ฒŒ ๋’ค์ฒ˜์ง€๋ฉฐ, ์ด๋Š” ์ŠคํŽ˜์ธ์–ด ์ง€์›์ด ์•„์ง ์ดˆ๊ธฐ ๋‹จ๊ณ„์ž„์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. TAU-BENCH(Airline) NON-REASONING์—์„œ๋„ 10.0์œผ๋กœ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ, ์†Œํ˜• ๋ชจ๋ธ์—์„œ์˜ ๋ณต์žกํ•œ Tool Use ์‹œ๋‚˜๋ฆฌ์˜ค๋Š” ์—ฌ์ „ํžˆ ๋„์ „์ ์ธ ๊ณผ์ œ์ž…๋‹ˆ๋‹ค.

Reasoning Budget โ€” ์ถ”๋ก  ๋น„์šฉ๊ณผ ์„ฑ๋Šฅ์˜ Trade-off

REASONING ๋ชจ๋“œ์—์„œ ๋ชจ๋ธ์€ ์ตœ์ข… ๋‹ต๋ณ€ ์ „์— โ€œ์ƒ๊ฐํ•˜๋Š” ๊ณผ์ •โ€(Thinking Token)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด Thinking Token์˜ ์ˆ˜๋ฅผ ์ œํ•œํ•˜๋ฉด ์ถ”๋ก  ๋น„์šฉ(์‹œ๊ฐ„, ์—ฐ์‚ฐ๋Ÿ‰)์ด ์ค„์–ด๋“ค์ง€๋งŒ ์„ฑ๋Šฅ๋„ ์˜ํ–ฅ์„ ๋ฐ›์Šต๋‹ˆ๋‹ค. ์ด Trade-off๋ฅผ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ถ„์„ํ•œ ๊ฒƒ์ด Reasoning Budget ์‹คํ—˜์ž…๋‹ˆ๋‹ค.

Reasoning ํ† ํฐ ์ˆ˜๋ฅผ 1K์—์„œ 64K๊นŒ์ง€ ๋ณ€ํ™”์‹œํ‚ค๋ฉฐ ์„ฑ๋Šฅ์„ ๊ด€์ฐฐํ•œ ๊ฒฐ๊ณผ๋Š” ์‹ค์šฉ์ ์œผ๋กœ ์˜๋ฏธ ์žˆ๋Š” ์‹œ์‚ฌ์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์˜ ์ƒ์„ฑ์ด ์ตœ๋Œ€ ํ† ํฐ Budget์— ๋„๋‹ฌํ•˜๋ฉด, ๊ฐ•์ œ๋กœ ์ƒ๊ฐ์„ ๋งˆ๋ฌด๋ฆฌํ•˜๊ณ  ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋„๋ก ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.

32B ๋ชจ๋ธ์—์„œ LIVECODEBENCH V6์€ 64K(66.7) โ†’ 32K(67.3)๋กœ ์˜คํžˆ๋ ค ์†Œํญ ์ƒ์Šนํ•˜๊ณ , 16K(53.0)์—์„œ ๋น„๋กœ์†Œ ๋ˆˆ์— ๋„๋Š” ํ•˜๋ฝ์ด ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ฝ”๋”ฉ ๋ฌธ์ œ์˜ ์ƒ๋‹น์ˆ˜๊ฐ€ 32K ์ด๋‚ด์˜ ์‚ฌ๊ณ ๋งŒ์œผ๋กœ ์ถฉ๋ถ„ํžˆ ํ•ด๊ฒฐ ๊ฐ€๋Šฅํ•˜๋ฉฐ, 64K๊นŒ์ง€์˜ ์ถ”๊ฐ€ ์‚ฌ๊ณ ๊ฐ€ ๋ฐ˜๋“œ์‹œ ๋„์›€์ด ๋˜์ง€๋Š” ์•Š์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค. AIME 2025๋Š” 64K(85.3) โ†’ 32K(74.8)๋กœ ์•ฝ 12% ๊ฐ์†Œ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€๋งŒ, ์ด ์ˆ˜์น˜๋„ ์—ฌ์ „ํžˆ Qwen 3 32B์˜ 72.9๋ฅผ ์ƒํšŒํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜ํ•™ ์˜ฌ๋ฆผํ”ผ์•„๋“œ๊ธ‰ ๋ฌธ์ œ๋Š” ์ฝ”๋”ฉ๋ณด๋‹ค ๋” ๊ธด ์ถ”๋ก  ์ฒด์ธ์ด ํ•„์š”ํ•œ ๊ฒฝํ–ฅ์ด ์žˆ์–ด Budget ๊ฐ์†Œ์— ๋” ๋ฏผ๊ฐํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

1.2B ๋ชจ๋ธ์—์„œ๋Š” AIME 2025์ด 64K(45.2) โ†’ 32K(45.3)๋กœ ๊ฑฐ์˜ ๋™์ผํ•˜๋ฉฐ, LIVECODEBENCH V6๋„ 64K(45.3) โ†’ 32K(43.0)๋กœ 5% ์ด๋‚ด์˜ ํ•˜๋ฝ์— ๊ทธ์นฉ๋‹ˆ๋‹ค. ์†Œํ˜• ๋ชจ๋ธ์€ ์• ์ดˆ์— 64K์— ๊ฐ€๊นŒ์šด ๋งค์šฐ ๊ธด ์ถ”๋ก  ์ฒด์ธ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•˜๊ธฐ ์–ด๋ ค์šฐ๋ฏ€๋กœ, 32K๋กœ ์ถฉ๋ถ„ํ•œ ๊ฒƒ์œผ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.

์ด ๊ฒฐ๊ณผ๋Š” ์‹ค์ œ ์„œ๋น„์Šค ๋ฐฐํฌ ์‹œ 32K Budget๋งŒ์œผ๋กœ๋„ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ์„ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. Reasoning Token์ด ์ค„์–ด๋“ค๋ฉด ์ถ”๋ก  ์ง€์—ฐ์‹œ๊ฐ„๊ณผ GPU ๋น„์šฉ์ด ์ง์ ‘์ ์œผ๋กœ ๊ฐ์†Œํ•˜๋ฏ€๋กœ, ์ด๋Š” ์‹ค์‹œ๊ฐ„ ์„œ๋น„์Šค์—์„œ์˜ ๋น„์šฉ ์ตœ์ ํ™”์— ๋ฐ”๋กœ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ์ธ์‚ฌ์ดํŠธ์ž…๋‹ˆ๋‹ค.

Limitations

๋…ผ๋ฌธ์—์„œ ๋ช…์‹œํ•˜๋Š” ํ•œ๊ณ„์ ์€ ๋ชจ๋“  LLM์— ๊ณตํ†ต์ ์ธ ๊ฒƒ๋“ค์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ํ†ต๊ณ„์  ํŠน์„ฑ์— ์˜์กดํ•˜์—ฌ ๋ถ€์ ์ ˆํ•˜๊ฑฐ๋‚˜ ํŽธํ–ฅ๋œ(๋‚˜์ด, ์„ฑ๋ณ„, ์ธ์ข… ๋“ฑ) ์‘๋‹ต์ด ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, Knowledge Cut-off(2024๋…„ 11์›”) ์ดํ›„์˜ ์ •๋ณด๋Š” ๋ฐ˜์˜๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ํ™•๋ฅ  ๊ธฐ๋ฐ˜ ํ…์ŠคํŠธ ์ƒ์„ฑ์˜ ๋ณธ์งˆ์  ํŠน์„ฑ์ƒ ์˜๋ฏธ์ /๊ตฌ๋ฌธ์ ์œผ๋กœ ๋ถ€์ •ํ™•ํ•œ ๋ฌธ์žฅ์ด ์ƒ์„ฑ๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋ผ์ด์„ ์Šค๋Š” EXAONE AI Model License Agreement 1.2 - NC๋กœ, ์—ฐ๊ตฌ ๋ฐ ๊ต์œก ๋ชฉ์ ์œผ๋กœ๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ƒ์—…์  ํ™œ์šฉ์—๋Š” LG AI Research์™€์˜ ๋ณ„๋„ ๋ผ์ด์„ ์Šค ๊ณ„์•ฝ์ด ํ•„์š”ํ•˜๋ฉฐ, ํŠนํžˆ ๊ฒฝ์Ÿ ๋ชจ๋ธ ๊ฐœ๋ฐœ์— EXAONE 4.0์˜ ๋ชจ๋ธ์ด๋‚˜ Output์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋„ ๋ช…์‹œ์ ์œผ๋กœ ๊ธˆ์ง€๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์˜คํ”ˆ์†Œ์Šค(Apache 2.0, MIT ๋“ฑ)์™€๋Š” ๋‹ค๋ฅธ ์ œํ•œ์  ๋ผ์ด์„ ์Šค์ด๋ฏ€๋กœ, ํ™œ์šฉ ์‹œ ์ฃผ์˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ฒฐ๋ก  ๋ฐ ์‹œ์‚ฌ์ 

EXAONE 4.0์€ โ€œํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ๋‘ ๊ฐ€์ง€ ๋ชจ๋“œโ€๋ผ๋Š” Hybrid ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ์‹คํšจ์„ฑ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค. 32B ๋ชจ๋ธ์ด Math/Coding์—์„œ 7๋ฐฐ ํฐ Qwen 3 235B๋ฅผ ๋Šฅ๊ฐ€ํ•˜๊ณ , 1.2B ๋ชจ๋ธ์ด 3B๊ธ‰ ๋ชจ๋ธ์„ ์ƒํšŒํ•˜๋Š” ๊ฒฐ๊ณผ๋Š” ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„(Hybrid Attention, QK-Reorder-LN)์™€ Data Curation(14T ํ† ํฐ, ๋„๋ฉ”์ธ๋ณ„ ๋งž์ถค ๋ฐ์ดํ„ฐ), ๊ทธ๋ฆฌ๊ณ  AGAPO๋ฅผ ํ†ตํ•œ RL ์ตœ์ ํ™”์˜ ์ข…ํ•ฉ์  ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํŠนํžˆ AGAPO๋Š” GRPO์˜ ๊ตฌ์ฒด์  ํ•œ๊ณ„ โ€” Clipped Objective์˜ ํƒ์ƒ‰ ์–ต์ œ, All-incorrect ์ƒ˜ํ”Œ ํ๊ธฐ, ๋ฐฐ์น˜ ๋ถ„ํฌ ๋ฏธ๋ฐ˜์˜ โ€” ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๊ฐœ์„ ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, RL ๊ธฐ๋ฐ˜ Reasoning ๊ฐ•ํ™” ์—ฐ๊ตฌ์— ์‹ค์งˆ์ ์ธ ๊ธฐ์—ฌ๋ฅผ ํ•ฉ๋‹ˆ๋‹ค. Hybrid Attention์˜ 3:1 ๋น„์œจ๊ณผ 4K Window Size๋ผ๋Š” ๊ตฌ์ฒด์  ์„ค๊ณ„ ์„ ํƒ, Global Attention์—์„œ์˜ RoPE ์ œ๊ฑฐ ๊ฒฐ์ •๋„ ๋‹ค๋ฅธ ๋ชจ๋ธ ์„ค๊ณ„์— ์ฐธ๊ณ ๋  ์ˆ˜ ์žˆ๋Š” ์‹ค์šฉ์  ์ง€์นจ์ž…๋‹ˆ๋‹ค.

ํ•œ๊ตญ์–ด ์„ฑ๋Šฅ์—์„œ์˜ ๊ฐ•์ (KO-LONGBENCH 76.9, KSM 87.6)์€ ํ•œ๊ตญ์–ด ์‚ฌ์šฉ์ž ๊ด€์ ์—์„œ ๋šœ๋ ทํ•œ ์ฐจ๋ณ„์ ์ด๋ฉฐ, Agentic Tool Use์˜ ๋„์ž…์€ LLM ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ ๊ฐœ๋ฐœ์˜ ๊ธฐ๋ฐ˜์„ ๋งˆ๋ จํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋น„์ƒ์—…์  ๋ผ์ด์„ ์Šค(NC)๋ผ๋Š” ์ œ์•ฝ๊ณผ, Summarization/๋ฒˆ์—ญ ๋“ฑ ์ผ๋ถ€ ์˜์—ญ์—์„œ์˜ ์ƒ๋Œ€์  ์•ฝ์ ์€ ์‹ค์ œ ํ™œ์šฉ ์‹œ ๊ณ ๋ คํ•ด์•ผ ํ•  ์š”์†Œ์ž…๋‹ˆ๋‹ค.

์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค