[Paper Review] Qwen-Image Technical Report

Posted by Euisuk's Dev Log on August 29, 2025

[Paper Review] Qwen-Image Technical Report

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/Paper-Review-Qwen-Image-Technical-Report

https://arxiv.org/abs/2508.02324

1
WU, Chenfei, et al. Qwen-image technical report. arXiv preprint arXiv:2508.02324, 2025.

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๋ณต์žกํ•œ ํ…์ŠคํŠธ ๋ Œ๋”๋ง๊ณผ ์ •๋ฐ€ํ•œ ์ด๋ฏธ์ง€ ํŽธ์ง‘์—์„œ ์ƒ๋‹นํ•œ ์ง„๋ณด๋ฅผ ์ด๋ฃฌ Qwen ์‹œ๋ฆฌ์ฆˆ์˜ ์ด๋ฏธ์ง€ ์ƒ์„ฑ foundation model์ธ Qwen-Image๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

  1. ์„œ๋ก 

text-to-image ์ƒ์„ฑ(T2I)๊ณผ ์ด๋ฏธ์ง€ ํŽธ์ง‘(TI2I)์„ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ์€ ํ˜„๋Œ€ ์ธ๊ณต์ง€๋Šฅ์˜ ๊ธฐ๋ณธ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๊ณ„๊ฐ€ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์—์„œ ์‹œ๊ฐ์ ์œผ๋กœ ๋งค๋ ฅ์ ์ด๊ณ  ์˜๋ฏธ์ ์œผ๋กœ ์ผ๊ด€๋œ ์ฝ˜ํ…์ธ ๋ฅผ ํ•ฉ์„ฑํ•˜๊ฑฐ๋‚˜ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ง€๋‚œ ๋ช‡ ๋…„ ๋™์•ˆ ์ด ๋ถ„์•ผ์—์„œ ๋†€๋ผ์šด ์ง„์ „์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ fine-grained semantic detail์„ ์บก์ฒ˜ํ•˜๋ฉด์„œ ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” diffusion-based architecture์˜ ์ถœํ˜„๊ณผ ํ•จ๊ป˜ ๋ง์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์ง„์ „์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋‘ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ๋ฌธ์ œ๊ฐ€ ์ง€์†๋ฉ๋‹ˆ๋‹ค:

์ฒซ์งธ, text-to-image ์ƒ์„ฑ์—์„œ ๋ณต์žกํ•˜๊ณ  ๋‹ค๋ฉด์ ์ธ ํ”„๋กฌํ”„ํŠธ์™€ ๋ชจ๋ธ ์ถœ๋ ฅ์˜ ์ •๋ ฌ์€ ์—ฌ์ „ํžˆ ์ค‘์š”ํ•œ ์žฅ๋ฒฝ์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ํ‰๊ฐ€์— ๋”ฐ๋ฅด๋ฉด GPT Image 1์ด๋‚˜ Seedream 3.0๊ณผ ๊ฐ™์€ state-of-the-art ์ƒ์šฉ ๋ชจ๋ธ๋“ค๋„ multi-line ํ…์ŠคํŠธ ๋ Œ๋”๋ง, non-alphabetic language ๋ Œ๋”๋ง(์˜ˆ: ์ค‘๊ตญ์–ด), ์ง€์—ญํ™”๋œ ํ…์ŠคํŠธ ์‚ฝ์ž…, ๋˜๋Š” ํ…์ŠคํŠธ์™€ ์‹œ๊ฐ์  ์š”์†Œ์˜ ๋งค๋„๋Ÿฌ์šด ํ†ตํ•ฉ์„ ์š”๊ตฌํ•˜๋Š” ํƒœ์Šคํฌ์— ์ง๋ฉดํ–ˆ์„ ๋•Œ ์–ด๋ ค์›€์„ ๊ฒช์Šต๋‹ˆ๋‹ค.

๋‘˜์งธ, ์ด๋ฏธ์ง€ ํŽธ์ง‘์—์„œ ํŽธ์ง‘๋œ ์ถœ๋ ฅ๊ณผ ์›๋ณธ ์ด๋ฏธ์ง€ ๊ฐ„์˜ ์ •ํ™•ํ•œ ์ •๋ ฌ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์€ ์ด์ค‘ ๋„์ „์„ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค: (i) visual consistency - ๋Œ€์ƒ ์˜์—ญ๋งŒ ์ˆ˜์ •๋˜์–ด์•ผ ํ•˜๊ณ  ๋‹ค๋ฅธ ๋ชจ๋“  ์‹œ๊ฐ์  ์„ธ๋ถ€์‚ฌํ•ญ์€ ๋ณด์กด๋˜์–ด์•ผ ํ•จ(์˜ˆ: ์–ผ๊ตด ์„ธ๋ถ€์‚ฌํ•ญ์„ ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๊ณ  ๋จธ๋ฆฌ ์ƒ‰๊น” ๋ณ€๊ฒฝ) (ii) semantic coherence - ๊ตฌ์กฐ์  ๋ณ€ํ™” ์ค‘์—๋„ ์ „์—ญ semantic์„ ๋ณด์กดํ•ด์•ผ ํ•จ(์˜ˆ: ์ •์ฒด์„ฑ๊ณผ ์žฅ๋ฉด ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์‚ฌ๋žŒ์˜ ์ž์„ธ ์ˆ˜์ •)

๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ํฌ๊ด„์ ์ธ data engineering, progressive learning ์ „๋žต, ๊ฐ•ํ™”๋œ multi-task training paradigm, ๊ทธ๋ฆฌ๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•œ infrastructure ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ์ด๋Ÿฌํ•œ ๋„์ „์„ ๊ทน๋ณตํ•˜๋„๋ก ์„ค๊ณ„๋œ Qwen ์‹œ๋ฆฌ์ฆˆ์˜ ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ์ธ Qwen-Image๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ์—ฌ์‚ฌํ•ญ

Qwen-Image์˜ ์ฃผ์š” ๊ธฐ์—ฌ์‚ฌํ•ญ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์š”์•ฝ๋ฉ๋‹ˆ๋‹ค:

  • ๋›ฐ์–ด๋‚œ ํ…์ŠคํŠธ ๋ Œ๋”๋ง: Qwen-Image๋Š” multiline layout, paragraph-level semantic, fine-grained detail์„ ํฌํ•จํ•œ ๋ณต์žกํ•œ ํ…์ŠคํŠธ ๋ Œ๋”๋ง์— ํƒ์›”ํ•ฉ๋‹ˆ๋‹ค. alphabetic language(์˜ˆ: ์˜์–ด)์™€ logographic language(์˜ˆ: ์ค‘๊ตญ์–ด) ๋ชจ๋‘๋ฅผ ๋†’์€ ์ถฉ์‹ค๋„๋กœ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • ์ผ๊ด€๋œ ์ด๋ฏธ์ง€ ํŽธ์ง‘: ๊ฐ•ํ™”๋œ multi-task training paradigm์„ ํ†ตํ•ด Qwen-Image๋Š” ํŽธ์ง‘ ์ž‘์—… ์ค‘ semantic meaning๊ณผ visual realism์„ ๋ชจ๋‘ ๋ณด์กดํ•˜๋Š” ๋ฐ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฐ•๋ ฅํ•œ cross-benchmark ์„ฑ๋Šฅ: ์—ฌ๋Ÿฌ benchmark์—์„œ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, Qwen-Image๋Š” ๋‹ค์–‘ํ•œ ์ƒ์„ฑ ๋ฐ ํŽธ์ง‘ ํƒœ์Šคํฌ์—์„œ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์„ ์ง€์†์ ์œผ๋กœ ๋Šฅ๊ฐ€ํ•˜์—ฌ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์œ„ํ•œ ๊ฐ•๋ ฅํ•œ foundation model์„ ํ™•๋ฆฝํ•ฉ๋‹ˆ๋‹ค.
  1. ๋ชจ๋ธ

์ด ์„น์…˜์—์„œ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ํ›ˆ๋ จ ์„ธ๋ถ€์‚ฌํ•ญ์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ ๊ฐœ์š”์™€ ํ•จ๊ป˜ Qwen-Image ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

2.1 ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

Figure 6์—์„œ ๋ณด๋“ฏ์ด, Qwen-Image ์•„ํ‚คํ…์ฒ˜๋Š” ๊ณ ์ถฉ์‹ค๋„ text-to-image ์ƒ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์กฐํ™”๋กญ๊ฒŒ ์ž‘๋™ํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

  1. Multimodal Large Language Model (MLLM): condition encoder ์—ญํ• ์„ ํ•˜๋ฉฐ ํ…์ŠคํŠธ ์ž…๋ ฅ์—์„œ feature๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  2. Variational AutoEncoder (VAE): ์ด๋ฏธ์ง€ tokenizer ์—ญํ• ์„ ํ•˜๋ฉฐ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋ฅผ compact latent representation์œผ๋กœ ์••์ถ•ํ•˜๊ณ  inference ์ค‘์— ์ด๋ฅผ ๋‹ค์‹œ ๋””์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.
  3. Multimodal Diffusion Transformer (MMDiT): backbone diffusion model๋กœ ๊ธฐ๋Šฅํ•˜๋ฉฐ ํ…์ŠคํŠธ ๊ฐ€์ด๋“œ ํ•˜์—์„œ noise์™€ image latent ๊ฐ„์˜ ๋ณต์žกํ•œ joint distribution์„ ๋ชจ๋ธ๋งํ•ฉ๋‹ˆ๋‹ค.

2.2 Multimodal Large Language Model

Qwen-Image๋Š” ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ์œ„ํ•œ feature extraction module๋กœ Qwen2.5-VL ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. Qwen2.5-VL์˜ language space์™€ visual space๊ฐ€ ์ด๋ฏธ ์ •๋ ฌ๋˜์–ด ์žˆ์–ด Qwen3์™€ ๊ฐ™์€ language-based model๋ณด๋‹ค text-to-image ํƒœ์Šคํฌ์— ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  2. Qwen2.5-VL์€ language model์— ๋น„ํ•ด ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ๊ฐ•๋ ฅํ•œ language modeling ๋Šฅ๋ ฅ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  3. Qwen2.5-VL์€ multimodal ์ž…๋ ฅ์„ ์ง€์›ํ•˜์—ฌ Qwen-Image๊ฐ€ ์ด๋ฏธ์ง€ ํŽธ์ง‘๊ณผ ๊ฐ™์€ ๋” ๊ด‘๋ฒ”์œ„ํ•œ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

2.3 Variational AutoEncoder

๊ฐ•๋ ฅํ•œ VAE representation์€ ๊ฐ•๋ ฅํ•œ ์ด๋ฏธ์ง€ foundation model์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ํ˜„์žฌ ์ด๋ฏธ์ง€ foundation model๋“ค์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€ dataset์—์„œ 2D convolution์œผ๋กœ ์ด๋ฏธ์ง€ VAE๋ฅผ ํ›ˆ๋ จํ•˜์—ฌ ๊ณ ํ’ˆ์งˆ ์ด๋ฏธ์ง€ representation์„ ์–ป์Šต๋‹ˆ๋‹ค.

์šฐ๋ฆฌ์˜ ์ž‘์—…์€ ์ด๋ฏธ์ง€์™€ ๋น„๋””์˜ค ๋ชจ๋‘์™€ ํ˜ธํ™˜๋˜๋Š” ๋” ์ผ๋ฐ˜์ ์ธ visual representation์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด์˜ joint image-video VAE๋“ค์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ด๋ฏธ์ง€ reconstruction ๋Šฅ๋ ฅ์ด ์ €ํ•˜๋˜๋Š” ์„ฑ๋Šฅ trade-off๋ฅผ ๊ฒช์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด single-encoder, dual-decoder ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

reconstruction fidelity, ํŠนํžˆ ์ž‘์€ ํ…์ŠคํŠธ์™€ fine-grained detail์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ํ…์ŠคํŠธ๊ฐ€ ํ’๋ถ€ํ•œ ์ด๋ฏธ์ง€์˜ in-house corpus์—์„œ decoder๋ฅผ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค. dataset์€ alphabetic(์˜ˆ: ์˜์–ด)์™€ logographic(์˜ˆ: ์ค‘๊ตญ์–ด) ์–ธ์–ด๋ฅผ ๋ชจ๋‘ ๋‹ค๋ฃจ๋Š” ์‹ค์ œ ๋ฌธ์„œ(PDF, PowerPoint ์Šฌ๋ผ์ด๋“œ, ํฌ์Šคํ„ฐ)์™€ ํ•ฉ์„ฑ paragraph๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

2.4 Multimodal Diffusion Transformer

Qwen-Image๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ jointly ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด Multimodal Diffusion Transformer(MMDiT)๋ฅผ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์€ FLUX ์‹œ๋ฆฌ์ฆˆ์™€ Seedream ์‹œ๋ฆฌ์ฆˆ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ํšจ๊ณผ์ ์ž„์ด ์ž…์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๊ฐ block ๋‚ด์—์„œ ์ƒˆ๋กœ์šด positional encoding ๋ฐฉ๋ฒ•์ธ Multimodal Scalable RoPE(MSRoPE)๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. Figure 8์—์„œ ๋ณด๋“ฏ์ด, ๋‹ค์–‘ํ•œ text-image joint positional encoding ์ „๋žต์„ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค.

MSRoPE์˜ ํŠน์ง•:

  • ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ์–‘์ชฝ ์ฐจ์›์— ๋™์ผํ•œ position ID๊ฐ€ ์ ์šฉ๋œ 2D tensor๋กœ ์ฒ˜๋ฆฌ
  • ํ…์ŠคํŠธ๊ฐ€ ์ด๋ฏธ์ง€์˜ ๋Œ€๊ฐ์„ ์„ ๋”ฐ๋ผ ์—ฐ๊ฒฐ๋œ ๊ฒƒ์œผ๋กœ ๊ฐœ๋…ํ™”
  • ์ด๋ฏธ์ง€ ์ธก๋ฉด์—์„œ resolution scaling ์žฅ์ ์„ ํ™œ์šฉํ•˜๋ฉด์„œ ํ…์ŠคํŠธ ์ธก๋ฉด์—์„œ 1D-RoPE์™€ ๊ธฐ๋Šฅ์ ์œผ๋กœ ๋™๋“ฑํ•จ์„ ์œ ์ง€
  1. ๋ฐ์ดํ„ฐ

3.1 ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘

์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ํ›ˆ๋ จ์„ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜์‹ญ์–ต ๊ฐœ์˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ์ฒด๊ณ„์ ์œผ๋กœ ์ˆ˜์ง‘ํ•˜๊ณ  ์ฃผ์„์„ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. raw dataset์˜ ๊ทœ๋ชจ์—๋งŒ ์ง‘์ค‘ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ๊ณผ ๊ท ํ˜• ์žกํžŒ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ์šฐ์„ ์‹œํ•˜์—ฌ ์‹ค์ œ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ๋ฐ€์ ‘ํ•˜๊ฒŒ ๋ฐ˜์˜ํ•˜๋Š” ์ž˜ ๊ท ํ˜• ์žกํžˆ๊ณ  ๋Œ€ํ‘œ์ ์ธ dataset์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค.

Figure 9์—์„œ ๋ณด๋“ฏ์ด, dataset์€ ๋„ค ๊ฐ€์ง€ ์ฃผ์š” ๋„๋ฉ”์ธ์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

  1. Nature (์•ฝ 55%): Objects, Landscape, Cityscape, Plants, Animals, Indoor, Food ์นดํ…Œ๊ณ ๋ฆฌ ๋“ฑ ๋‹ค์–‘ํ•œ ํ•˜์œ„ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  2. Design (์•ฝ 27%): Poster, User Interface, Presentation Slide์™€ ๊ฐ™์€ ๊ตฌ์กฐํ™”๋œ ์‹œ๊ฐ์  ์ฝ˜ํ…์ธ ์™€ ํšŒํ™”, ์กฐ๊ฐ, ๊ณต์˜ˆํ’ˆ, ๋””์ง€ํ„ธ ์•„ํŠธ ๋“ฑ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ์˜ˆ์ˆ ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  3. People (์•ฝ 13%): Portrait, Sports, Human Activities ๋“ฑ์˜ ํ•˜์œ„ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  4. Synthetic Data (์•ฝ 5%): ํ†ต์ œ๋œ ํ…์ŠคํŠธ ๋ Œ๋”๋ง ๊ธฐ์ˆ ์„ ํ†ตํ•ด ํ•ฉ์„ฑ๋œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

3.2 ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง

์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ๋ฐ˜๋ณต์  ๊ฐœ๋ฐœ ๊ณผ์ •์—์„œ ๊ณ ํ’ˆ์งˆ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ํ๋ ˆ์ด์…˜ํ•˜๊ธฐ ์œ„ํ•ด Figure 10์— ๋‚˜ํƒ€๋‚œ ๋ฐ”์™€ ๊ฐ™์ด 7๋‹จ๊ณ„์˜ ์ˆœ์ฐจ์  ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋œ multi-stage ํ•„ํ„ฐ๋ง ํŒŒ์ดํ”„๋ผ์ธ์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค.

Stage 1: Initial Pre-training Data Curation

  • 256p ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€๋กœ ํ›ˆ๋ จ
  • Broken Files Filter, File Size Filter, Resolution Filter, Deduplication Filter, NSFW Filter ์ ์šฉ

Stage 2: Image Quality Enhancement

  • Rotation Filter, Clarity Filter, Luma Filter, Saturation Filter, Entropy Filter, Texture Filter ์ ์šฉ

Stage 3: Image-Text Alignment Improvement

  • Raw Caption Split, Recaption Split, Fused Caption Split์œผ๋กœ ๋ถ„ํ• 
  • Chinese CLIP Filter, SigLIP Filter, Token Length Filter, Invalid Caption Filter ์ ์šฉ

Stage 4: Text Rendering Enhancement

  • English Split, Chinese Split, Other Language Split, Non-Text Split์œผ๋กœ ๋ถ„๋ฅ˜
  • Intensive Text Filter, Small Character Filter ์ ์šฉ

Stage 5: High-Resolution Refinement

  • 640p ํ•ด์ƒ๋„๋กœ ์ „ํ™˜
  • Image Quality Filter, Resolution Filter, Aesthetic Filter, Abnormal Element Filter ์ ์šฉ

Stage 6: Category Balance and Portrait Augmentation

  • General, Portrait, Text Rendering์˜ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ์žฌ๋ถ„๋ฅ˜
  • keyword-based retrieval๊ณผ image retrieval ๊ธฐ์ˆ  ์‚ฌ์šฉ

Stage 7: Balanced Multi-Scale Training

  • 640p์™€ 1328p ํ•ด์ƒ๋„์—์„œ joint ํ›ˆ๋ จ
  • hierarchical taxonomy system ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜

3.3 ๋ฐ์ดํ„ฐ ์ฃผ์„ ์ž‘์„ฑ

๋ฐ์ดํ„ฐ ์ฃผ์„ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ํฌ๊ด„์ ์ธ ์ด๋ฏธ์ง€ ์„ค๋ช…๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํ•„์ˆ˜ ์ด๋ฏธ์ง€ ์†์„ฑ๊ณผ ํ’ˆ์งˆ ํŠน์„ฑ์„ ์บก์ฒ˜ํ•˜๋Š” ๊ตฌ์กฐํ™”๋œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋Šฅ๋ ฅ์žˆ๋Š” ์ด๋ฏธ์ง€ captioner(์˜ˆ: Qwen2.5-VL)๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

captioning๊ณผ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ๋…๋ฆฝ์ ์ธ ํƒœ์Šคํฌ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€์‹ , captioner๊ฐ€ ๋™์‹œ์— ์‹œ๊ฐ์  ์ฝ˜ํ…์ธ ๋ฅผ ์„ค๋ช…ํ•˜๊ณ  JSON๊ณผ ๊ฐ™์€ ๊ตฌ์กฐํ™”๋œ ํ˜•์‹์œผ๋กœ ์„ธ๋ถ€ ์ •๋ณด๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ฃผ์„ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

3.4 ๋ฐ์ดํ„ฐ ํ•ฉ์„ฑ

์‹ค์ œ ์ด๋ฏธ์ง€์—์„œ ํ…์ŠคํŠธ ์ฝ˜ํ…์ธ ์˜ long-tail distribution, ํŠนํžˆ ์ค‘๊ตญ์–ด์™€ ๊ฐ™์€ non-Latin ์–ธ์–ด์—์„œ ์ˆ˜๋งŽ์€ ๋ฌธ์ž๊ฐ€ ๊ทน๋„๋กœ ๋‚ฎ์€ ๋นˆ๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด multi-stage text-aware ์ด๋ฏธ์ง€ ํ•ฉ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

์„ธ ๊ฐ€์ง€ ๋ณด์™„์  ์ „๋žต:

  1. Pure Rendering in Simple Backgrounds: ๊ฐ€์žฅ ์ง์ ‘์ ์ด๊ณ  ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฌธ์ž ์ธ์‹ ๋ฐ ์ƒ์„ฑ์„ ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.
  2. Compositional Rendering in Contextual Scenes: ํ•ฉ์„ฑ ํ…์ŠคํŠธ๋ฅผ ํ˜„์‹ค์ ์ธ ์‹œ๊ฐ์  ๋งฅ๋ฝ์— ์‚ฝ์ž…ํ•˜์—ฌ ์ผ์ƒ ํ™˜๊ฒฝ์—์„œ์˜ ๋ชจ์Šต์„ ๋ชจ๋ฐฉํ•ฉ๋‹ˆ๋‹ค.
  3. Complex Rendering in Structured Templates: ๋ณต์žกํ•˜๊ณ  ๊ตฌ์กฐํ™”๋œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋”ฐ๋ฅด๋Š” ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์‚ฌ์ „ ์ •์˜๋œ ํ…œํ”Œ๋ฆฟ์˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์  ํŽธ์ง‘์— ๊ธฐ๋ฐ˜ํ•œ ํ•ฉ์„ฑ ์ „๋žต์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

  4. ํ›ˆ๋ จ

4.1 Pre-training

Qwen-Image๋ฅผ pre-trainํ•˜๊ธฐ ์œ„ํ•ด flow matching ํ›ˆ๋ จ ๋ชฉํ‘œ๋ฅผ ์ฑ„ํƒํ•˜์—ฌ ordinary differential equation(ODE)์„ ํ†ตํ•œ ์•ˆ์ •์ ์ธ ํ•™์Šต dynamics๋ฅผ ์ด‰์ง„ํ•˜๋ฉด์„œ maximum likelihood ๋ชฉํ‘œ์™€์˜ ๋™๋“ฑ์„ฑ์„ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.

ํ›ˆ๋ จ ๊ณผ์ •:

  • ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ latent z=E(x)z = E(x)z=E(x) (VAE encoder๋ฅผ ํ†ตํ•ด)
  • random noise vector x1โˆผN(0,I)x_1 \sim N(0,I)x1โ€‹โˆผN(0,I)์—์„œ ์ƒ˜ํ”Œ๋ง
  • ์‚ฌ์šฉ์ž ์ž…๋ ฅ SSS์— ๋Œ€ํ•ด guidance latent h=ฯ•(S)h = \phi(S)h=ฯ•(S) (MLLM์—์„œ)
  • diffusion timestep ttt๋ฅผ logit-normal distribution์—์„œ ์ƒ˜ํ”Œ๋ง

์†์‹ค ํ•จ์ˆ˜:
L=E(x0,h)โˆผD,x1,tโˆฅvฮธ(xt,t,h)โˆ’vtโˆฅ2L = E_{(x_0,h)\sim D,x_1,t} |v_\theta(x_t, t, h) - v_t|^2L=E(x0โ€‹,h)โˆผD,x1โ€‹,tโ€‹โˆฅvฮธโ€‹(xtโ€‹,t,h)โˆ’vtโ€‹โˆฅ2

4.1.1 Producer-Consumer Framework

๋Œ€๊ทœ๋ชจ GPU cluster๋กœ ํ™•์žฅํ•  ๋•Œ ๋†’์€ throughput๊ณผ ํ›ˆ๋ จ ์•ˆ์ •์„ฑ์„ ๋ชจ๋‘ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์™€ ๋ชจ๋ธ ํ›ˆ๋ จ์„ ๋ถ„๋ฆฌํ•˜๋Š” Ray์—์„œ ์˜๊ฐ์„ ๋ฐ›์€ Producer-Consumer ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค.

Producer ์ธก๋ฉด:

  • raw ์ด๋ฏธ์ง€-caption ์Œ์„ ์‚ฌ์ „ ์ •์˜๋œ ๊ธฐ์ค€์— ๋”ฐ๋ผ ํ•„ํ„ฐ๋ง
  • MLLM ๋ชจ๋ธ๊ณผ VAE๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ latent representation์œผ๋กœ ์ธ์ฝ”๋”ฉ
  • ์ฒ˜๋ฆฌ๋œ ์ด๋ฏธ์ง€๋ฅผ ํ•ด์ƒ๋„๋ณ„๋กœ ๋น ๋ฅธ ์•ก์„ธ์Šค cache bucket์— ๊ทธ๋ฃนํ™”

Consumer ์ธก๋ฉด:

  • GPU ์ง‘์•ฝ์  cluster์— ๋ฐฐํฌ
  • ๋ชจ๋ธ ํ›ˆ๋ จ์—๋งŒ ์ „๋…
  • MMDiT parameter๋ฅผ 4-way tensor-parallel layout์œผ๋กœ ๋ถ„์‚ฐ

4.1.2 ๋ถ„์‚ฐ ํ›ˆ๋ จ ์ตœ์ ํ™”

Qwen-Image ๋ชจ๋ธ์˜ ํฐ parameter ํฌ๊ธฐ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ FSDP๋งŒ์œผ๋กœ๋Š” ๊ฐ GPU์— ๋ชจ๋ธ์„ ๋งž์ถ”๊ธฐ์— ๋ถˆ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ›ˆ๋ จ์„ ์œ„ํ•ด Megatron-LM์„ ํ™œ์šฉํ•˜๊ณ  ๋‹ค์Œ ์ตœ์ ํ™”๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค:

Hybrid Parallelism Strategy: data parallelism๊ณผ tensor parallelism์„ ๊ฒฐํ•ฉํ•œ hybrid parallelism ์ „๋žต์„ ์ฑ„ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.

Distributed Optimizer and Activation Checkpointing: GPU ๋ฉ”๋ชจ๋ฆฌ ์••๋ ฅ์„ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด distributed optimizer์™€ activation checkpointing์„ ์‹คํ—˜ํ–ˆ์Šต๋‹ˆ๋‹ค.

4.1.3 ํ›ˆ๋ จ ์ „๋žต

๋ฐ์ดํ„ฐ ํ’ˆ์งˆ, ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„, ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์ ์ง„์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” multi-stage pre-training ์ „๋žต์„ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค:

  1. ํ•ด์ƒ๋„ ํ–ฅ์ƒ: 256ร—256 pixel โ†’ 640ร—640 pixel โ†’ 1328ร—1328 pixel
  2. ํ…์ŠคํŠธ ๋ Œ๋”๋ง ํ†ตํ•ฉ: Non-text โ†’ Text
  3. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๊ฐœ์„ : Massive Data โ†’ Refined Data
  4. ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๊ท ํ˜•: Unbalanced โ†’ Balanced
  5. ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•: Real-World Data โ†’ Synthetic Data

4.2 Post-training

Qwen-Image๋ฅผ ์œ„ํ•œ post-training ํ”„๋ ˆ์ž„์›Œํฌ๋Š” supervised fine-tuning(SFT)๊ณผ reinforcement learning(RL)์˜ ๋‘ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

4.2.1 Supervised Fine-Tuning (SFT)

SFT ๋‹จ๊ณ„์—์„œ๋Š” semantic ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๊ณ„์ธต์ ์œผ๋กœ ๊ตฌ์„ฑ๋œ dataset์„ ๊ตฌ์ถ•ํ•˜๊ณ  ์„ธ์‹ฌํ•œ ์ธ๊ฐ„ ์ฃผ์„์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ํŠน์ • ๋‹จ์ ์„ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

4.2.2 Reinforcement Learning (RL)

๋‘ ๊ฐ€์ง€ ์„œ๋กœ ๋‹ค๋ฅธ RL ์ „๋žต์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

(A) Direct Preference Optimization (DPO)

  • flow-matching(one step) ์˜จ๋ผ์ธ preference modeling์— ๋›ฐ์–ด๋‚จ
  • ๊ณ„์‚ฐ์ ์œผ๋กœ ํšจ์œจ์ 

(B) Group Relative Policy Optimization (GRPO)

  • ํ›ˆ๋ จ ์ค‘ on-policy sampling ์ˆ˜ํ–‰
  • reward model๋กœ ๊ฐ trajectory ํ‰๊ฐ€

4.3 Multi-task ํ›ˆ๋ จ

text-to-image(T2I) ์ƒ์„ฑ ์™ธ์—๋„, text์™€ image ์ž…๋ ฅ์„ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” multimodal ์ด๋ฏธ์ง€ ์ƒ์„ฑ ํƒœ์Šคํฌ๋ฅผ ํƒ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด base model์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

ํฌํ•จ๋œ ํƒœ์Šคํฌ:

  • instruction-based ์ด๋ฏธ์ง€ ํŽธ์ง‘
  • novel view synthesis
  • depth estimation๊ณผ ๊ฐ™์€ computer vision ํƒœ์Šคํฌ
  1. ์‹คํ—˜

5.1 ์ธ๊ฐ„ ํ‰๊ฐ€

Qwen-Image์˜ ์ผ๋ฐ˜์ ์ธ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ณ  state-of-the-art closed-source API์™€ ๊ฐ๊ด€์ ์œผ๋กœ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด Elo rating system์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋œ ์˜คํ”ˆ ๋ฒค์น˜๋งˆํ‚น ํ”Œ๋žซํผ์ธ AI Arena๋ฅผ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

AI Arena ํŠน์ง•:

  • ๊ณต์ •ํ•˜๊ณ  ๋™์ ์ธ ์˜คํ”ˆ ๊ฒฝ์Ÿ ํ”Œ๋žซํผ
  • ๊ฐ ๋ผ์šด๋“œ์—์„œ ๊ฐ™์€ ํ”„๋กฌํ”„ํŠธ๋กœ ์ƒ์„ฑ๋œ ๋‘ ์ด๋ฏธ์ง€๋ฅผ ์ต๋ช…์œผ๋กœ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ์‹œ
  • 5,000๊ฐœ์˜ ๋‹ค์–‘ํ•œ ํ”„๋กฌํ”„ํŠธ ํ๋ ˆ์ด์…˜
  • 200๋ช… ์ด์ƒ์˜ ๋‹ค์–‘ํ•œ ์ „๋ฌธ ๋ฐฐ๊ฒฝ์„ ๊ฐ€์ง„ ํ‰๊ฐ€์ž ์ฐธ์—ฌ

๊ฒฝ์Ÿ์ž:

  • Imagen 4 Ultra Preview 0606
  • Seedream 3.0
  • GPT Image 1 [High]
  • FLUX.1 Kontext [Pro]
  • Ideogram 3.0

๊ฒฐ๊ณผ: Qwen-Image๋Š” ์œ ์ผํ•œ ์˜คํ”ˆ์†Œ์Šค ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ๋กœ์„œ AI Arena์—์„œ 3์œ„๋ฅผ ์ฐจ์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค.

5.2 ์ •๋Ÿ‰์  ๊ฒฐ๊ณผ

5.2.1 VAE Reconstruction ์„ฑ๋Šฅ

์—ฌ๋Ÿฌ state-of-the-art ์ด๋ฏธ์ง€ tokenizer๋ฅผ ์ •๋Ÿ‰์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜์—ฌ reconstruction ํ’ˆ์งˆ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด Peak Signal-to-Noise Ratio(PSNR)์™€ Structural Similarity Index Measure(SSIM)๋ฅผ ๋ณด๊ณ ํ•ฉ๋‹ˆ๋‹ค.

Table 2 ๊ฒฐ๊ณผ: Qwen-Image-VAE๋Š” ํ‰๊ฐ€๋œ ๋ชจ๋“  ๋ฉ”ํŠธ๋ฆญ์—์„œ state-of-the-art reconstruction ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

5.2.2 Text-to-Image ์ƒ์„ฑ ์„ฑ๋Šฅ

๋‘ ๊ฐ€์ง€ ๊ด€์ ์—์„œ Qwen-Image์˜ T2I ํƒœ์Šคํฌ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค: ์ผ๋ฐ˜์ ์ธ ์ƒ์„ฑ ๋Šฅ๋ ฅ๊ณผ ํ…์ŠคํŠธ ๋ Œ๋”๋ง ๋Šฅ๋ ฅ.

์ฃผ์š” ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ:

  • DPG: Qwen-Image๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์ „์ฒด ์ ์ˆ˜ ๋‹ฌ์„ฑ (88.32)
  • GenEval: RL ๊ฐœ์„  ํ›„ 0.91 ์ ์ˆ˜๋กœ 0.9 ์ž„๊ณ„๊ฐ’์„ ์ดˆ๊ณผํ•˜๋Š” ์œ ์ผํ•œ foundation model
  • OneIG-Bench: ์ค‘๊ตญ์–ด์™€ ์˜์–ด ํŠธ๋ž™ ๋ชจ๋‘์—์„œ ๊ฐ€์žฅ ๋†’์€ ์ „์ฒด ์ ์ˆ˜
  • ChineseWord: ๋ชจ๋“  ์„ธ ๋‹จ๊ณ„์—์„œ ๊ฐ€์žฅ ๋†’์€ ๋ Œ๋”๋ง ์ •ํ™•๋„
  • LongText-Bench: ์ค‘๊ตญ์–ด ๊ธด ํ…์ŠคํŠธ์—์„œ ๊ฐ€์žฅ ๋†’์€ ์ •ํ™•๋„, ์˜์–ด ๊ธด ํ…์ŠคํŠธ์—์„œ ๋‘ ๋ฒˆ์งธ๋กœ ๋†’์€ ์ •ํ™•๋„

5.2.3 ์ด๋ฏธ์ง€ ํŽธ์ง‘ ์„ฑ๋Šฅ

text์™€ image๋ฅผ conditioning ์ž…๋ ฅ์œผ๋กœ ๋งค๋„๋Ÿฝ๊ฒŒ ํ†ตํ•ฉํ•˜๋Š” Qwen-Image์˜ multi-task ๋ฒ„์ „์„ ์ด๋ฏธ์ง€ ํŽธ์ง‘(TI2I) ํƒœ์Šคํฌ๋ฅผ ์œ„ํ•ด ์ถ”๊ฐ€๋กœ ํ›ˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค.

์ฃผ์š” ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ:

  • GEdit: ์˜์–ด์™€ ์ค‘๊ตญ์–ด leaderboard ๋ชจ๋‘์—์„œ 1์œ„
  • ImgEdit: ์ „์ฒด์ ์œผ๋กœ ๊ฐ€์žฅ ๋†’์€ ์ˆœ์œ„
  • Novel view synthesis: GSO dataset์—์„œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ๊ฒฐ๊ณผ
  • Depth Estimation: ์—ฌ๋Ÿฌ key metric์—์„œ state-of-the-art ์„ฑ๋Šฅ

5.3 ์ •์„ฑ์  ๊ฒฐ๊ณผ

5.3.1 VAE Reconstruction์—์„œ์˜ ์ •์„ฑ์  ๊ฒฐ๊ณผ

Figure 17์€ state-of-the-art ์ด๋ฏธ์ง€ VAE๋“ค๋กœ ํ…์ŠคํŠธ๊ฐ€ ํ’๋ถ€ํ•œ ์ด๋ฏธ์ง€๋ฅผ reconstructionํ•œ ์ •์„ฑ์  ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์šฐ๋ฆฌ ๊ฒฐ๊ณผ์—์„œ โ€œdouble-aspectโ€๋ผ๋Š” ๊ตฌ๋ฌธ์ด ๋ช…ํ™•ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ๊ฒŒ ๋‚จ์•„์žˆ๋Š” ๋ฐ˜๋ฉด, ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์˜ reconstruction์—์„œ๋Š” ์ธ์‹ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

5.3.2 ์ด๋ฏธ์ง€ ์ƒ์„ฑ์—์„œ์˜ ์ •์„ฑ์  ๊ฒฐ๊ณผ

Qwen-Image์˜ text-to-image ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋„ค ๊ฐ€์ง€ ์ธก๋ฉด์—์„œ ์ •์„ฑ์  ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  1. ์˜์–ด ํ…์ŠคํŠธ ๋ Œ๋”๋ง: ๋” ํ˜„์‹ค์ ์ธ ์‹œ๊ฐ์  ์Šคํƒ€์ผ๊ณผ ๋” ๋‚˜์€ ๋ Œ๋”๋ง ํ’ˆ์งˆ
  2. ์ค‘๊ตญ์–ด ํ…์ŠคํŠธ ๋ Œ๋”๋ง: ์˜ˆ์ƒ๋˜๋Š” ์ค‘๊ตญ์–ด couplet์„ ์ •ํ™•ํ•˜๊ฒŒ ์ƒ์„ฑ
  3. Multi-Object ์ƒ์„ฑ: ๋ชจ๋“  ํ•„์š”ํ•œ ๋™๋ฌผ์„ ์ •ํ™•ํ•˜๊ฒŒ ์ƒ์„ฑํ•˜๊ณ  ์ง€์ •๋œ ์œ„์น˜๋ฅผ ์ถฉ์‹คํžˆ ๋ณด์กด
  4. ๊ณต๊ฐ„ ๊ด€๊ณ„ ์ƒ์„ฑ: ๋ณต์žกํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ๋”ฐ๋ฅด๋Š” ๊ฐ•๋ ฅํ•œ ๋Šฅ๋ ฅ

5.3.3 ์ด๋ฏธ์ง€ ํŽธ์ง‘์—์„œ์˜ ์ •์„ฑ์  ๊ฒฐ๊ณผ

Qwen-Image์˜ ์ด๋ฏธ์ง€ ํŽธ์ง‘(TI2I) ๋Šฅ๋ ฅ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์„ฏ ๊ฐ€์ง€ ์ฃผ์š” ์ธก๋ฉด์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ •์„ฑ์  ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  1. ํ…์ŠคํŠธ ๋ฐ ์žฌ๋ฃŒ ํŽธ์ง‘: ๋›ฐ์–ด๋‚œ ์žฌ๋ฃŒ ๋ Œ๋”๋ง ๋ฐ instruction-following ๋Šฅ๋ ฅ
  2. ๊ฐ์ฒด ์ถ”๊ฐ€/์ œ๊ฑฐ/๊ต์ฒด: ํŽธ์ง‘๋˜์ง€ ์•Š์€ ์˜์—ญ ๋ณด์กด์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์ข‹์€ ์„ฑ๋Šฅ
  3. ์ž์„ธ ์กฐ์ž‘: pose ํŽธ์ง‘ ์ค‘ ์„ธ๋ถ€์‚ฌํ•ญ๊ณผ ์ผ๊ด€์„ฑ ๋ณด์กด์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ
  4. ์—ฐ์‡„ ํŽธ์ง‘: ์ „์ฒด ํŽธ์ง‘ ์ฒด์ธ์„ ํ†ตํ•ด ๊ตฌ์กฐ์  ํŠน์ง• ๋ณด์กด
  5. Novel View Synthesis: ๋ณต์žกํ•œ ํŽธ์ง‘ ํƒœ์Šคํฌ์—์„œ ๋›ฐ์–ด๋‚œ ๊ณต๊ฐ„ ๋ฐ semantic coherence

  6. ๊ฒฐ๋ก 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ณต์žกํ•œ ํ…์ŠคํŠธ ๋ Œ๋”๋ง๊ณผ ์ •๋ฐ€ํ•œ ์ด๋ฏธ์ง€ ํŽธ์ง‘์—์„œ ์ฃผ์š”ํ•œ ์ง„์ „์„ ๋‹ฌ์„ฑํ•œ Qwen ์‹œ๋ฆฌ์ฆˆ์˜ ์ด๋ฏธ์ง€ ์ƒ์„ฑ foundation model์ธ Qwen-Image๋ฅผ ์†Œ๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ํฌ๊ด„์ ์ธ ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์ถ•ํ•˜๊ณ  progressive curriculum learning ์ „๋žต์„ ์ฑ„ํƒํ•จ์œผ๋กœ์จ Qwen-Image๋Š” ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€ ๋‚ด์—์„œ ๋ณต์žกํ•œ ํ…์ŠคํŠธ๋ฅผ ๋ Œ๋”๋งํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

๊ฐœ์„ ๋œ multi-task training paradigm๊ณผ dual-encoding ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ์ด๋ฏธ์ง€ ํŽธ์ง‘์˜ ์ผ๊ด€์„ฑ๊ณผ ํ’ˆ์งˆ์„ ํ˜„์ €ํžˆ ํ–ฅ์ƒ์‹œ์ผœ semantic coherence์™€ visual fidelity๋ฅผ ๋ชจ๋‘ ํšจ๊ณผ์ ์œผ๋กœ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ณต๊ฐœ benchmark์—์„œ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์€ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ฐ ํŽธ์ง‘ ํƒœ์Šคํฌ์—์„œ Qwen-Image์˜ state-of-the-art ์„ฑ๋Šฅ์„ ์ผ๊ด€๋˜๊ฒŒ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋” ๊นŠ์€ ์˜๋ฏธ์™€ ์ค‘์š”์„ฑ:

  • ์ด๋ฏธ์ง€ โ€œ์ƒ์„ฑโ€ ๋ชจ๋ธ๋กœ์„œ์˜ Qwen-Image: ๋‹จ์ˆœํžˆ photorealism์ด๋‚˜ ๋ฏธ์  ํ’ˆ์งˆ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๊ฐ„์˜ ์ •ํ™•ํ•œ ์ •๋ ฌ, ํŠนํžˆ ํ…์ŠคํŠธ ๋ Œ๋”๋ง์˜ ์–ด๋ ค์šด ํƒœ์Šคํฌ๋ฅผ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฏธ์ง€ โ€œ์ƒ์„ฑโ€ ๋ชจ๋ธ๋กœ์„œ์˜ Qwen-Image: generative framework๊ฐ€ ๊ณ ์ „์ ์ธ ์ดํ•ด ํƒœ์Šคํฌ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • โ€œ์ด๋ฏธ์ง€โ€ ์ƒ์„ฑ ๋ชจ๋ธ๋กœ์„œ์˜ Qwen-Image: 2D ์ด๋ฏธ์ง€ ํ•ฉ์„ฑ์„ ๋„˜์–ด์„  ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ํ™”๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
  • โ€œ์‹œ๊ฐ์  ์ƒ์„ฑโ€ ๋ชจ๋ธ๋กœ์„œ์˜ Qwen-Image: ํ†ตํ•ฉ๋œ ์ดํ•ด์™€ ์ƒ์„ฑ์˜ ๋น„์ „์„ ๋ฐœ์ „์‹œํ‚ต๋‹ˆ๋‹ค.

Qwen-Image๋Š” ๋‹จ์ˆœํžˆ state-of-the-art ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ ์ด์ƒ์ž…๋‹ˆ๋‹ค. multimodal foundation model์„ ๊ฐœ๋…ํ™”ํ•˜๊ณ  ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ์‹์˜ ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ์ˆ ์  benchmark๋ฅผ ๋„˜์–ด์„  ๊ธฐ์—ฌ๋ฅผ ํ†ตํ•ด generative model์ด perception, ์ธํ„ฐํŽ˜์ด์Šค ์„ค๊ณ„, ์ธ์ง€ ๋ชจ๋ธ๋ง์—์„œ ๋งก๋Š” ์—ญํ• ์„ ์žฌ๊ณ ํ•˜๋„๋ก ์ปค๋ฎค๋‹ˆํ‹ฐ์— ๋„์ „์žฅ์„ ๋‚ด๋ฐ‰๋‹ˆ๋‹ค.



-->