[Paper Review] Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Posted by Euisuk's Dev Log on August 29, 2025

[Paper Review] Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/Paper-Review-Qwen-Audio-Advancing-Universal-Audio-Understanding-via-Unified-Large-Scale-Audio-Language-Models

https://arxiv.org/abs/2311.07919

1
CHU, Yunfei, et al. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.

Abstract

์ตœ๊ทผ instruction-following audio-language ๋ชจ๋ธ๋“ค์ด ์ธ๊ฐ„๊ณผ์˜ ์˜ค๋””์˜ค ์ƒํ˜ธ์ž‘์šฉ์—์„œ ๊ด‘๋ฒ”์œ„ํ•œ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค ์œ ํ˜•๊ณผ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์˜ค๋””์˜ค ๋ชจ๋ธ์˜ ๋ถ€์žฌ๊ฐ€ ์ด ๋ถ„์•ผ์˜ ๋ฐœ์ „์„ ์ €ํ•ดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ธฐ์กด์˜ ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋“ค์€ ์ œํ•œ๋œ ๋ฒ”์œ„์˜ ์ƒํ˜ธ์ž‘์šฉ ๊ธฐ๋Šฅ๋งŒ์„ ์ง€์›ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Qwen-Audio ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜์—ฌ ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ์ธ๊ฐ„ ์Œ์„ฑ, ์ž์—ฐ์Œ, ์Œ์•…, ๋…ธ๋ž˜ ๋“ฑ ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค ์œ ํ˜•์„ ํฌํ•จํ•˜์—ฌ 30๊ฐœ ์ด์ƒ์˜ ์ž‘์—…์„ ๋‹ค๋ฃจ๋Š” audio-language ์‚ฌ์ „ ํ›ˆ๋ จ์„ ํ™•์žฅํ•จ์œผ๋กœ์จ ๋ฒ”์šฉ ์˜ค๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ์„ ์ด‰์ง„ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋ชจ๋“  ์ž‘์—…๊ณผ ๋ฐ์ดํ„ฐ์…‹์„ ์ง์ ‘ ๊ณต๋™ ํ›ˆ๋ จํ•˜๋ฉด ๊ฐ„์„ญ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์ž‘์—… ์ดˆ์ , ์–ธ์–ด, ์ฃผ์„ ์„ธ๋ถ„ํ™” ๋ฐ ํ…์ŠคํŠธ ๊ตฌ์กฐ์˜ ์ฐจ์ด๋กœ ์ธํ•ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์—ฐ๊ด€๋œ ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”์— ์ƒ๋‹นํ•œ ๋ณ€ํ™”๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ one-to-many ๊ฐ„์„ญ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ๊ณ„์ธต์  ํƒœ๊ทธ ์‹œํ€€์Šค๋ฅผ ์กฐ๊ฑด์œผ๋กœ ํ•˜๋Š” decoder๋ฅผ ํ†ตํ•ด ์ง€์‹ ๊ณต์œ ๋ฅผ ์žฅ๋ คํ•˜๊ณ  ๊ณต์œ  ํƒœ๊ทธ์™€ ํŠน์ • ํƒœ๊ทธ๋ฅผ ๊ฐ๊ฐ ํ†ตํ•ด ๊ฐ„์„ญ์„ ๋ฐฉ์ง€ํ•˜๋Š” multi-task ํ›ˆ๋ จ framework๋ฅผ ์‹ ์ค‘ํ•˜๊ฒŒ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ฃผ๋ชฉํ•  ์ ์€ Qwen-Audio๊ฐ€ ์ž‘์—…๋ณ„ fine-tuning ์—†์ด๋„ ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ ์ž‘์—…์—์„œ ์ธ์ƒ์ ์ธ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์—ฌ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์„ ๋Šฅ๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. Qwen-Audio์˜ ๊ธฐ๋Šฅ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค์™€ ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ํ—ˆ์šฉํ•˜๊ณ  multi-turn dialogue๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค ์ค‘์‹ฌ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ง€์›ํ•˜๋Š” Qwen-Audio-Chat์„ ์ถ”๊ฐ€๋กœ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

  1. Introduction

Large Language Models (LLMs)๋Š” ๊ฐ•๋ ฅํ•œ ์ง€์‹ ๋ณด์กด, ๋ณต์žกํ•œ ์ถ”๋ก  ๋ฐ ๋ฌธ์ œ ํ•ด๊ฒฐ ๋Šฅ๋ ฅ์œผ๋กœ ์ธํ•ด ์ผ๋ฐ˜ ์ธ๊ณต์ง€๋Šฅ(AGI) ๋ถ„์•ผ์˜ ๋ฐœ์ „์„ ํฌ๊ฒŒ ์ด‰์ง„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์–ธ์–ด ๋ชจ๋ธ์€ ์ธ๊ฐ„์ฒ˜๋Ÿผ ์ด๋ฏธ์ง€๋‚˜ ์˜ค๋””์˜ค์™€ ๊ฐ™์€ ๋น„ํ…์ŠคํŠธ modality๋ฅผ ์ธ์‹ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.

์Œ์„ฑ์€ ์ค‘์š”ํ•œ modality๋กœ์„œ, ์ธ๊ฐ„ ์Œ์„ฑ์˜ ๊ฐ์ •, ํ†ค, ์˜๋„, ์ž์—ฐ์Œ์˜ ๊ธฐ์ฐจ ๊ธฐ์ , ์‹œ๊ณ„ ์ข…์†Œ๋ฆฌ, ์ฒœ๋‘ฅ, ๊ทธ๋ฆฌ๊ณ  ์Œ์•…์˜ ๋ฉœ๋กœ๋”” ๋“ฑ ํ…์ŠคํŠธ๋ฅผ ๋„˜์–ด์„œ๋Š” ๋‹ค์–‘ํ•˜๊ณ  ๋ณต์žกํ•œ ์‹ ํ˜ธ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. LLMs๊ฐ€ ์˜ค๋””์˜ค ์ƒํ˜ธ์ž‘์šฉ์„ ์œ„ํ•ด ํ’๋ถ€ํ•œ ์˜ค๋””์˜ค ์‹ ํ˜ธ๋ฅผ ์ธ์‹ํ•˜๊ณ  ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์€ ๊ด‘๋ฒ”์œ„ํ•œ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด์˜ instruction following ์—ฐ๊ตฌ๋“ค์€ ์ฃผ๋กœ large (multimodal) LLMs์˜ ๋Šฅ๋ ฅ์„ ์ƒ์†๋ฐ›๊ณ  ๊ฐ€๋ฒผ์šด supervised fine-tuning์„ ์ฑ„ํƒํ•˜์—ฌ ์‚ฌ์šฉ์ž ์˜๋„์— ๋งž์ถฐ ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์„ ํ™œ์„ฑํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋“ค์€ ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค ์œ ํ˜•๊ณผ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์‚ฌ์ „ ํ›ˆ๋ จ๋œ audio-language ๋ชจ๋ธ์˜ ๋ถ€์กฑ์œผ๋กœ ์ธํ•ด ์˜ค๋””์˜ค ์ƒํ˜ธ์ž‘์šฉ ๋Šฅ๋ ฅ ๋ฉด์—์„œ ์ œ์•ฝ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด์˜ ๋Œ€ํ‘œ์ ์ธ audio-language multi-task language ๋ชจ๋ธ๋“ค์ธ SpeechNet, SpeechT5, VIOLA, Whisper, Pengi ๋“ฑ์€ ์ธ๊ฐ„ ์Œ์„ฑ์ด๋‚˜ ์ž์—ฐ์Œ๊ณผ ๊ฐ™์€ ํŠน์ • ์˜ค๋””์˜ค ์œ ํ˜• ์ฒ˜๋ฆฌ์— ์ œํ•œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

audio-text multimodal ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ์„ฑ์žฅ๊ณผ ๋ฐœ์ „์„ ์ด‰์ง„ํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ๋Œ€๊ทœ๋ชจ audio-language ๋ชจ๋ธ์ธ Qwen-Audio๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. Qwen-Audio๋Š” ์˜ค๋””์˜ค์™€ ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ์กฐ๊ฑด์œผ๋กœ ํ•˜๋Š” multi-task language ๋ชจ๋ธ๋กœ, ๋‹จ์ผ audio encoder์˜ ์—ฐ๊ฒฐ์„ ํ†ตํ•ด ์˜ค๋””์˜ค ์‹ ํ˜ธ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ธ์‹ํ•˜๋„๋ก Qwen-7B language ๋ชจ๋ธ์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

์ฃผ๋กœ ์ธ๊ฐ„ ์Œ์„ฑ๊ณผ ๊ฐ™์€ ๋‹จ์ผ ์˜ค๋””์˜ค ์œ ํ˜•์— ์ดˆ์ ์„ ๋งž์ถ”๊ฑฐ๋‚˜ ์Œ์„ฑ ์ธ์‹ ๋ฐ ์บก์…˜๊ณผ ๊ฐ™์€ ํŠน์ • ์ž‘์—…์— ์ง‘์ค‘ํ•˜๊ฑฐ๋‚˜ ๋‹จ์ผ ์–ธ์–ด๋กœ ๋ชจ๋ธ์„ ์ œํ•œํ•˜๋Š” ์ด์ „ ์—ฐ๊ตฌ๋“ค๊ณผ ๋‹ฌ๋ฆฌ, ์šฐ๋ฆฌ๋Š” ๋ฒ”์šฉ ์˜ค๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ ๋ฐœ์ „์„ ์œ„ํ•ด 8๊ฐœ ์–ธ์–ด์™€ ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ์˜ค๋””์˜ค๋ฅผ ํฌํ•จํ•˜์—ฌ 30๊ฐœ ์ด์ƒ์˜ ์ž‘์—…์„ ๋‹ค๋ฃจ๋Š” ์ˆ˜์‹ญ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ›ˆ๋ จ์„ ํ™•์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

multi-task ํ•™์Šต์˜ ์ค‘์š”ํ•œ ๊ณผ์ œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์—ฐ๊ด€๋œ ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”์˜ ์ƒ๋‹นํ•œ ๋ณ€ํ™”์—์„œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ณ€ํ™”๋Š” ์ž‘์—… ์ดˆ์ , ์–ธ์–ด, ์ฃผ์„ ์„ธ๋ถ„ํ™” ๋ฐ ํ…์ŠคํŠธ ๊ตฌ์กฐ(๊ตฌ์กฐํ™” ๋˜๋Š” ๋น„๊ตฌ์กฐํ™”)์˜ ์ฐจ์ด์—์„œ ๋น„๋กฏ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ one-to-many ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ๊ณ„์ธต์  ํƒœ๊ทธ์˜ ์‹œํ€€์Šค๋ฅผ ์กฐ๊ฑด์œผ๋กœ ํ•˜๋Š” decoder๋ฅผ ํ†ตํ•ด ์ง€์‹ ๊ณต์œ ๋ฅผ ์žฅ๋ คํ•˜๊ณ  ๊ณต์œ  ํƒœ๊ทธ์™€ ํŠน์ • ํƒœ๊ทธ๋ฅผ ๊ฐ๊ฐ ํ†ตํ•ด ๊ฐ„์„ญ์„ ์™„ํ™”ํ•˜๋Š” multi-task ํ›ˆ๋ จ framework๋ฅผ ์‹ ์ค‘ํ•˜๊ฒŒ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ, ์šฐ๋ฆฌ๋Š” ์ด์ „ multi-task ํ•™์Šต ์—ฐ๊ตฌ์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฌด์‹œ๋˜๋Š” word-level time-stamp ์˜ˆ์ธก(SRWT) ์ž‘์—…๊ณผ ํ•จ๊ป˜ ์Œ์„ฑ ์ธ์‹์„ ํ›ˆ๋ จ์— ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์ž‘์—…์ด ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ๋„˜์–ด์„  ์†Œ๋ฆฌ์™€ ์Œ์•… ๋“ฑ์˜ grounding ๋ฐ grounding ๊ธฐ๋ฐ˜ QA ์ž‘์—…์„ ๊ฐœ์„ ํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ASR ์„ฑ๋Šฅ๋„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

Figure 1์—์„œ ๋ณด๋“ฏ์ด, ๊ด‘๋ฒ”์œ„ํ•œ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด Qwen-Audio๊ฐ€ ์ž‘์—…๋ณ„ fine-tuning ์—†์ด๋„ ๋‹ค์–‘ํ•œ ์ž‘์—… ๋ฒ”์œ„์—์„œ ์ด์ „ multi-task ํ›ˆ๋ จ ๋ชจ๋ธ๋“ค์„ ๋Šฅ๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Qwen-Audio์˜ ์ฃผ๋ชฉํ•  ๋งŒํ•œ ์„ฑ๊ณผ๋Š” Aishell1, cochlscene, ClothoAQA, VocalSound์˜ ํ…Œ์ŠคํŠธ ์…‹์—์„œ ์ตœ์ฒจ๋‹จ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Qwen-Audio์˜ ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•˜์—ฌ, ์šฐ๋ฆฌ๋Š” supervised instruction fine-tuning์„ ํ†ตํ•ด Qwen-Audio-Chat์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” multi-turn dialogue์—์„œ ์˜ค๋””์˜ค์™€ ํ…์ŠคํŠธ modality ๋ชจ๋‘๋กœ๋ถ€ํ„ฐ ์œ ์—ฐํ•œ ์ž…๋ ฅ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ, ์ธ๊ฐ„ ์ง€์‹œ์‚ฌํ•ญ์— ๋”ฐ๋ฅธ ํšจ๊ณผ์ ์ธ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์˜ ๊ธฐ์—ฌ๋„:

โ€ข ๋‹ค์–‘ํ•œ ์ž‘์—…, ์–ธ์–ด ๋ฐ ์˜ค๋””์˜ค ์œ ํ˜•์„ ์ง€์›ํ•˜๋Š” ๋ฒ”์šฉ ์˜ค๋””์˜ค ์ดํ•ด ๋ชจ๋ธ ์—ญํ• ์„ ํ•˜๋Š” ๊ธฐ๋ณธ multi-task audio-language ๋ชจ๋ธ์ธ Qwen-Audio๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. Qwen-Audio๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ instruction fine-tuning์„ ํ†ตํ•ด Qwen-Audio-Chat์„ ๊ฐœ๋ฐœํ•˜์—ฌ multi-turn dialogue๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค ์ง€ํ–ฅ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. Qwen-Audio์™€ Qwen-Audio-Chat ๋ชจ๋ธ ๋ชจ๋‘ ์˜คํ”ˆ์†Œ์Šค๋กœ ์ œ๊ณต๋˜์–ด audio-text multimodal ์ปค๋ฎค๋‹ˆํ‹ฐ์˜ ์„ฑ์žฅ๊ณผ ๋ฐœ์ „์„ ์ด‰์ง„ํ•ฉ๋‹ˆ๋‹ค.

โ€ข audio-language ์‚ฌ์ „ ํ›ˆ๋ จ์„ ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•ด, ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์—ฐ๊ด€๋œ ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”์˜ ๋ณ€ํ™” ๋ฌธ์ œ๋ฅผ multi-task ํ›ˆ๋ จ framework๋ฅผ ์ œ์•ˆํ•˜์—ฌ ํ•ด๊ฒฐํ•˜๊ณ , ์ง€์‹ ๊ณต์œ ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ one-to-many ๊ฐ„์„ญ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ ๋ชจ๋ธ์€ 30๊ฐœ ์ด์ƒ์˜ ์ž‘์—…์„ ํ†ตํ•ฉํ•˜๋ฉฐ ๊ด‘๋ฒ”์œ„ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

โ€ข audio-language ์‚ฌ์ „ ํ›ˆ๋ จ์„ ์ด‰์ง„ํ•˜๊ธฐ ์œ„ํ•ด, ์˜ค๋””์˜ค multimodal ์—ฐ๊ตฌ ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ ์ข…์ข… ๊ฐ„๊ณผ๋˜๋Š” SRWT ์ž‘์—…์„ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ๋„˜์–ด์„  grounding ๋ฐ grounding ๊ธฐ๋ฐ˜ ์งˆ๋ฌธ ๋‹ต๋ณ€ ์ž‘์—…๊ณผ ASR ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

โ€ข ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” Qwen-Audio๊ฐ€ ์ž‘์—…๋ณ„ fine-tuning ์—†์ด๋„ ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ ์ž‘์—…์—์„œ ์ธ์ƒ์ ์ธ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์—ฌ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์„ ๋Šฅ๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํŠนํžˆ, Qwen-Audio๋Š” Aishell1, cochlscene, ClothoAQA, VocalSound์˜ ํ…Œ์ŠคํŠธ ์…‹์—์„œ ์ตœ์ฒจ๋‹จ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Multi-task Audio-Text Learning

multi-task ํ›ˆ๋ จ์˜ ๋ชฉํ‘œ๋Š” ํ†ตํ•ฉ๋œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์™€ ๋ฐ์ดํ„ฐ ํ˜•์‹์„ ํ†ตํ•ด ์„œ๋กœ ๋‹ค๋ฅธ ์ž‘์—… ๊ฐ„์— ์ง€์‹์„ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ค๋””์˜ค ์ฒ˜๋ฆฌ ์˜์—ญ์—์„œ๋Š” ์ธ๊ฐ„ ์Œ์„ฑ, ์ž์—ฐ์Œ, ์Œ์•…, ๋…ธ๋ž˜์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค ์‹ ํ˜ธ๊ฐ€ ์กด์žฌํ•˜๊ณ  ์ด๋“ค์˜ ๋ผ๋ฒจ๋ง ํ˜•์‹์ด ํฌ๊ฒŒ ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋“  ์˜ค๋””์˜ค ์ฒ˜๋ฆฌ ์ž‘์—…์„ ํ†ตํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

SpeechNet๊ณผ SpeechT5๋Š” ์ธ๊ฐ„ ์Œ์„ฑ ์ž‘์—…์„ speech/text ์ž…๋ ฅ ๋ฐ speech/text ์ถœ๋ ฅ ํ˜•์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ , ์‚ฌ์ „ ํ›ˆ๋ จ์„ ์œ„ํ•œ ๊ณต์œ  encoder-decoder framework๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋งŽ์€ ์—ฐ๊ตฌ๋“ค์ด speech representation์„ ์ง์ ‘ ๊ณต๊ธ‰ํ•˜๊ฑฐ๋‚˜ ์—ฐ์†์ ์ธ ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ discrete codes๋กœ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํ˜•์‹๊ณผ ์ž‘์—…์„ ํ†ตํ•ฉํ•˜๊ณ , ์„œ๋กœ ๋‹ค๋ฅธ ์ธ๊ฐ„ ์Œ์„ฑ ์ž‘์—…์„ ์กฐ๊ฑด๋ถ€ ์ƒ์„ฑ ์ž‘์—…์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

VoiceBox๋Š” ์ธ๊ฐ„ ์Œ์„ฑ ํ•ฉ์„ฑ ๋ฐ ์Œ์„ฑ ํŽธ์ง‘ ์ž‘์—…์„ ์œ„ํ•ด non-autoregressive continuous normalizing flow ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Whisper๋Š” ๋ฐ์ดํ„ฐ์…‹ ์ฃผ์„์˜ ์„ธ๋ถ„ํ™”(๋ฌธ์žฅ ์ˆ˜์ค€ ํƒ€์ž„์Šคํƒฌํ”„ ์œ ๋ฌด)์™€ ์ž‘์—… ์œ ํ˜•(์ธ๊ฐ„ ์Œ์„ฑ ์ธ์‹ ๋ฐ ๋ฒˆ์—ญ)์„ ๊ณ ๋ คํ•œ multi-task ํ›ˆ๋ จ์„ ์œ„ํ•œ ํ…œํ”Œ๋ฆฟ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

์ด์ „ ์—ฐ๊ตฌ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ์Œ์„ฑ ์ธ์‹ ๋ฐ ๋ฒˆ์—ญ๊ณผ ๊ฐ™์€ ์ธ๊ฐ„ ์Œ์„ฑ ์ฒ˜๋ฆฌ ์ž‘์—…์—๋งŒ ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์ž์—ฐ์Œ์ด๋‚˜ ์Œ์•…๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ์˜ค๋””์˜ค ์œ ํ˜•์„ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. Pengi๋Š” ์ž์—ฐ์Œ ์ดํ•ด ์ž‘์—…์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์ด๋Ÿฌํ•œ ์ž‘์—…์„ ํ…์ŠคํŠธ ์ƒ์„ฑ ์ž‘์—…์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ์—ฐ๊ตฌ์—์„œ Qwen-Audio๋Š” ์ธ๊ฐ„ ์Œ์„ฑ, ์ž์—ฐ์Œ, ์Œ์•…, ๋…ธ๋ž˜์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค ์œ ํ˜•์„ ํ†ตํ•ฉํ•˜๊ณ , ์ด์งˆ์ ์ธ ๋ฐ์ดํ„ฐ์—์„œ ์†Œ์‹ฑ๋˜๊ณ  ์„œ๋กœ ๋‹ค๋ฅธ ๋ผ๋ฒจ๋ง ์„ธ๋ถ„ํ™”๋ฅผ ํŠน์ง•์œผ๋กœ ํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ ๊ณต๋™ ํ›ˆ๋ จ์„ ์ด‰์ง„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ†ตํ•ฉ๋œ ํ•™์Šต framework์˜ ๋„์ž…์„ ํ†ตํ•ด ๋‹ฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

Interact with LLMs through Multiple Modality

์ตœ๊ทผ ChatGPT์™€ ๊ฐ™์€ large language ๋ชจ๋ธ๋“ค์ด ์ธ๊ฐ„ ์ง€์‹œ์‚ฌํ•ญ์— ๋”ฐ๋ฅธ ์ง€์‹ ๋ณด์กด, ์ถ”๋ก , ์ฝ”๋”ฉ์—์„œ ์ธ์ƒ์ ์ธ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์ˆ˜ ํ…์ŠคํŠธ ์ž‘์—…์„ ๋„˜์–ด LLMs์˜ ์ ์šฉ ๋ฒ”์œ„๋ฅผ ํ™•์žฅํ•˜๊ธฐ ์œ„ํ•ด, ๋งŽ์€ LLM ๊ธฐ๋ฐ˜ multimodal ๋ชจ๋ธ๋“ค์ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์‹œ๊ฐ์  modality์˜ ๊ฒฝ์šฐ, GPT4, Flamingo, Kosmos, BLIP, Shikra, Emu, Qwen-VL ๋“ฑ์ด LLMs์— ๋Œ€ํ•œ ์ด๋ฏธ์ง€ ์ดํ•ด ๋˜๋Š” ์ƒ์„ฑ ๋Šฅ๋ ฅ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋‹ค์–‘ํ•œ ํ†ตํ•ฉ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ–ˆ์Šต๋‹ˆ๋‹ค.

์˜ค๋””์˜ค modality์˜ ๊ฒฝ์šฐ, AudioGPT์™€ HuggingGPT์™€ ๊ฐ™์ด ์ž˜ ํ›ˆ๋ จ๋œ ์˜ค๋””์˜ค foundation ๋ชจ๋ธ๋“ค์„ ๋„๊ตฌ๋กœ ํ™œ์šฉํ•˜๋ฉด์„œ LLMs๋ฅผ ๋‹ค์–‘ํ•œ ์ธํ„ฐํŽ˜์ด์Šค๋กœ ํ™œ์šฉํ•˜๋ ค๋Š” ์‹œ๋„๋“ค์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋…ธ๋ ฅ๋“ค์€ ์™ธ๋ถ€ ๋„๊ตฌ๋ฅผ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•œ ๋ช…๋ น์„ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ์ธ๊ฐ„ ์Œ์„ฑ์„ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•œ ํ›„ LLMs์— ์ž…๋ ฅํ•˜๋„๋ก LLMs์— ์ง€์‹œํ•˜๋Š” ๊ฒƒ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹๋“ค์€ ์ธ๊ฐ„ ์Œ์„ฑ์˜ ์šด์œจ(prosody)๊ณผ ๊ฐ์ •๊ณผ ๊ฐ™์€ ์ค‘์š”ํ•œ ์ •๋ณด์˜ ํฌํ•จ์ด ๋ถ€์กฑํ•˜๋ฉฐ, ํŠน์ • ๊ฒฝ์šฐ์—๋Š” ์ž์—ฐ์Œ๊ณผ ๊ฐ™์€ ๋น„ํ…์ŠคํŠธ ์˜ค๋””์˜ค๋ฅผ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ LLMs์—์„œ ์Œ์„ฑ modality๋กœ์˜ ์ง€์‹ ์ „๋‹ฌ์— ์žฅ์• ๋ฌผ์ด ๋ฐœ์ƒํ•˜๊ณ , LLMs๋Š” ์˜ค๋””์˜ค ์‹ ํ˜ธ๋ฅผ ์ธ์‹ํ•˜๊ณ  ์ดํ•ดํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค.

์ตœ๊ทผ์˜ ๋…ธ๋ ฅ๋“ค์€ ์ง์ ‘์ ์ธ ์Œ์„ฑ ์ƒํ˜ธ์ž‘์šฉ์„ ์œ„ํ•œ end-to-end audio-text LLMs ํ›ˆ๋ จ์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. SpeechGPT๋Š” ๋จผ์ € ์ธ๊ฐ„ ์Œ์„ฑ์„ discrete HuBERT tokens๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , paired speech ๋ฐ์ดํ„ฐ, speech instruction ๋ฐ์ดํ„ฐ, chain-of-modality instruction ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด 3๋‹จ๊ณ„ ํ›ˆ๋ จ ํŒŒ์ดํ”„๋ผ์ธ์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค.

BLSP๋Š” LLM์ด ์ธ๊ฐ„ ์Œ์„ฑ๊ณผ ํ•ด๋‹น ์ „์‚ฌ๋ฅผ ์ œ๊ณต๋ฐ›์•˜์„ ๋•Œ ๋™์ผํ•œ ํ…์ŠคํŠธ ์—ฐ์†์„ ์ƒ์„ฑํ•˜๋„๋ก ์š”๊ตฌํ•˜์—ฌ representation์„ ์ •๋ ฌํ•ฉ๋‹ˆ๋‹ค. LLaSM์€ Microsoft TTS API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์Œ์„ฑ ์งˆ๋ฌธ์„ ์ƒ์„ฑํ•จ์œผ๋กœ์จ ๋Œ€๊ทœ๋ชจ ์Œ์„ฑ instruction ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•˜๊ณ , ์ธ๊ฐ„ ์Œ์„ฑ๊ณผ ํ…์ŠคํŠธ ๊ฐ„์˜ end-to-end ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•œ ํ›ˆ๋ จ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

LTU๋Š” 5M ์˜ค๋””์˜ค QA ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•˜๊ณ , ์†Œ๋ฆฌ ์ธ์‹๊ณผ ์ถ”๋ก  ๊ฐ„์˜ ์ •๋ ฌ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์˜ค๋””์˜ค ๋ชจ๋“ˆ๊ณผ LLaMA์˜ LoRA adapters์— ๋Œ€ํ•ด supervised finetuning (SFT)๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

SALMMON์€ ํ…์ŠคํŠธ encoder์™€ speech encoder๋ฅผ ๋ชจ๋‘ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์˜ค๋””์˜ค์™€ ํ…์ŠคํŠธ ์ž…๋ ฅ์—์„œ representation์„ ์ถ”์ถœํ•˜๊ณ , Q-former ์Šคํƒ€์ผ attention์„ ํ†ตํ•ด ์ž˜ ํ›ˆ๋ จ๋œ LLM์— ์ž…๋ ฅ์„ ์—ฐ๊ฒฐํ•˜์—ฌ ์‘๋‹ต์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋ณธ ์—ฐ๊ตฌ์—์„œ Qwen-Audio๋Š” ํ…์ŠคํŠธ ๋Œ€ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์กดํ•˜๋ฉด์„œ ์˜ค๋””์˜ค ์ž…๋ ฅ์„ ์ธ์‹ํ•˜๊ณ  ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ํ†ตํ•ฉ๋œ audio-text multi-task multilingual LLMs ํ›ˆ๋ จ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. Qwen-Audio๋Š” ๋ชจ๋“  ์˜ค๋””์˜ค์— ๋Œ€ํ•ด ๋‹จ์ผ encoder๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์ž์—ฐ์Œ ํƒ์ง€, ์ธ๊ฐ„ ์Œ์„ฑ ์ธ์‹ ๋ฐ grounding, ์˜ค๋””์˜ค ์บก์…˜ ์ž‘์—…๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๊ทœ๋ชจ end-to-end ํ›ˆ๋ จ์„ ํ†ตํ•ด ์˜ค๋””์˜ค์™€ ํ…์ŠคํŠธ modality ๊ฐ„์˜ ๊ฒฉ์ฐจ๋ฅผ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

  1. Methodology

์ด ์„น์…˜์€ ๋ฒ”์šฉ ์˜ค๋””์˜ค ์ดํ•ด์™€ ์ธ๊ฐ„ ์ง€์‹œ์‚ฌํ•ญ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์œ ์—ฐํ•œ ์ƒํ˜ธ์ž‘์šฉ์„ ์œ„ํ•ด ์„ค๊ณ„๋œ Qwen-Audio์™€ Qwen-Audio-Chat์˜ ์„ธ๋ถ€์‚ฌํ•ญ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. Qwen-Audio์™€ Qwen-Audio-Chat์˜ ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” ๋จผ์ € Section 3.1์—์„œ ์ œ์‹œ๋ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ ๋ชจ๋ธ์˜ ํ›ˆ๋ จ ๊ณผ์ •์€ ๋‘ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค: multitask pretraining๊ณผ supervised fine-tuning์ž…๋‹ˆ๋‹ค. Section 3.2์—์„œ๋Š” multitask ํ•™์Šต์„ ํ†ตํ•œ Qwen-Audio์˜ ํ›ˆ๋ จ์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ Section 3.3์—์„œ๋Š” ์œ ์—ฐํ•œ ์ธ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” supervised fine-tuning์„ ํ†ตํ•œ Qwen-Audio-Chat์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

3.1 Model Architecture

Qwen-Audio ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ฒ˜๋Š” Figure 3์— ๋ฌ˜์‚ฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. Qwen-Audio๋Š” audio encoder์™€ large language model์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. paired ๋ฐ์ดํ„ฐ (a, x)๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์—ฌ๊ธฐ์„œ a์™€ x๋Š” ๊ฐ๊ฐ ์˜ค๋””์˜ค ์‹œํ€€์Šค์™€ ํ…์ŠคํŠธ ์‹œํ€€์Šค๋ฅผ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ํ›ˆ๋ จ ๋ชฉํ‘œ๋Š” ๋‹ค์Œ ํ…์ŠคํŠธ ํ† ํฐ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

Pฮธ(xtโˆฃx<t,Encoderฯ•(a))P_ฮธ(x_t x_{<t}, \text{Encoder}_ฯ•(a))Pฮธโ€‹(xtโ€‹โˆฃx<tโ€‹,Encoderฯ•โ€‹(a))

์ด๋Š” ์˜ค๋””์˜ค representation๊ณผ ์ด์ „ ํ…์ŠคํŠธ ์‹œํ€€์Šค x<tx_{<t}x<tโ€‹๋ฅผ ์กฐ๊ฑด์œผ๋กœ ํ•˜๋ฉฐ, ์—ฌ๊ธฐ์„œ ฮธ์™€ ฯ•๋Š” ๊ฐ๊ฐ LLM๊ณผ audio encoder์˜ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

Audio Encoder: Qwen-Audio๋Š” ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ์˜ค๋””์˜ค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์ผ audio encoder๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. audio encoder์˜ ์ดˆ๊ธฐํ™”๋Š” Whisper-large-v2 ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” stem์œผ๋กœ ๋‘ ๊ฐœ์˜ convolution down-sampling ๋ ˆ์ด์–ด๋ฅผ ํฌํ•จํ•˜๋Š” 32์ธต Transformer ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. audio encoder๋Š” 640M ๊ฐœ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

Whisper๊ฐ€ ์Œ์„ฑ ์ธ์‹๊ณผ ๋ฒˆ์—ญ์„ ์œ„ํ•ด supervised ํ›ˆ๋ จ๋˜์—ˆ์ง€๋งŒ, ๊ทธ ์ธ์ฝ”๋”ฉ๋œ representation์€ ์—ฌ์ „ํžˆ ๋ฐฐ๊ฒฝ ์†Œ์Œ๊ณผ ๊ฐ™์€ ํ’๋ถ€ํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ์‹ฌ์ง€์–ด ์›๋ณธ ์Œ์„ฑ์„ ๋ณต๊ตฌํ•˜๋Š” ๋ฐ๋„ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด, Whisper๋Š” ์ด๋ฅผ 16kHz ์ฃผํŒŒ์ˆ˜๋กœ ๋ฆฌ์ƒ˜ํ”Œ๋งํ•˜๊ณ  25ms์˜ window size์™€ 10ms์˜ hop size๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ raw waveform์„ 80-channel mel-spectrogram์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์˜ค๋””์˜ค representation์˜ ๊ธธ์ด๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด stride๊ฐ€ 2์ธ pooling ๋ ˆ์ด์–ด๊ฐ€ ํ†ตํ•ฉ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ encoder ์ถœ๋ ฅ์˜ ๊ฐ ํ”„๋ ˆ์ž„์€ ์›๋ณธ ์˜ค๋””์˜ค ์‹ ํ˜ธ์˜ ์•ฝ 40ms ์„ธ๊ทธ๋จผํŠธ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์‹œ์—๋Š” ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์œผ๋กœ SpecAugment๊ฐ€ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

Large Language Model: Qwen-Audio๋Š” foundational ๊ตฌ์„ฑ ์š”์†Œ๋กœ large language model์„ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ Qwen-7B์—์„œ ํŒŒ์ƒ๋œ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ดˆ๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค. Qwen-7B๋Š” 4096์˜ hidden size๋ฅผ ๊ฐ€์ง„ 32์ธต Transformer decoder ๋ชจ๋ธ๋กœ, ์ด 7.7B ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

3.2 Multitask Pretraining

์˜ค๋””์˜ค ์ฒ˜๋ฆฌ ์˜์—ญ์—์„œ Table 1์—์„œ ๋ณด๋“ฏ์ด ํŠน์ • ์ž‘์—…์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ์…‹๋“ค์ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. Qwen-Audio๋Š” ๊ด‘๋ฒ”์œ„ํ•œ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๋™ ํ›ˆ๋ จ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๋ชฉํ‘œ๋Š” ๋ชจ๋“  ์˜ค๋””์˜ค ์ž‘์—…์„ ์ง€์›ํ•  ์ˆ˜ ์žˆ๋Š” ํ†ตํ•ฉ๋œ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜์—ฌ, ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ๋ฒˆ๊ฑฐ๋กœ์šด ๋ชจ๋ธ ์ „ํ™˜์˜ ํ•„์š”์„ฑ์„ ์—†์• ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋” ์ค‘์š”ํ•˜๊ฒŒ๋Š”, ๊ณต๋™ ํ›ˆ๋ จ ์ค‘์— ์ž‘์—…๋“ค์ด ์„œ๋กœ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: 1) ์œ ์‚ฌํ•œ ์ž‘์—…๋“ค์€ ์˜ค๋””์˜ค ์‹ ํ˜ธ์— ๋‚ด์žฅ๋œ ๊ธฐ๋ณธ ์ •๋ณด์— ๋Œ€ํ•œ ๊ณตํ†ต์ ์ธ ์ดˆ์ ์„ ๊ณต์œ ํ•˜๋ฏ€๋กœ ์ง€์‹ ๊ณต์œ ์™€ ํ˜‘๋ ฅ ํ•™์Šต์œผ๋กœ๋ถ€ํ„ฐ ์ด์ต์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค; 2) ๋‚ฎ์€ ์ˆ˜์ค€์˜ ์ธ์‹ ๋Šฅ๋ ฅ์— ์˜์กดํ•˜๋Š” ์ž‘์—…๋“ค์ด ๋†’์€ ์ˆ˜์ค€์˜ ์ดํ•ด๋‚˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์š”๊ตฌํ•˜๋Š” ์ž‘์—…๋“ค์„ ๋„์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹๋“ค์€ ์ž‘์—… ์ดˆ์ , ์–ธ์–ด, ์ฃผ์„ ์„ธ๋ถ„ํ™” ๋ฐ ํ…์ŠคํŠธ ๊ตฌ์กฐ์˜ ์ฐจ์ด๋กœ ์ธํ•ด ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”์—์„œ ์ƒ๋‹นํ•œ ๋ณ€ํ™”๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ๋ฅผ ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๋Œ€ํ•ด ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•ด ์ด๋Ÿฌํ•œ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๋‹จ์ˆœํžˆ ํ˜ผํ•ฉํ•˜๋Š” ๊ฒƒ์€ ์ƒํ˜ธ ํ–ฅ์ƒ์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์—†์œผ๋ฉฐ, ๋Œ€์‹  ๊ฐ„์„ญ์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด์˜ ๋Œ€๋ถ€๋ถ„์˜ multi-task ํ›ˆ๋ จ ์ ‘๊ทผ ๋ฐฉ์‹๋“ค์€ ์œ ์‚ฌํ•œ ์ž‘์—…๋“ค์„ ๊ทธ๋ฃนํ™”ํ•˜๊ฑฐ๋‚˜(์˜ˆ: ์˜ค๋””์˜ค ์บก์…˜, ์ „์‚ฌ) ๊ฐ„์„ญ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋ฐ์ดํ„ฐ์…‹์— dataset ID๋ฅผ ํ• ๋‹นํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ ๋ฐฉ์‹๋“ค์ด ์ผ์ •ํ•œ ํšจ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ์ƒ๋‹นํ•œ ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Whisper๋Š” voice activity detection, language identification, sentence-level timestamp ํƒœ๊ทธ์™€ ๊ฐ™์€ ์–ธ์–ด decoder์— ๋Œ€ํ•œ ์ž…๋ ฅ ํŠน์ˆ˜ ํ† ํฐ์˜ ์‹œํ€€์Šค๋กœ ์ž‘์—…๊ณผ ์กฐ๊ฑด ์ •๋ณด๋ฅผ ์ง€์ •ํ•˜์—ฌ multitask ํ›ˆ๋ จ ํ˜•์‹์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Whisper๋Š” ์Œ์„ฑ ๋ฒˆ์—ญ๊ณผ ์ธ์‹ ์ž‘์—…์—๋งŒ ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

Multi-task Training Format Framework: Whisper์—์„œ ์˜๊ฐ์„ ๋ฐ›์•„, ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์˜ค๋””์˜ค๋ฅผ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ multitask ํ›ˆ๋ จ ํ˜•์‹ framework๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค:

โ€ข Transcription Tag: ์˜ˆ์ธก์˜ ์‹œ์ž‘์€ transcription ํƒœ๊ทธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค. <|startoftranscripts|>๋Š” ์Œ์„ฑ ์ธ์‹ ๋ฐ ์Œ์„ฑ ๋ฒˆ์—ญ ์ž‘์—…๊ณผ ๊ฐ™์ด ์Œ์„ฑ ๋‹จ์–ด๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์ „์‚ฌํ•˜๊ณ  ์Œ์„ฑ ๋…น์Œ์˜ ์–ธ์–ด์  ๋‚ด์šฉ์„ ์บก์ฒ˜ํ•˜๋Š” ์ž‘์—…์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์ž‘์—…์˜ ๊ฒฝ์šฐ <|startofanalysis|> ํƒœ๊ทธ๊ฐ€ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

โ€ข Audio Language Tag: ๊ทธ๋Ÿฐ ๋‹ค์Œ ์˜ค๋””์˜ค์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์–ธ์–ด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์–ธ์–ด ํƒœ๊ทธ๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ด ํƒœ๊ทธ๋Š” ์ด 8๊ฐœ ์–ธ์–ด๋กœ ๊ตฌ์„ฑ๋œ ํ›ˆ๋ จ ์„ธํŠธ์— ์žˆ๋Š” ๊ฐ ์–ธ์–ด์— ํ• ๋‹น๋œ ๊ณ ์œ  ํ† ํฐ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ž์—ฐ์Œ๊ณผ ์Œ์•…๊ณผ ๊ฐ™์ด ์Œ์„ฑ์ด ํฌํ•จ๋˜์ง€ ์•Š์€ ์˜ค๋””์˜ค ์„ธ๊ทธ๋จผํŠธ์˜ ๊ฒฝ์šฐ, ๋ชจ๋ธ์€ <|unknown|> ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋„๋ก ํ›ˆ๋ จ๋ฉ๋‹ˆ๋‹ค.

โ€ข Task Tag: ํ›„์† ํ† ํฐ๋“ค์ด ์ž‘์—…์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜์ง‘๋œ ์˜ค๋””์˜ค ์ž‘์—…์„ ๋‹ค์„ฏ ๊ฐ€์ง€ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜ํ•ฉ๋‹ˆ๋‹ค: <|transcribe|>, <|translate|>, <|caption|>, <|analysis|>, <|question-answer|> ์ž‘์—…๋“ค. question-answer (QA) ์ž‘์—…์˜ ๊ฒฝ์šฐ, ํƒœ๊ทธ ํ›„์— ํ•ด๋‹น ์งˆ๋ฌธ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

โ€ข Text Language Tag: ํƒœ๊ทธ ํ† ํฐ์ด ์ถœ๋ ฅ ํ…์ŠคํŠธ ์‹œํ€€์Šค์˜ ์–ธ์–ด๋ฅผ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค.

โ€ข Timestamps Tag: <|timestamps|> ๋˜๋Š” <|notimestamps|> ํ† ํฐ์˜ ์กด์žฌ๋Š” ๋ชจ๋ธ์ด ํƒ€์ž„์Šคํƒฌํ”„๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. Whisper์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ฌธ์žฅ ์ˆ˜์ค€ ํƒ€์ž„์Šคํƒฌํ”„์™€ ๋‹ค๋ฅด๊ฒŒ, <|timestamps|> ํƒœ๊ทธ์˜ ํฌํ•จ์€ ๋ชจ๋ธ์ด SRWT(Speech Recognition with Word-level Timestamps)๋กœ ์ถ•์•ฝ๋˜๋Š” ์„ธ๋ฐ€ํ•œ word-level ํƒ€์ž„์Šคํƒฌํ”„ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํƒ€์ž„์Šคํƒฌํ”„์˜ ์˜ˆ์ธก์€ ์ „์‚ฌ ๋‹จ์–ด๋“ค๊ณผ ๊ต์ฐจ๋ฉ๋‹ˆ๋‹ค: ์‹œ์ž‘ ์‹œ๊ฐ„ ํ† ํฐ์€ ๊ฐ ์ „์‚ฌ ํ† ํฐ ์ „์— ์˜ˆ์ธก๋˜๊ณ , ์ข…๋ฃŒ ์‹œ๊ฐ„ ํ† ํฐ์€ ํ›„์— ์˜ˆ์ธก๋ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ ์‹คํ—˜์— ๋”ฐ๋ฅด๋ฉด, SRWT๋Š” ์˜ค๋””์˜ค ์‹ ํ˜ธ๋ฅผ ํƒ€์ž„์Šคํƒฌํ”„์™€ ์ •๋ ฌํ•˜๋Š” ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ–ฅ์ƒ๋œ ์ •๋ ฌ์€ ๋ชจ๋ธ์˜ ์Œ์„ฑ ์‹ ํ˜ธ์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ ์ดํ•ด์— ๊ธฐ์—ฌํ•˜๋ฉฐ, ์Œ์„ฑ ์ธ์‹ ๋ฐ ์˜ค๋””์˜ค QA ์ž‘์—…๊ณผ ๊ฐ™์€ ๋งŽ์€ ์ž‘์—…์—์„œ ์ฃผ๋ชฉํ•  ๋งŒํ•œ ๋ฐœ์ „์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

โ€ข Output Instruction: ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋‹ค์–‘ํ•œ ํ•˜์œ„ ์ž‘์—…์— ๋Œ€ํ•œ ์ž‘์—…๊ณผ ์›ํ•˜๋Š” ํ˜•์‹์„ ๋” ๊ตฌ์ฒด์ ์œผ๋กœ ์ง€์ •ํ•˜๊ธฐ ์œ„ํ•œ ์ถœ๋ ฅ ์ง€์‹œ์‚ฌํ•ญ์„ ์ œ๊ณตํ•˜๊ณ , ๊ทธ๋Ÿฐ ๋‹ค์Œ ํ…์ŠคํŠธ ์ถœ๋ ฅ์ด ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค.

์šฐ๋ฆฌ framework์˜ ์ง€๋„ ์›๋ฆฌ๋Š” ๊ณต์œ  ํƒœ๊ทธ๋ฅผ ํ†ตํ•ด ์œ ์‚ฌํ•œ ์ž‘์—…๋“ค ๊ฐ„์˜ ์ง€์‹ ๊ณต์œ ๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋™์‹œ์—, ์šฐ๋ฆฌ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ž‘์—…๊ณผ ์ถœ๋ ฅ ํ˜•์‹์ด ๊ตฌ๋ณ„๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ ๋ชจ๋ธ์˜ one-to-many ๋งคํ•‘ ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

3.3 Supervised Fine-tuning

multitask ๋ชจ๋ธ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์‚ฌ์ „ ํ›ˆ๋ จ์€ ์˜ค๋””์˜ค์— ๋Œ€ํ•œ ๊ด‘๋ฒ”์œ„ํ•œ ์ดํ•ด๋กœ ๋ชจ๋ธ์„ ๊ฐ–์ถ”๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, ์šฐ๋ฆฌ๋Š” instruction ๊ธฐ๋ฐ˜ fine-tuning ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ธ๊ฐ„ ์˜๋„์— ๋งž์ถฐ ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œ์ผœ Qwen-Audio-Chat์ด๋ผ๋Š” ๋Œ€ํ™”ํ˜• ์ฑ„ํŒ… ๋ชจ๋ธ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

์ด๋ฅผ ๋‹ฌ์„ฑํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ๊ฐ ์ž‘์—…์— ๋Œ€ํ•œ ์‹œ์—ฐ์„ ์ˆ˜๋™์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์‹œ์—ฐ์€ ์›์‹œ ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”, ์งˆ๋ฌธ ๋ฐ ๋‹ต๋ณ€์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ œ๊ณต๋œ ์›์‹œ ํ…์ŠคํŠธ ๋ ˆ์ด๋ธ”์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ถ”๊ฐ€ ์งˆ๋ฌธ๊ณผ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด GPT-3.5๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, ์ˆ˜๋™ ์ฃผ์„, ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ์ „๋žต ์—ฐ๊ฒฐ์„ ์‚ฌ์šฉํ•˜์—ฌ audio-dialogue ๋ฐ์ดํ„ฐ์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์€ ์ถ”๋ก , ์Šคํ† ๋ฆฌ ์ƒ์„ฑ ๋ฐ multi-image comprehension ๋Šฅ๋ ฅ์„ ๋ชจ๋ธ์— ํ†ตํ•ฉํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

multi-audio dialogue์™€ ์—ฌ๋Ÿฌ ์˜ค๋””์˜ค ์ž…๋ ฅ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด, โ€œAudio id:โ€๋กœ ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค๋ฅผ ๋ผ๋ฒจ๋งํ•˜๋Š” ๊ทœ์•ฝ์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ id๋Š” ์˜ค๋””์˜ค ์ž…๋ ฅ dialogue์˜ ์ˆœ์„œ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. dialogue ํ˜•์‹ ์ธก๋ฉด์—์„œ, ์šฐ๋ฆฌ๋Š” ChatML ํ˜•์‹์„ ์‚ฌ์šฉํ•˜์—ฌ instruction tuning ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ํ˜•์‹์—์„œ ๊ฐ ์ƒํ˜ธ์ž‘์šฉ์˜ ๋ฌธ์žฅ์€ dialogue ์ข…๋ฃŒ๋ฅผ ์ด‰์ง„ํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ ํŠน์ˆ˜ ํ† ํฐ(<im_start>์™€ <im_end>)์œผ๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค.

multi-turn dialogue ๋‚ด์—์„œ ์˜ค๋””์˜ค์™€ ์ˆœ์ˆ˜ ํ…์ŠคํŠธ modality ๋ชจ๋‘๋กœ๋ถ€ํ„ฐ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ์„ ์ด‰์ง„ํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ์ด ํ›ˆ๋ จ ๊ณผ์ •์—์„œ ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ audio-centric instruction ๋ฐ์ดํ„ฐ์™€ ์ˆœ์ˆ˜ ํ…์ŠคํŠธ instruction ๋ฐ์ดํ„ฐ์˜ ์กฐํ•ฉ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ†ตํ•ด ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ์ž…๋ ฅ์„ ์›ํ™œํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. instruction tuning ๋ฐ์ดํ„ฐ์˜ ์ด๋Ÿ‰์€ 20k์ž…๋‹ˆ๋‹ค.

  1. Experiments

4.1 Setup

multi-task ์‚ฌ์ „ ํ›ˆ๋ จ์˜ ๊ฒฝ์šฐ, LLM์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ ์ •ํ•˜๊ณ  audio encoder๋งŒ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ Qwen-Audio๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํ›„์† supervised fine-tuning ๋‹จ๊ณ„์—์„œ๋Š” audio encoder์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ ์ •ํ•˜๊ณ  LLM๋งŒ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ ๋ชจ๋ธ์€ Qwen-Audio-Chat์œผ๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค. ๋‘ ๋‹จ๊ณ„ ๋ชจ๋‘์˜ ์ž์„ธํ•œ ํ›ˆ๋ จ ๊ตฌ์„ฑ์€ Table 6์— ๋‚˜์—ด๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

4.2 Evaluation

Qwen-Audio์˜ ๋ฒ”์šฉ ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด, Table 2์—์„œ ๋ณด๋“ฏ์ด Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Automatic Audio Captioning (AAC), Acoustic Scene Classification (ASC), Speech Emotion Recognition (SER), Audio Question and Answering (AQA), Vocal Sound Classification (VSC), Music Note Analysis (MNA)๋ฅผ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ํฌ๊ด„ํ•˜๋Š” ์ข…ํ•ฉ์ ์ธ ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์ด ํ‰๊ฐ€๋Š” 12๊ฐœ์˜ ๋ฐ์ดํ„ฐ์…‹์— ๊ฑธ์ณ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹๋“ค์€ ๋ฐ์ดํ„ฐ ๋ˆ„์ถœ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ ์—„๊ฒฉํ•˜๊ฒŒ ์ œ์™ธ๋ฉ๋‹ˆ๋‹ค.

4.3 Main Results

์ด ์„น์…˜์—์„œ๋Š” ์ž‘์—…๋ณ„ fine-tuning ์—†์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์— ๊ฑธ์นœ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” Qwen-Audio ๋ชจ๋ธ์˜ ์ข…ํ•ฉ์ ์ธ ํ‰๊ฐ€๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.

Table 3์— ๋ฌ˜์‚ฌ๋œ English Automatic Speech Recognition (ASR) ๊ฒฐ๊ณผ๋ฅผ ๋จผ์ € ๊ฒ€ํ† ํ•˜๋ฉด, Qwen-Audio๊ฐ€ ์ด์ „ multi-task ํ•™์Šต ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ librispeech test-clean๊ณผ test-other ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ๊ฐ 2.0%์™€ 4.2%์˜ WER๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์ค‘๊ตญ์–ด ๋งŒ๋‹ค๋ฆฐ ASR ๊ฒฐ๊ณผ๋Š” ์ด์ „ ์ ‘๊ทผ ๋ฐฉ์‹๋“ค๊ณผ ๋น„๊ตํ•˜์—ฌ Qwen-Audio์˜ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์•„๋Š” ํ•œ, Qwen-Audio๋Š” Aishell1 dev์™€ test ์…‹์—์„œ ์ตœ์ฒจ๋‹จ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, CoVoST2 ๋ฐ์ดํ„ฐ์…‹์—์„œ Qwen-Audio์˜ ์Œ์„ฑ ๋ฒˆ์—ญ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” Qwen-Audio๊ฐ€ 7๊ฐœ ๋ฒˆ์—ญ ๋ฐฉํ–ฅ ๋ชจ๋‘์—์„œ ๊ธฐ์ค€์„ ์„ ์ƒ๋‹นํ•œ ์ฐจ์ด๋กœ ๋Šฅ๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ, Table 3์— ์š”์•ฝ๋œ AAC, SWRT, ASC, SER, AQA, VSC, MNA๋ฅผ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ ์˜ค๋””์˜ค ๋ถ„์„ ์ž‘์—…์—์„œ Qwen-Audio์˜ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ž‘์—…๋“ค์—์„œ Qwen-Audio๋Š” ์ƒ๋‹นํ•œ ์ฐจ์ด๋กœ ๊ธฐ์ค€์„ ๋“ค์„ ์ง€์†์ ์œผ๋กœ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, CochlScene, ClothoAQA, VocalSound์—์„œ ์ตœ์ฒจ๋‹จ ๊ฒฐ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ•˜์—ฌ ๋ชจ๋ธ์˜ ๊ฐ•๋ ฅํ•œ ์˜ค๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

4.4 Results of Interactive Chat

Figure 2์— ๋ฌ˜์‚ฌ๋œ ์˜ˆ์‹œ ์‚ฌ๋ก€๋ฅผ ํ†ตํ•ด Qwen-Audio-Chat์˜ ๋Œ€ํ™” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ, ์˜จ๋ผ์ธ ์ฑ„ํŒ… ์ƒํ˜ธ์ž‘์šฉ์„ ์œ„ํ•ด ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์— ๋Œ€ํ•œ ๊ณต๊ฐœ ์•ก์„ธ์Šค๋ฅผ ์ œ๊ณตํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

4.5 The Analysis of Word-level Timestamps Prediction

์šฐ๋ฆฌ๋Š” Qwen-Audio๊ฐ€ ์Œ์„ฑ ์ „์‚ฌ๋ฅผ ์ธ์‹ํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ฐ ๋‹จ์–ด์˜ ํƒ€์ž„์Šคํƒฌํ”„๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ›ˆ๋ จํ•จ์œผ๋กœ์จ word-level ํƒ€์ž„์Šคํƒฌํ”„๋ฅผ ์‚ฌ์šฉํ•œ ์Œ์„ฑ ์ธ์‹(SRWT) ์ž‘์—…์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. SRWT์˜ ๋ชฉ์ ์€ ๋‘ ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค: ์ฒซ์งธ, ์˜ค๋””์˜ค ์‹ ํ˜ธ๋ฅผ ์„ธ๋ฐ€ํ•œ ํƒ€์ž„์Šคํƒฌํ”„์™€ ์ •๋ ฌํ•˜๋Š” ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ; ๋‘˜์งธ, Qwen-Audio-Chat์—์„œ ์Œ์„ฑ ๋ฐ ์˜ค๋””์˜ค์˜ grounding๊ณผ grounding ๊ธฐ๋ฐ˜ QA ์ž‘์—…์„ ์ง€์›ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ์„น์…˜์—์„œ๋Š” ๋‹ค๋ฅธ ์ž‘์—…๋“ค์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ multitask pretraining์—์„œ SRWT ์ž‘์—…์˜ ํ›ˆ๋ จ์„ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค. ์ฃผ๋ชฉํ•  ์ ์€ SRWT ์ œ๊ฑฐ๊ฐ€ SRWT ์ž‘์—…์ด automatic speech recognition (ASR) ์ž‘์—…๊ณผ ๋™์ผํ•œ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ์…‹์„ ๊ณต์œ ํ•˜๋ฏ€๋กœ ํ›ˆ๋ จ์„ ์œ„ํ•œ ์˜ค๋””์˜ค ๋ฐ์ดํ„ฐ์…‹ ์ปค๋ฒ„๋ฆฌ์ง€์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Table 4์™€ Table 5์— ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚˜ ์žˆ์Šต๋‹ˆ๋‹ค: SRWT๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ๋“ค์ด ์ž๋™ ์Œ์„ฑ ์ธ์‹๊ณผ ์ž์—ฐ์Œ QA ๋ฐ Music QA๋ฅผ ํฌํ•จํ•œ ์˜ค๋””์˜ค ์งˆ๋ฌธ ๋‹ต๋ณ€ ์ž‘์—…์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋Š” ์ผ๋ฐ˜์ ์ธ ์˜ค๋””์˜ค ์‹ ํ˜ธ grounding ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์ดํ›„ ์†Œ๋ฆฌ์™€ ์Œ์•… ์‹ ํ˜ธ QA ์ž‘์—…์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์„ธ๋ฐ€ํ•œ word-level ํƒ€์ž„์Šคํƒฌํ”„๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ํšจ๊ณผ๋ฅผ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

  1. Conclusion

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฒ”์šฉ ์˜ค๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ ๋Œ€๊ทœ๋ชจ audio-language ๋ชจ๋ธ ์„ธํŠธ์ธ Qwen-Audio ์‹œ๋ฆฌ์ฆˆ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ณต๋™ ํ›ˆ๋ จ์„ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์˜ค๋””์˜ค๋ฅผ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ์œ ์‚ฌํ•œ ์ž‘์—…๋“ค ๊ฐ„์˜ ์ง€์‹ ๊ณต์œ ๋ฅผ ์ด‰์ง„ํ•˜๊ณ  ์„œ๋กœ ๋‹ค๋ฅธ ํ…์ŠคํŠธ ํ˜•์‹์œผ๋กœ ์ธํ•œ one-to-many ๋งคํ•‘ ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ํ†ตํ•ฉ๋œ multi-task ํ•™์Šต framework๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

์ž‘์—…๋ณ„ fine-tuning ์—†์ด๋„, ๊ฒฐ๊ณผ์ ์ธ Qwen-Audio ๋ชจ๋ธ๋“ค์€ ๋‹ค์–‘ํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ด์ „ ์—ฐ๊ตฌ๋“ค์„ ๋Šฅ๊ฐ€ํ•˜์—ฌ ๋ฒ”์šฉ ์˜ค๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. supervised instruction finetuning์„ ํ†ตํ•ด, Qwen-Audio-Chat์€ ์ธ๊ฐ„ ์˜๋„์— ๋งž์ถ˜ ๊ฐ•๋ ฅํ•œ ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ์˜ค๋””์˜ค์™€ ํ…์ŠคํŠธ ์ž…๋ ฅ ๋ชจ๋‘๋กœ๋ถ€ํ„ฐ ๋‹ค๊ตญ์–ด ๋ฐ multi-turn dialogue๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

  1. Acknowledgements

Jinze Bai, Shuai Bai, Peng Wang, Sinan Tan, Shijie Wang์™€์˜ ํ†ต์ฐฐ๋ ฅ ์žˆ๋Š” ํ† ๋ก ์— ๊ฐ์‚ฌ๋ฅผ ํ‘œํ•ฉ๋‹ˆ๋‹ค. ์ด ํ”„๋กœ์ ํŠธ์˜ ์ง€์›์— ๋Œ€ํ•ด Juan Zhu, Junyang Lin, Siqi Zheng, Jiaming Wang, Zhihao Du์—๊ฒŒ ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.


๋ณธ ์—ฐ๊ตฌ๋Š” audio-language ๋ชจ๋ธ๋ง ๋ถ„์•ผ์—์„œ significantํ•œ ์ง„์ „์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ํŠนํžˆ multi-task ํ•™์Šต๊ณผ word-level timestamp ์˜ˆ์ธก์„ ํ†ตํ•œ ๋ฒ”์šฉ ์˜ค๋””์˜ค ์ดํ•ด ๋Šฅ๋ ฅ์˜ ๋ฐœ์ „์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค.



-->