[๊ฐ•์˜๋…ธํŠธ] RAG From Scratch : Query Indexing ๊ธฐ๋ฒ•

Posted by Euisuk's Dev Log on September 14, 2024

[๊ฐ•์˜๋…ธํŠธ] RAG From Scratch : Query Indexing ๊ธฐ๋ฒ•

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/RAG-From-Scratch-12-14

  • ํ•ด๋‹น ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ๋Š” RAG From Scratch : Coursework ๊ฐ•์˜ ํŒŒํŠธ 12 - 14 ๋‚ด์šฉ์„ ๋‹ค๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
๋น„๋””์˜ค ์š”์•ฝ ๊ฐ•์˜ ๋งํฌ ์Šฌ๋ผ์ด๋“œ
Part 12 (๋‹ค์ค‘ ํ‘œํ˜„ ์ธ๋ฑ์‹ฑ) ํšจ์œจ์ ์ธ ๊ฒ€์ƒ‰์„ ์œ„ํ•ด ๋ฌธ์„œ ์š”์•ฝ์„ ์ธ๋ฑ์‹ฑํ•˜๋ฉด์„œ๋„ ์ „์ฒด ๋ฌธ์„œ์™€ ์—ฐ๊ฒฐํ•˜์—ฌ ํฌ๊ด„์ ์ธ ์ดํ•ด๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋…ผ์˜ํ•ฉ๋‹ˆ๋‹ค. ๐Ÿ“Œ ๊ฐ•์˜ ๐Ÿ“– ์ฐธ๊ณ ์ž๋ฃŒ
Part 13 (RAPTOR) ๋ฌธ์„œ ์š”์•ฝ๊ณผ ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ํ†ตํ•ด ๊ณ ์ˆ˜์ค€ ๊ฐœ๋…์„ ํฌ์ฐฉํ•˜๋Š” RAPTOR ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๐Ÿ“Œ ๊ฐ•์˜ ๐Ÿ“– ์ฐธ๊ณ ์ž๋ฃŒ
Part 14 (ColBERT) RAG ํ”„๋ ˆ์ž„์›Œํฌ ๋‚ด์—์„œ ๊ฐ•ํ™”๋œ ํ† ํฐ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰์„ ์œ„ํ•œ ColBERT๋ฅผ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ๐Ÿ“Œ ๊ฐ•์˜ ๐Ÿ“– ์ฐธ๊ณ ์ž๋ฃŒ

ํ•„์š” ํŒจํ‚ค์ง€ ์„ค์น˜

1
# ! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain youtube-transcript-api pytube

Part 12 (์ธ๋ฑ์‹ฑ/๋‹ค์ค‘ํ‘œํ˜„์ธ๋ฑ์‹ฑ)

  • ์œ„ ๊ทธ๋ฆผ์˜ ๋‹จ๊ณ„์— ๋”ฐ๋ผ ๊ฐ ๊ณผ์ •์˜ ์—ญํ• ๊ณผ ๊ฐœ๋…์„ ๊ฐ„๋žตํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค:

    1. Question (์งˆ๋ฌธ) : ์‚ฌ์šฉ์ž๊ฐ€ ์‹œ์Šคํ…œ์— ์ž…๋ ฅํ•˜๋Š” ์ž์—ฐ์–ด ํ˜•ํƒœ์˜ ์งˆ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ „์ฒด ํ”„๋กœ์„ธ์Šค์˜ ์‹œ์ž‘์ ์ด ๋ฉ๋‹ˆ๋‹ค.
    2. Query Translation (์ฟผ๋ฆฌ ๋ฒˆ์—ญ) : ์‚ฌ์šฉ์ž์˜ ์ž์—ฐ์–ด ์งˆ๋ฌธ์„ ์‹œ์Šคํ…œ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๊ธฐ์ˆ ์„ ํ™œ์šฉํ•˜์—ฌ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
    3. Routing (๋ผ์šฐํŒ…) : ๋ณ€ํ™˜๋œ ์ฟผ๋ฆฌ๋ฅผ ์ ์ ˆํ•œ ์ฒ˜๋ฆฌ ๊ฒฝ๋กœ๋‚˜ ๋ฐ์ดํ„ฐ ์†Œ์Šค๋กœ ์•ˆ๋‚ดํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์งˆ๋ฌธ์˜ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ตœ์ ์˜ ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
    4. Query Construction (์ฟผ๋ฆฌ ๊ตฌ์„ฑ) : ๋ผ์šฐํŒ…๋œ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋‚˜ ๊ฒ€์ƒ‰ ์—”์ง„์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ์˜ ์ฟผ๋ฆฌ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    5. Indexing (์ธ๋ฑ์‹ฑ, ์ด๋ฒˆ ์ฑ•ํ„ฐ๐Ÿ“Œ) : ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋‚˜ ๋ฌธ์„œ ์ปฌ๋ ‰์…˜์—์„œ ํšจ์œจ์ ์ธ ๊ฒ€์ƒ‰์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๊ณ  ์กฐ์งํ™”ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์ฃผ๋กœ ์‹œ์Šคํ…œ ๊ตฌ์ถ• ๋‹จ๊ณ„์—์„œ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
    6. Retrieval (๊ฒ€์ƒ‰) : ๊ตฌ์„ฑ๋œ ์ฟผ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ธ๋ฑ์‹ฑ๋œ ๋ฐ์ดํ„ฐ์—์„œ ๊ด€๋ จ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„์—์„œ ์งˆ๋ฌธ๊ณผ ๊ฐ€์žฅ ๊ด€๋ จ์„ฑ ๋†’์€ ์ •๋ณด๋ฅผ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค.
    7. Generation (์ƒ์„ฑ) : ๊ฒ€์ƒ‰๋œ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„์—์„œ๋Š” ์ฃผ๋กœ ์ž์—ฐ์–ด ์ƒ์„ฑ ๊ธฐ์ˆ ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
    8. Answer (๋‹ต๋ณ€) : ์ตœ์ข…์ ์œผ๋กœ ์ƒ์„ฑ๋œ ๋‹ต๋ณ€์„ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์›๋ž˜ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์‘๋‹ต์œผ๋กœ, ์ž์—ฐ์–ด ํ˜•ํƒœ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฒˆ ๊ฐ•์˜๋Š” Indexing(์ธ๋ฑ์‹ฑ), ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์ค‘์—์„œ๋„ Multi-Representation Indexing(๋‹ค์ค‘ ํ‘œํ˜„ ์ธ๋ฑ์‹ฑ)์ด๋ผ๋Š” ๊ฐœ๋…์„ ๋‹ค๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • Multi-Representation Indexing(๋‹ค์ค‘ ํ‘œํ˜„ ์ธ๋ฑ์‹ฑ)์€ ๋ฒกํ„ฐ ์Šคํ† ์–ด์—์„œ ์ •๋ณด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ฒ€์ƒ‰ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

    • ์ด ๊ธฐ๋ฒ•์€ ์ž์—ฐ์–ด ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ตœ์ ์˜ ๋ฌธ์„œ ๊ฒ€์ƒ‰์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ, ํŠนํžˆ ๊ธด ๋ฌธ๋งฅ์„ ์ฒ˜๋ฆฌํ•˜๋Š” LLM(Long Context Language Models)์—์„œ ๋งค์šฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

1. Indexing์ด๋ž€?

  • ์ธ๋ฑ์‹ฑ์€ ๋ฌธ์„œ๋‚˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ๋‚˜์ค‘์— ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ค€๋น„ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
    • ๋ฒกํ„ฐ ์Šคํ† ์–ด(Vector Store)๋Š” ๋ฌธ์„œ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ €์žฅํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ๋ฌธ์„œ์˜ ์ฃผ์š” ํŠน์ง•๋“ค์„ ๋ฒกํ„ฐ ํ˜•ํƒœ๋กœ ์ธ๋ฑ์‹ฑํ•˜๊ณ  ๋‚˜์ค‘์— ์œ ์‚ฌ๋„ ๊ฒ€์ƒ‰์„ ํ†ตํ•ด ๊ด€๋ จ ๋ฌธ์„œ๋ฅผ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฌธ์„œ์—์„œ ์ค‘์š”ํ•œ ํ‚ค์›Œ๋“œ๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์ด๋ฅผ ์ €์žฅํ•œ ํ›„, ์งˆ๋ฌธ๊ณผ ์œ ์‚ฌํ•œ ํ‚ค์›Œ๋“œ๋ฅผ ๊ฐ€์ง„ ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

2. Multi-Representation Indexing์ด๋ž€?

  • Multi-Representation Indexing์€ ๋ฌธ์„œ์˜ ์—ฌ๋Ÿฌ ํ‘œํ˜„ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ๋ฅผ ์ €์žฅํ•˜๊ณ  ๊ฒ€์ƒ‰ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
    • ์ด ๊ธฐ๋ฒ•์€ ํ•œ ๋ฌธ์„œ๋ฅผ ๋‹จ์ˆœํžˆ ๋‚˜๋ˆ„์–ด ์ €์žฅํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, LLM์„ ์‚ฌ์šฉํ•ด ๋ฌธ์„œ๋ฅผ ์š”์•ฝํ•˜๊ณ  ๊ทธ ์š”์•ฝ์„ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
    • ๋‚˜์ค‘์— ๊ฒ€์ƒ‰ํ•  ๋•Œ ์ด ์š”์•ฝ๋œ ๋‚ด์šฉ์„ ํ†ตํ•ด ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•œ ํ›„, ์ „์ฒด ๋ฌธ์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

  • ์ฃผ์š” ์ ˆ์ฐจ:

    1. ๋ฌธ์„œ ์š”์•ฝ: ์›๋ณธ ๋ฌธ์„œ๋ฅผ ์š”์•ฝํ•˜์—ฌ ์ค‘์š”ํ•œ ํ‚ค์›Œ๋“œ๋ฅผ ํฌํ•จํ•œ ์š”์•ฝ๋ณธ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    2. ์š”์•ฝ๋ณธ ์ธ๋ฑ์‹ฑ: ์š”์•ฝ๋œ ๋ฌธ์„œ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋ฒกํ„ฐ ์Šคํ† ์–ด์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
    3. ์ „์ฒด ๋ฌธ์„œ ์ €์žฅ: ์›๋ณธ ๋ฌธ์„œ๋Š” ๋ณ„๋„์˜ ๋ฌธ์„œ ์ €์žฅ์†Œ(Doc Store)์— ์ €์žฅํ•˜์—ฌ ๊ฒ€์ƒ‰ ํ›„์— ์›๋ฌธ์„ ๋ฐ˜ํ™˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ค€๋น„ํ•ฉ๋‹ˆ๋‹ค.
    4. ๊ฒ€์ƒ‰: ์‚ฌ์šฉ์ž๊ฐ€ ์งˆ๋ฌธ์„ ํ•˜๋ฉด, ์งˆ๋ฌธ๊ณผ ์œ ์‚ฌํ•œ ์š”์•ฝ๋ณธ์„ ๋ฒกํ„ฐ ์Šคํ† ์–ด์—์„œ ์ฐพ์•„๋‚ด๊ณ , ํ•ด๋‹น ์š”์•ฝ๋ณธ์— ์—ฐ๊ฒฐ๋œ ์›๋ณธ ๋ฌธ์„œ๋ฅผ ๋ฌธ์„œ ์ €์žฅ์†Œ์—์„œ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ๋ฐฉ์‹์€ ์š”์•ฝ๋ณธ์„ ํ†ตํ•ด ๋น ๋ฅด๊ฒŒ ๊ฒ€์ƒ‰ํ•œ ํ›„ ์ „์ฒด ๋ฌธ์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•จ์œผ๋กœ์จ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ , ๊ธด ๋ฌธ์„œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

3. Multi-Representation Indexing์˜ ์žฅ์ 

  • ๋น ๋ฅธ ๊ฒ€์ƒ‰: ์š”์•ฝ๋œ ๋‚ด์šฉ์„ ๋ฒกํ„ฐ๋กœ ์ธ๋ฑ์‹ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฒ€์ƒ‰ ์†๋„๊ฐ€ ๋น ๋ฆ…๋‹ˆ๋‹ค.
  • ์ •ํ™•ํ•œ ๋ฌธ์„œ ๋ฐ˜ํ™˜: ๊ฒ€์ƒ‰ ํ›„์—๋Š” ์›๋ณธ ๋ฌธ์„œ ์ „์ฒด๋ฅผ ๋ฐ˜ํ™˜ํ•˜์—ฌ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ถฉ๋ถ„ํ•œ ๋ฌธ๋งฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • LLM๊ณผ์˜ ํ†ตํ•ฉ: LLM์ด ๋ฌธ์„œ๋ฅผ ์š”์•ฝํ•˜๊ณ  ๊ทธ ์š”์•ฝ๋ณธ์„ ๊ฒ€์ƒ‰์— ์‚ฌ์šฉํ•จ์œผ๋กœ์จ, ๋ฌธ์„œ์˜ ํ•ต์‹ฌ ๋‚ด์šฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋” ์ •ํ™•ํ•œ ๊ฒ€์ƒ‰์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

4. ์ฝ”๋“œ ์„ค๋ช…

  • ๋‹ค์Œ์€ ๋‘ ๊ฐœ์˜ ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ๋ฅผ ๋กœ๋“œํ•˜๊ณ , ์š”์•ฝํ•˜์—ฌ Multi-Representation Indexing์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

4.1 ์›น ๋ฌธ์„œ ๋กœ๋“œ ๋ฐ ์š”์•ฝ ์ƒ์„ฑ

  • ์œ„ ์ฝ”๋“œ๋Š” ์›น์—์„œ ๋‘ ๊ฐœ์˜ ๋ฌธ์„œ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ ์ด ๋ฌธ์„œ๋ฅผ ์š”์•ฝํ•˜์—ฌ ๋ฒกํ„ฐ ์Šคํ† ์–ด์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
1
2
3
4
5
6
7
8
9
10
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

len(docs) # 2 

4.2 LLM์„ ์ด์šฉํ•œ ๋ฌธ์„œ ์š”์•ฝ

  • ์ด ์ฝ”๋“œ๋Š” LLM์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ๋ฅผ ์š”์•ฝํ•˜๋Š” ๊ณผ์ •์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ฐ ๋ฌธ์„œ๋Š” ์š”์•ฝ๋ณธ์œผ๋กœ ๋ณ€ํ™˜๋˜๊ณ , ์ด ์š”์•ฝ๋ณธ์€ ๋ฒกํ„ฐ ์Šคํ† ์–ด์— ์ €์žฅ๋ฉ๋‹ˆ๋‹ค.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import uuid
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | Azure_Chat #ChatOpenAI(model="gpt-3.5-turbo",max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})
summaries
  • chain.batch(docs, {"max_concurrency": 5})์—์„œ max_concurrency: 5๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค:

    1. ๋™์‹œ ์ฒ˜๋ฆฌ ๋ฌธ์„œ ์ˆ˜: ํ•œ ๋ฒˆ์— ์ตœ๋Œ€ 5๊ฐœ์˜ ๋ฌธ์„œ๋ฅผ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
    2. ์ฒ˜๋ฆฌ ์†๋„ ์ตœ์ ํ™”: 5๊ฐœ์”ฉ ๋ฌธ์„œ๋ฅผ ๋ฌถ์–ด ์ฒ˜๋ฆฌํ•จ์œผ๋กœ์จ ์ „์ฒด ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์„ ๋‹จ์ถ•ํ•ฉ๋‹ˆ๋‹ค.
    3. ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ: ์‹œ์Šคํ…œ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์„ ์ œ์–ดํ•˜์—ฌ ๊ณผ๋ถ€ํ•˜๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  • ํ•ด๋‹น ์˜ˆ์‹œ์˜ ๊ฒฝ์šฐ, ๋ฌธ์„œ๊ฐ€ 2๊ฐœ์ด๊ธฐ ๋•Œ๋ฌธ์— max_concurrency: 5๋กœ ์„ค์ •๋˜์–ด ์žˆ๋”๋ผ๋„ ์‹ค์ œ ์ฒ˜๋ฆฌ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค:

    1. ๋™์‹œ ์ฒ˜๋ฆฌ: ๋‘ ๋ฌธ์„œ ๋ชจ๋‘ ๋™์‹œ์— ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค. max_concurrency๊ฐ€ 5๋กœ ์„ค์ •๋˜์–ด ์žˆ์ง€๋งŒ, ์‹ค์ œ ๋ฌธ์„œ ์ˆ˜๊ฐ€ 2๊ฐœ์ด๋ฏ€๋กœ 2๊ฐœ๋งŒ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.

4.3 ๋ฒกํ„ฐ ์Šคํ† ์–ด ๋ฐ ๋ฌธ์„œ ์ €์žฅ์†Œ ์„ค์ •

  • ์—ฌ๊ธฐ์„œ๋Š” ์š”์•ฝ๋œ ๋ฌธ์„œ๋ฅผ ๋ฒกํ„ฐ ์Šคํ† ์–ด์— ์ €์žฅํ•˜๊ณ , ์›๋ณธ ๋ฌธ์„œ๋Š” ๋ณ„๋„์˜ ๋ฌธ์„œ ์ €์žฅ์†Œ์— ์ €์žฅํ•˜์—ฌ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from langchain.storage import InMemoryByteStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever

# ์š”์•ฝ๋ณธ์„ ์ €์žฅํ•  ๋ฒกํ„ฐ ์Šคํ† ์–ด
vectorstore = Chroma(collection_name="summaries", 
                     embedding_function=OpenAIEmbeddings())

# ์›๋ณธ ๋ฌธ์„œ๋ฅผ ์ €์žฅํ•  ๋ฌธ์„œ ์ €์žฅ์†Œ
store = InMemoryByteStore()

# ์š”์•ฝ๋ณธ๊ณผ ์›๋ณธ ๋ฌธ์„œ๋ฅผ ์—ฐ๊ฒฐํ•˜๋Š” ํ‚ค
id_key = "doc_id"

# ์š”์•ฝ๋ณธ๊ณผ ์›๋ณธ ๋ฌธ์„œ๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ๊ฒ€์ƒ‰๊ธฐ
retriever = MultiVectorRetriever(vectorstore=vectorstore,
                                 byte_store=store,
                                 id_key=id_key,
                                 )


4.4 ๋ฌธ์„œ ๊ฒ€์ƒ‰ ๋ฐ ๋ฐ˜ํ™˜

  • ์ด ์ฝ”๋“œ๋Š” โ€œMemory in agentsโ€๋ผ๋Š” ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์š”์•ฝ๋ณธ์„ ๊ฒ€์ƒ‰ํ•˜๊ณ , ํ•ด๋‹น ์š”์•ฝ๋ณธ์— ์—ฐ๊ฒฐ๋œ ์›๋ณธ ๋ฌธ์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# ์š”์•ฝ๋ณธ๊ณผ ์›๋ณธ ๋ฌธ์„œ ์—ฐ๊ฒฐ
doc_ids = [str(uuid.uuid4()) for _ in docs]
summary_docs = [Document(page_content=s, metadata={id_key: doc_ids[i]}) for i, s in enumerate(summaries)]

# ๋ฒกํ„ฐ ์Šคํ† ์–ด์— ์š”์•ฝ๋ณธ ์ถ”๊ฐ€
retriever.vectorstore.add_documents(summary_docs)

# ๋ฌธ์„œ ์ €์žฅ์†Œ์— ์›๋ณธ ๋ฌธ์„œ ์ถ”๊ฐ€
retriever.docstore.mset(list(zip(doc_ids, docs)))

# ๊ฒ€์ƒ‰ ์ฟผ๋ฆฌ
query = "Memory in agents"

# ์š”์•ฝ๋ณธ์œผ๋กœ๋ถ€ํ„ฐ ๋ฌธ์„œ ๊ฒ€์ƒ‰
sub_docs = vectorstore.similarity_search(query, k=1)

# ๊ฒ€์ƒ‰๋œ ์›๋ณธ ๋ฌธ์„œ ๋ฐ˜ํ™˜
retrieved_docs = retriever.get_relevant_documents(query, n_results=1)
retrieved_docs[0].page_content[0:500]

์ •๋ฆฌ

  • Multi-Representation Indexing์€ ์š”์•ฝ๋ณธ์„ ํ†ตํ•ด ๋น ๋ฅด๊ฒŒ ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•˜๊ณ , ๊ฒ€์ƒ‰๋œ ์š”์•ฝ๋ณธ์— ์—ฐ๊ฒฐ๋œ ์›๋ณธ ๋ฌธ์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๊ณ , ๊ธด ๋ฌธ๋งฅ์„ ๊ฐ€์ง„ ๋ฌธ์„œ๋„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ํŠนํžˆ ๋ฒกํ„ฐ ์Šคํ† ์–ด๋ฅผ ํ™œ์šฉํ•œ ๋ฌธ์„œ ๊ฒ€์ƒ‰์—์„œ ๋งค์šฐ ์œ ์šฉํ•˜๋ฉฐ, LLM์„ ์‚ฌ์šฉํ•ด ๋ฌธ์„œ๋ฅผ ์š”์•ฝํ•˜๊ณ  ๊ฒ€์ƒ‰ํ•˜๋Š” ๋ฐ ํฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

Part 13 (RAPTOR)

  • RAPTOR๋Š” ๊ณ„์ธต์  ์ธ๋ฑ์‹ฑ(hierarchical indexing) ๊ธฐ๋ฒ•์œผ๋กœ, ๋Œ€๊ทœ๋ชจ ๋ฌธ์„œ๋‚˜ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ •๋ณด๋ฅผ ๊ฒ€์ƒ‰ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

1. RAPTOR ๊ฐœ๋…

  • RAPTOR (RECURSIVE ABSTRACTIVE PROCESSING FOR TREE-ORGANIZED RETRIEVAL)๋Š” ๋ฌธ์„œ ์ง‘ํ•ฉ์„ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ์š”์•ฝํ•˜์—ฌ ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ํ˜•์„ฑํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์–‘ํ•œ ์งˆ๋ฌธ ์œ ํ˜•์— ๋งž๊ฒŒ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
    • ์ฆ‰, ํŠน์ • ์งˆ๋ฌธ์ด ์„ธ๋ถ€์ ์ธ ์ •๋ณด๋ฅผ ์š”๊ตฌํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” ํ•˜์œ„ ๋…ธ๋“œ(์„ธ๋ถ€ ๋ฌธ์„œ๋‚˜ ์ž‘์€ ์ •๋ณด ์กฐ๊ฐ)๋ฅผ, ์ƒ์œ„ ๊ฐœ๋…์„ ๋‹ค๋ฃจ๋Š” ์งˆ๋ฌธ์—๋Š” ์ƒ์œ„ ๋…ธ๋“œ(๋ฌธ์„œ ์š”์•ฝ)๋ฅผ ๊ฒ€์ƒ‰ํ•˜์—ฌ ๋‹ต์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2. ๊ณ„์ธต์  ์ถ”์ƒํ™”

  • RAPTOR์˜ ํ•ต์‹ฌ ๊ฐœ๋…์€ ๋ฐ”๋กœ ๊ณ„์ธต์  ์ถ”์ƒํ™”(hierarchical abstraction)์ž…๋‹ˆ๋‹ค.
    • ๋ฌธ์„œ๋‚˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž‘์€ ์กฐ๊ฐ์œผ๋กœ ๋‚˜๋‰˜๋ฉฐ, ๊ฐ ์กฐ๊ฐ์€ ์ƒ์œ„ ๋ ˆ๋ฒจ์—์„œ ์š”์•ฝ๋˜๊ณ , ์ด ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜์—ฌ ์ตœ์ข…์ ์œผ๋กœ ์ „์ฒด ๋ฌธ์„œ๋ฅผ ์š”์•ฝํ•œ ์ตœ์ƒ์œ„ ์š”์•ฝ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ ์š”์•ฝ ํŠธ๋ฆฌ๋Š” ์งˆ๋ฌธ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ์ˆ˜์ค€์˜ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
    • ์„ธ๋ถ€์ ์ธ ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ๊ฐœ๋ณ„ ๋ฌธ์„œ๋‚˜ ์ž‘์€ ๋ฐ์ดํ„ฐ ์กฐ๊ฐ์—์„œ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ , ๋” ํฐ ๋ฒ”์œ„์˜ ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ์ƒ์œ„ ๋ ˆ๋ฒจ์—์„œ ์š”์•ฝ๋œ ์ •๋ณด๋ฅผ ๊ฒ€์ƒ‰ํ•ฉ๋‹ˆ๋‹ค.

3. RAPTOR์˜ ๋™์ž‘ ๋ฐฉ์‹

RATOR์˜ ๋™์ž‘์„ ๋” ๊นŠ์ด ์žˆ๊ฒŒ ์„ค๋ช…ํ•˜๊ธฐ ์œ„ํ•ด ์œ„ ์ฝ”๋“œ์˜ ์ž‘๋™ ์›๋ฆฌ์™€ ๊ตฌ์ฒด์ ์ธ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • 3.1. ๋ฌธ์„œ ๋กœ๋“œ ๋ฐ ๋ฒกํ„ฐํ™”:

    • ๋ฌธ์„œ์˜ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋จผ์ € ๋กœ๋“œํ•˜๊ณ , ์ด๋ฅผ ๋ฒกํ„ฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ๋ฒกํ„ฐํ™”๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆซ์ž๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ๋งŒ๋“œ๋Š” ์ค‘์š”ํ•œ ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„์—์„œ๋Š” ๋ฌธ์„œ ๋‚ด์—์„œ ๋‹จ์–ด๋‚˜ ๋ฌธ์žฅ ๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ๋„๋ฅผ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐ ๋ฌธ์„œ๊ฐ€ ๊ณ ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ๋ฒกํ„ฐํ™”๋œ ๋ฐ์ดํ„ฐ๋Š” ์ดํ›„ ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด๋‚˜ ๊ฒ€์ƒ‰์—์„œ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

      1
      2
      3
      
      docs = loader.load()  # ๋ฌธ์„œ ๋กœ๋“œ
      docs_texts = [d.page_content for d in docs]  # ํ…์ŠคํŠธ ์ถ”์ถœ
      text_embeddings = embd.embed_documents(docs_texts)  # ๋ฌธ์„œ ๋ฒกํ„ฐํ™”
      
  • 3.2. ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐ ์š”์•ฝ:

    • ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ์„ฑ์„ ๊ธฐ์ค€์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ํด๋Ÿฌ์Šคํ„ฐ๋ง์€ ๋น„์Šทํ•œ ๋ฌธ์„œ๋“ค์„ ๋ฌถ์–ด์ฃผ๋Š” ๊ณผ์ •์œผ๋กœ, Gaussian Mixture Model (GMM)์„ ํ™œ์šฉํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ์˜ ๊ฐœ์ˆ˜๋ฅผ ๊ฒฐ์ •ํ•˜๊ณ  ๊ฐ ๋ฌธ์„œ๊ฐ€ ์–ด๋А ํด๋Ÿฌ์Šคํ„ฐ์— ์†ํ•˜๋Š”์ง€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๋ฌธ์„œ๋“ค์„ ์š”์•ฝํ•˜์—ฌ ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์˜ ์ฃผ์š” ๋‚ด์šฉ์„ ๊ฐ„๊ฒฐํ•˜๊ฒŒ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

      1
      
      df_clusters, df_summary = embed_cluster_summarize_texts(docs_texts, level=1)
      
    • embed_cluster_summarize_texts ํ•จ์ˆ˜

      1
      2
      3
      
      def embed_cluster_summarize_texts(
          texts: List[str], level: int
      ) -> Tuple[pd.DataFrame, pd.DataFrame]:
      
      • ๊ธฐ๋Šฅ: ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ ๋ชฉ๋ก์„ ์ž„๋ฒ ๋”ฉํ•˜๊ณ , ํด๋Ÿฌ์Šคํ„ฐ๋ง์„ ์ˆ˜ํ–‰ํ•œ ํ›„ ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด์˜ ๋ฌธ์„œ๋“ค์„ ์š”์•ฝํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐํ™”๋œ ๋ฌธ์„œ ์ •๋ณด์™€ ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์˜ ์š”์•ฝ ์ •๋ณด๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
      • ์ž…๋ ฅ ์ธ์ž:

        • texts: ๋ฌธ์„œ์˜ ๋ฆฌ์ŠคํŠธ๋กœ, ๊ฐ ๋ฌธ์„œ๊ฐ€ ๋ฌธ์ž์—ด๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฆฌ์ŠคํŠธ๋Š” ํด๋Ÿฌ์Šคํ„ฐ๋ง์˜ ๋Œ€์ƒ์ด ๋˜๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
        • level: ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐ ์š”์•ฝ์˜ ๊นŠ์ด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜์ž…๋‹ˆ๋‹ค. ์ฃผ๋กœ ์žฌ๊ท€์  ์š”์•ฝ ๊ณผ์ •์—์„œ ์ด ๊ฐ’์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
      • ์ถœ๋ ฅ:

        • df_clusters: ๊ฐ ๋ฌธ์„œ์™€ ๊ทธ ์ž„๋ฒ ๋”ฉ, ํด๋Ÿฌ์Šคํ„ฐ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด DataFrame.
        • df_summary: ๊ฐ ํด๋Ÿฌ์Šคํ„ฐ์˜ ์š”์•ฝ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด DataFrame.
  • 3.3. ์žฌ๊ท€์  ์š”์•ฝ:

    • ํด๋Ÿฌ์Šคํ„ฐ๋ณ„ ์š”์•ฝ์„ ํ•œ ๋ฒˆ ์ˆ˜ํ–‰ํ•œ ํ›„, ์ด๋ฅผ ๋‹ค์‹œ ์ƒ์œ„ ๋ ˆ๋ฒจ์—์„œ ํด๋Ÿฌ์Šคํ„ฐ๋งํ•˜๊ณ  ์š”์•ฝํ•˜๋Š” ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ฌธ์„œ์˜ ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ , ๊ณ ์ˆ˜์ค€ ์š”์•ฝ์„ ์ ์ง„์ ์œผ๋กœ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์žฌ๊ท€์  ์š”์•ฝ์„ ํ†ตํ•ด ๋ฌธ์„œ ์ „์ฒด๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•์Šต๋‹ˆ๋‹ค.

      1
      
      results = recursive_embed_cluster_summarize(docs_texts, level=1, n_levels=3)
      
    • recursive_embed_cluster_summarize ํ•จ์ˆ˜

      1
      2
      3
      
      def recursive_embed_cluster_summarize(
          texts: List[str], level: int = 1, n_levels: int = 3
      ) -> Dict[int, Tuple[pd.DataFrame, pd.DataFrame]]:
      
      • ๊ธฐ๋Šฅ: ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ ๋ชฉ๋ก์— ๋Œ€ํ•ด ์žฌ๊ท€์ ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง๊ณผ ์š”์•ฝ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ 3๋‹จ๊ณ„๊นŒ์ง€ ์žฌ๊ท€๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ๊ฐ ๋‹จ๊ณ„์—์„œ ์ƒ์„ฑ๋œ ํด๋Ÿฌ์Šคํ„ฐ์™€ ์š”์•ฝ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
      • ์ž…๋ ฅ ์ธ์ž:

        • texts: ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐ ์š”์•ฝ ๋Œ€์ƒ์ด ๋˜๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ฆฌ์ŠคํŠธ.
        • level: ํ˜„์žฌ ์žฌ๊ท€ ์ˆ˜์ค€์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ธฐ๋ณธ๊ฐ’์€ 1์ด๋ฉฐ, ์žฌ๊ท€ ํ˜ธ์ถœ ์‹œ ์ด ๊ฐ’์ด ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
        • n_levels: ์žฌ๊ท€ ํ˜ธ์ถœ์˜ ์ตœ๋Œ€ ๊นŠ์ด๋กœ, ๊ธฐ๋ณธ๊ฐ’์€ 3์ž…๋‹ˆ๋‹ค. ์ตœ๋Œ€ 3๋‹จ๊ณ„๊นŒ์ง€ ์žฌ๊ท€์ ์œผ๋กœ ํด๋Ÿฌ์Šคํ„ฐ๋ง ๋ฐ ์š”์•ฝ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
      • ์ถœ๋ ฅ:

        • ๊ฐ ์žฌ๊ท€ ๋ ˆ๋ฒจ๋ณ„ ํด๋Ÿฌ์Šคํ„ฐ ๋ฐ ์š”์•ฝ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” Dict.
        • ๊ฐ ๋ ˆ๋ฒจ์˜ ๊ฒฐ๊ณผ๋Š” ํŠœํ”Œ ํ˜•ํƒœ์˜ (df_clusters, df_summary).
  • 3.4. ๊ฒ€์ƒ‰ ๊ธฐ๋Šฅ ๊ตฌํ˜„:

    • ๋ฒกํ„ฐํ™”๋œ ๋ฌธ์„œ๋“ค์„ ๋ฒกํ„ฐ ์Šคํ† ์–ด์— ์ธ๋ฑ์‹ฑํ•˜์—ฌ ์‚ฌ์šฉ์ž๊ฐ€ ์งˆ๋ฌธํ•  ๋•Œ ๊ด€๋ จ ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉ์ž๋Š” ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ๋ฌธ์„œ ์ค‘์—์„œ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์‹ ์†ํ•˜๊ฒŒ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

      1
      2
      
      vectorstore = Chroma.from_texts(texts=all_texts, embedding=embd)
      retriever = vectorstore.as_retriever()
      
    • ๋ฒกํ„ฐ ์Šคํ† ์–ด๋Š” ๋ฌธ์„œ์˜ ๋ฒกํ„ฐํ™”๋ฅผ ์ €์žฅํ•˜๊ณ  ๊ด€๋ฆฌํ•˜๋ฉฐ, ์ƒˆ๋กœ์šด ์ž…๋ ฅ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ๊ฐ€์žฅ ๊ด€๋ จ์„ฑ ๋†’์€ ๋ฌธ์„œ๋“ค์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    • retriever๋Š” ๊ฒ€์ƒ‰ ์‹œ์— ์œ ์‚ฌํ•œ ๋ฌธ์„œ๋ฅผ ์ฐพ๋Š” ์—ญํ• ์„ ํ•˜๋ฉฐ, ๋ฒกํ„ฐ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฌธ์„œ๋“ค์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • 3.5. RAG ์ฒด์ธ ๊ตฌ์„ฑ:

    • ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” RAG (Retrieval-Augmented Generation) ์ฒด์ธ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. RAG๋Š” ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋” ๊ตฌ์ฒด์ ์ด๊ณ  ๋งฅ๋ฝ์— ๋งž๋Š” ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ๋กœ, ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๋ฌธ๋งฅ์— ๋งž์ถ”์–ด ์ž˜ ์ •๋ฆฌํ•˜์—ฌ ์ œ๊ณตํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๋ฐฉ๋ฒ•๋ก ์ž…๋‹ˆ๋‹ค.

      1
      2
      3
      4
      5
      6
      7
      
      rag_chain = (
          {"context": retriever | format_docs, "question": RunnablePassthrough()}
          | prompt
          | model
          | StrOutputParser()
      )
      rag_chain.invoke("How to define a RAG chain?")
      
    • ์งˆ๋ฌธ์„ ์ž…๋ ฅ๋ฐ›์œผ๋ฉด ๊ฒ€์ƒ‰๋œ ๋ฌธ์„œ์˜ ๋‚ด์šฉ๊ณผ ์งˆ๋ฌธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฌธ์„œ์˜ ์ค‘์š”ํ•œ ๋‚ด์šฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • StrOutputParser๋Š” ์ƒ์„ฑ๋œ ํ…์ŠคํŠธ๋ฅผ ํ›„์ฒ˜๋ฆฌํ•˜์—ฌ ๊ตฌ์กฐํ™”๋œ ์‘๋‹ต์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

Part 14 (ColBERT)

  • ์ด๋ฒˆ ๊ฐ•์˜๋Š” ColBERT๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ฒ€์ƒ‰ ๋ฐฉ๋ฒ•๋ก ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ColBERT๋Š” ๊ธฐ์กด์˜ ์ž„๋ฒ ๋”ฉ ๋ฐฉ์‹์—์„œ ๋ฐœ์ƒํ•˜๋Š” ํ•œ๊ณ„๋ฅผ ๋ณด์™„ํ•˜์—ฌ, ๋” ๋‚˜์€ ๋ฌธ์„œ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ์„ค๋ช…์—์„œ๋Š” ColBERT์˜ ๊ฐœ๋…๊ณผ ๊ทธ ๊ตฌํ˜„ ๋ฐฉ๋ฒ•์„ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

1. ColBERT์˜ ๊ฐœ๋…

  • ๋‹ค์Œ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•˜์…”์•ผ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค:
1
! pip install -U ragatouille

1.1 ์ „ํ†ต์ ์ธ ์ž„๋ฒ ๋”ฉ ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„

  • ๊ธฐ์กด์˜ ๋ฌธ์„œ ๊ฒ€์ƒ‰์€ ๋ฌธ์„œ ์ „์ฒด๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์ž„๋ฒ ๋”ฉ(embedding) ํ•˜์—ฌ ์ด๋ฅผ ์ด์šฉํ•œ K-์ตœ๊ทผ์ ‘ ์ด์›ƒ(K-Nearest Neighbors, KNN) ๋ฐฉ์‹์œผ๋กœ ๊ฒ€์ƒ‰์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ฌธ์„œ ์ „์ฒด๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์ถ•์•ฝํ•œ ํ›„, ์งˆ๋ฌธ์„ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋ฒกํ„ฐํ™”ํ•˜์—ฌ ๋‘ ๋ฒกํ„ฐ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•ด ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋ฌธ์„œ๋ฅผ ์ฐพ๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
    • ํ•˜์ง€๋งŒ ์ด ๋ฐฉ์‹์€ ๋ฌธ์„œ ์ „์ฒด๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์••์ถ•ํ•˜๋Š” ๊ณผ์ •์—์„œ ๋งŽ์€ ์„ธ๋ถ€ ์ •๋ณด(nuance) ๋ฅผ ์žƒ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

1.2 ColBERT์˜ ํ•ด๊ฒฐ์ฑ…

  • ColBERT (Contextualized Late Interaction over BERT)๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ์„œ๋ฅผ ๋‹จ์ˆœํžˆ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์ถ•์•ฝํ•˜์ง€ ์•Š๊ณ , ๋ฌธ์„œ๋ฅผ ์—ฌ๋Ÿฌ ํ† ํฐ(token)์œผ๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด ๊ฐœ๋ณ„ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ๋„ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ์ž„๋ฒ ๋”ฉ์„ ์ง„ํ–‰ํ•˜์—ฌ, ๊ฐ ์งˆ๋ฌธ์˜ ํ† ํฐ๊ณผ ๋ฌธ์„œ์˜ ๋ชจ๋“  ํ† ํฐ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. - ๊ฐ ์งˆ๋ฌธ ํ† ํฐ์ด ๋ฌธ์„œ ๋‚ด์—์„œ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ํ† ํฐ์„ ์ฐพ๊ณ , ์ด ์œ ์‚ฌ๋„ ๊ฐ’๋“ค์˜ ํ•ฉ์„ ์ตœ์ข…์ ์œผ๋กœ ๋ฌธ์„œ์™€ ์งˆ๋ฌธ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

2. ColBERT์˜ ๋™์ž‘ ๋ฐฉ์‹

2.1 ํ† ํฐํ™” ๋ฐ ์ž„๋ฒ ๋”ฉ

  • ๋ฌธ์„œ๋ฅผ ์ž„๋ฒ ๋”ฉํ•  ๋•Œ, ColBERT๋Š” ๋ฌธ์„œ ์ „์ฒด๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์••์ถ•ํ•˜๋Š” ๋Œ€์‹  ํ† ํฐํ™”(tokenization) ๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ํ† ํฐ์€ ๋ฌธ์„œ์˜ ๊ฐ ๋‹จ์–ด๋‚˜ ์˜๋ฏธ ์žˆ๋Š” ๋‹จ์œ„๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค.
  • ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด ๊ฐœ๋ณ„์ ์ธ ๋ฒกํ„ฐ(์ž„๋ฒ ๋”ฉ) ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๊ฐ ํ† ํฐ์˜ ์œ„์น˜ ์ •๋ณด(positional information) ๊ฐ€ ๋ฐ˜์˜๋ฉ๋‹ˆ๋‹ค.

2.2 ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ฒ˜๋ฆฌ

  • ์งˆ๋ฌธ๋„ ๋ฌธ์„œ์™€ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ํ† ํฐํ™” ๋ฐ ์ž„๋ฒ ๋”ฉ์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ์งˆ๋ฌธ์˜ ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๊ฐ€ ์ƒ์„ฑ๋˜๋ฉฐ, ๋ฌธ์„œ์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ค€๋น„๋ฅผ ํ•ฉ๋‹ˆ๋‹ค.

2.3 ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

  • ์งˆ๋ฌธ์˜ ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด, ๋ฌธ์„œ์˜ ๋ชจ๋“  ํ† ํฐ๊ณผ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์งˆ๋ฌธ ํ† ํฐ์— ๋Œ€ํ•ด ๋ฌธ์„œ์—์„œ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ํ† ํฐ์„ ์ฐพ๊ณ , ์ด ์œ ์‚ฌ๋„ ๊ฐ’์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
  • ์งˆ๋ฌธ์˜ ๋ชจ๋“  ํ† ํฐ์— ๋Œ€ํ•ด ์ด ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ ํ›„, ์œ ์‚ฌ๋„ ๊ฐ’๋“ค์˜ ํ•ฉ(sum of maximum similarities) ์„ ์ตœ์ข… ์œ ์‚ฌ๋„๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ๋ฐฉ์‹์€ ๋ฌธ์„œ ์ „์ฒด์˜ ์ •๋ณด๋ฅผ ํ•˜๋‚˜๋กœ ์••์ถ•ํ•˜๋Š” ๋Œ€์‹ , ๊ฐ๊ฐ์˜ ์„ธ๋ถ€ ์ •๋ณด๊ฐ€ ์œ ์ง€๋˜๋„๋ก ํ•˜์—ฌ ๋” ์ •ํ™•ํ•œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

3. ColBERT์˜ ํŠน์ง•

  • ColBERT๋Š” ๋ฌธ์„œ์˜ ์„ธ๋ถ€์ ์ธ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ฐ˜์˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๊ธด ๋ฌธ์„œ๋‚˜ ๋ณต์žกํ•œ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚ฉ๋‹ˆ๋‹ค.
    • ํ•˜์ง€๋งŒ ์ด ๋ฐฉ์‹์€ ๋ชจ๋“  ํ† ํฐ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ฒ˜๋ฆฌ ์†๋„๊ฐ€ ๋А๋ฆด ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์‹ค์‹œ๊ฐ„ ์‘๋‹ต์ด ํ•„์š”ํ•œ ํ™˜๊ฒฝ์—์„œ๋Š” ์„ฑ๋Šฅ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

4. ColBERT ๊ตฌํ˜„ ์˜ˆ์‹œ

  • ์ด์ œ ColBERT๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ฝ”๋“œ ์˜ˆ์‹œ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

4.1 Load Pretrained Model

1
2
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

4.2 Load Docs from Wikipedia

  • ์ด ์ฝ”๋“œ๋Š” ColBERT ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ Hayao Miyazaki์˜ Wikipedia ํŽ˜์ด์ง€๋ฅผ ๊ฐ€์ ธ์™€์„œ ํ† ํฐํ™” ๋ฐ ์ธ๋ฑ์‹ฑ์„ ์ง„ํ–‰ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.

    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"}

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data["query"]["pages"].values()))
    return page["extract"] if "extract" in page else None

# Get Document
full_document = get_wikipedia_page("Hayao_Miyazaki")

# Create Index
RAG.index(
    collection=[full_document],
    index_name="Miyazaki-123",
    max_document_length=180,
    split_documents=True,
)

4.3 Search Query

  • ์ด์ œ ๋ฌธ์„œ์— ๋Œ€ํ•ด ์งˆ๋ฌธ์„ ๋˜์ ธ ๊ฒ€์ƒ‰ํ•˜๋Š” ๊ณผ์ •์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
1
2
3
results = RAG.search(query="What animation studio did Miyazaki found?", k=3)

results
  • ์œ„ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด, โ€œMiyazaki๊ฐ€ ์„ค๋ฆฝํ•œ ์• ๋‹ˆ๋ฉ”์ด์…˜ ์ŠคํŠœ๋””์˜ค๋Š”?โ€์ด๋ผ๋Š” ์งˆ๋ฌธ์— ๋Œ€ํ•ด ๊ด€๋ จ๋œ ๋‹ต๋ณ€์ด ๊ฒ€์ƒ‰๋ฉ๋‹ˆ๋‹ค.

4.4 Merge with LangChain

  • LangChain ๋‚ด์—์„œ ColBERT๋ฅผ retriever๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋” ๋ณต์žกํ•œ ์‘๋‹ต์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
1
2
3
retriever = RAG.as_langchain_retriever(k=3)

retriever.invoke("What animation studio did Miyazaki found?")
  • ์ด ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด LangChain์—์„œ ColBERT๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒ€์ƒ‰ ๊ธฐ๋Šฅ์„ ๊ฐ•ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.



-->