[๊ฐ•์˜๋…ธํŠธ] Text Splitting For Retrieval

Posted by Euisuk's Dev Log on September 16, 2024

[๊ฐ•์˜๋…ธํŠธ] Text Splitting For Retrieval

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/The-5-Levels-Of-Text-Splitting-For-Retrieval-๊ฐ•์˜์š”์•ฝ

Introduction

๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM, Large Language Model)์„ ์ด์šฉํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ์ „๋žต ์ค‘ ํ•˜๋‚˜๋Š” ํฐ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ž‘์€ ์กฐ๊ฐ์œผ๋กœ ๋ถ„ํ• ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. LLM์—๊ฒŒ ํ•„์š”ํ•œ ์ •๋ณด๋งŒ์„ ์ œ๊ณตํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์˜ ์ž‘์—… ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ…์ŠคํŠธ ๋ถ„ํ•  ๊ธฐ์ˆ ์€ ๋‹จ์ˆœํ•œ ๋ฐฉ๋ฒ• ๊ฐ™์ง€๋งŒ, ๊ทธ ์•ˆ์—๋Š” ๋ณต์žกํ•œ ๊ณผํ•™๊ณผ ์˜ˆ์ˆ ์ด ์ˆจ์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์˜์ƒ ์ œ๋ชฉ : The 5 Levels Of Text Splitting For Retrieval
  • ์˜์ƒ ๋งํฌ : https://youtu.be/8OJC21T2SL4

์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” 5๊ฐ€์ง€์˜ ํ…์ŠคํŠธ ๋ถ„ํ• (Levels of Text Splitting) ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

(*ํ•ด๋‹น ๊ธ€์€ ์œ„ Youtube ์˜์ƒ ์ž๋ฃŒ๋ฅผ ๊ณต๋ถ€ ํ›„์— ์ •๋ฆฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.)

Background

์–ธ์–ด ๋ชจ๋ธ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฌธ๋งฅ ๊ธธ์ด(Context Length)๋ผ๋Š” ์ œํ•œ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ์ œํ•œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๊บผ๋ฒˆ์— ๋ชจ๋ธ์— ๋„˜๊ธฐ๊ธฐ๋ณด๋‹ค๋Š”, ๋ฐ์ดํ„ฐ๋ฅผ ์ž‘์€ ์กฐ๊ฐ์œผ๋กœ ๋‚˜๋ˆ ์„œ ํ•„์š”ํ•œ ๋ถ€๋ถ„๋งŒ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ด ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค. ์ด๋•Œ ์ค‘์š”ํ•œ ๊ฒƒ์ด ๋ฐ”๋กœ ํ…์ŠคํŠธ ๋ถ„ํ• ์ž…๋‹ˆ๋‹ค.

ํ…์ŠคํŠธ ๋ถ„ํ• ์€ ๋‹จ์ˆœํžˆ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋ฅด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ตœ์ ์˜ ์ •๋ณด ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ์ „๋žต์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์‹ ํ˜ธ ๋Œ€ ์žก์Œ๋น„(Signal-to-Noise Ratio)๋ฅผ ๋†’์—ฌ ๋ชจ๋ธ์ด ๋ณด๋‹ค ์ค‘์š”ํ•œ ์ •๋ณด์— ์ง‘์ค‘ํ•˜๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

The 5 Levels of Text Splitting

ํ…์ŠคํŠธ ๋ถ„ํ• ์€ ํฌ๊ฒŒ ๋‹ค์„ฏ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„๋Š” ์ ์  ๋” ๋ณต์žกํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ฐœ์ „ํ•˜๋ฉฐ, ํ…์ŠคํŠธ์˜ ๋ฌผ๋ฆฌ์  ๊ตฌ์กฐ์—์„œ๋ถ€ํ„ฐ ์˜๋ฏธ์  ๊ตฌ์กฐ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๊ด€์ ์„ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค.

โ‘  Level 1: Character Splitting - Simple static character chunks of data

โ‘ก Level 2: Recursive Character Text Splitting - Recursive chunking based on a list of separators

โ‘ข Level 3: Document Specific Splitting - Various chunking methods for different document types (PDF, Python, Markdown)

โ‘ฃ Level 4: Semantic Splitting - Embedding walk based chunking

โ‘ค Level 5: Agentic Splitting - Experimental method of splitting text with an agent-like system.


Level 1: Character Splitting (์บ๋ฆญํ„ฐ ๋ถ„ํ• )

์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ์‹์ธ ํ…์ŠคํŠธ๋ฅผ ๊ณ ์ •๋œ ๋ฌธ์ž ์ˆ˜๋กœ ๋ถ„ํ• ํ•˜๋Š” ๋ฐฉ์‹์€ ๊ธฐ๋ณธ์ ์ธ ํ…์ŠคํŠธ ๋ถ„ํ•  ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๊ตฌํ˜„์ด ๊ฐ„๋‹จํ•˜์ง€๋งŒ, ํ…์ŠคํŠธ์˜ ๋ฌธ๋งฅ์ด๋‚˜ ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์‹ค๋ฌด์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

๊ณ ์ •๋œ ๋ฌธ์ž ์ˆ˜๋กœ ๋ถ„ํ• ํ•  ๊ฒฝ์šฐ, ๋‹จ์–ด ์ค‘๊ฐ„์—์„œ ๋ถ„ํ• ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๊ฐ€๋…์„ฑ์— ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์ด์œ ๋กœ ์ด ๋ฐฉ์‹์€ ์‹ค์ œ ์‘์šฉ์—์„œ ๊ฑฐ์˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š์œผ๋ฉฐ, ๋Œ€์‹  ๋” ์ •๊ตํ•œ ๋ถ„ํ•  ๊ธฐ๋ฒ•์ด ์„ ํ˜ธ๋ฉ๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
text = "This is the text I would like to chunk up. It is the example text for this exercise"
chunks = []
chunk_size = 35

for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)

์žฅ์ :

  • ๋งค์šฐ ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฆ…๋‹ˆ๋‹ค.
  • ๊ตฌํ˜„์ด ์‰ฝ์Šต๋‹ˆ๋‹ค.

๋‹จ์ :

  • ํ…์ŠคํŠธ์˜ ์˜๋ฏธ๋‚˜ ๊ตฌ์กฐ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ์‹ค์ „์—์„œ ๊ฑฐ์˜ ์‚ฌ์šฉ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
  • ๋‹จ์–ด ์ค‘๊ฐ„์—์„œ ๋ถ„ํ• ์ด ๋ฐœ์ƒํ•˜์—ฌ ๊ฐ€๋…์„ฑ์— ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Level 2: Recursive Character Splitting (์žฌ๊ท€์  ์บ๋ฆญํ„ฐ ๋ถ„ํ• )

๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ์žฌ๊ท€์  ์บ๋ฆญํ„ฐ ๋ถ„ํ• ์ž…๋‹ˆ๋‹ค. RecursiveCharacterTextSplitter์™€ ๊ฐ™์€ ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•ด ํ…์ŠคํŠธ๋ฅผ ์žฌ๊ท€์ ์œผ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์–ด์ง„ ๊ตฌ๋ถ„์ž(์˜ˆ: ๋ฌธ์žฅ, ๋‹จ๋ฝ, ๊ณต๋ฐฑ ๋“ฑ)์— ๋”ฐ๋ผ ํ…์ŠคํŠธ๋ฅผ ๋‚˜๋ˆ„๋ฉฐ, ๊ฐ ๊ตฌ๋ถ„์ž๊ฐ€ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ๋‹ค์Œ ๊ตฌ๋ถ„์ž๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ๋” ์„ธ๋ถ„ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฐฉ๋ฒ•์€ ํ…์ŠคํŠธ์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ๋ฆ„์„ ๊ณ ๋ คํ•ด ์ ์ ˆํ•œ ๊ตฌ๋ถ„์ž๋ฅผ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด ํŠน์ง•์ž…๋‹ˆ๋‹ค. ๋จผ์ € ํฐ ๋‹จ์œ„๋กœ ๋‚˜๋ˆˆ ํ›„, ํ•„์š”์— ๋”ฐ๋ผ ์ž‘์€ ๋‹จ์œ„๋กœ ์žฌ๊ท€์ ์œผ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ณด๋‹ค ์˜๋ฏธ ์žˆ๋Š” ์ฒญํฌ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

1
2
3
4
5
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
texts = text_splitter.split_text(text)

์žฅ์ :

  • ํ…์ŠคํŠธ์˜ ๊ตฌ์กฐ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์˜๋ฏธ ์žˆ๋Š” ๋‹จ์œ„๋กœ ๋ถ„ํ• ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋‹ค์–‘ํ•œ ๊ตฌ๋ถ„์ž๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์œ ์—ฐํ•˜๊ฒŒ ํ…์ŠคํŠธ๋ฅผ ๋ถ„ํ• ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹จ์ :

  • ์—ฌ์ „ํžˆ ๊ณ ์ •๋œ ๋ฌธ์ž ์ˆ˜์— ์˜์กดํ•˜์—ฌ ๋ถ„ํ• ๋ฉ๋‹ˆ๋‹ค.
  • ๋ณต์žกํ•œ ๋ฌธ์„œ์—์„œ๋Š” ์ถ”๊ฐ€์ ์ธ ์กฐ์ •์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Level 3: Document-Specific Splitting (๋ฌธ์„œ ํŠนํ™” ๋ถ„ํ• )

์„ธ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ๋ฌธ์„œ์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ๋ถ„ํ• ํ•˜๋Š” ๋ฌธ์„œ ํŠนํ™” ๋ถ„ํ• ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, Markdown, Python ์ฝ”๋“œ, PDF ๋ฌธ์„œ ๋“ฑ ๊ฐ๊ฐ์˜ ๋ฌธ์„œ ํ˜•์‹์—๋Š” ๊ณ ์œ ์˜ ๊ตฌ์กฐ์  ํŠน์ง•์ด ์žˆ์Šต๋‹ˆ๋‹ค. Markdown ๋ฌธ์„œ์˜ ๊ฒฝ์šฐ, ์ œ๋ชฉ(Heading)์„ ๊ธฐ์ค€์œผ๋กœ, ์ฝ”๋“œ ๋ฌธ์„œ์˜ ๊ฒฝ์šฐ ํ•จ์ˆ˜๋‚˜ ํด๋ž˜์Šค ๋‹จ์œ„๋กœ ๋ถ„ํ• ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

PDF์™€ ๊ฐ™์€ ๋ณต์žกํ•œ ๋ฌธ์„œ๋Š” ์ด๋ฏธ์ง€, ํ‘œ ๋“ฑ ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ํฌํ•จํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Ÿฌํ•œ ์š”์†Œ๋“ค์— ๋งž๋Š” ๋ถ„ํ•  ๋ฐฉ์‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

python ๋ถ„ํ•  ์˜ˆ์‹œ:

1
2
3
4
5
6
7
8
9
10
# Python ์ฝ”๋“œ ๋ฌธ์„œ ๋ถ„ํ•  ์˜ˆ์ œ
from langchain.text_splitter import PythonCodeSplitter

code = """
class Example:
    def function(self):
        pass
"""
splitter = PythonCodeSplitter(chunk_size=200)
chunks = splitter.split_text(code)

JS ๋ถ„ํ•  ์˜ˆ์‹œ:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

javascript_text = """
// Function is called, the return value will end up in x
let x = myFunction(4, 3);

function myFunction(a, b) {
// Function returns the product of a and b
  return a * b;
}
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=65, chunk_overlap=0
)

js_splitter.create_documents([javascript_text])

PDF ๋ถ„ํ•  ์˜ˆ์‹œ:

1
2
3
4
from PyPDF2 import PdfReader

reader = PdfReader('sample.pdf')
chunks = [page.extract_text() for page in reader.pages]

Markdown ๋ถ„ํ•  ์˜ˆ์‹œ:

1
2
3
4
from langchain.text_splitter import MarkdownTextSplitter

text_splitter = MarkdownTextSplitter(chunk_size=200)
chunks = text_splitter.split_text(markdown_text)

์žฅ์ :

  • ๋ฌธ์„œ์˜ ํ˜•์‹(์˜ˆ: Markdown, Python ์ฝ”๋“œ, PDF)์„ ๊ณ ๋ คํ•œ ๋งž์ถคํ˜• ๋ถ„ํ• ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฌธ์„œ ๊ตฌ์กฐ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ์˜๋ฏธ๋ฅผ ๋” ์ž˜ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.

๋‹จ์ :

  • ๊ฐ ๋ฌธ์„œ ํ˜•์‹์— ๋งž์ถ˜ ๋งž์ถคํ˜• ์ฝ”๋“œ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๋Ÿฌ ๋ฌธ์„œ ํ˜•์‹์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด ์ถ”๊ฐ€์ ์ธ ๋„๊ตฌ๋‚˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Level 4: Semantic Splitting (์˜๋ฏธ ๊ธฐ๋ฐ˜ ๋ถ„ํ• )

๋„ค ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ์˜๋ฏธ์  ๋ถ„ํ• ์ž…๋‹ˆ๋‹ค. ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋‹จ๊ณ„๋“ค์€ ํ…์ŠคํŠธ์˜ ๋ฌผ๋ฆฌ์  ๊ตฌ์กฐ(๋ฌธ๋‹จ, ๋ฌธ์žฅ, ๋‹จ์–ด)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ถ„ํ• ํ–ˆ์ง€๋งŒ, ์ด ๋‹จ๊ณ„์—์„œ๋Š” ํ…์ŠคํŠธ์˜ ์˜๋ฏธ์™€ ๋‚ด์šฉ์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๊ฐ ๋ฌธ์žฅ์ด ๋‹ค๋ฃจ๋Š” ์ฃผ์ œ๋‚˜ ๋‚ด์šฉ์˜ ์œ ์‚ฌ์„ฑ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๋ถ„ํ• ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ๋ฐฉ๋ฒ•์€ ํ…์ŠคํŠธ์˜ ์˜๋ฏธ๋ฅผ ๋” ์ž˜ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ๊ตฌํ˜„์ด ๋ณต์žกํ•˜๊ณ  ๋น„์šฉ์ด ๋งŽ์ด ๋“ญ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ํ…์ŠคํŠธ์˜ ์ž„๋ฒ ๋”ฉ(embedding)์„ ์‚ฌ์šฉํ•ด ๋ฌธ์žฅ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์˜๋ฏธ์ ์œผ๋กœ ๊ด€๋ จ๋œ ๋ฌธ์žฅ๋“ค์„ ํ•˜๋‚˜์˜ ์ฒญํฌ๋กœ ๋ฌถ์Šต๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_distances

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
sentences = ["This is a test.", "How does it work?", "Let's see!"]
embeddings = model.encode(sentences)

distances = cosine_distances(embeddings)

์žฅ์ :

  • ํ…์ŠคํŠธ์˜ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์˜๋ฏธ ์žˆ๋Š” ๋ถ„ํ• ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • ์˜๋ฏธ์ ์œผ๋กœ ๊ด€๋ จ๋œ ๋ฌธ์žฅ๋“ค์„ ํ•˜๋‚˜์˜ ์ฒญํฌ๋กœ ๋ฌถ์–ด ์ •๋ณด ๊ฒ€์ƒ‰ ๋ฐ ๋ถ„์„์— ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋‹จ์ :

  • ์ž„๋ฒ ๋”ฉ ๋ฐ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ๊ณ„์‚ฐ ๋น„์šฉ์ด ํฝ๋‹ˆ๋‹ค.
  • ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋‚˜ ๋ณต์žกํ•œ ๋ฌธ์„œ์˜ ๊ฒฝ์šฐ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์ด ๊ธธ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Level 5: Agentic Splitting (์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜ ๋ถ„ํ• )

๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„๋Š” ์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜ ๋ถ„ํ• ์ž…๋‹ˆ๋‹ค. Agentic Splitting์€ ํ…์ŠคํŠธ๋ฅผ ๋Šฅ๋™์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜์—ฌ ๋ถ„ํ• ํ•˜๋Š” ์‹คํ—˜์ ์ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ฐ ๋ฌธ์žฅ์„ ํ•˜๋‚˜์˜ โ€œPropositionโ€์œผ๋กœ ๋ณด๊ณ , ๊ฐ Proposition์„ ํ‰๊ฐ€ํ•˜์—ฌ ๊ธฐ์กด ์ฒญํฌ์— ํฌํ•จํ• ์ง€ ์ƒˆ๋กœ์šด ์ฒญํฌ๋กœ ๋‚˜๋ˆŒ์ง€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ๋ฌธ์žฅ์ด ์ถ”๊ฐ€๋  ๋•Œ๋งˆ๋‹ค ์ฒญํฌ์˜ ์š”์•ฝ๊ณผ ์ œ๋ชฉ์ด ์ž๋™์œผ๋กœ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค.

์ด ๋ฐฉ์‹์€ ๋งค์šฐ ๋А๋ฆฌ๊ณ  ๋น„์šฉ์ด ๋งŽ์ด ๋“ค์ง€๋งŒ, ๋ณต์žกํ•˜๊ณ  ๋‹ค์–‘ํ•œ ๋ฌธ์„œ์— ๋Œ€ํ•ด ๋†’์€ ์ˆ˜์ค€์˜ ๋ถ„ํ• ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
class AgenticChunker:
    def __init__(self):
        self.chunks = []

    def add_proposition(self, proposition):
        # Proposition์„ ์ถ”๊ฐ€ํ•  ์ฒญํฌ๋ฅผ ํ‰๊ฐ€
        pass

    def create_new_chunk(self, proposition):
        # ์ƒˆ ์ฒญํฌ ์ƒ์„ฑ
        pass

์ž‘๋™ ๋ฐฉ์‹:

  • Proposition(๋ฌธ์žฅ)์ด ์ฃผ์–ด์ง€๋ฉด, ๊ธฐ์กด ์ฒญํฌ ์ค‘ ์–ด๋А ๊ฒƒ์— ํฌํ•จ๋ ์ง€ ํŒ๋‹จํ•˜๊ฑฐ๋‚˜, ์ƒˆ๋กœ์šด ์ฒญํฌ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ์ฒญํฌ๊ฐ€ ์ƒ์„ฑ๋  ๋•Œ๋งˆ๋‹ค GPT-4์™€ ๊ฐ™์€ LLM์„ ์‚ฌ์šฉํ•˜์—ฌ ์š”์•ฝ๊ณผ ์ œ๋ชฉ์„ ์ž๋™ ์ƒ์„ฑํ•˜๋ฉฐ, Proposition์ด ์ถ”๊ฐ€๋  ๋•Œ๋งˆ๋‹ค ์ฒญํฌ์˜ ์š”์•ฝ๊ณผ ์ œ๋ชฉ์ด ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค.

์žฅ์ :

  • ๋ฌธ์žฅ์„ ๋Šฅ๋™์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜์—ฌ ๋ถ„ํ• ํ•˜๋ฉฐ, ์ฒญํฌ์˜ ์š”์•ฝ ๋ฐ ์ œ๋ชฉ์ด ์ง€์†์ ์œผ๋กœ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค.
  • ๋ฌธ์„œ์˜ ์˜๋ฏธ ๋ณ€ํ™”์— ์œ ์—ฐํ•˜๊ฒŒ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹จ์ :

  • LLM์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ ๋น„์šฉ์ด ํฌ๊ณ  ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค.
  • ์‹คํ—˜์ ์ธ ๊ธฐ๋ฒ•์œผ๋กœ, ์•„์ง ์ผ๋ฐ˜์ ์ธ ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ๋Š” ์‚ฌ์šฉ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

AgenticChunker์— ๋Œ€ํ•œ ๋ณด์ถฉ ์„ค๋ช…:

AgenticChunker ํด๋ž˜์Šค๋Š” ๋ฌธ์žฅ์„ ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ์ฒญํฌ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๊ฐ ์ฒญํฌ์˜ ๋‚ด์šฉ์„ ์š”์•ฝํ•˜๊ณ  ์ œ๋ชฉ์„ ์ƒ์„ฑํ•˜๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ์ด ์‹œ์Šคํ…œ์€ ์‚ฌ์šฉ์ž๊ฐ€ ๋ฌธ์„œ๋ฅผ ์„ธ๋ถ€์ ์œผ๋กœ ๋ถ„ํ• ํ•˜๊ณ  ๊ด€๋ฆฌํ•˜๋Š” ๋ฐ ์ ํ•ฉํ•˜๋ฉฐ, ํŠนํžˆ ํฐ ๋ฌธ์„œ๋‚˜ ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

AgenticChunker๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ค‘์š”ํ•œ ๊ธฐ๋Šฅ์„ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ฌธ์žฅ์„ ์ฒญํฌ์— ์ถ”๊ฐ€ํ• ์ง€ ์ƒˆ๋กœ์šด ์ฒญํฌ๋ฅผ ์ƒ์„ฑํ• ์ง€ ํŒ๋‹จ.
  • GPT-4๋ฅผ ํ†ตํ•ด ๊ฐ ์ฒญํฌ์˜ ์š”์•ฝ๊ณผ ์ œ๋ชฉ์„ ์ž๋™์œผ๋กœ ์ƒ์„ฑ ๋ฐ ๊ฐฑ์‹ .
  • Proposition์ด ์ถ”๊ฐ€๋  ๋•Œ๋งˆ๋‹ค ์˜๋ฏธ๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ  ์ฒญํฌ๋ฅผ ๋™์ ์œผ๋กœ ๊ด€๋ฆฌ.

Conclusion

ํ…์ŠคํŠธ ๋ถ„ํ• ์€ ์–ธ์–ด ๋ชจ๋ธ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ํ•„์ˆ˜์ ์ธ ๊ณผ์ •์ž…๋‹ˆ๋‹ค.

์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” โ€œThe 5 Levels Of Text Splitting For Retrievalโ€ ๊ฐ•์˜์—์„œ ์†Œ๊ฐœํ•˜๋Š” 5๋‹จ๊ณ„์˜ ํ…์ŠคํŠธ ๋ถ„ํ•  ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€ํ•ด์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์€ ๊ฐ๊ธฐ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๋‚˜๋ˆ„๋ฉฐ, ๋ฌธ์„œ์˜ ํŠน์„ฑ, ์˜๋ฏธ, ๊ทธ๋ฆฌ๊ณ  ์ž‘์—…์˜ ๋ชฉ์ ์— ๋งž๋Š” ๋ถ„ํ•  ์ „๋žต์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ณธ์ ์ธ ์บ๋ฆญํ„ฐ ๋ถ„ํ• ์—์„œ๋ถ€ํ„ฐ ์˜๋ฏธ์  ๋ถ„ํ• , ๊ทธ๋ฆฌ๊ณ  ์—์ด์ „ํŠธ ๊ธฐ๋ฐ˜์˜ ๋ถ„ํ• ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์˜ต์…˜์„ ์‹คํ—˜ํ•ด๋ณด๋ฉด์„œ, ์—ฌ๋Ÿฌ๋ถ„์˜ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์— ๋งž๋Š” ์ตœ์ ์˜ ๋ฐฉ๋ฒ•์„ ์ฐพ์„ ์ˆ˜ ์žˆ๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค. ๋” ๋‚˜์•„๊ฐ€, ํ…์ŠคํŠธ ๋ถ„ํ•  ์™ธ์—๋„ ๋ฉ€ํ‹ฐ๋ฒกํ„ฐ ์ธ๋ฑ์‹ฑ(Multi-Vector Indexing)์ด๋‚˜ ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ ์ถ”์ถœ๊ณผ ๊ฐ™์€ ๊ณ ๊ธ‰ ๊ธฐ๋ฒ•๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ๋”์šฑ ์ •๊ตํ•œ ์ •๋ณด ๊ฒ€์ƒ‰ ์‹œ์Šคํ…œ์„ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.



-->