[NLP] 3. Natural Language Preprocessing

Posted by Euisuk's Dev Log on July 29, 2024

[NLP] 3. Natural Language Preprocessing

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/NLP-3.-์ž์—ฐ์–ด-์ „์ฒ˜๋ฆฌ-๊ธฐ

  1. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP) ๊ฐœ์š”

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์ผ๋ฐ˜์ ์ธ ์ˆœ์„œ

  • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋Š” ์Œ์„ฑ์„ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ํ•ด๋‹น ํ…์ŠคํŠธ๋ฅผ ๋ถ„์„ ๋ฐ ์˜๋ฏธ๋ฅผ ์ถ”์ถœํ•œ ๋’ค, ์ด๋ฅผ ๋‹ค์‹œ ์Œ์„ฑ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. (์•„๋ž˜ ๊ทธ๋ฆผ ์ฐธ๊ณ )
  • ์ด ๊ณผ์ •์€ ํฌ๊ฒŒ STT(Speech to Text)์™€ TTS(Text to Speech)๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค.

1.1 ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์˜ ์ฃผ์š” ๋ถ„์•ผ

  • ์Œ์šด๋ก , ํ˜•ํƒœ๋ก , ๊ตฌ๋ฌธ๋ก , ์˜๋ฏธ๋ก , ํ™”์šฉ๋ก , ๋‹ด๋ก ๊ณผ ๊ฐ™์€ Classical Categorization์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋ฅผ ์—ฐ๊ตฌํ•˜๊ณ  ์‘์šฉํ•˜๋Š” ๋ฐ ์žˆ์–ด์„œ ์–ธ์–ด์˜ ๊ฐ ์ธก๋ฉด์„ ๋ถ„์„ํ•˜๊ณ  ์ดํ•ดํ•˜๋Š” ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์„ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๋ถ„๋ฅ˜๋Š” ๊ฐ๊ฐ์˜ ์–ธ์–ด์  ํ˜„์ƒ์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋‹ค๋ฃจ๊ณ , ๊ฐ ํ˜„์ƒ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์—ฐ๊ตฌํ•˜๋Š”๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ์ž์—ฐ์–ด๋ฅผ ๋‹ค์–‘ํ•œ ์ธต์œ„์—์„œ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ์Œ์šด๋ก (Phonology):

    • ๊ธฐ์ค€: ์†Œ๋ฆฌ์™€ ๊ด€๋ จ๋œ ์–ธ์–ด์  ํ˜„์ƒ
    • ์„ค๋ช…: ์Œ์šด๋ก ์€ ์–ธ์–ด์˜ ์†Œ๋ฆฌ ์ฒด๊ณ„์™€ ์Œ์†Œ(phoneme)๋ฅผ ์—ฐ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ๋Š” ์Œ์„ฑ์„ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•˜๊ฑฐ๋‚˜, ์Œ์„ฑ ์ธ์‹ ์‹œ์Šคํ…œ์„ ๊ฐœ๋ฐœํ•  ๋•Œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.
  2. ํ˜•ํƒœ๋ก (Morphology):

    • ๊ธฐ์ค€: ๋‹จ์–ด์˜ ๊ตฌ์กฐ์™€ ํ˜•์„ฑ
    • ์„ค๋ช…: ํ˜•ํƒœ๋ก ์€ ๋‹จ์–ด์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ๋ฅผ ์—ฐ๊ตฌํ•˜๋ฉฐ, ๋‹จ์–ด๋ฅผ ์˜๋ฏธ ์žˆ๋Š” ๋‹จ์œ„(morpheme)๋กœ ๋ถ„ํ•ดํ•˜๋Š” ์ž‘์—…์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ๋Š” ํ˜•ํƒœ์†Œ ๋ถ„์„(morphological analysis) ๋ฐ ํ˜•ํƒœ์†Œ ๊ธฐ๋ฐ˜ ํ† ํฐํ™”(tokenization)์— ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.
  3. ๊ตฌ๋ฌธ๋ก (Syntax):

    • ๊ธฐ์ค€: ๋ฌธ์žฅ ๊ตฌ์กฐ์™€ ๋ฌธ๋ฒ• ๊ทœ์น™
    • ์„ค๋ช…: ๊ตฌ๋ฌธ๋ก ์€ ๋‹จ์–ด๋“ค์ด ๊ฒฐํ•ฉ๋˜์–ด ๋ฌธ์žฅ์„ ์ด๋ฃจ๋Š” ๊ทœ์น™์„ ์—ฐ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ๋Š” ๋ฌธ์žฅ ๊ตฌ๋ฌธ ๋ถ„์„(parsing)์„ ํ†ตํ•ด ๋ฌธ๋ฒ•์ ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅธ ๋ฌธ์žฅ์„ ๋ถ„์„ํ•˜๊ณ  ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  4. ์˜๋ฏธ๋ก (Semantics):

    • ๊ธฐ์ค€: ์˜๋ฏธ์™€ ํ•ด์„
    • ์„ค๋ช…: ์˜๋ฏธ๋ก ์€ ๋‹จ์–ด, ๊ตฌ, ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ์—ฐ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ๋Š” ํ…์ŠคํŠธ์˜ ์˜๋ฏธ๋ฅผ ์ถ”์ถœํ•˜๊ณ , ๋ฌธ๋งฅ์„ ์ดํ•ดํ•˜๋ฉฐ, ์˜๋ฏธ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰ ๋ฐ ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ์— ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.
  5. ํ™”์šฉ๋ก (Pragmatics):

    • ๊ธฐ์ค€: ์–ธ์–ด ์‚ฌ์šฉ์˜ ๋งฅ๋ฝ๊ณผ ๋ชฉ์ 
    • ์„ค๋ช…: ํ™”์šฉ๋ก ์€ ์–ธ์–ด๊ฐ€ ์‹ค์ œ ์ƒํ™ฉ์—์„œ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋˜๋Š”์ง€๋ฅผ ์—ฐ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ๋Š” ๋Œ€ํ™” ์‹œ์Šคํ…œ, ์ฑ—๋ด‡ ๋“ฑ์—์„œ ์‚ฌ์šฉ์ž์˜ ์˜๋„๋ฅผ ํŒŒ์•…ํ•˜๊ณ  ์ ์ ˆํ•œ ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
  6. ๋‹ด๋ก (Discourse):

    • ๊ธฐ์ค€: ๋ฌธ์žฅ ๊ฐ„์˜ ๊ด€๊ณ„์™€ ๋งฅ๋ฝ
    • ์„ค๋ช…: ๋‹ด๋ก ์€ ๋ฌธ์žฅ๋“ค์ด ๊ฒฐํ•ฉ๋˜์–ด ๋” ํฐ ํ…์ŠคํŠธ๋‚˜ ๋Œ€ํ™”๋ฅผ ํ˜•์„ฑํ•˜๋Š” ๋ฐฉ์‹์„ ์—ฐ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ๋Š” ํ…์ŠคํŠธ ์š”์•ฝ, ๋ฌธ์„œ ๋ถ„๋ฅ˜, ๋Œ€ํ™”์˜ ์ผ๊ด€์„ฑ ์œ ์ง€ ๋“ฑ์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

1.2 ์šฐ๋ฆฌ๋Š” ์ง€๊ธˆ ์–ด๋А ๋‹จ๊ณ„์ธ๊ฐ€?

  • ํ˜„์žฌ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP)์˜ ๋ฐœ์ „ ์ˆ˜์ค€์„ ํ‰๊ฐ€ํ•  ๋•Œ, ์Œ์šด๋ก , ํ˜•ํƒœ๋ก , ๊ตฌ๋ฌธ๋ก , ์˜๋ฏธ๋ก , ํ™”์šฉ๋ก , ๋‹ด๋ก ์˜ 6๊ฐ€์ง€ ์˜์—ญ์—์„œ ๊ฐ๊ฐ์˜ ์„ฑ๊ณผ๋ฅผ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ฐ๊ฐ์˜ ์˜์—ญ์—์„œ ํ˜„์žฌ ๊ธฐ์ˆ ์ด ์–ด๋А ์ •๋„ ์„ฑ์ˆ™๋˜์—ˆ๋Š”์ง€ ์‚ดํŽด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

    1. ์Œ์šด๋ก  (Phonology)

    • ์Œ์„ฑ ์ธ์‹(Speech Recognition): ๋งค์šฐ ๋†’์€ ์ •ํ™•๋„๋กœ ์Œ์„ฑ์„ ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ธฐ์ˆ ์ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
    • ํ…์ŠคํŠธ ์Œ์„ฑ ๋ณ€ํ™˜(Text-to-Speech): ์ž์—ฐ์Šค๋Ÿฌ์šด ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜๋Š” TTS ๊ธฐ์ˆ ๋„ ์ƒ๋‹นํžˆ ๋ฐœ์ „ํ–ˆ์Šต๋‹ˆ๋‹ค. DeepMind์˜ WaveNet๊ณผ ๊ฐ™์€ ๋ชจ๋ธ์ด ๊ทธ ์˜ˆ์ž…๋‹ˆ๋‹ค.
    • ์ˆ˜์ค€: ๋งค์šฐ ๋ฐœ์ „๋˜์–ด ์‹ค์šฉํ™” ๋‹จ๊ณ„.
    • ์ƒ์šฉํ™” ์˜ˆ์‹œ: Siri, Google Assistant, Amazon Alexa ๋“ฑ์˜ ์Œ์„ฑ ๋น„์„œ ๊ธฐ๋Šฅ

2. ํ˜•ํƒœ๋ก  (Morphology)

  • ํ˜•ํƒœ์†Œ ๋ถ„์„(Morphological Analysis): ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ์™€ ํ† ํฌ๋‚˜์ด์ €(tokenizer)๊ฐ€ ๋‹ค์ˆ˜ ๊ฐœ๋ฐœ๋˜์—ˆ์œผ๋ฉฐ, ํ•œ๊ตญ์–ด, ์ผ๋ณธ์–ด, ํ•€๋ž€๋“œ์–ด ๋“ฑ ํ˜•ํƒœ๊ฐ€ ๋ณต์žกํ•œ ์–ธ์–ด์—์„œ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • ์ˆ˜์ค€: ์„ฑ์ˆ™ ๋‹จ๊ณ„๋กœ, ๋‹ค์–‘ํ•œ ์–ธ์–ด์— ๋Œ€ํ•ด ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ž„.
  • ์ƒ์šฉํ™” ์˜ˆ์‹œ: Grammarly, Microsoft, ํ•œ์ปด ๋“ฑ์˜ ๋งž์ถค๋ฒ• ๊ฒ€์‚ฌ ๊ธฐ๋Šฅ

3. ๊ตฌ๋ฌธ๋ก  (Syntax)

  • ๊ตฌ๋ฌธ ๋ถ„์„(Parsing): ๊ตฌ๋ฌธ ๋ถ„์„๊ธฐ๋Š” CFG, PCFG, Dependency Parsing ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ฐœ์ „๋˜์–ด ์™”์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ์—๋Š” BERT, GPT ๋“ฑ Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์ด ๊ตฌ๋ฌธ ๋ถ„์„์—์„œ๋„ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๋ณด์ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ˆ˜์ค€: ์‹ค์šฉํ™” ๋‹จ๊ณ„๋กœ, ๋งŽ์€ NLP ์‘์šฉ์—์„œ ํ™œ์šฉ ๊ฐ€๋Šฅ.
  • ์‚ฌ์šฉํ™” ์˜ˆ์‹œ: Stanford Parser, spaCy, NLTK ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

๐Ÿค”Stanford Parser, spaCy, NLTK?

  • Stanford Parser:
    1. Stanford ๋Œ€ํ•™์—์„œ ๊ฐœ๋ฐœํ•œ ์ž์—ฐ์–ด ๊ตฌ๋ฌธ ๋ถ„์„๊ธฐ์ž…๋‹ˆ๋‹ค.
    2. ๋‹ค์–‘ํ•œ ์–ธ์–ด(์˜์–ด, ์ค‘๊ตญ์–ด, ์•„๋ž์–ด, ๋…์ผ์–ด, ํ”„๋ž‘์Šค์–ด, ์ŠคํŽ˜์ธ์–ด ๋“ฑ)์— ๋Œ€ํ•œ ๊ตฌ๋ฌธ ๋ถ„์„์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
    3. ๊ตฌ์„ฑ์„ฑ๋ถ„ ๋ถ„์„(constituency parsing)๊ณผ ์˜์กด ๊ตฌ๋ฌธ ๋ถ„์„(dependency parsing)์„ ๋ชจ๋‘ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    4. PCFG(Probabilistic Context-Free Grammar), Shift-Reduce, Neural Network ๊ธฐ๋ฐ˜์˜ ์˜์กด ๊ตฌ๋ฌธ ๋ถ„์„๊ธฐ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
    5. ์˜คํ”ˆ ์†Œ์Šค๋กœ ์ œ๊ณต๋˜๋ฉฐ, ๋น„์ƒ์—…์  ์šฉ๋„๋กœ๋Š” ๋ฌด๋ฃŒ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • spaCy:
    1. ์‚ฐ์—…์šฉ ๊ฐ•๋„์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ Python ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค.
    2. ํ† ํฐํ™”, ํ’ˆ์‚ฌ ํƒœ๊น…, ๊ฐœ์ฒด๋ช… ์ธ์‹, ์˜์กด ๊ตฌ๋ฌธ ๋ถ„์„ ๋“ฑ ๋‹ค์–‘ํ•œ NLP ์ž‘์—…์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
    3. ๋น ๋ฅธ ์ฒ˜๋ฆฌ ์†๋„์™€ ๋†’์€ ์ •ํ™•๋„๋กœ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.
    4. ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ์ œ๊ณตํ•˜์—ฌ ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    5. ๋‹ค๊ตญ์–ด ์ง€์›์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  • NLTK (Natural Language Toolkit):

    NLTK๋Š” ๊ต์œก๊ณผ ์—ฐ๊ตฌ ๋ชฉ์ ์œผ๋กœ ์„ค๊ณ„๋˜์—ˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ NLP ์ž‘์—…์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

    1. ํ† ํฐํ™”: ๋ฌธ์žฅ๊ณผ ๋‹จ์–ด ๋‹จ์œ„๋กœ ํ…์ŠคํŠธ๋ฅผ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
    2. ํ’ˆ์‚ฌ ํƒœ๊น…: ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•ด ํ’ˆ์‚ฌ ์ •๋ณด๋ฅผ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.
    3. ๊ตฌ๋ฌธ ๋ถ„์„: ๊ตฌ์„ฑ์„ฑ๋ถ„ ๋ถ„์„๊ณผ ์˜์กด ๊ตฌ๋ฌธ ๋ถ„์„์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
    4. ์–ดํœ˜ ์ž์›: WordNet๊ณผ ๊ฐ™์€ ์–ดํœ˜ ์ž์›์„ ํฌํ•จํ•˜์—ฌ ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.
    5. ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ: ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ, ์–ด๊ฐ„ ์ถ”์ถœ, ํ‘œ์ œ์–ด ์ถ”์ถœ ๋“ฑ ๋‹ค์–‘ํ•œ ์ „์ฒ˜๋ฆฌ ์ž‘์—…์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

4. ์˜๋ฏธ๋ก  (Semantics)

  • ์˜๋ฏธ ๋ถ„์„(Semantic Analysis): ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ(word embedding), ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ ๋“ฑ์˜ ๊ธฐ์ˆ ์ด ๋ฐœ์ „ํ•˜๋ฉด์„œ ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. BERT, GPT-3 ๊ฐ™์€ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ๋“ค์ด ๋ฌธ๋งฅ์„ ์ดํ•ดํ•˜๊ณ  ์ ์ ˆํ•œ ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์œ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ˆ˜์ค€: ๋งค์šฐ ๋ฐœ์ „๋œ ๋‹จ๊ณ„๋กœ, ์‹ค์ƒํ™œ ์‘์šฉ์—์„œ ํ™œ์šฉ ์ค‘.
  • ์‚ฌ์šฉํ™” ์˜ˆ์‹œ: Word2Vec, BERT, GPT ๋“ฑ์˜ ์–ธ์–ด ๋ชจ๋ธ

5. ํ™”์šฉ๋ก  (Pragmatics)

  • ๋Œ€ํ™” ์‹œ์Šคํ…œ(Dialogue Systems): ์‚ฌ์šฉ์ž ์˜๋„๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ ์ ˆํžˆ ์‘๋‹ตํ•˜๋Š” ์ฑ—๋ด‡๊ณผ ๋Œ€ํ™” ์—์ด์ „ํŠธ๊ฐ€ ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ ์ƒ์šฉ ์ œํ’ˆ๋“ค๋„ ๋“ฑ์žฅํ–ˆ์ง€๋งŒ, ์•„์ง ์™„๋ฒฝํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.
  • ๋งฅ๋ฝ ์ดํ•ด(Context Understanding): ์ƒํ™ฉ์— ๋งž๋Š” ์ ์ ˆํ•œ ์–ธ์–ด ์‚ฌ์šฉ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ์žˆ์–ด ์ œํ•œ์ ์ธ ์„ฑ๊ณผ๋ฅผ ๋ณด์ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ˆ˜์ค€: ๋ฐœ์ „ ์ค‘, ์ƒ๋‹นํ•œ ๊ฐœ์„  ์—ฌ์ง€ ์žˆ์Œ
  • ์‚ฌ์šฉํ™” ์˜ˆ์‹œ: ์ฑ—๋ด‡, ๊ณ ๊ฐ ์„œ๋น„์Šค ์ž๋™ํ™” ์‹œ์Šคํ…œ

6. ๋‹ด๋ก  (Discourse)

  • ํ…์ŠคํŠธ ์ผ๊ด€์„ฑ(Text Coherence): ๊ธด ๋ฌธ๋งฅ์„ ์œ ์ง€ํ•˜๊ณ  ์ผ๊ด€์„ฑ ์žˆ๋Š” ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. GPT-3์™€ ๊ฐ™์€ ๋ชจ๋ธ๋“ค์ด ๊ธด ๋ฌธ์„œ ์ƒ์„ฑ์—์„œ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๋ณด์ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋‹ด๋ก  ๋ถ„์„(Discourse Analysis): ๋ฌธ์žฅ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๊ณ , ์ผ๊ด€๋œ ๋Œ€ํ™”๋ฅผ ์œ ์ง€ํ•˜๋Š” ๋ฐ์—๋Š” ์•„์ง ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ˆ˜์ค€: ์ดˆ๊ธฐ ๋‹จ๊ณ„, ๋งŽ์€ ๊ฐœ์„  ํ•„์š”
  • ์‚ฌ์šฉํ™” ์˜ˆ์‹œ: ์ž๋™ ์š”์•ฝ ๋„๊ตฌ, ๋Œ€ํ™”ํ˜• AI ์‹œ์Šคํ…œ

๐Ÿ”Ž ์ฑ—๋ด‡ vs ๋Œ€ํ™”ํ˜• ์ฑ—๋ด‡, ๋ฌด์—‡์ด ๋‹ค๋ฅธ๊ฐ€?

์ฑ—๋ด‡๊ณผ ๋Œ€ํ™”ํ˜• AI์˜ ์ฐจ์ด์ ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๊ธฐ์ˆ ์„ ๋น„๊ตํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‘ ๊ธฐ์ˆ ์˜ ์ฃผ์š” ์ฐจ์ด์ ์„ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ฑ—๋ด‡ (Chatbots) ๐Ÿ—ฃ๏ธ

  • ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ : ๊ทœ์น™ ๊ธฐ๋ฐ˜(rule-based) ๋˜๋Š” ์‚ฌ์ „์— ์ •์˜๋œ ์Šคํฌ๋ฆฝํŠธ(predefined scripts)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ • ์งˆ์˜๋‚˜ ๋ช…๋ น์— ์‘๋‹ตํ•ฉ๋‹ˆ๋‹ค.
  • ์ดํ•ด ๋Šฅ๋ ฅ: ์ž์—ฐ์–ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ์ด ์ œํ•œ์ ์ด๋ฉฐ, ๋ณต์žกํ•œ ๋Œ€ํ™”๋ฅผ ์ž˜ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.
  • ์‘๋‹ต ๋ฐฉ์‹: ํŠน์ • ์ž…๋ ฅ์— ๋Œ€ํ•ด ๋ฏธ๋ฆฌ ์ •์˜๋œ ์‘๋‹ต์„ ์ œ๊ณตํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค.
  • ์ ์šฉ ๋ถ„์•ผ: ์ฃผ๋กœ ๊ณ ๊ฐ ์ง€์›, FAQ, ๊ฐ„๋‹จํ•œ ์ •๋ณด ์ œ๊ณต ๋“ฑ์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๋Œ€ํ™”ํ˜• AI (Conversational AI) ๐ŸŽ™๏ธ

  • ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ : ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP), ๋จธ์‹  ๋Ÿฌ๋‹(ML), ์ธ๊ณต์ง€๋Šฅ(AI) ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ๊ธฐ์ˆ ์„ ํฌํ•จํ•˜๋Š” ๋„“์€ ๋ฒ”์œ„์˜ ๊ธฐ์ˆ ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ๊ณ„๊ฐ€ ์ž์—ฐ์–ด๋กœ ์ธ๊ฐ„๊ณผ ๊ฐ™์€ ์‘๋‹ต์„ ์ดํ•ดํ•˜๊ณ , ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ดํ•ด ๋Šฅ๋ ฅ: ๋” ๋ฐœ์ „๋œ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ๋งฅ, ๊ฐ์ •, ์–ธ์–ด์˜ ๋ฏธ๋ฌ˜ํ•œ ์ฐจ์ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์‘๋‹ต ๋ฐฉ์‹: ์ƒํ˜ธ์ž‘์šฉ์—์„œ ํ•™์Šตํ•˜๊ณ , ์‚ฌ์šฉ์ž ์ž…๋ ฅ์— ๋”ฐ๋ผ ์ ์‘ํ•˜๋ฉฐ, ๋” ๋ณต์žกํ•˜๊ณ  ๋™์ ์ธ ๋Œ€ํ™”๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ ์šฉ ๋ถ„์•ผ: ๊ณ ๊ฐ ์ง€์›, ๋น„์ฆˆ๋‹ˆ์Šค ํ”„๋กœ์„ธ์Šค ์ž๋™ํ™”, ๊ฐœ์ธ ๋น„์„œ, ๋ณต์žกํ•œ ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • ํ˜„์žฌ NLP ๊ธฐ์ˆ ์˜ ๋ฐœ์ „ ๋‹จ๊ณ„๋Š” 4~5 ์˜์—ญ์— ๊ฑธ์ณ ์žˆ์œผ๋ฉฐ, ๊ฐ ์˜์—ญ์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์ง•์„ ๋ณด์ž…๋‹ˆ๋‹ค:

    1. ์˜๋ฏธ๋ก  (Semantics) ์˜์—ญ์—์„œ๋Š” ์ƒ๋‹นํ•œ ์ง„์ „์„ ์ด๋ฃจ์–ด ์‹ค์šฉ์ ์ธ ์ˆ˜์ค€์— ๋„๋‹ฌํ–ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ์ดํ•ด์™€ ํ•ด์„์—๋Š” ๋ฏธ์น˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

      • ๊ณผ์ œ: ๋ณต์žกํ•œ ์ถ”๋ก , ์ƒ์‹์  ์ดํ•ด, ์€์œ  ํ•ด์„ ๋“ฑ์—์„œ ์—ฌ์ „ํžˆ ๊ฐœ์„ ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
    2. ํ™”์šฉ๋ก  (Pragmatics) ์˜์—ญ์€ ์˜๋ฏธ๋ก ๋ณด๋‹ค๋Š” ๋œ ๋ฐœ์ „ํ–ˆ์ง€๋งŒ, ์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ์—์„œ ์ ์ฐจ ํ™œ์šฉ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋Œ€ํ™” ์‹œ์Šคํ…œ๊ณผ ๊ฐ์ • ๋ถ„์„ ๋“ฑ์—์„œ ์œ ์šฉ์„ฑ์„ ๋ณด์ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

      • ๊ณผ์ œ: ํ™”์ž์˜ ์˜๋„, ์•”์‹œ์  ์˜๋ฏธ, ์‚ฌํšŒ๋ฌธํ™”์  ๋งฅ๋ฝ ์ดํ•ด ๋“ฑ์—์„œ ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ์ดํ•ด๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค.
    3. ๋‹ด๋ก  (Discourse) ์˜์—ญ์€ ๊ฐ€์žฅ ๋ณต์žกํ•˜๊ณ  ๋„์ „์ ์ธ ์˜์—ญ์œผ๋กœ, ํ˜„์žฌ ์ดˆ๊ธฐ ์—ฐ๊ตฌ ๋‹จ๊ณ„์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์˜์—ญ์˜ ๋ฐœ์ „์€ ํ–ฅํ›„ NLP ๊ธฐ์ˆ ์˜ ํฐ ๋„์•ฝ์„ ๊ฐ€์ ธ์˜ฌ ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋ฉ๋‹ˆ๋‹ค.

      • ๊ณผ์ œ: ๊ธด ๋ฌธ๋งฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ผ๊ด€์„ฑ ์žˆ๋Š” ๋Œ€ํ™”๋ฅผ ์ด์–ด๊ฐ€๋Š” ๋Šฅ๋ ฅ, ๋ณต์žกํ•œ ๋…ผ๋ฆฌ ๊ตฌ์กฐ ์ดํ•ด ๋“ฑ์ด ์ฃผ์š” ๊ณผ์ œ์ž…๋‹ˆ๋‹ค.

์ „๋ฐ˜์ ์œผ๋กœ, NLP ๊ธฐ์ˆ ์€ ์ด ์„ธ ์˜์—ญ์—์„œ ์ง€์†์ ์œผ๋กœ ๋ฐœ์ „ํ•˜๊ณ  ์žˆ์ง€๋งŒ, ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ์–ธ์–ด ์ดํ•ด์™€ ์ƒ์„ฑ์„ ์œ„ํ•ด์„œ๋Š” ์•„์ง ๋งŽ์€ ์—ฐ๊ตฌ์™€ ํ˜์‹ ์ด ํ•„์š”ํ•œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค. (=> ์ง€๊ธˆ๋„ ํ™œ๋ฐœํ•˜๊ฒŒ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค)

  1. NLP์˜ ์–ด๋ ค์›€๊ณผ ์—ญ์‚ฌ

  • ์™œ NLP๊ฐ€ ์–ด๋ ค์šด๊ฐ€? : ์ž์—ฐ์–ด๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด์™€ ๋‹ฌ๋ฆฌ ๋ฐฉ๋Œ€ํ•œ ๋‹จ์–ด์˜ ์–‘, ๋ณต์žกํ•œ ๊ตฌ๋ฌธ, ๋ชจํ˜ธ์„ฑ ๋“ฑ์œผ๋กœ ์ธํ•ด ๋‹ค๋ฃจ๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋Œ€ํ•œ ๋‹จ์–ด์˜ ์–‘ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๊ตฌ๋ฌธ์˜ ๋ณต์žก์„ฑ ๋ฐ ๋ชจํ˜ธ์„ฑ, ์‹œ๊ฐ„์˜ ํ๋ฆ„์— ๋”ฐ๋ผ ์ง„ํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ดˆ๊ธฐ NLP์—ฐ๊ตฌ์— ์–ด๋ ค์›€์ด ์กด์žฌํ–ˆ์Šต๋‹ˆ๋‹ค.

  • NLP ์—ฐ๊ตฌ์˜ ์—ญ์‚ฌ : ์ดˆ๊ธฐ์—๋Š” ๊ทœ์น™ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์ด ์ฃผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ์ง€๋งŒ, ์ž์—ฐ์–ด์˜ ๋™์  ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•˜๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ๋Š” ํ†ต๊ณ„์  ๋ฐฉ๋ฒ•๊ณผ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋ฉฐ, ๋จธ์‹  ๋Ÿฌ๋‹ ๋ฐ ๋”ฅ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์ด ์ฃผ๋ฅ˜๋ฅผ ์ด๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

  1. Lexical Analysis

3.1. ํ† ํฐํ™” (Tokenization)

ํ† ํฐํ™”๋Š” ํ…์ŠคํŠธ๋ฅผ ์ž‘์€ ์˜๋ฏธ ๋‹จ์œ„๋กœ ๋ถ„ํ• ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋‹จ์œ„๋Š” ๋‹จ์–ด, ๊ตฌ, ์‹ฌ์ง€์–ด๋Š” ๊ฐœ๋ณ„ ๋ฌธ์ž์ผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, โ€œHello, world!โ€๋ผ๋Š” ๋ฌธ์žฅ์€ [โ€œHelloโ€, โ€œ,โ€, โ€œworldโ€, โ€œ!โ€]๋กœ ๋ถ„ํ• ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ† ํฐํ™”๋Š” ํ…์ŠคํŠธ์˜ ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๊ณ  ๊ฐ ๋‹จ์–ด๊ฐ€ ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ๋˜๋Š”์ง€๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฐ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

ํ† ํฐํ™”์—๋Š” ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๊ณต๋ฐฑ ๊ธฐ๋ฐ˜ ํ† ํฐํ™”: ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด๋ฅผ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • ๊ตฌ๋‘์  ๊ธฐ๋ฐ˜ ํ† ํฐํ™”: ๊ตฌ๋‘์ ๋„ ๋ณ„๋„์˜ ํ† ํฐ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  • ์–ด์ ˆ ๊ธฐ๋ฐ˜ ํ† ํฐํ™”: ์–ธ์–ด์  ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์–ด์ ˆ ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค. (ํ•œ๊ตญ์–ด์™€ ๊ฐ™์€ ์–ธ์–ด์—์„œ ํŠนํžˆ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค)

์ฝ”๋“œ ์˜ˆ์‹œ

1
2
3
4
5
from nltk.tokenize import word_tokenize

sentence = "Hello, world! Welcome to NLP."
tokens = word_tokenize(sentence)
print(tokens)

์ถœ๋ ฅ: ['Hello', ',', 'world', '!', 'Welcome', 'to', 'NLP', '.']

3.2. ํ˜•ํƒœ์†Œ ๋ถ„์„ (Morphological Analysis)

ํ˜•ํƒœ์†Œ ๋ถ„์„์€ ๋‹จ์–ด์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋‹จ์ˆœํžˆ ๋‹จ์–ด๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ํ† ํฐํ™” ๋‹จ๊ณ„๋ณด๋‹ค ๋” ๊นŠ์€ ๋ถ„์„์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ํ˜•ํƒœ์†Œ ๋ถ„์„์€ ์ฃผ๋กœ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ์ž‘์—…์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค:

  • ํ˜•ํƒœ์†Œ(Morpheme) ์‹๋ณ„: ๋‹จ์–ด๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์ตœ์†Œ ์˜๋ฏธ ๋‹จ์œ„์ธ ํ˜•ํƒœ์†Œ๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค. ํ˜•ํƒœ์†Œ๋Š” ์ž๋ฆฝ ํ˜•ํƒœ์†Œ(๋…๋ฆฝ์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๋‹จ์–ด)์™€ ์˜์กด ํ˜•ํƒœ์†Œ(๋‹ค๋ฅธ ํ˜•ํƒœ์†Œ์™€ ๊ฒฐํ•ฉ๋˜์–ด์•ผ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ๋‹จ์–ด)๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค.

    • ์˜ˆ์‹œ: โ€œunhappinessโ€๋Š” โ€œun-โ€œ, โ€œhappyโ€, โ€œ-nessโ€์˜ ์„ธ ๊ฐ€์ง€ ํ˜•ํƒœ์†Œ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
  • ํ˜•ํƒœ์†Œ ๋ณ€ํ˜•(Morphological Inflection): ๋‹จ์–ด๊ฐ€ ๋ฌธ๋ฒ•์  ํ˜•ํƒœ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด ๋ณ€ํ˜•๋˜๋Š” ๊ณผ์ •์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋‹ค์Œ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค:

    • Stemming: ๋‹จ์–ด์˜ ์–ด๊ฐ„(Stem)์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋‹จ์–ด์˜ ๋ณ€ํ˜•๋œ ํ˜•ํƒœ๋“ค์„ ๋™์ผํ•˜๊ฒŒ ์ธ์‹ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, โ€œrunningโ€, โ€œrunnerโ€, โ€œranโ€์€ ๋ชจ๋‘ โ€œrunโ€์œผ๋กœ ๋ณ€ํ™˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Stemming์€ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ, ๋‹จ์–ด์˜ ์ผ๋ถ€๋ถ„์„ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ˜„๋ฉ๋‹ˆ๋‹ค.
      • ์˜ˆ์‹œ: Porter Stemmer, Snowball Stemmer
    • Lemmatization: ๋‹จ์–ด์˜ ๊ธฐ๋ณธ ํ˜•ํƒœ(Lemma)๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์–ดํœ˜ ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ํฌํ•จํ•˜์—ฌ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, โ€œrunningโ€์€ โ€œrunโ€์œผ๋กœ, โ€œbetterโ€๋Š” โ€œgoodโ€์œผ๋กœ ๋ณ€ํ™˜๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Lemmatization์€ ์–ธ์–ด์  ์ง€์‹(์‚ฌ์ „)์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹จ์–ด๋ฅผ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
      • ์˜ˆ์‹œ: WordNet Lemmatizer

์ฝ”๋“œ ์˜ˆ์‹œ

1
2
3
4
5
6
7
8
9
10
11
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word1 = "running"
word2 = "better"

print(stemmer.stem(word1))   # ์ถœ๋ ฅ: run
print(lemmatizer.lemmatize(word1, pos='v'))  # ์ถœ๋ ฅ: run
print(lemmatizer.lemmatize(word2, pos='a'))  # ์ถœ๋ ฅ: good

์ถœ๋ ฅ:

1
2
3
run
run
good

3.3. ๋ฌธ์žฅ ๋ถ„ํ•  (Sentence Splitting)

๋ฌธ์žฅ ๋ถ„ํ• ์€ ํ…์ŠคํŠธ๋ฅผ ๊ฐœ๋ณ„ ๋ฌธ์žฅ์œผ๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ตฌ๋‘์ ์ด๋‚˜ ๋ฌธ์žฅ์˜ ๊ตฌ์กฐ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ๋ฌธ์žฅ ๋ถ„ํ• ์€ ํ…์ŠคํŠธ ๋ถ„์„์˜ ๋‹จ์œ„๋ฅผ ๋ช…ํ™•ํžˆ ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์˜ˆ์‹œ

1
2
3
4
5
from nltk.tokenize import sent_tokenize

text = "Hello world! How are you doing? NLP is interesting."
sentences = sent_tokenize(text)
print(sentences)

์ถœ๋ ฅ: ['Hello world!', 'How are you doing?', 'NLP is interesting.']

3.4. ํ’ˆ์‚ฌ ํƒœ๊น… (Part-of-Speech Tagging)

ํ’ˆ์‚ฌ ํƒœ๊น…์€ ๋ฌธ์žฅ์—์„œ ๊ฐ ๋‹จ์–ด์— ํ•ด๋‹นํ•˜๋Š” ํ’ˆ์‚ฌ๋ฅผ ํ• ๋‹นํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋ฌธ์žฅ์˜ ๊ตฌ๋ฌธ ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๊ณ , ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค๋‹ˆ๋‹ค.

ํ’ˆ์‚ฌ ํƒœ๊น…์€ ์ฃผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํƒœ๊น…(Rule-Based Tagging): ์ •ํ•ด์ง„ ๋ฌธ๋ฒ• ๊ทœ์น™์„ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ํ†ต๊ณ„์  ํƒœ๊น…(Statistical Tagging): ์ฝ”ํผ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ’ˆ์‚ฌ๋ฅผ ํ™•๋ฅ ์ ์œผ๋กœ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. HMM(Hidden Markov Model)์ด๋‚˜ Maximum Entropy ๋ชจ๋ธ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • ๊ธฐ๊ณ„ ํ•™์Šต ๊ธฐ๋ฐ˜ ํƒœ๊น…(Machine Learning-Based Tagging): ์ง€๋„ ํ•™์Šต์„ ํ†ตํ•ด ํ’ˆ์‚ฌ๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. SVM, Decision Tree, ๊ทธ๋ฆฌ๊ณ  ์ตœ๊ทผ์—๋Š” ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ(Neural Networks)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ์‹œ: CRF(Conditional Random Fields), Bi-LSTM

์ฝ”๋“œ ์˜ˆ์‹œ

1
2
3
4
5
6
7
import nltk
nltk.download('averaged_perceptron_tagger')

sentence = "I love eating chicken."
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

์ถœ๋ ฅ: [('I', 'PRP'), ('love', 'VBP'), ('eating', 'VBG'), ('chicken', 'NN')]

  1. Named Entity Recognition (NER)

  • ๊ฐ์ฒด๋ช… ์ธ์‹ (Named Entity Recognition, NER) : ๊ฐ์ฒด๋ช… ์ธ์‹์€ ๋ฌธ์žฅ์—์„œ ํŠน์ • ์š”์†Œ๋ฅผ ๋ฏธ๋ฆฌ ์ •์˜๋œ ์นดํ…Œ๊ณ ๋ฆฌ(์˜ˆ: ์‚ฌ๋žŒ, ์žฅ์†Œ, ์กฐ์ง, ๋‚ ์งœ ๋“ฑ)๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ด๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ ‘๊ทผ ๋ฐฉ๋ฒ•

  1. ์‚ฌ์ „ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ• (Dictionary / Rule-based Approach)
  • List Lookup: ์‚ฌ์ „์— ์ •์˜๋œ ๋ฆฌ์ŠคํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ์—์„œ ํ•ด๋‹น ๋‹จ์–ด๋ฅผ ๊ฒ€์ƒ‰ํ•ฉ๋‹ˆ๋‹ค.
    • ์žฅ์ : ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅด๋ฉฐ, ํŠน์ • ์–ธ์–ด์— ํŠนํ™”๋œ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
    • ๋‹จ์ : ๋ฆฌ์ŠคํŠธ์˜ ํฌ๊ธฐ์™€ ์œ ์ง€ ๊ด€๋ฆฌ๊ฐ€ ์–ด๋ ต๊ณ , ์ƒˆ๋กœ์šด ๋‹จ์–ด๋‚˜ ๋ณ€ํ™”์— ๋ฏผ๊ฐํ•ฉ๋‹ˆ๋‹ค.
  • Shallow Parsing Approach: ๊ทผ๊ฑฐ ์žˆ๋Š” ์ฆ๊ฑฐ(evidence)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ…์ŠคํŠธ์—์„œ ๊ฐ์ฒด๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
    • ์˜ˆ์‹œ: โ€œWall Streetโ€์—์„œ โ€œStreetโ€๊ณผ ๊ฐ™์€ ๋‹จ์–ด๋ฅผ ํ†ตํ•ด ์ง€๋ช…์„ ์ธ์‹ํ•ฉ๋‹ˆ๋‹ค.
  1. ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ• (Model-based Approach)
    • CRF (Conditional Random Fields): ๋ฌธ๋งฅ์„ ๊ณ ๋ คํ•˜์—ฌ ๋‹จ์–ด์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    • ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ๋ชจ๋ธ (Neural Network-based Models): RNN, CNN ๋“ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
      • ์žฅ์ : ๋ฌธ๋งฅ์„ ์ž˜ ๋ฐ˜์˜ํ•˜์—ฌ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.
      • ๋‹จ์ : ๋งŽ์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ๊ณ„์‚ฐ ์ž์›์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์˜ˆ์‹œ

1
2
3
4
5
6
7
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Barack Obama was born in Hawaii.")

for ent in doc.ents:
    print(ent.text, ent.label_)

์ถœ๋ ฅ: Barack Obama PERSON, Hawaii GPE

  1. Syntax Analysis (๊ตฌ๋ฌธ ๋ถ„์„)

๊ตฌ๋ฌธ ๋ถ„์„ (Syntax Analysis) : ๊ตฌ๋ฌธ ๋ถ„์„์€ ๋ฌธ์žฅ์—์„œ ๋‹จ์–ด๋“ค์ด ์–ด๋–ป๊ฒŒ ๊ตฌ์กฐ์ ์œผ๋กœ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ๋ถ„์„ํ•˜๋Š” ์ž‘์—…์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋ฌธ์žฅ์˜ ๋ฌธ๋ฒ•์  ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๊ณ  ๊ฐ ๋‹จ์–ด์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

5.1. Parsing

  • Top-down Parsing: ๋ฌธ์žฅ์˜ ์‹œ์ž‘๋ถ€ํ„ฐ ๋ฌธ๋ฒ• ๊ทœ์น™์„ ์ ์šฉํ•˜์—ฌ ๊ตฌ์กฐ๋ฅผ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.
  • Bottom-up Parsing: ๋‹จ์–ด๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ๋ฌธ๋ฒ• ๊ทœ์น™์„ ์—ญ์œผ๋กœ ์ ์šฉํ•˜์—ฌ ๊ตฌ์กฐ๋ฅผ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

5.2. ๋ชจํ˜ธ์„ฑ (Ambiguity)

๊ตฌ๋ฌธ ๋ถ„์„์—์„œ ์ค‘์š”ํ•œ ๋ฌธ์ œ๋Š” ๋ชจํ˜ธ์„ฑ์„ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ๊ตฌ์กฐ์˜ ๋ชจํ˜ธ์„ฑ (Structural Ambiguity): ๊ฐ™์€ ๋ฌธ์žฅ์ด ์—ฌ๋Ÿฌ ๋ฐฉ์‹์œผ๋กœ ํ•ด์„๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์˜ˆ์‹œ: โ€œJohn saw Mary in the park.โ€ (์กด์ด ๊ณต์›์—์„œ ๋ฉ”๋ฆฌ๋ฅผ ๋ดค๋‹ค / ์กด์ด ๋ฉ”๋ฆฌ๋ฅผ ๋ดค๋Š”๋ฐ, ๋ฉ”๋ฆฌ๋Š” ๊ณต์›์— ์žˆ๋‹ค)
  • ์–ดํœ˜์˜ ๋ชจํ˜ธ์„ฑ (Lexical Ambiguity): ๊ฐ™์€ ๋‹จ์–ด๊ฐ€ ์ƒํ™ฉ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ํ˜•ํƒœ์†Œ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์˜ˆ์‹œ: โ€œTime flies like an arrow.โ€ (์‹œ๊ฐ„์€ ํ™”์‚ด์ฒ˜๋Ÿผ ๋‚ ์•„๊ฐ„๋‹ค / ํŒŒ๋ฆฌ๋Š” ํ™”์‚ด์„ ์ข‹์•„ํ•œ๋‹ค)

์ฝ”๋“œ ์˜ˆ์‹œ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import nltk

sentence = "John saw the man with a telescope."
grammar = nltk.CFG.fromstring("""
  S -> NP VP
  VP -> V NP | VP PP
  PP -> P NP
  V -> "saw"
  NP -> "John" | "man" | "telescope" | Det N | NP PP
  Det -> "a" | "the"
  N -> "man" | "telescope"
  P -> "with"
""")
parser = nltk.ChartParser(grammar)
for tree in parser.parse(sentence.split()):
    print(tree)
  1. ์–ธ์–ด ๋ชจ๋ธ๋ง (Language Modeling)

ํ™•๋ฅ ์  ์–ธ์–ด ๋ชจ๋ธ (Probabilistic Language Modeling) : ์–ธ์–ด ๋ชจ๋ธ๋ง์€ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์ด ์‹ค์ œ ๋ฌธ๋ฒ•์ ์œผ๋กœ ํƒ€๋‹นํ•œ์ง€ ํ‰๊ฐ€ํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋ฌธ์žฅ์˜ ๋ฌธ๋ฒ•์  ๊ตฌ์กฐ์™€ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ ‘๊ทผ ๋ฐฉ๋ฒ•

  1. N-๊ทธ๋žจ ๋ชจ๋ธ (N-gram Models)
  • ๋‹จ์–ด์˜ ์—ฐ์†์ ์ธ ํŒจํ„ด์„ ๋ถ„์„ํ•˜์—ฌ ๋‹ค์Œ ๋‹จ์–ด์˜ ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • Unigram: ๋‹จ์–ด ํ•˜๋‚˜์˜ ํ™•๋ฅ 
  • Bigram: ๋‘ ๋‹จ์–ด์˜ ์—ฐ์† ํ™•๋ฅ 
  • Trigram: ์„ธ ๋‹จ์–ด์˜ ์—ฐ์† ํ™•๋ฅ 
  1. ์‹ ๊ฒฝ๋ง ๊ธฐ๋ฐ˜ ๋ชจ๋ธ (Neural Network-based Models)
    • RNN, LSTM, Transformer์™€ ๊ฐ™์€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
    • BERT, GPT-2: ์‚ฌ์ „ ํ•™์Šต๋œ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ๋กœ, ๋ฌธ๋งฅ์„ ์ž˜ ๋ฐ˜์˜ํ•˜์—ฌ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

์‘์šฉ ๋ถ„์•ผ

  • ๊ธฐ๊ณ„ ๋ฒˆ์—ญ (Machine Translation)
  • ์ฒ ์ž ๊ต์ • (Spell Correction)
  • ์Œ์„ฑ ์ธ์‹ (Speech Recognition)
  • ์š”์•ฝ (Summarization)
  • ์งˆ์˜ ์‘๋‹ต ์‹œ์Šคํ…œ (Question Answering)

์ฝ”๋“œ ์˜ˆ์‹œ

1
2
3
4
5
6
7
8
9
10
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

input_text = "Natural language processing is"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids, max_length=50, num_return_sequences=1)

print(tokenizer.decode(output[0], skip_special_tokens=True))


-->