[NLP] 6. Topic Modeling์ด๋ž€?

Posted by Euisuk's Dev Log on January 28, 2025

[NLP] 6. Topic Modeling์ด๋ž€?

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/NLP-6.-Topic-Modeling์ด๋ž€

  1. Topic Modeling์ด๋ž€?

๋ณธ ๊ฐ•์˜๋Š” DSBA ๊ฐ•ํ•„์„ฑ ๊ต์ˆ˜๋‹˜์˜ ๊ฐ•์˜๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ์ž‘์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Topic Modeling์€ ๊ธฐ๊ณ„ ํ•™์Šต ๋ฐ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ถ„์•ผ์—์„œ ๋ฌธ์„œ ์ง‘ํ•ฉ ๋‚ด์—์„œ ์ž ์žฌ์ ์ธ ์ฃผ์ œ(Latent Topic)๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ํ†ต๊ณ„์  ๋ชจ๋ธ๋ง ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

  • ์ฃผ์–ด์ง„ ๋ฌธ์„œ์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด ํŒจํ„ด์„ ๋ถ„์„ํ•˜์—ฌ ๋ฌธ์„œ๋ฅผ ํŠน์ • ์ฃผ์ œ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

Topic Modeling์„ ํ™œ์šฉํ•˜๋ฉด ๋ฐฉ๋Œ€ํ•œ ๋ฌธ์„œ ๋ฐ์ดํ„ฐ์—์„œ ์ž ์žฌ์ ์ธ ์˜๋ฏธ ๊ตฌ์กฐ(Latent Structure)๋ฅผ ์ž๋™์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ํŠนํžˆ, ๋น„์ง€๋„ ํ•™์Šต(Unsupervised Learning) ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์ „์— ๋ ˆ์ด๋ธ”๋ง๋œ ๋ฐ์ดํ„ฐ ์—†์ด๋„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ๋ฐ์ดํ„ฐ ๋ ˆ์ด๋ธ”์ด ๋ถ€์กฑํ•œ ์ƒํ™ฉ์—์„œ ์œ ์šฉํ•˜๊ฒŒ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ฌธ์„œ ๋ถ„๋ฅ˜, ์ถ”์ฒœ ์‹œ์Šคํ…œ, ์ •๋ณด ๊ฒ€์ƒ‰ ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์‘์šฉ ๋ถ„์•ผ์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง•

  • ๋ฌธ์„œ ๋‚ด์—์„œ ์ฃผ์š” ์ฃผ์ œ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ธฐ๋ฒ•
  • ๋น„์ง€๋„ ํ•™์Šต(Unsupervised Learning)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘๋™
  • ๋‹จ์–ด ๊ฐ„์˜ ์ถœํ˜„ ๋นˆ๋„์™€ ๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์ฃผ์ œ ๋ถ„ํฌ(Topic Distribution)๋ฅผ ๋„์ถœ
  • ๋‰ด์Šค ๊ธฐ์‚ฌ, ์—ฐ๊ตฌ ๋…ผ๋ฌธ, ๊ณ ๊ฐ ๋ฆฌ๋ทฐ ๋“ฑ์˜ ๋Œ€๋Ÿ‰์˜ ๋ฌธ์„œ ๋ฐ์ดํ„ฐ์—์„œ ์ฃผ์š” ์ฃผ์ œ๋ฅผ ์ž๋™์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐ ํ™œ์šฉ๋จ
  • ๋ฌธ์„œ ๋‚ด์˜ ๋‹จ์–ด ์ถœํ˜„ ํŒจํ„ด์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜๋ฏธ์  ๊ตฌ์กฐ๋ฅผ ์ฐพ์•„๋‚ด์–ด ์ˆจ๊ฒจ์ง„ ๊ด€๊ณ„๋ฅผ ๋“œ๋Ÿฌ๋‚ผ ์ˆ˜ ์žˆ์Œ

2. Topic Modeling์˜ ์ ‘๊ทผ๋ฒ•

Topic Modeling์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹ค์–‘ํ•œ ์ ‘๊ทผ๋ฒ•์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

  • ํฌ๊ฒŒ ํ–‰๋ ฌ ๋ถ„ํ•ด(Matrix Factorization) ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๊ณผ ํ™•๋ฅ  ๋ชจ๋ธ(Probabilistic Model) ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์œผ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  1. ํ–‰๋ ฌ ๋ถ„ํ•ด ๊ธฐ๋ฐ˜(Matrix Factorization) ์ ‘๊ทผ

    • Singular Value Decomposition (SVD) ๊ธฐ๋ฐ˜ ์ฐจ์› ์ถ•์†Œ ๊ธฐ๋ฒ• ํ™œ์šฉ
    • ๋‹จ์–ด์™€ ๋ฌธ์„œ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ €์ฐจ์› ๊ณต๊ฐ„์—์„œ ํ•ด์„
    • Example
      • Latent Semantic Analysis (LSA)
      • Non-negative Matrix Factorization (NMF)
  2. ํ™•๋ฅ  ๋ชจ๋ธ(Probabilistic Model) ๊ธฐ๋ฐ˜ ์ ‘๊ทผ

    • ๋ฌธ์„œ์™€ ๋‹จ์–ด๊ฐ€ ํ™•๋ฅ ์  ํ† ํ”ฝ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š” ํ™•๋ฅ  ๋ชจ๋ธ
    • ๋ฒ ์ด์ฆˆ ์ถ”๋ก ์„ ์ด์šฉํ•˜์—ฌ ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ๋ฅผ ํ•™์Šต
    • Example
      • Probabilistic Latent Semantic Analysis (pLSA)
      • Latent Dirichlet Allocation (LDA)

2.1 ํ–‰๋ ฌ ๋ถ„ํ•ด ๊ธฐ๋ฐ˜(Matrix Factorization) ์ ‘๊ทผ

  • ์ •์˜:

    • ํ–‰๋ ฌ ๋ถ„ํ•ด ๊ธฐ๋ฐ˜ ์ ‘๊ทผ(Matrix Factorization Approach)์€ ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ(Term-Document Matrix)์„ ํ–‰๋ ฌ ๋ถ„ํ•ด(Matrix Factorization) ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์ €์ฐจ์› ์ž ์žฌ ์˜๋ฏธ ๊ณต๊ฐ„(Latent Semantic Space)์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
  • ํ•ต์‹ฌ ๊ฐœ๋…:

    • ๊ณ ์ฐจ์› ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ์„ ์ €์ฐจ์› ์ž ์žฌ ์˜๋ฏธ ๊ณต๊ฐ„์œผ๋กœ ์••์ถ• ํ•˜์—ฌ ์ฃผ์ œ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ์‹
    • SVD, NMF ๋“ฑ์˜ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ํ–‰๋ ฌ์„ ๋ถ„ํ•ดํ•˜๊ณ  ์ด๋ฅผ ํ†ตํ•ด ์ฃผ์ œ๋ฅผ ํ•™์Šต

      • SVD: Singular Value Decomposition
      • NMF: Non-negative Matrix Factorization
    • ๋‹จ์–ด์™€ ๋ฌธ์„œ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ฒกํ„ฐ ๊ณต๊ฐ„ ๋‚ด์—์„œ ์ˆ˜์น˜์ ์œผ๋กœ ๋ถ„์„
  • ์ฃผ์š” ๊ธฐ๋ฒ•:

    1. Latent Semantic Analysis (LSA)

      • Singular Value Decomposition (SVD)์„ ํ™œ์šฉํ•˜์—ฌ ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ์„ ๋ถ„ํ•ด
      • ๋‹จ์–ด์™€ ๋ฌธ์„œ๋ฅผ ์ €์ฐจ์› ๋ฒกํ„ฐ ๊ณต๊ฐ„์— ๋งคํ•‘ํ•˜์—ฌ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ๋ถ„์„
      • ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ•˜๋ฉฐ, ํ™•๋ฅ ์  ํ•ด์„์ด ์–ด๋ ต๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Œ
    2. Non-negative Matrix Factorization (NMF)

      • ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ์„ ๋น„์Œ์ˆ˜ ํ–‰๋ ฌ(Non-negative Matrix)๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ์˜๋ฏธ๋ฅผ ํ•ด์„
      • ๋ชจ๋“  ์š”์†Œ๊ฐ€ 0 ์ด์ƒ์ด๋ฏ€๋กœ ํ•ด์„์ด ์ง๊ด€์ ์ด๋ฉฐ ์˜๋ฏธ์ ์ธ ํ† ํ”ฝ์„ ์ถ”์ถœํ•˜๋Š” ๋ฐ ์œ ๋ฆฌ

(์ฐธ๊ณ ) Singular Value Decomposition (SVD)๋‚˜ Non-negative Matrix Factorization (NMF)๋Š” ์ฐจ์›์ถ•์†Œ ๊ธฐ๋ฒ•์˜ ์ผ์ข…์ด๋ผ๊ณ  ๋ด๋„ ๋ฌด๋ฐฉํ•ฉ๋‹ˆ๋‹ค. (=> ์ด๋Š” ๋‘ ๊ธฐ๋ฒ• ๋ชจ๋‘ Matrix๋ฅผ ๋ถ„ํ•ดํ•˜๊ธฐ ๋•Œ๋ฌธ)

  • ์ด์™€ ๊ด€๋ จํ•˜์—ฌ ์ข‹์€ ์ฐธ๊ณ  ์ž๋ฃŒ๊ฐ€ ์žˆ์–ด์„œ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ ํ•ด๋‹น ๋‚ด์šฉ์— ๋Œ€ํ•ด์„œ ๋‹ค๋ฃฐ ์˜ˆ์ •์ž„.
  • ์žฅ์ :

    โœ” ์—ฐ์‚ฐ ์†๋„๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋น ๋ฅด๊ณ  ๊ตฌํ˜„์ด ์šฉ์ดํ•จ

    โœ” ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ ๋ฐ ์˜๋ฏธ์  ์••์ถ•์ด ๊ฐ€๋Šฅํ•˜์—ฌ ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ์„ฑ ๋ถ„์„์— ์ ํ•ฉ

    โœ” ์ฐจ์› ์ถ•์†Œ๋ฅผ ํ†ตํ•ด ๋ฌธ์„œ์™€ ๋‹จ์–ด์˜ ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ๋ถ„์„ ๊ฐ€๋Šฅ

  • ๋‹จ์ :

    โœ– ํ™•๋ฅ  ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์ด ์•„๋‹ˆ๋ฏ€๋กœ ๋ถˆํ™•์‹ค์„ฑ์„ ๋‹ค๋ฃจ๊ธฐ ์–ด๋ ค์›€

    โœ– ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์ œ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ

    โœ– ์ƒˆ๋กœ์šด ๋ฌธ์„œ๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ๊ธฐ์กด ๋ชจ๋ธ์„ ๋‹ค์‹œ ํ•™์Šตํ•ด์•ผ ํ•จ


2.2 ํ™•๋ฅ  ๋ชจ๋ธ(Probabilistic Model) ๊ธฐ๋ฐ˜ ์ ‘๊ทผ

  • ์ •์˜:

    • ํ™•๋ฅ  ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ(Probabilistic Model Approach)์€ (1) ๋ฌธ์„œ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž ์žฌ์ ์ธ ์ฃผ์ œ(Topic)๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, (2) ๊ฐ ์ฃผ์ œ๋Š” ํŠน์ • ๋‹จ์–ด์˜ ํ™•๋ฅ ์  ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๋Š” ๊ฐ€์ •ํ•˜์— ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด์˜ ์ฃผ์ œ ๋ถ„ํฌ๋ฅผ ์ถ”๋ก ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
  • ํ•ต์‹ฌ ๊ฐœ๋…:

    • ๋ฌธ์„œ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ฃผ์ œ(Topic)๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ๊ฐ ์ฃผ์ œ๋Š” ํŠน์ • ๋‹จ์–ด๋“ค์˜ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง
    • ์ฃผ์–ด์ง„ ๋ฌธ์„œ์—์„œ ๋‹จ์–ด๊ฐ€ ์ƒ์„ฑ๋  ํ™•๋ฅ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฃผ์ œ ๋ถ„ํฌ๋ฅผ ์ถ”์ •
    • ๋ฒ ์ด์ฆˆ ์ถ”๋ก (Bayesian Inference)์„ ํ™œ์šฉํ•˜์—ฌ ํ™•๋ฅ ์  ํ† ํ”ฝ ๋ชจ๋ธ๋ง ์ˆ˜ํ–‰
  • ์ฃผ์š” ๊ธฐ๋ฒ•:

    1. Probabilistic Latent Semantic Analysis (pLSA)

      • ๋ฌธ์„œ์˜ ๋‹จ์–ด ์ถœํ˜„ ํ™•๋ฅ ์„ ์ž ์žฌ์ ์ธ ์ฃผ์ œ(Topic) ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์„ค๋ช…
        • ํ™•๋ฅ  ๋ชจ๋ธ์„ ์ด์šฉํ•ด **๋ฌธ์„œ-ํ† ํ”ฝ ๋ถ„ํฌ P(zโˆฃd)P(z d)P(zโˆฃd)** ๋ฐ **ํ† ํ”ฝ-๋‹จ์–ด ๋ถ„ํฌ P(wโˆฃz)P(w z)P(wโˆฃz)**๋ฅผ ํ•™์Šต
        • EM(Expectation-Maximization) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™œ์šฉํ•˜์—ฌ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ถ”์ •
    2. Latent Dirichlet Allocation (LDA)

      • pLSA์˜ ํ•œ๊ณ„๋ฅผ ๊ฐœ์„ ํ•œ ๋ฒ ์ด์ง€์•ˆ ํ™•๋ฅ  ๋ชจ๋ธ
      • ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ(ฮธ\thetaฮธ)์™€ ํ† ํ”ฝ์˜ ๋‹จ์–ด ๋ถ„ํฌ(ฯ•\phiฯ•)๋ฅผ Dirichlet ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋ง
      • Gibbs Sampling ๋˜๋Š” ๋ณ€๋ถ„ ์ถ”๋ก (Variational Inference)์„ ์‚ฌ์šฉํ•˜์—ฌ ํ™•๋ฅ ์„ ์ถ”์ •
  • ์žฅ์ :

    โœ” ๋ฌธ์„œ ๋‚ด์—์„œ ๊ฐ ์ฃผ์ œ์˜ ํ™•๋ฅ ์  ๋ถ„ํฌ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ

    โœ” ์ƒˆ๋กœ์šด ๋ฌธ์„œ๊ฐ€ ์ถ”๊ฐ€๋˜์–ด๋„ ๊ธฐ์กด ๋ชจ๋ธ์„ ์žฌํ•™์Šตํ•  ํ•„์š” ์—†์ด ์‚ฌ์ „ ํ•™์Šต๋œ ์ฃผ์ œ ๋ถ„ํฌ๋ฅผ ํ™œ์šฉ ๊ฐ€๋Šฅ

    โœ” ๋ฒ ์ด์ง€์•ˆ ์ ‘๊ทผ์„ ํ™œ์šฉํ•˜๋ฉด ๋ฐ์ดํ„ฐ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ๋ณด๋‹ค ์ž˜ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์Œ

  • ๋‹จ์ :

    โœ– ํ•™์Šต ๊ณผ์ •์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋ณต์žกํ•˜๊ณ  ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋†’์Œ

    โœ– Gibbs Sampling ๋˜๋Š” Variational Inference๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ํ•™์Šต ์‹œ๊ฐ„์ด ๊ธธ์–ด์งˆ ์ˆ˜ ์žˆ์Œ

    โœ– ์ฃผ์ œ ์ˆ˜(K)๋ฅผ ์‚ฌ์ „์— ์ •ํ•ด์•ผ ํ•˜๋ฉฐ, ์ด๋ฅผ ์ž˜๋ชป ์„ค์ •ํ•˜๋ฉด ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋  ์ˆ˜ ์žˆ์Œ

๐Ÿ“Œ ํ–‰๋ ฌ ๋ถ„ํ•ด ์ ‘๊ทผ vs ํ™•๋ฅ  ๋ชจ๋ธ ์ ‘๊ทผ ๋น„๊ต ์ •๋ฆฌ

๊ตฌ๋ถ„ ํ–‰๋ ฌ ๋ถ„ํ•ด (Matrix Factorization) ํ™•๋ฅ  ๋ชจ๋ธ (Probabilistic Model)
ํ•ต์‹ฌ ์•„์ด๋””์–ด ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ์„ ์ €์ฐจ์›์œผ๋กœ ์••์ถ• ๋ฌธ์„œ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ฃผ์ œ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ ๊ฐ ์ฃผ์ œ๋Š” ํŠน์ • ๋‹จ์–ด์˜ ํ™•๋ฅ ์  ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง
์ฃผ์š” ๊ธฐ๋ฒ• LSA (SVD), NMF pLSA, LDA
๊ฒฐ๊ณผ ํ•ด์„ ๋ฐฉ์‹ ์„ ํ˜• ๋Œ€์ˆ˜ ๊ธฐ๋ฐ˜ ํ™•๋ฅ  ๊ธฐ๋ฐ˜
ํ™•๋ฅ ์  ํ•ด์„ ๊ฐ€๋Šฅ ์—ฌ๋ถ€ โŒ (ํ™•๋ฅ  ๋ชจ๋ธ์ด ์•„๋‹˜) โœ… (ํ™•๋ฅ ์  ๋ชจ๋ธ)
์ƒˆ๋กœ์šด ๋ฌธ์„œ ์ฒ˜๋ฆฌ ๊ธฐ์กด ๋ชจ๋ธ์„ ๋‹ค์‹œ ํ•™์Šตํ•ด์•ผ ํ•จ ๊ธฐ์กด ๋ชจ๋ธ์„ ํ™œ์šฉ ๊ฐ€๋Šฅ
๊ณ„์‚ฐ ๋น„์šฉ ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์Œ ์ƒ๋Œ€์ ์œผ๋กœ ๋†’์Œ
์ ์šฉ ์‚ฌ๋ก€ ๊ฒ€์ƒ‰ ์—”์ง„, ์ถ”์ฒœ ์‹œ์Šคํ…œ, ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๋ถ„์„ ๋ฌธ์„œ ๋ถ„๋ฅ˜, ํ† ํ”ฝ ํƒ์ƒ‰, ํŠธ๋ Œ๋“œ ๋ถ„์„
๋‹จ์  ํ™•๋ฅ ์  ํ•ด์„์ด ์–ด๋ ค์›€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋†’๊ณ  ํ•™์Šต์ด ๋ณต์žกํ•จ

(์ฐธ๊ณ ) Naรฏve Approach (MLE, Maximum Likelihood Estimation)

ํŠน์ • ๋ฌธ์„œ (Document, d)์—์„œ ๋‹จ์–ด (Word, w)๊ฐ€ ๋“ฑ์žฅํ•  ํ™•๋ฅ ์„ MLE(์ตœ๋Œ€ ์šฐ๋„ ์ถ”์ •)๋กœ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•.

  • MLE๋Š” ํ™•๋ฅ  ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ• ์ค‘ ๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ํ•˜์ง€๋งŒ, LDA๋‚˜ pLSA์™€ ๋‹ฌ๋ฆฌ ์ž ์žฌ ๋ณ€์ˆ˜(Latent Variable) ๊ฐœ๋…์ด ์—†์œผ๋ฉฐ, ๋‹จ์ˆœํžˆ ๋ฌธ์„œ๋ณ„ ๋‹จ์–ด ์ถœํ˜„ ๋นˆ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜์ ์ธ Topic Modeling๊ณผ๋Š” ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์•„์ด๋””์–ด:
    • ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด์˜ ๋‹จ์ˆœ ๋นˆ๋„์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ํŠน์ • ๋ฌธ์„œ์—์„œ ๋ฐœ์ƒํ•  ํ™•๋ฅ ์„ ์ง์ ‘ ๊ณ„์‚ฐ

  • ํ•œ๊ณ„์ :

    • MLE๋Š” ๋‹จ์ˆœํ•œ ๋นˆ๋„ ๊ธฐ๋ฐ˜ ํ™•๋ฅ  ์ถ”์ • ๋ฐฉ๋ฒ•์œผ๋กœ ์ง๊ด€์ ์ด์ง€๋งŒ, ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋ถ€์กฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๋ฌธ์„œ ๋‚ด์— ์—†๋Š” ๋‹จ์–ด๋Š” ํ™•๋ฅ ์ด 0์œผ๋กœ ์„ค์ •๋˜๋ฉฐ, ๋ฐ์ดํ„ฐ์˜ ํฌ์†Œ์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
  • ๋ณด์™„ ๋ฐฉ๋ฒ•:

    • Zero-Frequency Problem (๋นˆ๋„ 0 ๋ฌธ์ œ)๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ, ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ์Šค๋ฌด๋”ฉ(Smoothing) ๊ธฐ๋ฒ•์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

โ“ Zero-Frequency Problem (๋นˆ๋„ 0 ๋ฌธ์ œ)์ด๋ž€?

Zero-Frequency Problem์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†๋Š” ๋‹จ์–ด๋‚˜ ๋‹จ์–ด ์กฐํ•ฉ์ด ๋“ฑ์žฅํ–ˆ์„ ๋•Œ ํ™•๋ฅ ์ด 0์ด ๋˜์–ด ๋ชจ๋ธ์ด ์ผ๋ฐ˜ํ™”๋˜์ง€ ๋ชปํ•˜๋Š” ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

  • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์Šค๋ฌด๋”ฉ(Smoothing) ๊ธฐ๋ฒ•์ด ์‚ฌ์šฉ๋˜๋ฉฐ, ๋ฐ์ดํ„ฐ์˜ ํฌ์†Œ์„ฑ ์ •๋„์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ๋ฐฉ๋ฒ•์„ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฐ„๋‹จํ•œ ๋ฌธ์ œ ํ•ด๊ฒฐ์—๋Š” Add-k ์Šค๋ฌด๋”ฉ(Laplace)์ด ์ ์ ˆํ•ฉ๋‹ˆ๋‹ค.
    • ๋ฐ์ดํ„ฐ ํฌ์†Œ์„ฑ์ด ๋†’์€ ๊ฒฝ์šฐ์—๋Š” Good-Turing ์Šค๋ฌด๋”ฉ์ด ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.
    • ์‹ค์ œ ์–ธ์–ด ๋ชจ๋ธ์—์„œ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ฐฉ๋ฒ•์€ Kneser-Ney ์Šค๋ฌด๋”ฉ์ž…๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ, ์Šค๋ฌด๋”ฉ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜๋ฉด ํฌ์†Œํ•œ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์–ธ์–ด ๋ชจ๋ธ์ด ๋ณด๋‹ค ์ผ๋ฐ˜ํ™”๋œ ํ™•๋ฅ ์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ฒ€์ƒ‰, ๋ฒˆ์—ญ, ํ…์ŠคํŠธ ์ƒ์„ฑ ๋“ฑ ๋‹ค์–‘ํ•œ NLP ์ž‘์—…์—์„œ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์Šค๋ฌด๋”ฉ ๊ธฐ๋ฒ• ํŠน์ง• ์žฅ์  ๋‹จ์ 
Additive Smoothing (Add-k) ๋ชจ๋“  ๋นˆ๋„์— ์ž‘์€ ๊ฐ’์„ ์ถ”๊ฐ€ ๊ตฌํ˜„์ด ๊ฐ„๋‹จ ๊ณผ๋„ํ•˜๊ฒŒ ๊ท ๋“ฑํ•œ ํ™•๋ฅ  ๋ถ„ํฌ
Good-Turing Smoothing ๋นˆ๋„ 0์ธ ๋‹จ์–ด๋ฅผ ๋‚ฎ์€ ๋นˆ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ถ”์ • ํฌ์†Œํ•œ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉ ๋†’์€ ๋นˆ๋„์˜ ๋‹จ์–ด์— ์ ์šฉ์ด ์–ด๋ ค์›€
Kneser-Ney Smoothing ๋ฌธ๋งฅ ๋‚ด ์‚ฌ์šฉ ๋นˆ๋„๋ฅผ ๊ณ ๋ ค ๊ฐ€์žฅ ํšจ๊ณผ์ ์ด๊ณ  ์ •ํ™•๋„ ๋†’์Œ ๊ตฌํ˜„์ด ๋ณต์žก

์œ„์—์„œ Topic Modeling์„ ๋ถ„๋ฅ˜ํ•ด๋ดค์œผ๋‹ˆ ์ด์ œ ๊ฐ๊ฐ์˜ ์ปจ์…‰์— ๋Œ€ํ•ด์„œ ์ž์„ธํ•˜๊ฒŒ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.


  1. Latent Semantic Analysis (LSA)

3.1. Latent Structure (์ž ์žฌ ๊ตฌ์กฐ)

3.1.1 ๊ฐœ๋…

  • ๋ฐ์ดํ„ฐ๋ฅผ ํ–‰๋ ฌ๋กœ ํ‘œํ˜„ํ•˜๋ฉด, ํ–‰๋ ฌ์˜ ํฌ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜, ๋„ˆ๋ฌด ๋ณต์žกํ•˜์—ฌ ๋ถ„์„์ด ์–ด๋ ต๊ฑฐ๋‚˜, ๋ช…ํ™•ํ•œ ๊ตฌ์กฐ๊ฐ€ ๋“œ๋Ÿฌ๋‚˜์ง€ ์•Š๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ.

    • ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์‰ฝ๊ฒŒ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•จ.
  • Latent Structure(์ž ์žฌ ๊ตฌ์กฐ)๋Š” ๋ฐ์ดํ„ฐ ์†์— ์ˆจ๊ฒจ์ง„ ํŒจํ„ด์ด๋‚˜ ๊ตฌ์กฐ๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ณผ์ •์„ ์˜๋ฏธํ•จ.

    • ์ด๋ฅผ ํ†ตํ•ด, ๋ณด๋‹ค ๋‹จ์ˆœํ•˜๊ณ  ์˜๋ฏธ ์žˆ๋Š” ํ‘œํ˜„์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ.

3.1.2 ๋ฌธ์ œ์ 

  • ๋ฐ์ดํ„ฐ ํ–‰๋ ฌ์ด ๋„ˆ๋ฌด ํฌ๋‹ค (Too large).
  • ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋ณต์žกํ•˜๋‹ค (Too complicated).
  • ๋ฐ์ดํ„ฐ๊ฐ€ ๊ตฌ์กฐ์ ์œผ๋กœ ๋ช…ํ™•ํ•˜์ง€ ์•Š๋‹ค (Lack of structure).
  • ๊ฒฐ์ธก๊ฐ’์ด๋‚˜ ์žก์Œ์ด ํฌํ•จ๋  ์ˆ˜ ์žˆ์Œ (Missing Entries, Noisy Entries).

3.1.3 ํ•ด๊ฒฐ์ฑ…: Latent Structure ์ฐพ๊ธฐ

  • ๋” ๊ฐ„๋‹จํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์„๊นŒ?
  • ๋ฐ์ดํ„ฐ์— ์ˆจ๊ฒจ์ง„ ์ž ์žฌ์ ์ธ ๊ตฌ์กฐ(Latent Structure)๊ฐ€ ์žˆ์„๊นŒ?
  • ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์ฐพ์„ ์ˆ˜ ์žˆ์„๊นŒ?

3.2. Matrix Decomposition (ํ–‰๋ ฌ ๋ถ„ํ•ด)

3.2.1 ๊ฐœ๋…

  • ๋ฐ์ดํ„ฐ ํ–‰๋ ฌ์„ ๋‹จ์ˆœํ™”ํ•˜์—ฌ, ์›๋ž˜ ๋ฐ์ดํ„ฐ์˜ ์ค‘์š”ํ•œ ๊ตฌ์กฐ๋งŒ์„ ์œ ์ง€ํ•˜๋Š” ์ž‘์€ ํ–‰๋ ฌ๋กœ ๋ถ„ํ•ดํ•˜๋Š” ๊ณผ์ •.
  • ํ–‰๋ ฌ ๋ถ„ํ•ด๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด ์›๋ณธ ๋ฐ์ดํ„ฐ๋ณด๋‹ค ๋” ์ž‘์€ ์ฐจ์›์˜ ํ‘œํ˜„์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์˜๋ฏธ ์žˆ๋Š” ํŒจํ„ด์„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ์Œ.

3.2.2 ์ˆ˜ํ•™์  ํ‘œํ˜„

Aโ‰ˆA^=Lโ‹…RA \approx \hat{A} = L \cdot RAโ‰ˆA^=Lโ‹…R

  • AAA : ์›๋ณธ ๋ฐ์ดํ„ฐ ํ–‰๋ ฌ
  • A^\hat{A}A^ : ๊ทผ์‚ฌ๋œ ํ–‰๋ ฌ
  • LLL : ์™ผ์ชฝ ํ–‰๋ ฌ (Left Factor, nโ‹…qn ยท qnโ‹…q) โ†’ ์ž ์žฌ ๋ณ€์ˆ˜(Topics)์™€ ๋ฌธ์„œ ๊ด€๊ณ„
  • RRR : ์˜ค๋ฅธ์ชฝ ํ–‰๋ ฌ (Right Factor, mโ‹…qm ยท qmโ‹…q) โ†’ ์ž ์žฌ ๋ณ€์ˆ˜(Topics)์™€ ๋‹จ์–ด ๊ด€๊ณ„
  • qqq : ์ž ์žฌ ์ฐจ์›(Latent Dimension) โ†’ ์ฃผ์ œ(Topics)์˜ ๊ฐœ์ˆ˜

3.2.3 ํŠน์ง•

  • ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋ฅผ ์ค„์ž„ โ†’ ๊ณ„์‚ฐ๋Ÿ‰์„ ๊ฐ์†Œ์‹œํ‚ด.
  • ์ž ์žฌ ์˜๋ฏธ ๊ตฌ์กฐ(Latent Structure)๋ฅผ ์ถ”์ถœ โ†’ ํŒจํ„ด์„ ๋ถ„์„ํ•˜๋Š” ๋ฐ ์œ ์šฉ.
  • ๊ฐ ์š”์†Œ๋“ค์ด ์ž ์žฌ ์ฃผ์ œ๋ฅผ ๋ฐ˜์˜ โ†’ ์˜๋ฏธ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ํ‘œํ˜„ ๊ฐ€๋Šฅ.

3.3. LSA (Latent Semantic Analysis)

3.3.1 ๊ฐœ๋…

  • LSA๋Š” ํ–‰๋ ฌ ๋ถ„ํ•ด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฌธ์„œ์™€ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ž ์žฌ ์˜๋ฏธ ๊ณต๊ฐ„์—์„œ ๋ถ„์„ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
    • Topic modeling ์‹œ์— LSA๋Š”, Singular Value Decomposition (SVD, ํŠน์ด๊ฐ’ ๋ถ„ํ•ด)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ์„ ์ €์ฐจ์› ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜ํ•จ.

โ“ Singular Value Decomposition (SVD, ํŠน์ด๊ฐ’ ๋ถ„ํ•ด)๋ž€?

  • Singular Value Decomposition (SVD, ํŠน์ด๊ฐ’ ๋ถ„ํ•ด)๋Š” ํ–‰๋ ฌ์„ ์„ธ ๊ฐœ์˜ ์ง๊ต ํ–‰๋ ฌ์˜ ๊ณฑ์œผ๋กœ ๋ถ„ํ•ดํ•˜๋Š” ์„ ํ˜• ๋Œ€์ˆ˜ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
    • ์ด๋ฅผ ํ†ตํ•ด ๊ณ ์ฐจ์›์˜ ํ–‰๋ ฌ์„ ์ €์ฐจ์› ์ž ์žฌ ์˜๋ฏธ ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ฐจ์› ์ถ•์†Œ, ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ, ๋ฐ์ดํ„ฐ ์••์ถ• ๋“ฑ์— ๋„๋ฆฌ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • SVD์˜ ํ•ต์‹ฌ ์›๋ฆฌ
    • ํ–‰๋ ฌ์„ ๋” ์ž‘์€ ์˜๋ฏธ ์žˆ๋Š” ๊ตฌ์„ฑ ์š”์†Œ๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ์ฃผ์š”ํ•œ ํŒจํ„ด์„ ์ถ”์ถœ
    • ๋ฐ์ดํ„ฐ ์ฐจ์›์„ ์ค„์ด๋Š” ๋™์‹œ์— ์ค‘์š”ํ•œ ์ •๋ณด๋งŒ ์œ ์ง€ ๊ฐ€๋Šฅ
    • ํŠน์ด๊ฐ’(Singular Value)์„ ์‚ฌ์šฉํ•˜์—ฌ ํ–‰๋ ฌ์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ตฌ์กฐ๋ฅผ ํ•™์Šต

๐Ÿ’Œ SVD ์ˆ˜์‹ ์ •๋ฆฌ

SVD๋ฅผ ํ†ตํ•ด ์ž…๋ ฅ ํ–‰๋ ฌ AAA๋Š” ์„ธ ๊ฐœ์˜ ํ–‰๋ ฌ๋กœ ๋ถ„ํ•ด๋˜๋Š”๋ฐ, ๊ฐ ํ–‰๋ ฌ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

A=UฮฃVTA = U \Sigma V^TA=UฮฃVT

๐Ÿ’ฌ ๊ธฐํ˜ธ ์„ค๋ช…

  • AAA (์ž…๋ ฅ ํ–‰๋ ฌ, mร—nm \times nmร—n)

    • ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•œ ์›๋ณธ ํ–‰๋ ฌ (Original Matrix)
    • ์ด ํ–‰๋ ฌ์€ ํ–‰(row)๊ณผ ์—ด(column) ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง
  • UUU (์™ผ์ชฝ ์ง๊ต ํ–‰๋ ฌ, mร—mm \times mmร—m)

    • A์˜ ์—ด(column) ๋ฐฉํ–ฅ์œผ๋กœ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์˜ ์ฃผ์„ฑ๋ถ„(Principal Component)์„ ๋‚˜ํƒ€๋ƒ„
      • ์ฆ‰, UUU์˜ ์—ด ๋ฒกํ„ฐ๋“ค์€ A์˜ ์—ด ๋ฒกํ„ฐ๋“ค์„ ์ƒˆ๋กœ์šด ์ง๊ต ๊ธฐ์ €(orthonormal basis)๋กœ ํ‘œํ˜„ํ•˜๋Š” ์—ญํ• 
    • ๊ฐ ํ–‰(row)์€ ์›๋ž˜์˜ AAA์˜ ํ–‰์„ ์ƒˆ๋กœ์šด ๊ธฐ์ €์—์„œ ํ‘œํ˜„ํ•œ ๊ฒƒ
    • ๊ฐ ์—ด(column)์€ A์˜ ์ฃผ์š”ํ•œ ํŠน์ง•(feature)์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ (Aํ–‰๋ ฌ์˜ ์—ด์˜ ์ฃผ์„ฑ๋ถ„)
      • ์˜ˆ: ๊ณ ์œ  ์–ผ๊ตด(Face Recognition) ๋ถ„์„์—์„œ ํŠน์ • ์–ผ๊ตด ํŠน์ง•์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ
  • ฮฃ\Sigmaฮฃ (๋Œ€๊ฐ ํ–‰๋ ฌ, mร—nm \times nmร—n)

    • A์—์„œ ์ค‘์š”ํ•œ ์ •๋ณด(ํŠน์ด๊ฐ’, Singular Value)๋ฅผ ํฌํ•จํ•˜๋Š” ํ–‰๋ ฌ
    • ๋Œ€๊ฐ ์›์†Œ๋งŒ ์กด์žฌํ•˜๋ฉฐ, ๋Œ€๊ฐ์„ ์˜ ๊ฐ’(ํŠน์ด๊ฐ’)์€ A์˜ ์ฃผ์š”ํ•œ ํŒจํ„ด(Principal Components)์˜ ๊ฐ•๋„๋ฅผ ๋‚˜ํƒ€๋ƒ„
    • ํฐ ํŠน์ด๊ฐ’(Singular Value)์ผ์ˆ˜๋ก ๋ฐ์ดํ„ฐ์—์„œ ์ค‘์š”ํ•œ ์ •๋ณด
    • ์ž‘์€ ํŠน์ด๊ฐ’์€ ๋…ธ์ด์ฆˆ๋‚˜ ๋œ ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์˜๋ฏธ
      • ์˜ˆ: ์••์ถ•(compression) task์—์„œ ์ž‘์€ ํŠน์ด๊ฐ’์„ ์ œ๊ฑฐํ•˜๋ฉด ์ค‘์š”ํ•œ ์ •๋ณด๋งŒ ์œ ์ง€๋จ
  • VTV^TVT (์˜ค๋ฅธ์ชฝ ์ง๊ต ํ–‰๋ ฌ, nร—nn \times nnร—n)

    • A์˜ ํ–‰(row) ๋ฐฉํ–ฅ์œผ๋กœ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์˜ ์ฃผ์„ฑ๋ถ„์„ ๋‚˜ํƒ€๋ƒ„
      • ์ฆ‰, VTV^TVT์˜ ํ–‰ ๋ฒกํ„ฐ๋“ค์€ A์˜ ํ–‰ ๋ฒกํ„ฐ๋“ค์„ ์ƒˆ๋กœ์šด ์ง๊ต ๊ธฐ์ €์—์„œ ํ‘œํ˜„
    • ๊ฐ ์—ด(column)์€ ์›๋ž˜์˜ AAA์˜ ์—ด์„ ์ƒˆ๋กœ์šด ๊ธฐ์ €์—์„œ ํ‘œํ˜„ํ•œ ๊ฒƒ
    • ๊ฐ ํ–‰(row)์€ A์˜ ์ฃผ์š”ํ•œ ํŠน์ง•(feature)์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ (Aํ–‰๋ ฌ์˜ ํ–‰์˜ ์ฃผ์„ฑ๋ถ„)
      • ์˜ˆ: ๋ฌธ์„œ ๋ถ„์„์—์„œ ์ฃผ์ œ(Topic)๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ

(์ฐธ๊ณ ) ๐Ÿ‘€ SVD ๊ธฐํ•˜ํ•™์  ํ•ด์„

๐Ÿ“Œ SVD๋Š” ๋ณธ์งˆ์ ์œผ๋กœ A๋ผ๋Š” ์„ ํ˜• ๋ณ€ํ™˜์„ โ€œํšŒ์ „(Rotation) โ†’ ํฌ๊ธฐ ์กฐ์ •(Scaling) โ†’ ๋‹ค์‹œ ํšŒ์ „(Rotation)โ€œํ•˜๋Š” ๊ณผ์ •์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

A=UฮฃVTA = U \Sigma V^TA=UฮฃVT

  • 1๏ธโƒฃ VTV^TVT: ๊ธฐ์ € ๋ณ€ํ™˜ (Basis Change)
    • ๋ฐ์ดํ„ฐ(ํ–‰๋ ฌ A)๋ฅผ ์ ์ ˆํ•œ ์ง๊ต ์ขŒํ‘œ๊ณ„(Orthogonal Basis)๋กœ ๋ณ€ํ™˜
    • ์ฆ‰, ์›๋ž˜ ์ขŒํ‘œ๊ณ„์—์„œ ์ƒˆ๋กœ์šด ์ขŒํ‘œ๊ณ„๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์—ญํ• 
  • 2๏ธโƒฃ ฮฃ\Sigmaฮฃ: ์ถ• ๋ฐฉํ–ฅ ์Šค์ผ€์ผ๋ง (Scaling along Principal Axes)
    • ๊ฐ ์ขŒํ‘œ์ถ• ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ˜•(ํฌ๊ธฐ ์กฐ์ ˆ)
    • ์ฆ‰, ์ค‘์š”ํ•œ ์ถ•(Principal Component) ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋Š˜๋ฆฌ๊ณ  ์ค„์ด๋Š” ์—ญํ• 
  • 3๏ธโƒฃ UUU: ๋‹ค์‹œ ํšŒ์ „ (Final Rotation back to original space)
    • ๋ณ€ํ˜•๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์›๋ž˜ ์ขŒํ‘œ๊ณ„๋กœ ํšŒ์ „ ๋ณ€ํ™˜
  • ๐Ÿ’ก ์ฆ‰, SVD๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋‹ค ์ž˜ ์ •๋ ฌ๋œ (์ง๊ต) ์ขŒํ‘œ๊ณ„๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ๊ทธ ์•ˆ์—์„œ ์ฃผ์ถ•์„ ๋”ฐ๋ผ ํฌ๊ธฐ ์กฐ์ ˆํ•œ ํ›„, ๋‹ค์‹œ ์›๋ž˜ ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ด์ง ๊ฐœ๋…์ ์ธ ์ด์•ผ๊ธฐ๋ฅผ ํ•˜๋‹ค๊ฐ€ ์˜†์œผ๋กœ ๋น ์กŒ๋Š”๋ฐ ๋‹ค์‹œ LSA์— ๋Œ€ํ•ด์„œ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

  • LSA๋Š” ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ(document-term matrix)์— SVD๋ฅผ ์ ์šฉํ•˜์—ฌ ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ(latent semantics)๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • LSA๋ฅผ ํ†ตํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ ๊ฒƒ๋“ค์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

    • ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์น˜ํ™”ํ•˜๊ณ , ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ๋กœ ํ‘œํ˜„
    • SVD๋ฅผ ์ ์šฉํ•˜์—ฌ ์ค‘์š”ํ•œ ์˜๋ฏธ๋งŒ ๋‚จ๊ธฐ๊ณ , ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ
    • ์ฐจ์› ์ถ•์†Œ๋ฅผ ํ†ตํ•ด ๋ฌธ์„œ ๊ฐ„ ์˜๋ฏธ์  ์œ ์‚ฌ๋„๋ฅผ ํŒŒ์•…

3.3.2 LSA์˜ ์ˆ˜ํ•™์  ํ‘œํ˜„

ํ–‰๋ ฌ๋ถ„ํ•ด ๊ธฐ๋ฒ•(SVD)์ธ LSA๋ฅผ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Aโ‰ˆUkฮฃkVkTA \approx U_k \Sigma_k V_k^TAโ‰ˆUkโ€‹ฮฃkโ€‹VkTโ€‹

  • AAA (Term-Document Matrix): ๋‹จ์–ด-๋ฌธ์„œ ํ–‰๋ ฌ (์ž…๋ ฅ ๋ฐ์ดํ„ฐ)
  • UkU_kUkโ€‹ (Left Singular Vectors): ๋‹จ์–ด์™€ ์ฃผ์ œ(Topic) ๊ฐ„์˜ ๊ด€๊ณ„
  • ฮฃk\Sigma_kฮฃkโ€‹ (Singular Values, ํŠน์ด๊ฐ’ ํ–‰๋ ฌ): ์ฃผ์ œ์˜ ์ค‘์š”๋„๋ฅผ ๋‚˜ํƒ€๋ƒ„
  • VkTV_k^TVkTโ€‹ (Right Singular Vectors): ๋ฌธ์„œ์™€ ์ฃผ์ œ(Topic) ๊ฐ„์˜ ๊ด€๊ณ„

3.3.3 LSA์˜ ์ฃผ์š” ๊ณผ์ •

Step 1: ์›๋ณธ ํ–‰๋ ฌ AAA๋ฅผ SVD๋กœ ๋ถ„ํ•ด

Aโ‰ˆUkฮฃkVkTA \approx U_k \Sigma_k V_k^TAโ‰ˆUkโ€‹ฮฃkโ€‹VkTโ€‹

  • ์—ฌ๊ธฐ์„œ kkk๋Š” ์„ ํƒํ•œ ์ž ์žฌ ์ฐจ์›์˜ ์ˆ˜

    • ๋ณดํ†ต kkk๋ฅผ ๊ธฐ์กด ์ฐจ์›๋ณด๋‹ค ์ž‘๊ฒŒ ์„ค์ •ํ•˜์—ฌ ์˜๋ฏธ ์žˆ๋Š” ์ •๋ณด๋งŒ ์œ ์ง€.
  • Uk,ฮฃk,VkTU_k, \Sigma_k, V_k^TUkโ€‹,ฮฃkโ€‹,VkTโ€‹๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›๋ž˜ ํ–‰๋ ฌ์„ ๊ทผ์‚ฌํ•จ.

Step 2: ์ฃผ์–ด์ง„ ํ–‰๋ ฌ AkA_kAkโ€‹๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ถ„์„

UkTAk=UkTUkฮฃkVkT=IฮฃkVkT=ฮฃkVkTU_k^T A_k = U_k^T U_k \Sigma_k V_k^T = \mathbf{I} \Sigma_k V_k^T = \Sigma_k V_k^TUkTโ€‹Akโ€‹=UkTโ€‹Ukโ€‹ฮฃkโ€‹VkTโ€‹=Iฮฃkโ€‹VkTโ€‹=ฮฃkโ€‹VkTโ€‹

  • ์ด ๊ณผ์ •์—์„œ ๋ฐ์ดํ„ฐ์˜ ์ค‘์š”ํ•œ ์˜๋ฏธ(์ž ์žฌ ์˜๋ฏธ)๋ฅผ ๋ณด์กดํ•˜๋ฉด์„œ ์ฐจ์›์„ ์ถ•์†Œํ•จ.

Step 3: ๋ฐ์ดํ„ฐ ๋ถ„์„์— ํ™œ์šฉ

LSA๋ฅผ ํ†ตํ•ด ๋„์ถœ๋œ ๊ฐ๊ฐ์˜ ํ–‰๋ ฌ์„ ํ™œ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • LSA์—์„œ ๋ถ„ํ•ด๋œ ํ–‰๋ ฌ (UkU_kUkโ€‹, ฮฃk\Sigma_kฮฃkโ€‹, VkTV_k^TVkTโ€‹)์€ ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์œผ๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋Œ€ํ‘œ์ ์ธ ์‘์šฉ ์‚ฌ๋ก€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

    Image Source : https://medium.com/analytics-vidhya/what-is-topic-modeling-161a76143cae

1๏ธโƒฃ UkU_kUkโ€‹: โ€œํ† ํ”ฝ๋ณ„ ์ฃผ์š” ๋‹จ์–ด ์ฐพ๊ธฐโ€

๐Ÿ“Œ ๋ชฉ์ : ๊ฐ ํ† ํ”ฝ(์ž ์žฌ ์˜๋ฏธ)์„ ๋Œ€ํ‘œํ•˜๋Š” ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์ฃผ์ œ๋ฅผ ํ•ด์„ํ•˜๊ธฐ ์œ„ํ•จ.

๐Ÿค” ์„ค๋ช…: UkU_kUkโ€‹๋Š” ๋‹จ์–ด์™€ ํ† ํ”ฝ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ๋กœ, ํŠน์ • ์ฃผ์ œ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์ค‘์š”ํ•œ ๋‹จ์–ด๋“ค์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ด๋ฅผ ํ™œ์šฉํ•˜๋ฉด ํ† ํ”ฝ ๋ชจ๋ธ๋ง์„ ํ†ตํ•ด ๊ฐ ์ฃผ์ œ๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Š ํ™œ์šฉ ์˜ˆ์‹œ:

  • ๋‰ด์Šค ๊ธฐ์‚ฌ์—์„œ ์ฃผ์ œ๋ณ„ ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ ์ถ”์ถœ
  • ๋…ผ๋ฌธ, ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ์—์„œ ์ฃผ์š” ํ† ํ”ฝ์„ ๋ถ„์„

2๏ธโƒฃ VkV_kVkโ€‹: โ€œ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐโ€

๐Ÿ“Œ ๋ชฉ์ : ๋ฌธ์„œ ๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋น„์Šทํ•œ ๋ฌธ์„œ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•จ.

๐Ÿค” ์„ค๋ช…: VkV_kVkโ€‹๋Š” ๋ฌธ์„œ์™€ ํ† ํ”ฝ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค.

  • ์ด๋ฅผ ํ™œ์šฉํ•˜๋ฉด ๋ฌธ์„œ ๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋น„์Šทํ•œ ๋ฌธ์„œ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Š ํ™œ์šฉ ์˜ˆ์‹œ:

  • ๋‰ด์Šค ๊ธฐ์‚ฌ ์ค‘ ๋น„์Šทํ•œ ๋‚ด์šฉ์„ ๋‹ค๋ฃฌ ๋ฌธ์„œ ์ถ”์ฒœ
  • ๋…ผ๋ฌธ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ์œ ์‚ฌํ•œ ์—ฐ๊ตฌ ๋…ผ๋ฌธ ์ถ”์ฒœ
  • ๊ณ ๊ฐ ๋ฆฌ๋ทฐ์—์„œ ๋น„์Šทํ•œ ์˜๊ฒฌ์„ ๊ฐ€์ง„ ๋ฆฌ๋ทฐ ๊ตฐ์ง‘ํ™”

3๏ธโƒฃ VkV_kVkโ€‹: โ€œ์˜๋ฏธ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰ (Query Expansion)โ€

๐Ÿ“Œ ๋ชฉ์ : ๋‹จ์ˆœ ํ‚ค์›Œ๋“œ ๋งค์นญ์ด ์•„๋‹Œ, ์˜๋ฏธ์ ์œผ๋กœ ์—ฐ๊ด€๋œ ๋ฌธ์„œ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•จ.

๐Ÿค” ์„ค๋ช…: VkV_kVkโ€‹๋ฅผ ํ™œ์šฉํ•˜๋ฉด ๋‹จ์ˆœ ํ‚ค์›Œ๋“œ ๋งค์นญ์ด ์•„๋‹Œ, ์˜๋ฏธ์ ์œผ๋กœ ๊ด€๋ จ ์žˆ๋Š” ๋ฌธ์„œ๋ฅผ ๊ฒ€์ƒ‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ด๋Š” ์ „ํ†ต์ ์ธ ๊ฒ€์ƒ‰ ์—”์ง„๊ณผ ๋‹ฌ๋ฆฌ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ํ™•์žฅํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Š ํ™œ์šฉ ์˜ˆ์‹œ:

  • ์‚ฌ์šฉ์ž๊ฐ€ โ€œAI ์ฃผ์‹ ์‹œ์žฅโ€์„ ๊ฒ€์ƒ‰ํ•˜๋ฉด โ€œ์—”๋น„๋””์•„ ์ฃผ๊ฐ€ ํ•˜๋ฝโ€, โ€œ๋ฐ˜๋„์ฒด ์ฃผ์‹ ๋ณ€๋™โ€ ๊ฐ™์€ ๋ฌธ์„œ ์ถ”์ฒœ
  • ๊ณ ๊ฐ์ด โ€œ๋ฐฐ์†ก ๋Šฆ์Œโ€์„ ๊ฒ€์ƒ‰ํ•˜๋ฉด โ€œ๋ฐฐ์†ก ์ง€์—ฐ ๋ณด์ƒ ์ •์ฑ…โ€ ๋ฌธ์„œ ์ถ”์ฒœ
  • ๋…ผ๋ฌธ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ โ€œ๊ฐ•ํ™” ํ•™์Šตโ€์„ ๊ฒ€์ƒ‰ํ•˜๋ฉด โ€œ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ๊ฐ•ํ™” ํ•™์Šตโ€ ๋…ผ๋ฌธ๋„ ์ถ”์ฒœ

3.4 LSA์˜ ํŠน์ง•

โœ… ์žฅ์ :

  • ์ฐจ์› ์ถ•์†Œ๋ฅผ ํ†ตํ•ด ์žก์Œ์„ ์ œ๊ฑฐํ•˜๊ณ  ์ค‘์š”ํ•œ ์˜๋ฏธ ๊ตฌ์กฐ๋งŒ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Œ.
  • ๋‹จ์–ด ๊ฐ„ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ์„ ์ฐพ์„ ์ˆ˜ ์žˆ์Œ (์˜ˆ: Synonym, ์˜๋ฏธ์ ์œผ๋กœ ๊ฐ€๊นŒ์šด ๋‹จ์–ด๋“ค์„ ์œ ์‚ฌํ•˜๊ฒŒ ๋ถ„์„).
  • ๊ฒ€์ƒ‰, ์ถ”์ฒœ ์‹œ์Šคํ…œ, ๋ฌธ์„œ ๋ถ„๋ฅ˜ ๋“ฑ์— ํ™œ์šฉ ๊ฐ€๋Šฅ.

โŒ ํ•œ๊ณ„:

  • ํ™•๋ฅ ์  ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ฏ€๋กœ, ๊ฒฐ๊ณผ ํ•ด์„์ด ์–ด๋ ค์›€ (ํ™•๋ฅ  ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ธ LDA์™€ ๋น„๊ต๋จ).
  • ์ƒˆ๋กœ์šด ๋ฌธ์„œ์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”๊ฐ€ ์–ด๋ ค์›€ โ†’ ๊ธฐ์กด ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณ€ํ•˜๋ฉด ๋‹ค์‹œ ํ•™์Šตํ•ด์•ผ ํ•จ.
  • ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ์ •๊ทœ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•จ โ†’ ์‹ค์ œ ์ž์—ฐ์–ด ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ์™€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ.

3.5 LSA์˜ ์‹ค์Šต

๋‹ค์Œ์€ LSA ์ฝ”๋“œ ์‹ค์Šต์ž…๋‹ˆ๋‹ค.

  • ์‹ค์ œ ์ตœ์‹  ๋‰ด์Šค ๊ธฐ์‚ฌ๋ฅผ ๊ฐ€์ ธ์™€์„œ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ LSA๋ฅผ ์ˆ˜ํ–‰ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

# 1. ๋‰ด์Šค ๊ธฐ์‚ฌ ๋ฐ์ดํ„ฐ ์ค€๋น„ (์˜ˆ์ œ ๋ฌธ์„œ)

documents = [
    "์ค‘๊ตญ์˜ ์ €๋น„์šฉ ์ธ๊ณต์ง€๋Šฅ(AI) ๋”ฅ์‹œํฌ(DeepSeek)์˜ ๋“ฑ์žฅ์— ์—”๋น„๋””์•„ ๋“ฑ ๋ฏธ๊ตญ์˜ ์ธ๊ณต์ง€๋Šฅ(AI) ๊ด€๋ จ ๋น…ํ…Œํฌ ๊ธฐ์—…๋“ค์ด ํ”๋“ค๋ฆฌ๊ณ  ์žˆ๋‹ค.",
    "๋”ฅ์‹œํฌ ๋“ฑ์žฅ์— ๊ธฐ์กด ์ธ๊ณต์ง€๋Šฅ ๊ธฐ์—…๋“ค์˜ ๊ฒฝ์Ÿ๋ ฅ์ด ์˜์‹ฌ๋ฐ›์œผ๋ฉฐ ์ตœ์•…์˜ ์ฃผ๊ฐ€ ํญ๋ฝ์ด ์ผ์–ด๋‚ฌ๋‹ค.",
    "์ค‘๊ตญ์ด ๊ฐ’์‹ธ๊ณ  ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์˜ ์ธ๊ณต์ง€๋Šฅ์„ ๊ฐœ๋ฐœํ•จ์œผ๋กœ์จ, ์ด ๋ถ„์•ผ์—์„œ ๋ฏธ๊ตญ์„ ์•ž์„ค ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ถ„์„๋„ ๋‚˜์˜จ๋‹ค.",
    "27์ผ ๋ฏธ๊ตญ ์ฆ์‹œ์—์„œ๋Š” ์ฑ—์ง€ํ”ผํ‹ฐ(ChatGPT) ๋“ฑ ์ƒ์„ฑํ˜• ์ธ๊ณต์ง€๋Šฅ ์ถœ์‹œ ์ดํ›„ ์ฆ์‹œ์—์„œ ์ตœ๋Œ€ ๋น…ํ…Œํฌ ๊ธฐ์—…์œผ๋กœ ์„ฑ์žฅํ•œ ์—”๋น„๋””์•„๊ฐ€ ๋ฌด๋ ค 17% ํญ๋ฝํ•ด, 5890์–ต๋‹ฌ๋Ÿฌ๊ฐ€ ์ฆ๋ฐœ๋๋‹ค.",
    "์—”๋น„๋””์•„ ๋“ฑ ๋ฏธ ์ฆ์‹œ์—์„œ ๋น„์ค‘์ด ํฐ ๋น…ํ…Œํฌ ๊ธฐ์—…๋“ค์ด ์ผ์ œํžˆ ํญ๋ฝํ•˜๋ฉฐ ๋‚˜์Šค๋‹ฅ ์ง€์ˆ˜๋Š” 3.1%, ์—”์Šค์•คํ”ผ(S&P)500 ์ง€์ˆ˜๋Š” 1.5%๋‚˜ ๋–จ์–ด์กŒ๋‹ค.",
    "ํ•˜์ง€๋งŒ, ๋น…ํ…Œํฌ ๊ธฐ์—…์ด ํŽธ์ž…๋˜์ง€ ์•Š์€ ๋‹ค์šฐ์กด์Šค ์ง€์ˆ˜๋Š” 0.7% ์˜ฌ๋ž๋‹ค.",
    "ํŠนํžˆ, ์ธ๊ณต์ง€๋Šฅ ๋ฐ ๋ฐ˜๋„์ฒด ๊ด€๋ จ์ฃผ๋กœ ๊ตฌ์„ฑ๋œ ํ•„๋ผ๋ธํ”ผ์•„ ๋ฐ˜๋„์ฒด์ง€์ˆ˜๋Š” 9.15%๋‚˜ ํญ๋ฝํ•ด, ์ง€๋‚œํ•ด 9์›”3์ผ 7.75% ์ดํ›„ ์ตœ๋Œ€๋กœ ๋–จ์–ด์กŒ๋‹ค.",
    "์ด ์ง€์ˆ˜๊ฐ€ 9% ์ด์ƒ ํญ๋ฝํ•˜๊ธฐ๋Š” ์ฝ”๋กœ๋‚˜19 ์ถฉ๊ฒฉ์ด ๊ฐ€ํ•ด์กŒ๋˜ ์ง€๋‚œ 2020๋…„ 3์›”18์ผ ์ดํ›„ ์ฒ˜์Œ์ด๋‹ค."
]

# 2. TF-IDF ๋ฒกํ„ฐํ™”
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

# 3. LSA ์ ์šฉ (SVD๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฐจ์› ์ถ•์†Œ)
num_topics = 2  # 2๊ฐœ์˜ ์ฃผ์š” ์ฃผ์ œ ์ถ”์ถœ
svd = TruncatedSVD(n_components=num_topics)
X_lsa = svd.fit_transform(X)

# 4. ๋‹จ์–ด-ํ† ํ”ฝ ํ–‰๋ ฌ (U_k)
terms = vectorizer.get_feature_names_out()
word_topic_matrix = pd.DataFrame(svd.components_.T, index=terms, columns=[f"Topic {i+1}" for i in range(num_topics)])

# 5. ๋ฌธ์„œ-ํ† ํ”ฝ ํ–‰๋ ฌ (V_k)
document_topic_matrix = pd.DataFrame(X_lsa, index=[f"Doc {i+1}" for i in range(len(documents))], columns=[f"Topic {i+1}" for i in range(num_topics)])

# 6. ์ฃผ์š” ํ† ํ”ฝ ๋ณ„ ์ƒ์œ„ ๋‹จ์–ด ํ™•์ธ (๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋‹จ์–ด 10๊ฐœ ์ถœ๋ ฅ)
print("๐Ÿ“Œ LSA ํ† ํ”ฝ ๋ณ„ ์ฃผ์š” ๋‹จ์–ด")
for i in range(num_topics):
    print(f"\n๐Ÿ”น Topic {i+1}")
    print(word_topic_matrix[f"Topic {i+1}"].abs().sort_values(ascending=False).head(10))

# 7. ๋ฌธ์„œ-ํ† ํ”ฝ ๋ถ„ํฌ ์‹œ๊ฐํ™”
plt.figure(figsize=(10, 6))
sns.heatmap(document_topic_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("๋ฌธ์„œ๋ณ„ ํ† ํ”ฝ ๋ถ„ํฌ (LSA ๊ธฐ๋ฐ˜)")
plt.xlabel("ํ† ํ”ฝ")
plt.ylabel("๋ฌธ์„œ")
plt.show()

# 8. ๊ฒฐ๊ณผ ์ถœ๋ ฅ
print("\n๐Ÿ“Œ LSA ๋‹จ์–ด-ํ† ํ”ฝ ํ–‰๋ ฌ (U_k)")
print(word_topic_matrix)

print("\n๐Ÿ“Œ LSA ๋ฌธ์„œ-ํ† ํ”ฝ ํ–‰๋ ฌ (V_k)")
print(document_topic_matrix)

์ฝ”๋“œ ์„ค๋ช…

  1. TF-IDF ๋ฒกํ„ฐํ™”๋ฅผ ํ†ตํ•ด ๋ฌธ์„œ๋ฅผ ์ˆ˜์น˜ํ™”๋œ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ (A ์ƒ์„ฑ)

  2. SVD๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฐจ์› ์ถ•์†Œ (๐‘˜=2)

  3. ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„(Cosine Similarity) ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„ ์ธก์ •

  4. ์œ ์‚ฌ๋„ ํ–‰๋ ฌ์„ ์ถœ๋ ฅํ•˜์—ฌ ๋ฌธ์„œ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ™•์ธ

๋ถ„์„ ์ˆ˜ํ–‰

(1) UkU_kUkโ€‹, ฮฃk\Sigma_kฮฃkโ€‹, VkV_kVkโ€‹ ๋ถ„ํ•ด ๊ฒฐ๊ณผ

  • ์œ„ ์ฝ”๋“œ๋ฅผ ๋Œ๋ฆฌ๋ฉด ๊ฐ๊ฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด UkU_kUkโ€‹, ฮฃk\Sigma_kฮฃkโ€‹, VkV_kVkโ€‹๊ฐ€ ๋„์ถœ๋˜์–ด ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    • ๋‹จ์–ด-ํ† ํ”ฝ ํ–‰๋ ฌ (UkU_kUkโ€‹)

    • Topic Importance (๐šบ)

    • ๋ฌธ์„œ-ํ† ํ”ฝ ํ–‰๋ ฌ (VkV_kVkโ€‹)

(2) Topic ๋ถ„ํ•ด ๊ฒฐ๊ณผ - truncated svd (k=2)

  • LSA๋ฅผ ํ†ตํ•ด ๋‰ด์Šค ๊ธฐ์‚ฌ์—์„œ 2๊ฐœ์˜ ์ฃผ์š” ํ† ํ”ฝ์„ ์ถ”์ถœํ–ˆ์Šต๋‹ˆ๋‹ค.
    • Topic 1: โ€œAI & ์ฆ์‹œ ํญ๋ฝ ๊ด€๋ จโ€

      => โ€œAI ๊ธฐ์—… ๋ฐ ์ฆ์‹œ ํญ๋ฝ ๊ด€๋ จ ๋‰ด์Šคโ€๋กœ ํ•ด์„ ๊ฐ€๋Šฅ

    • Topic 2: โ€œ์ฆ์‹œ ๋ฐ ๋‹ค์šฐ์กด์Šค ์ƒ์Šน/ํ•˜๋ฝ ๊ด€๋ จโ€

      => โ€œ์ฆ์‹œ ๋ณ€ํ™”์™€ ๋‹ค์šฐ์กด์Šค ๊ด€๋ จ ๋‰ด์Šคโ€๋กœ ํ•ด์„ ๊ฐ€๋Šฅ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
๐Ÿ“Œ LSA ํ† ํ”ฝ ๋ณ„ ์ฃผ์š” ๋‹จ์–ด

๐Ÿ”น Topic 1
์ง€์ˆ˜๋Š”     0.314311
์ธ๊ณต์ง€๋Šฅ    0.287213
๋น…ํ…Œํฌ     0.274496
ai      0.218333
๊ธฐ์—…๋“ค์ด    0.198402
์—”๋น„๋””์•„    0.198402
์ฆ์‹œ์—์„œ    0.170829
๋–จ์–ด์กŒ๋‹ค    0.167795
๋“ฑ์žฅ์—     0.163324
๋”ฅ์‹œํฌ     0.163324
Name: Topic 1, dtype: float64

๐Ÿ”น Topic 2
์ง€์ˆ˜๋Š”     0.374815
์ธ๊ณต์ง€๋Šฅ    0.248398
์˜ฌ๋ž๋‹ค     0.207143
ํ•˜์ง€๋งŒ     0.207143
์•Š์€      0.207143
๋‹ค์šฐ์กด์Šค    0.207143
๊ธฐ์—…์ด     0.207143
ํŽธ์ž…๋˜์ง€    0.207143
๋”ฅ์‹œํฌ     0.196706
๋“ฑ์žฅ์—     0.196706
Name: Topic 2, dtype: float64

(3) ๋ฌธ์„œ(๋ฌธ์žฅ) ๋ณ„ ํ† ํ”ฝ ๋ถ„ํฌ

๋‹ค์Œ์€ ๋‰ด์Šค ๊ธฐ์‚ฌ์˜ ๊ฐ ๋ฌธ์žฅ๋“ค์ด ๊ฐ๊ฐ ์–ด๋–ค topic์— ์—ฐ๊ด€์„ฑ์„ ๋„๋Š”์ง€ ๋ถ„์„ํ•œ ๊ทธ๋ฆผ์ž…๋‹ˆ๋‹ค.

์•„๋ž˜์™€ ๊ฐ™์ด ํ•ด์„ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • Doc 1, 5 โ†’ Topic 1๊ณผ ๊ฐ•ํ•œ ์—ฐ๊ด€์„ฑ (AI & ์ฆ์‹œ ํญ๋ฝ ๊ด€๋ จ ๋‰ด์Šค)
  • Doc 5, 6 โ†’ Topic 2์™€ ๊ฐ•ํ•œ ์—ฐ๊ด€์„ฑ (์ฆ์‹œ & ๋‹ค์šฐ์กด์Šค ๊ด€๋ จ ๋‰ด์Šค)
  • Doc 3์€ ๊ฑฐ์˜ ๋‘ ๊ฐœ์˜ ํ† ํ”ฝ๊ณผ ์—ฐ๊ด€์ด ์—†์Œ (์ผ๋ฐ˜์ ์ธ ๋‚ด์šฉ์ด๊ฑฐ๋‚˜ LSA์—์„œ ์˜๋ฏธ์  ๊ด€๊ณ„๊ฐ€ ๋‚ฎ์€ ๋ฌธ์„œ)
  • Doc 2, 7 โ†’ Topic 1๊ณผ ๋ฐ€์ ‘ํ•œ ๊ด€๊ณ„, ํ•˜์ง€๋งŒ Topic 2์™€๋Š” ๋ฐ˜๋Œ€ ์„ฑํ–ฅ (์ฆ‰, AI ์ค‘์‹ฌ ๋‰ด์Šค์ผ ๊ฐ€๋Šฅ์„ฑ ๋†’์Œ)

๐ŸŽ“ (์‹ฌํ™”) ์ถ”๊ฐ€ ๋ถ„์„ ์‹œ๋‚˜๋ฆฌ์˜ค

  • ์œ„์—์„œ ์ •์˜ํ•ด๋ณด์•˜๋˜ โ€œStep 3: ๋ฐ์ดํ„ฐ ๋ถ„์„์— ํ™œ์šฉโ€œ์— ๋Œ€ํ•ด์„œ ์ข€ ๋” ์‹ฌํ™”์žˆ๊ฒŒ ๋ถ„์„ํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๋ถ„์„์„ ํ•ด๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1๏ธโƒฃ UkU_kUkโ€‹ : ํ† ํ”ฝ๋ณ„ ์ฃผ์š” ๋‹จ์–ด ์ฐพ๊ธฐ

1
2
3
4
5
6
top_n_words = 5
lsa_topic_words = {}
for i in range(num_topics):
    lsa_topic_words[f"Topic {i+1}"] = word_topic_matrix[f"Topic {i+1}"].abs().sort_values(ascending=False).head(top_n_words)
lsa_topic_words_df = pd.DataFrame(lsa_topic_words)
lsa_topic_words_df

2๏ธโƒฃ VkV_kVkโ€‹ : ๋ฌธ์„œ ๊ฐ„ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ

1
2
3
4
from sklearn.metrics.pairwise import cosine_similarity
doc_similarity = cosine_similarity(document_topic_matrix)
doc_similarity_df = pd.DataFrame(doc_similarity, index=[f"Doc {i+1}" for i in range(len(documents))], columns=[f"Doc {i+1}" for i in range(len(documents))])
doc_similarity_df

3๏ธโƒฃ VkV_kVkโ€‹ : ์˜๋ฏธ ๊ธฐ๋ฐ˜ ๊ฒ€์ƒ‰

1
2
3
4
5
6
7
8
query = ["AI ์ฃผ์‹ ์‹œ์žฅ"]
query_vector = vectorizer.transform(query)  # TF-IDF ๋ณ€ํ™˜
query_lsa = svd.transform(query_vector)  # LSA ๋ณ€ํ™˜
query_similarity = cosine_similarity(query_lsa, X_lsa)
top_n = 3
most_relevant_docs = query_similarity.argsort()[0][-top_n:][::-1]
for _doc in most_relevant_docs:
    print(documents[_doc])


4. Probabilistic Latent Semantic Analysis (pLSA)

4.1 pLSA์˜ ๋“ฑ์žฅ ๋ฐฐ๊ฒฝ

  • Latent Semantic Analysis(LSA)๋Š” Singular Value Decomposition(SVD)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ์™€ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ €์ฐจ์› ๋ฒกํ„ฐ ๊ณต๊ฐ„์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ LSA๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

  1. ์„ ํ˜• ๋ชจ๋ธ์˜ ํ•œ๊ณ„

    • LSA๋Š” ํ–‰๋ ฌ ๋ถ„ํ•ด ๊ธฐ๋ฐ˜์˜ ์„ ํ˜• ๋ชจ๋ธ๋กœ, ๋‹จ์–ด์™€ ๋ฌธ์„œ์˜ ๊ด€๊ณ„๋ฅผ ๋‹จ์ˆœํ•œ ๋ฒกํ„ฐ ๊ณต๊ฐ„์—์„œ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.
      • ํ•˜์ง€๋งŒ ์–ธ์–ด ๋ฐ์ดํ„ฐ๋Š” ์ข…์ข… ๋น„์„ ํ˜•์ ์ธ ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋ฉฐ, ๋‹จ์ˆœํ•œ ํ–‰๋ ฌ ์—ฐ์‚ฐ๋งŒ์œผ๋กœ๋Š” ์˜๋ฏธ์  ๊ตฌ์กฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
  2. ํ™•๋ฅ ์  ํ•ด์„์˜ ๋‚œํ•ดํ•จ

    • LSA์˜ ๊ฒฐ๊ณผ๋Š” ๋ฒกํ„ฐ ๊ฐ’์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋ฉฐ, ์ด๋ฅผ ํ™•๋ฅ ์ ์ธ ์˜๋ฏธ๋กœ ํ•ด์„ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
      • ์˜ˆ๋ฅผ ๋“ค์–ด, ํŠน์ • ๋ฌธ์„œ๊ฐ€ ํŠน์ • ํ† ํ”ฝ์„ ํฌํ•จํ•  ํ™•๋ฅ ์ด๋‚˜ ํŠน์ • ๋‹จ์–ด๊ฐ€ ํŠน์ • ํ† ํ”ฝ์—์„œ ๋ฐœ์ƒํ•  ํ™•๋ฅ  ๋“ฑ์„ ์ถ”๋ก ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
  3. ์ƒˆ๋กœ์šด ๋ฌธ์„œ ์ผ๋ฐ˜ํ™” ๋ถˆ๊ฐ€๋Šฅ

    • LSA๋Š” ๊ณ ์ •๋œ ๋ฌธ์„œ-๋‹จ์–ด ํ–‰๋ ฌ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•™์Šต๋˜์ง€ ์•Š์€ ์ƒˆ๋กœ์šด ๋ฌธ์„œ๊ฐ€ ๋“ฑ์žฅํ•˜๋ฉด ๊ธฐ์กด ๋ชจ๋ธ์„ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
      • ์ƒˆ๋กœ์šด ๋ฌธ์„œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด ๊ธฐ์กด ํ–‰๋ ฌ์„ ๋‹ค์‹œ ๊ตฌ์„ฑํ•˜๊ณ  ๋‹ค์‹œ SVD๋ฅผ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  4. ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ๊ณ„์‚ฐ๋Ÿ‰ ๋ฌธ์ œ

    • SVD ์—ฐ์‚ฐ์€ O(nยณ)์˜ ๋ณต์žก๋„๋ฅผ ๊ฐ€์ง€๋ฉฐ, ๋ฌธ์„œ ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๊ณ„์‚ฐ๋Ÿ‰์ด ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
      • ๋”ฐ๋ผ์„œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ LSA๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์€ ํ˜„์‹ค์ ์œผ๋กœ ์–ด๋ ค์šด ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

์œ„์™€ ๊ฐ™์€ LSA์˜ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Probabilistic Latent Semantic Analysis (pLSA)๊ฐ€ ๋“ฑ์žฅํ•˜์˜€์Šต๋‹ˆ๋‹ค.


4.2 pLSA ๊ฐœ๋…

  • pLSA๋Š” ํ™•๋ฅ  ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด๊ฐ€ ์ž ์žฌ์ ์ธ ํ† ํ”ฝ(Latent Topics)์— ์˜ํ•ด ์ƒ์„ฑ๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
    • ์ฆ‰, ํŠน์ • ๋ฌธ์„œ์— ํฌํ•จ๋œ ๋‹จ์–ด๋Š” ํ•ด๋‹น ๋ฌธ์„œ์˜ ์ฃผ์ œ(ํ† ํ”ฝ)๋กœ๋ถ€ํ„ฐ ํ™•๋ฅ ์ ์œผ๋กœ ์ƒ์„ฑ๋œ๋‹ค๋Š” ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

๐Ÿ™Œ pLSA์˜ ํ•ต์‹ฌ ๊ฐ€์ •

  • ๋ฌธ์„œ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ† ํ”ฝ(Topic)์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.
  • ๊ฐ ํ† ํ”ฝ์€ ํŠน์ • ๋‹จ์–ด๋“ค์˜ ํ™•๋ฅ  ๋ถ„ํฌ(Word Distribution)๋ฅผ ๊ฐ€์ง„๋‹ค.
  • ๊ฐ ๋‹จ์–ด๋Š” ํŠน์ • ํ† ํ”ฝ์—์„œ ๋ฐœ์ƒํ•  ํ™•๋ฅ ์„ ๊ฐ€์ง€๋ฉฐ, ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด ๋ถ„ํฌ๋Š” ์ด๋Ÿฌํ•œ ํ† ํ”ฝ ๋ถ„ํฌ์˜ ํ˜ผํ•ฉ์œผ๋กœ ํ‘œํ˜„๋œ๋‹ค.

pLSA์˜ ํ™•๋ฅ  ๋ชจ๋ธ ๊ตฌ์กฐ

  • pLSA๋Š” ๋ฌธ์„œ ddd์—์„œ ๋‹จ์–ด www๊ฐ€ ๋ฐœ์ƒํ•  ํ™•๋ฅ ์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.
P(wโˆฃd)=โˆ‘zP(zโˆฃd)P(wโˆฃz)P(w d) = \sum_{z} P(z d) P(w z)P(wโˆฃd)=zโˆ‘โ€‹P(zโˆฃd)P(wโˆฃz)
  • P(zโˆฃd)P(z d)P(zโˆฃd) : ๋ฌธ์„œ ddd๊ฐ€ ํŠน์ • ํ† ํ”ฝ zzz๋ฅผ ํฌํ•จํ•  ํ™•๋ฅ 
  • P(wโˆฃz)P(w z)P(wโˆฃz) : ํŠน์ • ํ† ํ”ฝ zzz์—์„œ ๋‹จ์–ด www๊ฐ€ ๋“ฑ์žฅํ•  ํ™•๋ฅ 
  • zzz : ์ž ์žฌ ํ† ํ”ฝ (Latent Topic)

์ฆ‰, ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด๋Š” ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ์™€ ํ† ํ”ฝ ๋‚ด ๋‹จ์–ด ๋ถ„ํฌ์˜ ์กฐํ•ฉ์œผ๋กœ ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.


4.3 pLSA์˜ ํ•™์Šต ๋ฐฉ๋ฒ•: EM ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • pLSA์˜ ํ•™์Šต ๊ณผ์ •์€ Expectation-Maximization(EM) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

๐Ÿค” Expectation-Maximization(EM) ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ž€?

EM ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ชจ์ˆ˜์— ๊ด€ํ•œ ์ถ”์ •๊ฐ’์œผ๋กœ ๋กœ๊ทธ๊ฐ€๋Šฅ๋„(log likelihood)์˜ ๊ธฐ๋Œ“๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ธฐ๋Œ“๊ฐ’ (E) ๋‹จ๊ณ„์™€ ์ด ๊ธฐ๋Œ“๊ฐ’์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ชจ์ˆ˜ ์ถ”์ •๊ฐ’๋“ค์„ ๊ตฌํ•˜๋Š” ์ตœ๋Œ€ํ™” (M) ๋‹จ๊ณ„๋ฅผ ๋ฒˆ๊ฐˆ์•„๊ฐ€๋ฉด์„œ ์ ์šฉํ•œ๋‹ค. ์ตœ๋Œ€ํ™” ๋‹จ๊ณ„์—์„œ ๊ณ„์‚ฐํ•œ ๋ณ€์ˆ˜๊ฐ’์€ ๋‹ค์Œ ๊ธฐ๋Œ“๊ฐ’ ๋‹จ๊ณ„์˜ ์ถ”์ •๊ฐ’์œผ๋กœ ์“ฐ์ธ๋‹ค.

์ถœ์ฒ˜ : wikipedia EM ์•Œ๊ณ ๋ฆฌ์ฆ˜ (๋งํฌ)

1) E-Step (Expectation Step)

  • E-Step์—์„œ๋Š” ํ˜„์žฌ์˜ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ(P(zโˆฃd)P(z d)P(zโˆฃd)์™€ P(wโˆฃz)P(w z)P(wโˆฃz))๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž ์žฌ ๋ณ€์ˆ˜(Latent Variable)์ธ P(zโˆฃd,w)P(z d, w)P(zโˆฃd,w)๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์„œ ddd์™€ ๋‹จ์–ด www์— ๋Œ€ํ•ด ํŠน์ • ํ† ํ”ฝ zzz๊ฐ€ ์„ ํƒ๋  ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ˆ˜์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

    P(zโˆฃd,w)=P(zโˆฃd)P(wโˆฃz)โˆ‘zโ€ฒP(zโ€ฒโˆฃd)P(wโˆฃzโ€ฒ)P(z d, w) = \frac{P(z d) P(w z)}{\sum_{zโ€™} P(zโ€™ d) P(w zโ€™)}P(zโˆฃd,w)=โˆ‘zโ€ฒโ€‹P(zโ€ฒโˆฃd)P(wโˆฃzโ€ฒ)P(zโˆฃd)P(wโˆฃz)โ€‹
    + P(zโˆฃd)P(z d)P(zโˆฃd): ย  ย  ย  ย 
    • ๋ฌธ์„œ ddd์—์„œ ํ† ํ”ฝ zzz๊ฐ€ ์„ ํƒ๋  ํ™•๋ฅ 
    • P(wโˆฃz)P(w z)P(wโˆฃz):
      • ํ† ํ”ฝ zzz์—์„œ ๋‹จ์–ด www๊ฐ€ ์ƒ์„ฑ๋  ํ™•๋ฅ 
    • โˆ‘zโ€ฒP(zโ€ฒโˆฃd)P(wโˆฃzโ€ฒ)\sum_{zโ€™} P(zโ€™ d) P(w zโ€™)โˆ‘zโ€ฒโ€‹P(zโ€ฒโˆฃd)P(wโˆฃzโ€ฒ):
      • ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ํ† ํ”ฝ zโ€ฒzโ€™zโ€ฒ์— ๋Œ€ํ•ด ๋ฌธ์„œ ddd์™€ ๋‹จ์–ด www๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ์˜ ํ™•๋ฅ ์˜ ํ•ฉ
  • ์ด ๋‹จ๊ณ„์—์„œ๋Š” ํ˜„์žฌ์˜ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋‹จ์–ด๊ฐ€ ํŠน์ • ํ† ํ”ฝ์—์„œ ์ƒ์„ฑ๋  ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

2) M-Step (Maximization Step)

  • M-Step์—์„œ๋Š” E-Step์—์„œ ๊ณ„์‚ฐ๋œ P(zโˆฃd,w)P(z d, w)P(zโˆฃd,w)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์ธ P(zโˆฃd)P(z d)P(zโˆฃd)์™€ P(wโˆฃz)P(w z)P(wโˆฃz)๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
    • **๋ฌธ์„œ-ํ† ํ”ฝ ๋ถ„ํฌ P(zโˆฃd)P(z d)P(zโˆฃd) ์—…๋ฐ์ดํŠธ:**
      P(zโˆฃd)=โˆ‘wn(w,d)P(zโˆฃd,w)โˆ‘w,zn(w,d)P(zโˆฃd,w)P(z d) = \frac{\sum_{w} n(w, d) P(z d, w)}{\sum_{w, z} n(w, d) P(z d, w)}P(zโˆฃd)=โˆ‘w,zโ€‹n(w,d)P(zโˆฃd,w)โˆ‘wโ€‹n(w,d)P(zโˆฃd,w)โ€‹
    • **ํ† ํ”ฝ-๋‹จ์–ด ๋ถ„ํฌ P(wโˆฃz)P(w z)P(wโˆฃz) ์—…๋ฐ์ดํŠธ:**
      P(wโˆฃz)=โˆ‘dn(w,d)P(zโˆฃd,w)โˆ‘wโ€ฒ,dn(wโ€ฒ,d)P(zโˆฃd,wโ€ฒ)P(w z) = \frac{\sum_{d} n(w, d) P(z d, w)}{\sum_{wโ€™, d} n(wโ€™, d) P(z d, wโ€™)}P(wโˆฃz)=โˆ‘wโ€ฒ,dโ€‹n(wโ€ฒ,d)P(zโˆฃd,wโ€ฒ)โˆ‘dโ€‹n(w,d)P(zโˆฃd,w)โ€‹
  • ์—ฌ๊ธฐ์„œ n(w,d)n(w, d)n(w,d)๋Š” ๋ฌธ์„œ ddd์—์„œ ๋‹จ์–ด www์˜ ๋ฐœ์ƒ ํšŸ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

    • P(zโˆฃd)P(z d)P(zโˆฃd): ๋ฌธ์„œ ddd์—์„œ ํ† ํ”ฝ zzz๊ฐ€ ์„ ํƒ๋  ์ƒˆ๋กœ์šด ํ™•๋ฅ 
    • P(wโˆฃz)P(w z)P(wโˆฃz): ํ† ํ”ฝ zzz์—์„œ ๋‹จ์–ด www๊ฐ€ ์ƒ์„ฑ๋  ์ƒˆ๋กœ์šด ํ™•๋ฅ 
  • ์ด ๋‹จ๊ณ„์—์„œ๋Š” E-Step์—์„œ ๊ณ„์‚ฐ๋œ P(zโˆฃd,w)P(z d,w)P(zโˆฃd,w)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ-ํ† ํ”ฝ ๋ถ„ํฌ์™€ ํ† ํ”ฝ-๋‹จ์–ด ๋ถ„ํฌ๋ฅผ ์žฌ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

3) ๋ฐ˜๋ณต ์ˆ˜ํ–‰

  • E-Step๊ณผ M-Step์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ ๋ชจ๋ธ์˜ ๋กœ๊ทธ ๊ฐ€๋Šฅ๋„(Log-Likelihood)๊ฐ€ ์ˆ˜๋ ดํ•  ๋•Œ๊นŒ์ง€ ํ•™์Šต์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    • ๋กœ๊ทธ ๊ฐ€๋Šฅ๋„๋Š” ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ์„ค๋ช…ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ง€ํ‘œ๋กœ, ์ด ๊ฐ’์ด ๋” ์ด์ƒ ํฌ๊ฒŒ ๋ณ€ํ•˜์ง€ ์•Š์œผ๋ฉด ํ•™์Šต์„ ์ข…๋ฃŒํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ป ํ•™์Šต ๊ณผ์ • ์š”์•ฝ

    1. Initialization:
      • P(zโˆฃd)P(z d)P(zโˆฃd)์™€ P(wโˆฃz)P(w z)P(wโˆฃz)๋ฅผ ์ž„์˜์˜ ๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
    1. E-Step (Expectation Step):
      • ํ˜„์žฌ์˜ P(zโˆฃd)P(z d)P(zโˆฃd)์™€ P(wโˆฃz)P(w z)P(wโˆฃz)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ P(zโˆฃd,w)P(z d, w)P(zโˆฃd,w)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
      • ์ด๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž ์žฌ ๋ณ€์ˆ˜(Latent Variable)๋ฅผ ์ถ”์ •ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
    1. M-step (Maximization Step):
      • E-Step์—์„œ ๊ณ„์‚ฐ๋œ P(zโˆฃd,w)P(z d, w)P(zโˆฃd,w)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ P(zโˆฃd)P(z d)P(zโˆฃd)์™€ P(wโˆฃz)P(w z)P(wโˆฃz)๋ฅผ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.
      • ์ด๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
    1. Iteration/Convergence:
      • ์ข…๋ฃŒ/์ˆ˜๋ ดํ•  ๋•Œ๊นŒ์ง€ E-Step๊ณผ M-Step์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
      • ๊ฐ iteration์€ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ, ๋”ฅ๋Ÿฌ๋‹์˜ epoch๊ณผ ์œ ์‚ฌํ•œ ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

Image Source: https://www.geeksforgeeks.org/ml-expectation-maximization-algorithm/


4.4 pLSA์˜ ์žฅ์ 

  1. ํ™•๋ฅ ์  ํ•ด์„ ๊ฐ€๋Šฅ

    • ๋ฌธ์„œ์™€ ๋‹จ์–ด์˜ ๊ด€๊ณ„๋ฅผ ํ™•๋ฅ ์ ์ธ ๋ฐฉ์‹์œผ๋กœ ๋ชจ๋ธ๋งํ•˜์—ฌ ํ† ํ”ฝ๊ณผ ๋‹จ์–ด ๊ฐ„์˜ ์˜๋ฏธ์  ๊ด€๊ณ„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํŒŒ์•… ๊ฐ€๋Šฅ.
  2. ์ƒˆ๋กœ์šด ๋ฌธ์„œ์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ

    • LSA๋Š” ์ƒˆ๋กœ์šด ๋ฌธ์„œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์–ด๋ ค์› ์ง€๋งŒ, pLSA๋Š” ํ•™์Šต๋œ ํ† ํ”ฝ-๋‹จ์–ด ๋ถ„ํฌ P(wโˆฃz)P(w z)P(wโˆฃz)๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ **์ƒˆ๋กœ์šด ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ P(zโˆฃd)P(z d)P(zโˆฃd)๋ฅผ ์ถ”์ •**ํ•  ์ˆ˜ ์žˆ์Œ.
  3. ์–ธ์–ด ๋ชจ๋ธ๋ง์— ๋” ์ ํ•ฉํ•œ ๊ตฌ์กฐ

    • ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด ๋ฐœ์ƒ์„ ์ƒ์„ฑ ๋ชจ๋ธ(Generative Model)๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์–ด, ๋ณด๋‹ค ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฌธ์„œ ๋ถ„๋ฅ˜ ๋ฐ ์ •๋ณด ๊ฒ€์ƒ‰์ด ๊ฐ€๋Šฅ.

4.5 pLSA์˜ ํ•œ๊ณ„

  1. ์˜ค๋ฒ„ํ”ผํŒ… ๋ฌธ์ œ

    • pLSA๋Š” ๊ฐ ๋ฌธ์„œ ddd์— ๋Œ€ํ•ด ๊ฐœ๋ณ„์ ์ธ ํ™•๋ฅ  ๋ถ„ํฌ P(zโˆฃd)P(z d)P(zโˆฃd)๋ฅผ ํ•™์Šตํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ๋ฌธ์„œ ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋„ ๋น„๋ก€ํ•˜์—ฌ ์ฆ๊ฐ€.
    • ์ด๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์ ์„ ๊ฒฝ์šฐ ์˜ค๋ฒ„ํ”ผํŒ…(Overfitting) ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์Œ.
  2. Bayesian ํ™•๋ฅ  ๋ชจ๋ธ์ด ์•„๋‹˜

    • pLSA๋Š” ๋‹จ์ˆœํ•œ ์ตœ๋Œ€์šฐ๋„์ถ”์ •(MLE, Maximum Likelihood Estimation)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•˜๋ฏ€๋กœ, ์‚ฌ์ „ ํ™•๋ฅ (Prior Distribution) ์ ์šฉ์ด ์–ด๋ ต๋‹ค.
    • ์ฆ‰, pLSA๋Š” ๋ฌธ์„œ-ํ† ํ”ฝ ๋ถ„ํฌ P(zโˆฃd)P(z d)P(zโˆฃd)๋ฅผ ์ง์ ‘ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ์‹์ด๋ฏ€๋กœ, ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋ถ€์กฑํ•จ.
  3. LDA(Latent Dirichlet Allocation)์˜ ๋“ฑ์žฅ

    • ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Latent Dirichlet Allocation(LDA)์ด ์ œ์•ˆ๋จ.
    • LDA๋Š” pLSA์˜ ํ™•์žฅ ๋ชจ๋ธ๋กœ, ๋””๋ฆฌํด๋ ˆ ๋ถ„ํฌ(Dirichlet Distribution)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์„œ-ํ† ํ”ฝ ๋ถ„ํฌ๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ ์˜ค๋ฒ„ํ”ผํŒ…์„ ์™„ํ™”ํ•˜๊ณ  ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ.

5. Latent Dirichlet Allocation (LDA)

5.1 LDA์˜ ๋“ฑ์žฅ ๋ฐฐ๊ฒฝ: ์™œ pLSA์—์„œ LDA๋กœ ๋ฐœ์ „ํ–ˆ๋Š”๊ฐ€?

Probabilistic Latent Semantic Analysis(pLSA)๋Š” ๊ธฐ์กด์˜ LSA๊ฐ€ ๊ฐ€์ง€๋˜ ๋ฌธ์ œ์ (ํ™•๋ฅ ์  ํ•ด์„ ๋ถˆ๊ฐ€๋Šฅ, ์ผ๋ฐ˜ํ™” ๋ถˆ๊ฐ€๋Šฅ, ๊ณ„์‚ฐ๋Ÿ‰ ๋ฌธ์ œ)์„ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ๊ธฐ์—ฌํ–ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„๋ฅผ ์ง€๋‹ˆ๊ณ  ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

  1. ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ์ฆ๊ฐ€ ๋ฌธ์ œ (Overfitting)

    • pLSA๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์†ํ•œ ๋ฌธ์„œ๋งˆ๋‹ค ๋ณ„๋„์˜ ๋ฌธ์„œ-ํ† ํ”ฝ ๋ถ„ํฌ P(zโˆฃd)P(z d)P(zโˆฃd)๋ฅผ ์ถ”์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • ๋”ฐ๋ผ์„œ ๋ฌธ์„œ ์ˆ˜ DDD๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋„ ๋น„๋ก€ํ•˜์—ฌ ์ฆ๊ฐ€ํ•˜์—ฌ, ๊ณผ์ ํ•ฉ(Overfitting) ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง‘๋‹ˆ๋‹ค.
  2. ์ผ๋ฐ˜ํ™” ๋ฌธ์ œ (Lack of Generative Model)

    • pLSA๋Š” ์ƒˆ๋กœ์šด ๋ฌธ์„œ์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ถ”๋ก ํ•˜๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.
    • ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†๋Š” ์ƒˆ๋กœ์šด ๋ฌธ์„œ์— ๋Œ€ํ•ด P(zโˆฃd)P(z d)P(zโˆฃd)๋ฅผ ๋ฐ”๋กœ ์ถ”์ •ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์—, ์ƒˆ๋กœ์šด ๋ฌธ์„œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด ํ•™์Šต ๊ณผ์ •์„ ๋‹ค์‹œ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  3. ์‚ฌ์ „ ํ™•๋ฅ (Prior) ๋ถ€์žฌ

    • pLSA๋Š” ์ˆœ์ˆ˜ํ•œ ์ตœ๋Œ€์šฐ๋„์ถ”์ •(Maximum Likelihood Estimation, MLE)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋˜๋ฏ€๋กœ, ์ ์ ˆํ•œ ์‚ฌ์ „ ํ™•๋ฅ (Prior Distribution)์ด ๋ถ€์žฌํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
    • ์ด๋Š” ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ฌ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ถˆ์•ˆ์ •ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ดˆ๋ž˜ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Latent Dirichlet Allocation (LDA)์ด ์ œ์•ˆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • LDA๋Š” ๋ฒ ์ด์ง€์•ˆ ํ™•๋ฅ  ๋ชจ๋ธ(Bayesian Probabilistic Model)์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ๋ฌธ์„œ-ํ† ํ”ฝ ๋ถ„ํฌ์™€ ํ† ํ”ฝ-๋‹จ์–ด ๋ถ„ํฌ์— Dirichlet ๋ถ„ํฌ(Dirichlet Prior)๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค.


5.2 LDA ๊ฐœ๋…

LDA๋Š” ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด๊ฐ€ ์—ฌ๋Ÿฌ ์ž ์žฌ์ ์ธ ํ† ํ”ฝ(Latent Topics)์—์„œ ์ƒ์„ฑ๋˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋Š” ์ƒ์„ฑ ๋ชจ๋ธ(Generative Model)์ž…๋‹ˆ๋‹ค.

  • ์ฆ‰, ๋ฌธ์„œ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ† ํ”ฝ(Topic)์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ๊ฐ ํ† ํ”ฝ์€ ํŠน์ • ๋‹จ์–ด๋“ค์˜ ํ™•๋ฅ  ๋ถ„ํฌ(Word Distribution)๋ฅผ ๊ฐ€์ง„๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.
  1. ๋ฌธ์„œ ์ƒ์„ฑ์˜ ๊ฐ€์ •:

    • ๊ฐ ๋ฌธ์„œ๋Š” ํ•˜๋‚˜ ์ด์ƒ์˜ ํ† ํ”ฝ์ด ํ˜ผํ•ฉ๋œ ํ˜•ํƒœ๋กœ ์กด์žฌํ•˜๋ฉฐ, ๊ฐ ๋‹จ์–ด๋Š” ํ•ด๋‹น ๋ฌธ์„œ์˜ ํ† ํ”ฝ๋“ค ์ค‘ ํ•˜๋‚˜์—์„œ ์ƒ์„ฑ๋จ.
    • ๊ฐ ํ† ํ”ฝ์€ ํŠน์ • ๋‹จ์–ด๋“ค์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ๋‹จ์–ด๊ฐ€ ํŠน์ • ํ† ํ”ฝ์—์„œ ๋ฐœ์ƒํ•  ํ™•๋ฅ ์„ ํ•™์Šต.
  2. ๋ฌธ์ œ ํ•ด๊ฒฐ ๋ชฉํ‘œ:

    • ์ฃผ์–ด์ง„ ๋ฌธ์„œ ์ง‘ํ•ฉ(Corpus)์—์„œ ๊ฐ ๋ฌธ์„œ๊ฐ€ ์–ด๋–ค ํ† ํ”ฝ์„ ํฌํ•จํ•˜๋Š”์ง€๋ฅผ ์ถ”์ •.
    • ๊ฐ ๋ฌธ์„œ์— ํฌํ•จ๋œ ํ† ํ”ฝ ๋ถ„ํฌ(ฮธd\theta_dฮธdโ€‹) ๋ฐ ๊ฐ ํ† ํ”ฝ์—์„œ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด ๋ถ„ํฌ(ฯ•k\phi_kฯ•kโ€‹)๋ฅผ ์ถ”๋ก .
    • ํŠน์ • ๋ฌธ์„œ ๋‚ด์—์„œ ๊ฐ ๋‹จ์–ด๊ฐ€ ์–ด๋–ค ํ† ํ”ฝ์— ํ• ๋‹น๋˜์—ˆ๋Š”์ง€ ์ถ”์ •.

5.3 LDA์˜ ๋ฌธ์„œ ์ƒ์„ฑ ๊ณผ์ •

LDA๋Š” ๋ฌธ์„œ๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™์€ ์ ˆ์ฐจ๋ฅผ ํ†ตํ•ด ์ƒ์„ฑ๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค.

1) ์ดˆ๊ธฐ ์„ค์ •

  • ํ† ํ”ฝ ๋ถ„ํฌ (ฮธd\theta_dฮธdโ€‹)
    • ๋ฌธ์„œ ddd์˜ ํ† ํ”ฝ ๋ถ„ํฌ๋Š” Dirichlet ๋ถ„ํฌ Dir(ฮฑ)Dir(\alpha)Dir(ฮฑ)์—์„œ ์ƒ˜ํ”Œ๋ง.
  • ๋‹จ์–ด ๋ถ„ํฌ (ฯ•k\phi_kฯ•kโ€‹)
    • ํ† ํ”ฝ kkk์˜ ๋‹จ์–ด ๋ถ„ํฌ๋Š” Dirichlet ๋ถ„ํฌ Dir(ฮฒ)Dir(\beta)Dir(ฮฒ)์—์„œ ์ƒ˜ํ”Œ๋ง.

2) ๋ฌธ์„œ ์ƒ์„ฑ ์ ˆ์ฐจ

  1. ๋ฌธ์„œ ddd์— ๋Œ€ํ•ด, ํ† ํ”ฝ ๋ถ„ํฌ ฮธd\theta_dฮธdโ€‹๋ฅผ Dirichlet ๋ถ„ํฌ Dir(ฮฑ)Dir(\alpha)Dir(ฮฑ)์—์„œ ์ƒ˜ํ”Œ๋ง.
  2. ๊ฐ ๋‹จ์–ด wdnw_{dn}wdnโ€‹์— ๋Œ€ํ•ด:
    • ํ† ํ”ฝ zdnz_{dn}zdnโ€‹ ์„ ํƒ: ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ ฮธd\theta_dฮธdโ€‹์—์„œ ์ƒ˜ํ”Œ๋ง.
    • ๋‹จ์–ด wdnw_{dn}wdnโ€‹ ์„ ํƒ: ์„ ํƒ๋œ ํ† ํ”ฝ zdnz_{dn}zdnโ€‹์˜ ๋‹จ์–ด ๋ถ„ํฌ ฯ•zdn\phi_{z_{dn}}ฯ•zdnโ€‹โ€‹์—์„œ ์ƒ˜ํ”Œ๋ง.

3) ์ˆ˜ํ•™์  ํ‘œํ˜„

LDA์˜ ์ „์ฒด ๋ฌธ์„œ ์ƒ์„ฑ ๊ณผ์ •์€ ํ™•๋ฅ ์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค:

P(W,Z,ฮธ,ฯ•โˆฃฮฑ,ฮฒ)=โˆd=1DP(ฮธdโˆฃฮฑ)โˆk=1KP(ฯ•kโˆฃฮฒ)โˆn=1NdP(zdnโˆฃฮธd)P(wdnโˆฃฯ•zdn)P(W, Z, \theta, \phi \alpha, \beta) = \prod_{d=1}^{D} P(\theta_d \alpha) \prod_{k=1}^{K} P(\phi_k \beta) \prod_{n=1}^{N_d} P(z_{dn} \theta_d) P(w_{dn} \phi_{z_{dn}})P(W,Z,ฮธ,ฯ•โˆฃฮฑ,ฮฒ)=d=1โˆDโ€‹P(ฮธdโ€‹โˆฃฮฑ)k=1โˆKโ€‹P(ฯ•kโ€‹โˆฃฮฒ)n=1โˆNdโ€‹โ€‹P(zdnโ€‹โˆฃฮธdโ€‹)P(wdnโ€‹โˆฃฯ•zdnโ€‹โ€‹)
  • WWW : ๋‹จ์–ด ์ง‘ํ•ฉ
  • ZZZ : ๋‹จ์–ด์˜ ํ† ํ”ฝ ํ• ๋‹น
  • ฮธ\thetaฮธ : ๋ฌธ์„œ์˜ ํ† ํ”ฝ ๋ถ„ํฌ
  • ฯ•\phiฯ• : ํ† ํ”ฝ์˜ ๋‹จ์–ด ๋ถ„ํฌ

5.4 LDA์˜ ํ•™์Šต ๋ฐ ์ถ”๋ก 

LDA์˜ ์ฃผ์š” ๋ฌธ์ œ๋Š” ๊ด€์ธก๋œ ๋‹จ์–ด WWW๋งŒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž ์žฌ ๋ณ€์ˆ˜ ฮธ,ฯ•,Z\theta, \phi, Zฮธ,ฯ•,Z๋ฅผ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด LDA์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•™์Šต ๋ฐฉ๋ฒ•์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

1) Collapsed Gibbs Sampling

  • LDA์—์„œ ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ์ถ”๋ก  ๊ธฐ๋ฒ•์œผ๋กœ, ๋‹จ์–ด-ํ† ํ”ฝ ํ• ๋‹น ๋ณ€์ˆ˜ ZZZ๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ฮธ\thetaฮธ์™€ ฯ•\phiฯ•๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ˜ํ”Œ๋ง์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜์—ฌ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.
P(zdn=kโˆฃZโˆ’dn,W,ฮฑ,ฮฒ)โˆnd,kโˆ’dn+ฮฑโˆ‘kโ€ฒ(nd,kโ€ฒโˆ’dn+ฮฑ)โ‹…nk,wโˆ’dn+ฮฒโˆ‘wโ€ฒ(nk,wโ€ฒโˆ’dn+ฮฒ)P(z_{dn} = k Z_{-dn}, W, \alpha, \beta) \propto \frac{n_{d,k}^{-dn} + \alpha}{\sum_{kโ€™} (n_{d,kโ€™}^{-dn} + \alpha)} \cdot \frac{n_{k,w}^{-dn} + \beta}{\sum_{wโ€™} (n_{k,wโ€™}^{-dn} + \beta)}P(zdnโ€‹=kโˆฃZโˆ’dnโ€‹,W,ฮฑ,ฮฒ)โˆโˆ‘kโ€ฒโ€‹(nd,kโ€ฒโˆ’dnโ€‹+ฮฑ)nd,kโˆ’dnโ€‹+ฮฑโ€‹โ‹…โˆ‘wโ€ฒโ€‹(nk,wโ€ฒโˆ’dnโ€‹+ฮฒ)nk,wโˆ’dnโ€‹+ฮฒโ€‹
  • nd,kโˆ’dnn_{d,k}^{-dn}nd,kโˆ’dnโ€‹ : ๋ฌธ์„œ ddd์—์„œ ํ† ํ”ฝ kkk๊ฐ€ ํ• ๋‹น๋œ ๋‹จ์–ด์˜ ์ˆ˜ (ํ˜„์žฌ ๋‹จ์–ด ์ œ์™ธ)
  • nk,wโˆ’dnn_{k,w}^{-dn}nk,wโˆ’dnโ€‹ : ํ† ํ”ฝ kkk์—์„œ ๋‹จ์–ด www๊ฐ€ ๋“ฑ์žฅํ•œ ํšŸ์ˆ˜ (ํ˜„์žฌ ๋‹จ์–ด ์ œ์™ธ)

2) Variational Inference

  • Gibbs Sampling๋ณด๋‹ค ๋น ๋ฅธ ๊ทผ์‚ฌ ์ถ”๋ก  ๋ฐฉ๋ฒ•์œผ๋กœ, ELBO(Evidence Lower Bound)๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉ์‹.
  • LDA๋ฅผ ํ™•๋ฅ ์  ๊ทธ๋ž˜ํ”„ ๋ชจ๋ธ๋กœ ํ•ด์„ํ•˜์—ฌ ๋ณ€๋ถ„ ๋ถ„ํฌ q(ฮธ,ฯ•,Z)q(\theta, \phi, Z)q(ฮธ,ฯ•,Z)๋ฅผ ์ตœ์ ํ™”.


  1. ๊ฒฐ๋ก 

Topic Modeling์€ ๋น„์ •ํ˜• ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™”ํ•˜๊ณ  ์˜๋ฏธ๋ฅผ ๋ฐœ๊ฒฌํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ฒ•์œผ๋กœ, ๋ฌธ์„œ ๋‚ด ์ˆจ๊ฒจ์ง„ ์ฃผ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • LSA, pLSA, LDA์™€ ๊ฐ™์€ ๋Œ€ํ‘œ์ ์ธ ๊ธฐ๋ฒ•๋“ค์€ ๊ฐ๊ฐ์˜ ํŠน์„ฑ๊ณผ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง€๋ฉฐ, ํ™œ์šฉ ๋ชฉ์ ๊ณผ ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ๋ฐฉ๋ฒ•์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

    • LSA๋Š” ์„ ํ˜• ๋Œ€์ˆ˜ ๊ธฐ๋ฐ˜ ํ–‰๋ ฌ ๋ถ„ํ•ด ๊ธฐ๋ฒ•์œผ๋กœ ๊ณ„์‚ฐ์ด ๋น ๋ฅด๊ณ  ์ง๊ด€์ ์ด์ง€๋งŒ, ํ™•๋ฅ ์  ํ•ด์„์ด ์–ด๋ ต๊ณ  ์ƒˆ๋กœ์šด ๋ฌธ์„œ ์ฒ˜๋ฆฌ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
    • pLSA๋Š” ํ™•๋ฅ  ๋ชจ๋ธ์„ ๋„์ž…ํ•˜์—ฌ ๋ฌธ์„œ ๋‚ด ํ† ํ”ฝ ๋ถ„ํฌ๋ฅผ ๋ณด๋‹ค ์ •๊ตํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•˜์ง€๋งŒ, ์˜ค๋ฒ„ํ”ผํŒ… ๋ฌธ์ œ์™€ ์ผ๋ฐ˜ํ™”์˜ ์–ด๋ ค์›€์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
    • LDA๋Š” ๋ฒ ์ด์ง€์•ˆ ์ถ”๋ก ์„ ํ™œ์šฉํ•˜์—ฌ ๋ฌธ์„œ์™€ ํ† ํ”ฝ์˜ ๊ด€๊ณ„๋ฅผ ํ™•๋ฅ ์  ๋ชจ๋ธ๋งํ•˜๋ฉฐ, ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ๋ณ€ํ˜• ๋ชจ๋ธ์ด ๊ฐœ๋ฐœ๋˜๋ฉด์„œ ๋”์šฑ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • Topic Modeling์—์„œ ํ•ต์‹ฌ์ ์ธ ์š”์†Œ๋Š” ์ ์ ˆํ•œ ํ† ํ”ฝ ๊ฐœ์ˆ˜ ์„ค์ •์ด๋ฉฐ, ํ† ํ”ฝ ๊ฐœ์ˆ˜๊ฐ€ ์ ์ ˆํ•˜์ง€ ์•Š์œผ๋ฉด ๊ณผ์ ํ•ฉ(Overfitting) ํ˜น์€ ์˜๋ฏธ์  ํ˜ผ๋ž€์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    • Perplexity, Coherence Score, Human Evaluation ๋“ฑ์˜ ๋ฐฉ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์ตœ์ ์˜ ํ† ํ”ฝ ๊ฐœ์ˆ˜๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ๋Š” LDA๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ณ€ํ˜• ๋ชจ๋ธ๋“ค์ด ๋“ฑ์žฅํ•˜๋ฉด์„œ ์„ฑ๋Šฅ์ด ๋”์šฑ ํ–ฅ์ƒ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค ๐Ÿ˜Ž



-->