[NLP] 4. Natural Language Embeddings

Posted by Euisuk's Dev Log on July 29, 2024

[NLP] 4. Natural Language Embeddings

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/NLP-4.-Natural-Language-Embeddings

Natural Language Embeddings

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP)์—์„œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ์ž์—ฐ์–ด ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•(Natural Language Embedding)์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๊ธฐ๋ฒ•๋“ค์„ ์ž์„ธํžˆ ์„ค๋ช…ํ•˜๊ณ , ๊ฐ ๊ธฐ๋ฒ•์˜ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด ์ดํ•ด๋ฅผ ๋•๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”Ž Text Representation Learning & Natural Language Embedding์ด๋ž€?

  • ํ…์ŠคํŠธ ํ‘œํ˜„ ํ•™์Šต(Text Representation Learning)์€ ๋น„์ •ํ˜• ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์ •ํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•˜๊ณ  ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“œ๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.
  • ์ž์—ฐ์–ด ์ž„๋ฒ ๋”ฉ(Natural Language Embedding)์€ ์ด๋Ÿฌํ•œ ๊ณผ์ •์—์„œ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒกํ„ฐ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•๋“ค์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.
  • ์ด ๊ณผ์ •์—์„œ ์ค‘์š”ํ•œ ๊ฒƒ์€ ๋‹จ์–ด ๊ฐ„์˜ ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ๊ณผ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  1. ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์ •ํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ

  • ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋Š” ํ…์ŠคํŠธ์™€ ๊ฐ™์€ ๋น„๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋งํ•˜๋ฉฐ, ์ด๋ฅผ ์ •ํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP)์˜ ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.
  • ์ด ๊ณผ์ •์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒกํ„ฐ ๋˜๋Š” ๋งคํŠธ๋ฆญ์Šค ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ •ํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ฃผ์š” ๊ณผ์ •:

    • ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘: ์„œ์ , ์ธํ„ฐ๋„ท ๋“ฑ ๋‹ค์–‘ํ•œ ์†Œ์Šค๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค.
    • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ: ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ NLP ๊ธฐ๋ฒ•์„ ์ด์šฉํ•ด ์ „์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
      • ๋Œ€์†Œ๋ฌธ์ž ํ†ต์ผ: โ€œTheyโ€์™€ โ€œtheyโ€๋ฅผ ๋ชจ๋‘ ์†Œ๋ฌธ์ž๋กœ ํ†ต์ผ.
      • ๋ถˆํ•„์š”ํ•œ ๋ฌธ์žฅ ๊ธฐํ˜ธ ์ œ๊ฑฐ: ์˜ˆ๋ฅผ ๋“ค์–ด, โ€œ!โ€๋‚˜ โ€œ?โ€ ๊ฐ™์€ ๊ธฐํ˜ธ๋ฅผ ์ œ๊ฑฐ.
      • ์ˆซ์ž ์ œ๊ฑฐ: ํ•„์š” ์—†๋Š” ์ˆซ์ž๋Š” ์ œ๊ฑฐ.
      • ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ: ๋ฌธ๋ฒ• ์š”์†Œ ๋“ฑ ์ค‘์š”ํ•œ ์˜๋ฏธ๋ฅผ ๋‹ด๊ณ  ์žˆ์ง€ ์•Š์€ ๋‹จ์–ด ์ œ๊ฑฐ.
    • ์ •ํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜: ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒกํ„ฐ ๋˜๋Š” ๋งคํŠธ๋ฆญ์Šค ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  1. ํ…์ŠคํŠธ ๋ฒกํ„ฐํ™” ๊ธฐ๋ฒ•

ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

2.1 ๋ฐฑ ์˜ค๋ธŒ ์›Œ์ฆˆ (Bag-of-Words) ๋ชจ๋ธ

Bag of Words๋Š” ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ, ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด์˜ ๋นˆ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฌธ์„œ๋ฅผ ๋ฒกํ„ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

์„œ๋กœ ๋‹ค๋ฅธ ๋ฌธ์„œ๋“ค์˜ BoW๋“ค์„ ๊ฒฐํ•ฉํ•˜๋ฉด ๋ฌธ์„œ ๋‹จ์–ด ํ–‰๋ ฌ(Document-Term Matrix, DTM) ๋˜๋Š” ๋‹จ์–ด ๋ฌธ์„œ ํ–‰๋ ฌ(Term-Document Matrix, TDM)์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฌธ์„œ๋“ค๊นŒ์ง€ ํ™•์žฅํ•ด์„œ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

  • ๊ฐœ๋…: ๋‹จ์–ด์˜ ๋“ฑ์žฅ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ , ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด์˜ ๋นˆ๋„๋ฅผ ์„ธ์–ด ๋ฒกํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • BoW ์ƒ์„ฑ ๋ฐฉ์‹:
1
2
    (1) ๊ฐ ๋‹จ์–ด์— ๊ณ ์œ ํ•œ ์ •์ˆ˜ ์ธ๋ฑ์Šค ๋ถ€์—ฌ.
    (2) ๊ฐ ์ธ๋ฑ์Šค์˜ ์œ„์น˜์— ๋‹จ์–ด ํ† ํฐ์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜๋ฅผ ๊ธฐ๋กํ•œ ๋ฒกํ„ฐ ์ƒ์„ฑ. 
  • ํŠน์ง•: ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๋ฌด์‹œํ•˜๋ฉฐ ๋นˆ๋„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฒกํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ:
1
2
3
4
5
6
    ๋ฌธ์„œ1: "I love machine learning."
    ๋ฌธ์„œ2: "Machine learning is fun."
    
    BoW ๋ฒกํ„ฐ:
    ๋ฌธ์„œ1: [1, 1, 1, 1, 0, 0]  # "I", "love", "machine", "learning", "is", "fun"์˜ ์ˆœ์„œ
    ๋ฌธ์„œ2: [0, 0, 1, 1, 1, 1]
  • ์˜ˆ์‹œ ์ฝ”๋“œ:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from konlpy.tag import Okt

okt = Okt()

def build_bag_of_words(document):
  # ์˜จ์  ์ œ๊ฑฐ ๋ฐ ํ˜•ํƒœ์†Œ ๋ถ„์„
  document = document.replace('.', '')
  tokenized_document = okt.morphs(document)

  word_to_index = {}
  bow = []

  for word in tokenized_document:  
    if word not in word_to_index.keys():
      word_to_index[word] = len(word_to_index)  
      # BoW์— ์ „๋ถ€ ๊ธฐ๋ณธ๊ฐ’ 1์„ ๋„ฃ๋Š”๋‹ค.
      bow.insert(len(word_to_index) - 1, 1)
    else:
      # ์žฌ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด์˜ ์ธ๋ฑ์Šค
      index = word_to_index.get(word)
      # ์žฌ๋“ฑ์žฅํ•œ ๋‹จ์–ด๋Š” ํ•ด๋‹นํ•˜๋Š” ์ธ๋ฑ์Šค์˜ ์œ„์น˜์— 1์„ ๋”ํ•œ๋‹ค.
      bow[index] = bow[index] + 1

  return word_to_index, bow

2.2 ๋‹จ์–ด ๊ฐ€์ค‘์น˜ (Word Weighting) - TF-IDF

ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋ฌธ์„œ์—์„œ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์–ด ๋นˆ๋„(TF, Term-Frequency)์™€ ๋ฌธ์„œ ๋นˆ๋„(DF, Document-Frequency)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • ๊ฐœ๋…: ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด๊ฐ€ ํŠน์ • ๋ฌธ์„œ์—์„œ ๋งŽ์ด ๋“ฑ์žฅํ•  ๊ฒฝ์šฐ ํ•ด๋‹น ๋‹จ์–ด์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋†’์ž…๋‹ˆ๋‹ค.
  • TF-IDF ์ƒ์„ฑ ๋ฐฉ์‹:

    • TF (Term Frequency): tf(d,t), ํŠน์ • ๋ฌธ์„œ d์—์„œ์˜ ํŠน์ • ๋‹จ์–ด t์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜.
    • DF (Document Frequency): df(t), ํŠน์ • ๋‹จ์–ด t๊ฐ€ ๋“ฑ์žฅํ•œ ๋ฌธ์„œ์˜ ์ˆ˜.
    • IDF (Inverse Document Frequency): idf(t), ์ „์ฒด ๋ฌธ์„œ์—์„œ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•œ ๋นˆ๋„์˜ ์—ญ์ˆ˜

      • ์ˆ˜์‹: log(n/(1+df(t)))log({n}/{(1+df(t))})log(n/(1+df(t)))
    • TF-IDF: TF์™€ IDF๋ฅผ ๊ณฑํ•˜์—ฌ ๊ณ„์‚ฐ.
  • ํŠน์ง•: ๋‹จ์ˆœ ๋นˆ๋„์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜์—ฌ ๋‹จ์–ด์˜ ์ค‘์š”์„ฑ์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค. (๊ฐ€์ค‘์น˜ ๊ฐœ๋… ์ถ”๊ฐ€)
  • ์˜ˆ์‹œ:

    • ๋ฌธ์„œ1: โ€œI love machine learning.โ€
    • ๋ฌธ์„œ2: โ€œMachine learning is fun.โ€
    • ๋‹จ์–ด โ€˜machineโ€™์— ๋Œ€ํ•ด:
      • n=2n = 2n=2 (์ „์ฒด ๋ฌธ์„œ ์ˆ˜)
      • df(t)=2\text{df}(t) = 2df(t)=2 (โ€˜machineโ€™์ด ๋“ฑ์žฅํ•˜๋Š” ๋ฌธ์„œ ์ˆ˜)
    • ๋”ฐ๋ผ์„œ IDF๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:
      • IDF(machine)=logโก(21+2)=logโก(23)=logโก(0.6667)โ‰ˆโˆ’0.176\text{IDF}(machine) = \log\left(\frac{2}{1 + 2}\right) = \log\left(\frac{2}{3}\right) = \log(0.6667) \approx -0.176IDF(machine)=log(1+22โ€‹)=log(32โ€‹)=log(0.6667)โ‰ˆโˆ’0.176
    • TF ๊ณ„์‚ฐ
      • ๋ฌธ์„œ1์—์„œ โ€˜machineโ€™์˜ TF: 1
      • ๋ฌธ์„œ2์—์„œ โ€˜machineโ€™์˜ TF: 1
    • TF-IDF ๊ณ„์‚ฐ
      • ๋ฌธ์„œ1: TF-IDF(machine)=1ร—โˆ’0.176=โˆ’0.176\text{TF-IDF}(machine) = 1 \times -0.176 = -0.176TF-IDF(machine)=1ร—โˆ’0.176=โˆ’0.176
      • ๋ฌธ์„œ2: TF-IDF(machine)=1ร—โˆ’0.176=โˆ’0.176\text{TF-IDF}(machine) = 1 \times -0.176 = -0.176TF-IDF(machine)=1ร—โˆ’0.176=โˆ’0.176
  • ์˜ˆ์‹œ ์ฝ”๋“œ:
1
2
3
4
5
6
7
8
9
10
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    documents = ["I love machine learning.", "Machine learning is fun."]
    
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
    
    print(tfidf_matrix.toarray())
    print(tfidf_vectorizer.get_feature_names_out())
    

2.3 N-Grams ๋ชจ๋ธ

๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•ด N-Grams ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋‹จ์–ด์˜ ์—ฐ์†๋œ N๊ฐœ์˜ ๋‹จ์–ด ์กฐํ•ฉ์„ ํŠน์ง•์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

  • ๊ฐœ๋…: N๊ฐœ์˜ ์—ฐ์†๋œ ๋‹จ์–ด๋ฅผ ํ•˜๋‚˜์˜ ํ† ํฐ์œผ๋กœ ์ทจ๊ธ‰ํ•˜์—ฌ ๋ฒกํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    • ์œ ๋‹ˆ๊ทธ๋žจ(Unigram): ๋‹จ์ผ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋œ N-๊ทธ๋žจ. (N=1)
    • ๋ฐ”์ด๊ทธ๋žจ(Bigram): ๋‘ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋œ N-๊ทธ๋žจ. (N=2)
    • ํŠธ๋ผ์ด๊ทธ๋žจ(Trigram): ์„ธ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋œ N-๊ทธ๋žจ. (N=3)
    • 4-๊ทธ๋žจ(4-gram): ๋„ค ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋œ N-๊ทธ๋žจ. (N=4)
  • N-Gram ์ƒ์„ฑ ๋ฐฉ์‹:
1
2
3
    (1) ํ…์ŠคํŠธ๋ฅผ N๊ฐœ์˜ ๋‹จ์–ด ๋ฌถ์Œ์œผ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
    (2) ๊ฐ N-Gram ๋ฌถ์Œ์„ ๋ฒกํ„ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
    
  • ํŠน์ง•: ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ๋ฌธ๋งฅ์„ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ:
1
2
3
4
5
6
    ๋ฌธ์žฅ: "I love machine learning"
    
    Unigram: ["I", "love", "machine", "learning"]
    Bigram: ["I love", "love machine", "machine learning"]
    Trigram: ["I love machine", "love machine learning"]
    4-gram: ["I love machine learning"]
  • ์˜ˆ์‹œ ์ฝ”๋“œ:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    from sklearn.feature_extraction.text import CountVectorizer
    
    bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
    trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))
    
    documents = ["I love machine learning."]
    
    bigram_matrix = bigram_vectorizer.fit_transform(documents)
    trigram_matrix = trigram_vectorizer.fit_transform(documents)
    
    print("Bigram:\\n", bigram_matrix.toarray())
    print("Bigram Features:\\n", bigram_vectorizer.get_feature_names_out())
    
    print("Trigram:\\n", trigram_matrix.toarray())
    print("Trigram Features:\\n", trigram_vectorizer.get_feature_names_out())
    
  1. ๋ถ„์‚ฐ ํ‘œํ˜„ (Distributed Representation)

๋ถ„์‚ฐ ํ‘œํ˜„์ด๋ž€, ๋‹จ์–ด๋ฅผ ๋ถ„์‚ฐ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ๋‹จ์–ด ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ๋ณด์กดํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ๋ฒกํ„ฐ ๊ณต๊ฐ„ ์ƒ์—์„œ ์œ ์‚ฌํ•˜๊ฒŒ ๋ฐฐ์น˜ํ•˜์—ฌ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

3.1 NNLM (NeuralNet Language Model)

  • ๊ฐœ๋…: NNLM์€ ๋‹จ์–ด์˜ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋‹ค์Œ์— ์˜ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ๋œ ๋‹จ์–ด ์‹œํ€€์Šค๋ฅผ ๊ณ ์ • ๊ธธ์ด์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ๋‹จ์–ด์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. (RNN/LSTM ๊ณ„์—ด ๋ชจ๋ธ)

  • NNLM ์ƒ์„ฑ ๋ฐฉ์‹:
1
2
3
4
    (1) ๋‹จ์–ด ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ๊ณ ์ • ๊ธธ์ด์˜ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    (2) ์€๋‹‰์ธต์„ ๊ฑฐ์ณ ๋‹ค์Œ ๋‹จ์–ด์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
    (3) ๋ชจ๋ธ์„ ํ•™์Šตํ•˜์—ฌ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.
    
  • ํŠน์ง•: NNLM์€ ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ๋ฌธ๋งฅ์„ ๋ฐ˜์˜ํ•˜๋ฉฐ, ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ:
1
2
3
    ๋ฌธ์žฅ: "I love machine learning"
    >>> ์ž…๋ ฅ: ["I", "love", "machine"]
    >>> ์ถœ๋ ฅ: "learning"
  • ์˜ˆ์‹œ ์ฝ”๋“œ:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Embedding, LSTM, Dense
    
    # ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ
    sentences = ["I love machine learning", "Machine learning is fun", "I love deep learning"]
    
    # ํ† ํฌ๋‚˜์ด์ € ์ƒ์„ฑ
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(sentences)
    total_words = len(tokenizer.word_index) + 1
    
    # ์‹œํ€€์Šค ์ƒ์„ฑ
    input_sequences = []
    for line in sentences:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    
    # ํŒจ๋”ฉ
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')
    
    # ํŠน์ง•๊ณผ ๋ ˆ์ด๋ธ” ๋ถ„๋ฆฌ
    X, y = input_sequences[:,:-1], input_sequences[:,-1]
    
    # ๋ ˆ์ด๋ธ” ์›-ํ•ซ ์ธ์ฝ”๋”ฉ
    y = tf.keras.utils.to_categorical(y, num_classes=total_words)
    
    # ๋ชจ๋ธ ์ƒ์„ฑ
    model = Sequential()
    model.add(Embedding(total_words, 10, input_length=max_sequence_len-1))
    model.add(LSTM(100))
    model.add(Dense(total_words, activation='softmax'))
    
    # ๋ชจ๋ธ ์ปดํŒŒ์ผ
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    # ๋ชจ๋ธ ํ•™์Šต
    model.fit(X, y, epochs=100, verbose=1)
    
    # ์˜ˆ์ธก
    seed_text = "I love"
    next_words = 3
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict(token_list, verbose=0)
        output_word = tokenizer.index_word[np.argmax(predicted)]
        seed_text += " " + output_word
    print(seed_text)
    

3.2 Word2Vec

  • ๊ฐœ๋…: ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ ๊ณต๊ฐ„์— ๋งคํ•‘ํ•˜์—ฌ ๋‹จ์–ด ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์„ ๋ณด์กดํ•˜๋Š” ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

    • ์ฃผ์š” ๋ชจ๋ธ๋กœ๋Š” CBOW(Continuous Bag of Words)์™€ Skip-gram์ด ์žˆ์Šต๋‹ˆ๋‹ค.

    • Word2Vec์„ ํ•™์Šตํ•  ๋•Œ๋Š” CBOW์™€ Skip-gram ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒ์ ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋ฐฉ๋ฒ•์€ ์„œ๋กœ ๋‹ค๋ฅธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ๊ฐ๊ฐ์˜ ์žฅ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

    • CBOW (Continuous Bag of Words):

      • ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค๋กœ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
      • ์ผ๋ฐ˜์ ์œผ๋กœ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋” ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.
      • ๋นˆ๋ฒˆํ•œ ๋‹จ์–ด์— ๋Œ€ํ•ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
    • Skip-gram:

      • ์ค‘์‹ฌ ๋‹จ์–ด๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
      • ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
      • ๋“œ๋ฌธ ๋‹จ์–ด์— ๋Œ€ํ•ด ๋” ๋‚˜์€ ํ‘œํ˜„์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • (Method 1) CBOW: ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค๋กœ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธก.

  • (Method 2) Skip-gram: ์ค‘์‹ฌ ๋‹จ์–ด๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ์˜ˆ์ธก.

  • ํŠน์ง•: ์œ ์‚ฌํ•œ ์˜๋ฏธ์˜ ๋‹จ์–ด๋“ค์ด ๊ฐ€๊นŒ์šด ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค. ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ:
1
2
3
4
5
6
7
8
9
10
11
    ๋ฌธ์žฅ: "The cat sat on the mat"
    ์ค‘์‹ฌ ๋‹จ์–ด: "sat"
    ์ฃผ๋ณ€ ๋‹จ์–ด: ["The", "cat", ___, "on", "the", "mat"]
    
    ## CBOW: ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค โ†’ ์ค‘์‹ฌ ๋‹จ์–ด ์˜ˆ์ธก
    >> ์ธํ’‹: ["The", "cat", "on", "the", "mat"]
    >> ์•„์›ƒํ’‹: "sat"
    
    ## Skip-gram: ์ค‘์‹ฌ ๋‹จ์–ด โ†’ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค ์˜ˆ์ธก
    >> ์ธํ’‹: "sat"
    >> ์•„์›ƒํ’‹: ["The", "cat", "on", "the", "mat"]
  • ์˜ˆ์‹œ ์ฝ”๋“œ:
1
2
3
4
5
6
7
8
9
10
    from gensim.models import Word2Vec
    
    sentences = [["the", "cat", "sat", "on", "the", "mat"],
                 ["the", "dog", "sat", "on", "the", "couch"]]
    
    model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=0)  # CBOW
    
    vector = model.wv['cat']
    print(vector)
    

3.3 GloVe (Global Vectors for Word Representation)

  • ๊ฐœ๋…: ๊ธ€๋กœ๋ธŒ(Global Vectors for Word Representation, GloVe)๋Š” ์นด์šดํŠธ ๊ธฐ๋ฐ˜๊ณผ ์˜ˆ์ธก ๊ธฐ๋ฐ˜์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ 2014๋…„์— ๋ฏธ๊ตญ ์Šคํƒ ํฌ๋“œ๋Œ€ํ•™์—์„œ ๊ฐœ๋ฐœํ•œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ, ๋‹จ์–ด์˜ ๋™์‹œ ๋“ฑ์žฅ ํ–‰๋ ฌ์„ ํ™œ์šฉํ•˜์—ฌ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
    • LSA ๊ธฐ๋ฒ•์€ DTM(TDM)์ด๋‚˜ TF-IDF ํ–‰๋ ฌ๊ณผ ๊ฐ™์ด ๊ฐ ๋ฌธ์„œ์—์„œ์˜ ๊ฐ ๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์นด์šดํŠธ ํ•œ ํ–‰๋ ฌ์ด๋ผ๋Š” ์ „์ฒด์ ์ธ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์ฐจ์›์„ ์ถ•์†Œ(Truncated SVD)ํ•˜์—ฌ ์ž ์žฌ๋œ ์˜๋ฏธ๋ฅผ ๋Œ์–ด๋‚ด๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.
      • (๋‹จ์ ) LSA๋Š” ์นด์šดํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ์ฝ”ํผ์Šค์˜ ์ „์ฒด์ ์ธ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ๋Š” ํ•˜์ง€๋งŒ, King : Man = Queen : ? (Woman) ๋˜๋Š” Korea : Seoul = Japan : ? (Tokyo)์™€ ๊ฐ™์€ ๋‹จ์–ด ์˜๋ฏธ์˜ ์œ ์ถ” ์ž‘์—…(Analogy task)์—๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.
    • Word2Vec ๊ธฐ๋ฒ•์€ CBOW์™€ Skip-gram์ด๋ผ๋Š” ๋”ฅ๋Ÿฌ๋‹ ํ•™์Šต๋ฐฉ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์— ๋Œ€ํ•œ ์˜ค์ฐจ๋ฅผ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์ค„์—ฌ๋‚˜๊ฐ€๋ฉฐ ํ•™์Šตํ•˜๋ฉฐ ์ž ์žฌ๋œ ์˜๋ฏธ๋ฅผ ๋Œ์–ด๋‚ด๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.
      • (๋‹จ์ ) Word2Vec๋Š” ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๊ฐ€ ๋‹จ์–ด ๊ฐ„์˜ ์ƒ๊ด€์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์œ ์ถ” ์ž‘์—…์€ LSA๋ณด๋‹ค ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ์œˆ๋„์šฐ ํฌ๊ธฐ ๋‚ด์—์„œ๋งŒ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ฝ”ํผ์Šค์˜ ์ „์ฒด์ ์ธ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.
    • Glove๋Š” ์ด๋Ÿฌํ•œ ๊ฐ๊ฐ์˜ ๋ชจ๋ธ์˜ ์žฅ์ ์„ ํ•ฉ์ณ์„œ ๊ฐ๊ฐ์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • GloVe ์ƒ์„ฑ ๋ฐฉ์‹:
    1. ๋™์‹œ ๋“ฑ์žฅ ํ–‰๋ ฌ (Co-occurrence Matrix)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
      • ์ •์˜: ์ „์ฒด ์ฝ”ํผ์Šค์—์„œ ๊ฐ ๋‹จ์–ด ์Œ์ด ํŠน์ • ๋ฌธ๋งฅ ์œˆ๋„์šฐ ๋‚ด์—์„œ ํ•จ๊ป˜ ๋“ฑ์žฅํ•˜๋Š” ๋นˆ๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค.
      • ๊ตฌ์กฐ: ํ–‰๊ณผ ์—ด์ด ๋ชจ๋‘ ์–ดํœ˜์˜ ๋‹จ์–ด๋“ค๋กœ ์ด๋ฃจ์–ด์ง„ ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค.
      • ๊ฐ’: ํ–‰๋ ฌ์˜ ๊ฐ ์…€ XijX_{ij}Xijโ€‹๋Š” ๋‹จ์–ด i์˜ ๋ฌธ๋งฅ์—์„œ ๋‹จ์–ด j๊ฐ€ ๋“ฑ์žฅํ•œ ํšŸ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
    2. ๋™์‹œ ๋“ฑ์žฅ ๋นˆ๋„ (Co-occurrence Frequency)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
      • ์ •์˜: ํŠน์ • ๋‹จ์–ด ์Œ์ด ์ •์˜๋œ ๋ฌธ๋งฅ ์œˆ๋„์šฐ ๋‚ด์—์„œ ํ•จ๊ป˜ ๋“ฑ์žฅํ•˜๋Š” ํšŸ์ˆ˜์ž…๋‹ˆ๋‹ค.
      • ๊ตฌ์กฐ: ๋ณดํ†ต ์ค‘์‹ฌ ๋‹จ์–ด๋กœ๋ถ€ํ„ฐ ์ผ์ • ๊ฑฐ๋ฆฌ(์œˆ๋„์šฐ ํฌ๊ธฐ) ๋‚ด์˜ ๋‹จ์–ด๋“ค์„ ๋ฌธ๋งฅ์œผ๋กœ ๊ฐ„์ฃผํ•˜์—ฌ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
      • ๊ฐ’: ์ด ๋นˆ๋„๋Š” ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ค‘์š”ํ•œ ํ†ต๊ณ„์  ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

  • ํŠน์ง•: ์ „์ฒด ์ฝ”ํผ์Šค์—์„œ ๋‹จ์–ด๊ฐ€ ํ•จ๊ป˜ ๋“ฑ์žฅํ•˜๋Š” ๋นˆ๋„๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ „์—ญ์ ์ธ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
๋ฌธ์žฅ๋“ค:
1. "The cat sits on the mat"
2. "The dog chases the cat"
3. "A cat and a dog play"

>> ๋ฌธ๋งฅ ์œˆ๋„์šฐ ํฌ๊ธฐ: 2 (์ค‘์‹ฌ ๋‹จ์–ด์˜ ์–‘์ชฝ์œผ๋กœ 2๊ฐœ์˜ ๋‹จ์–ด๊นŒ์ง€ ๊ณ ๋ ค)

>> ๋™์‹œ ๋“ฑ์žฅ ํ–‰๋ ฌ:
       the  cat  sits  on   mat  dog  chases  a    and  play
the    0    3    1     1    1    1    1       0    0    0
cat    3    0    1     1    1    2    1       2    1    1
sits   1    1    0     1    1    0    0       0    0    0
on     1    1    1     0    1    0    0       0    0    0
mat    1    1    1     1    0    0    0       0    0    0
dog    1    2    0     0    0    0    1       2    1    1
chases 1    1    0     0    0    1    0       0    0    0
a      0    2    0     0    0    2    0       0    1    1
and    0    1    0     0    0    1    0       1    0    1
play   0    1    0     0    0    1    0       1    1    0
  • ์˜ˆ์‹œ ์ฝ”๋“œ:
    • GloVe ์‚ฌ์šฉ์‹œ, Stanford์˜ pre-trained ๋ฒกํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œ ํ›„ ์‚ฌ์šฉํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์ด๋Š” ๋‹ค์Œ ๋งํฌ์—์„œ ์ง์ ‘ ๋ฐ›์œผ์‹ค ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค: https://nlp.stanford.edu/projects/glove/

  • ์•„๋ž˜๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ Glove ์ž„๋ฐฐ๋”ฉ์„ ํ˜ธ์ถœํ•˜๊ณ , 'cat'์ด๋ผ๋Š” ๋‹จ์–ด์˜ ์ž„๋ฐฐ๋”ฉ ๊ฐ’์„ ํ™•์ธํ•ด๋ณด๋Š” ์˜ˆ์ œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from urllib.request import urlretrieve
import zipfile

# glove ์‚ฌ์ „ํ•™์Šต ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ
urlretrieve("http://nlp.stanford.edu/data/glove.6B.zip", filename="glove.6B.zip")

# glove ์‚ฌ์ „ํ•™์Šต ํŒŒ์ผ ์••์ถ• ํ•ด์ œ
with zipfile.ZipFile('glove.6B.zip', 'r') as zip_ref:
    zip_ref.extractall()

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors

# GloVe ํŒŒ์ผ ๊ฒฝ๋กœ
glove_input_file = './glove.6B.100d.txt'

# ๋ณ€ํ™˜๋œ Word2Vec ํŒŒ์ผ ๊ฒฝ๋กœ
word2vec_output_file = 'glove.6B.100d.word2vec.txt'

# GloVe ํ˜•์‹์„ Word2Vec ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜
glove2word2vec(glove_input_file, word2vec_output_file)

# ๋ณ€ํ™˜๋œ Word2Vec ํ˜•์‹์˜ ํŒŒ์ผ์„ ๋กœ๋“œ
model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)

# ํŠน์ • ๋‹จ์–ด์˜ ๋ฒกํ„ฐ๋ฅผ ์ถœ๋ ฅ
vector = model['cat']
print(vector)
    

3.3 FastText

  • ๊ฐœ๋…: FastText๋Š” Facebook(ํ˜„ Meta)์—์„œ ๊ฐœ๋ฐœํ•œ ์ž„๋ฐฐ๋”ฉ ๊ธฐ๋ฒ•์œผ๋กœ, ๋‹จ์–ด ๋‚ด๋ถ€์˜ ๋ฌธ์ž n-gram์„ ํ™œ์šฉํ•˜์—ฌ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
    • Word2Vec์™€ FastText์™€์˜ ๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์ด๋ผ๋ฉด Word2Vec๋Š” ๋‹จ์–ด๋ฅผ ๊ธฐ๋ณธ ๋‹จ์œ„๋กœ ์ƒ๊ฐํ•œ๋‹ค๋ฉด, FastText๋Š” ํ•˜๋‚˜์˜ ๋‹จ์–ด ์•ˆ์—๋„ ๋‹จ์œ„๋“ค์ด ์žˆ๋‹ค๊ณ  ๊ฐ„์ฃผํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋‹จ์–ด์˜ ํ˜•ํƒœ์†Œ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค.

  • FastText ์ƒ์„ฑ ๋ฐฉ์‹:
1
2
3
4
    ## Subword Model
    (1) ๋‹จ์–ด๋ฅผ ๋ฌธ์ž n-gram์œผ๋กœ ๋ถ„ํ•ดํ•ฉ๋‹ˆ๋‹ค.
    (2) ๊ฐ n-gram ๋ฒกํ„ฐ๋ฅผ ํ•ฉ์‚ฐํ•˜์—ฌ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    
  • ํŠน์ง•: ๋ฌธ์ž n-gram์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์–ด์˜ ๋‚ด๋ถ€ ๊ตฌ์กฐ๋ฅผ ๋ฐ˜์˜ํ•˜๋ฉฐ, ๋ชจ๋ฅด๋Š” ๋‹จ์–ด(Out Of Vocabulary, OOV)๋„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ:
1
2
3
4
    ๋‹จ์–ด: "apple"
    ๋ฌธ์ž n-gram: ["a", "ap", "app", "p", "pp", "ple", "e"]
    ๋ฒกํ„ฐ ํ•™์Šต: ๊ฐ ๋ฌธ์ž n-gram์˜ ๋ฒกํ„ฐ๋ฅผ ํ•ฉ์‚ฐํ•˜์—ฌ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
    
  • ์˜ˆ์‹œ ์ฝ”๋“œ:
1
2
3
4
5
6
7
8
9
    from gensim.models import FastText
    
    sentences = [["the", "cat", "sat", "on", "the", "mat"],
                 ["machine", "learning", "is", "fun"]]
    
    model = FastText(sentences, vector_size=100, window=5, min_count=1)
    vector = model.wv['cat']
    print(vector)
    
  1. ๋ฌธ์„œ ์ž„๋ฒ ๋”ฉ (Document Embedding)

๋ฌธ์„œ ์ „์ฒด๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋ฌธ์„œ์˜ ๋‚ด์šฉ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ๊ณ ๋ คํ•˜์—ฌ ๋ฒกํ„ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

4.1 ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์˜ ํ‰๊ท  (Averaging Word Embeddings)

  • ๊ฐœ๋…: ๊ฐ€์žฅ ์›์ดˆ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ, ๋ฌธ์„œ ๋‚ด์˜ ๋ชจ๋“  ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ํ‰๊ท ์„ ๊ตฌํ•˜์—ฌ ๋ฌธ์„œ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • ์ƒ์„ฑ ๋ฐฉ์‹:
1
2
3
    (1) ๋ฌธ์„œ ๋‚ด์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
    (2) ๋ณ€ํ™˜๋œ ๋ชจ๋“  ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ํ‰๊ท ์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
    
  • ํŠน์ง•: ๊ฐ„๋‹จํ•˜๊ณ  ๊ณ„์‚ฐ์ด ๋น ๋ฅด์ง€๋งŒ, ํ‰๊ท ์„ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ์„œ ๋‚ด ๋‹จ์–ด ์ˆœ์„œ๋‚˜ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ ์ฝ”๋“œ:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
    import numpy as np
    from gensim.models import Word2Vec
    
    sentences = [["I", "love", "machine", "learning"], ["Machine", "learning", "is", "fun"]]
    word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
    
    def average_word_embeddings(document):
    
        vectors = [word2vec_model.wv[word] for word in document if word in word2vec_model.wv]
        
        return np.mean(vectors, axis=0)
    
    document = ["I", "love", "machine", "learning"]
    document_vector = average_word_embeddings(document)
    
    print(document_vector)
    

4.2 PV-DM (Distributed Memory Model of Paragraph Vectors)

  • ๊ฐœ๋…: ๋ฌธ์„œ์™€ ๋‹จ์–ด๋ฅผ ๋™์‹œ์— ์ž„๋ฒ ๋”ฉํ•˜์—ฌ ๋ฌธ์„œ ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ฌธ๋งฅ ๋‹จ์–ด์™€ ๋ฌธ์„œ ๋ฒกํ„ฐ๋ฅผ ์กฐํ•ฉํ•˜์—ฌ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.

  • ์ƒ์„ฑ ๋ฐฉ์‹:
1
2
    (1) ๊ฐ ๋ฌธ์„œ์— ๊ณ ์œ ํ•œ ID๋ฅผ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.
    (2) ๋ฌธ๋งฅ ๋‹จ์–ด์™€ ๋ฌธ์„œ ID๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
  • ํŠน์ง•: ๋ฌธ์„œ์˜ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ๋ฌธ์„œ ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ ์ฝ”๋“œ:
1
2
3
4
5
6
7
8
9
10
    from gensim.models.doc2vec import Doc2Vec, TaggedDocument
    
    documents = [TaggedDocument(words=["I", "love", "machine", "learning"], tags=['doc1']),
                 TaggedDocument(words=["Machine", "learning", "is", "fun"], tags=['doc2'])]
    
    model = Doc2Vec(documents, vector_size=100, window=2, min_count=1, workers=4, dm=1)  # PV-DM
    vector = model.dv['doc1']
    
    print(vector)
    

4.3 PV-DBOW (Distributed Bag of Words Model of Paragraph Vectors)

  • ๊ฐœ๋…: ๋‹จ์–ด ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ๋ฌธ์„œ ๋ฒกํ„ฐ๋งŒ์œผ๋กœ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋‹จ์–ด์˜ ์ˆœ์„œ๋‚˜ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ์•Š์œผ๋ฉฐ, Skip-gram ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

  • ์ƒ์„ฑ ๋ฐฉ์‹:
1
2
3
    (1) ๊ฐ ๋ฌธ์„œ์— ๊ณ ์œ ํ•œ ID๋ฅผ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.
    (2) ๋ฌธ์„œ ID๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.
    
  • ํŠน์ง•: ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์œผ๋ฉฐ, ๋ฌธ์„œ ๋ฒกํ„ฐ๋งŒ์œผ๋กœ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค. ๊ณ„์‚ฐ์ด ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฆ…๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ ์ฝ”๋“œ:
1
2
3
4
5
6
7
8
9
10
11
    from gensim.models.doc2vec import Doc2Vec, TaggedDocument
    
    documents = [TaggedDocument(words=["I", "love", "machine", "learning"], tags=['doc1']),
                 TaggedDocument(words=["Machine", "learning", "is", "fun"], tags=['doc2'])]
    
    model = Doc2Vec(documents, vector_size=100, window=2, min_count=1, workers=4, dm=0)  # PV-DBOW
    vector = model.dv['doc1']
    
    print(vector)
    
    

์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ๋Š” ์ž์—ฐ์–ด ์ž„๋ฒ ๋”ฉ ํ•™์Šต์„ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ๊ธฐ๋ฒ•๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด ์„ค๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์ •ํ˜• ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์—์„œ Bag-of-Words(BoW), TF-IDF, N-Grams ๋ชจ๋ธ์„ ํ†ตํ•ด ๋‹จ์–ด์˜ ๋นˆ๋„์™€ ์ˆœ์„œ๋ฅผ ๋ฐ˜์˜ํ•œ ๋ฒกํ„ฐํ™”๋ฅผ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค.
  • Word2Vec, GloVe, FastText๋ฅผ ํ†ตํ•ด ๋‹จ์–ด์˜ ์˜๋ฏธ์™€ ์œ ์‚ฌ์„ฑ์„ ๋ฐ˜์˜ํ•œ ๋ถ„์‚ฐ ํ‘œํ˜„ ๊ธฐ๋ฒ•์„ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ, ๋ฌธ์„œ ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ๋ฌธ์„œ ์ „์ฒด๋ฅผ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์„ค๋ช…ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๊ธด ๊ธ€ ์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค ๐Ÿ’Œ



-->