[๋จธ์‹ ๋Ÿฌ๋‹] ๊ฑฐ๋ฆฌโ€ข๊ตฐ์ง‘โ€ข์„œํฌํŠธ๋ฒกํ„ฐ ๊ธฐ๋ฐ˜ ์ด์ƒํƒ์ง€ ๊ธฐ๋ฒ•

Posted by Euisuk's Dev Log on November 21, 2021

[๋จธ์‹ ๋Ÿฌ๋‹] ๊ฑฐ๋ฆฌโ€ข๊ตฐ์ง‘โ€ข์„œํฌํŠธ๋ฒกํ„ฐ ๊ธฐ๋ฐ˜ ์ด์ƒํƒ์ง€ ๊ธฐ๋ฒ•

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/๋จธ์‹ ๋Ÿฌ๋‹์ฐจ์›์ถ•์†Œ-์ด์ƒ์น˜-ํƒ์ง€-๊ธฐ๋ฒ•-๊ฑฐ๋ฆฌ๊ตฐ์ง‘์„œํฌํŠธ๋ฒกํ„ฐ-๊ธฐ๋ฐ˜-์ด์ƒ์น˜ํƒ์ง€

๋ณธ ํฌ์ŠคํŠธ๋Š” ๊ณ ๋ ค๋Œ€ํ•™๊ต ๊ฐ•ํ•„์„ฑ ๊ต์ˆ˜๋‹˜์˜ ๊ฐ•์˜๋ฅผ ์ˆ˜๊ฐ• ํ›„ ์ •๋ฆฌ๋ฅผ ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ž‘์„ฑ ๋ฐ ์„ค๋ช…์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด ์•„๋ž˜ ํฌ์ŠคํŠธ๋Š” ๋ฐ˜๋ง๋กœ ์ž‘์„ฑํ•œ ์  ์–‘ํ•ด๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜ ์ด์ƒ์น˜ํƒ์ง€

K-Nearest Neighbor-based Anomaly Detection

  • ๊ฐ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ Anomaly Score๋ฅผ K๊ฐœ์˜ ๊ทผ์ ‘ ์ด์›ƒ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด์›ƒ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋น„ํ•ด ํฌ๋‹ค๋ฉด, ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” ์ด์ƒ์น˜์ผ ํ™•๋ฅ ์ด ๋†’๋‹ค.
  • ๊ธฐ์กด์—๋Š” ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ๋‹ค์Œ 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•๋“ค์„ ์ด์šฉํ•ด์™”์ง€๋งŒ, ํ•ด๋‹น ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•์œผ๋ก  ์ฐพ์•„๋‚ด์ง€ ๋ชปํ•˜๋Š” ์ผ€์ด์Šค๋“ค์ด ์กด์žฌํ–ˆ๋‹ค.

    ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•

  • ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์ด convex-hull distance๋ฅผ ๋„์ž…ํ•ดํ•˜๊ณ , ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ hybrid distance๋ฅผ ์ •์˜ํ•ด์คŒ์œผ๋กœ์„œ ํƒ์ง€ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.

    ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•-adv

๊ตฐ์ง‘ ๊ธฐ๋ฐ˜ ์ด์ƒ์น˜ํƒ์ง€

K-Means Clustering-based Anomaly Detection

  • ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ตฐ์ง‘์— ํ• ๋‹นํ•˜์ง€ ์•Š๋Š” ๊ฐ์ฒด๋“ค์€ ์ด์ƒ์น˜๋กœ ์ทจ๊ธ‰ํ•œ๋‹ค.
  • Anomaly Score by K-Means Clustering-based Anomaly Detection(KMC)

    โ‘  ์ ˆ๋Œ€์  ๊ฑฐ๋ฆฌ : A anomaly score (a1) = B anomaly score (b1)

    โ‘ก ์ƒ๋Œ€์  ๊ฑฐ๋ฆฌ : A anomaly score (a1/a2) < B anomaly score (b1/b2)

KMC

PCA-based Novelty Detection Method

  • ๊ธฐ๋ณธ PCA์˜ ๋ชฉ์ ์€ ์‚ฌ์˜ ํ›„ ๋ถ„์‚ฐ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. PCA๊ธฐ๋ฐ˜ ์ด์ƒ์น˜ ํƒ์ง€๋Š” ๊ธฐ์กด์— ์‚ฌ์˜ํ–ˆ๋˜ ๊ฒƒ์„ ๋‹ค์‹œ ๋ณต์›์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ธฐ์กด์˜ ๊ฐ’๊ณผ ์ฐจ์ด๊ฐ€ ๋งŽ์ด ๋‚˜๋ฉด ์ด์ƒ์น˜์ผ ํ™•๋ฅ ์ด ๋†’์€ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผํ•œ๋‹ค.

    PCA1

    PCA2

Auto-Encoder

  • Auto Encoder๋Š” ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฐ’๊ณผ ์ž…๋ ฅ ๊ฐ’์ด ๋น„์Šทํ•ด์ง€๋„๋ก ํ•™์Šต์ด ์ˆ˜ํ–‰๋˜๋ฉฐ, ์•„๋ž˜์™€ ๊ฐ™์ด 4๊ฐœ์˜ Layer๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.

    AE

  1. Mapping Layer(Encoder)

    • Encoder์—์„œ๋Š” Input ๋ฐ์ดํ„ฐ๋ฅผ Bottleneck Layer ๋กœ ๋ณด๋‚ด Input ์ •๋ณด๋ฅผ ์ €์ฐจ์›์œผ๋กœ ์••์ถ•ํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
  2. Bottleneck Layer
  3. Demapping Layer(Decoder)

    • Decoder ์—์„œ๋Š” ์••์ถ•๋œ ํ˜•ํƒœ์˜ Input ์ •๋ณด๋ฅผ ์›๋ž˜์˜ Input ๋ฐ์ดํ„ฐ๋กœ ๋ณต์›ํ•œ๋‹ค.
  4. Output Layer
  • Auto-Encoder์˜ ๋ชฉํ‘œ๋Š” input ๋ฐ์ดํ„ฐ์™€ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ Output Layer๋ฅผ ํ†ตํ•ด ๋‚˜์˜ค๋Š” ์˜ˆ์ธก ๊ฐ’๊ณผ ์‹ค์ œ ๊ฐ’์˜ ์ฐจ์ด๋ฅผ Loss Function์œผ๋กœ ์ •์˜ํ•˜๊ณ  ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ํ•ด๋‹น Loss Function์„ Reconstruction Error๋ผ๊ณ  ํ•œ๋‹ค.

    Reconstruction Error

  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ(์ •์ƒ ๋ฐ์ดํ„ฐ)์™€ ์ƒ์ดํ•œ ํŠน์ง•์„ ์ง€๋‹Œ ๋น„์ •์ƒ ๋ฐ์ดํ„ฐ์ธ Abnormal Data๋ฅผ ์ž…๋ ฅํ•˜๋ฉด Reconstruction Error๊ฐ€ ์ปค์ง€๊ฒŒ ๋˜์–ด, ์ด๋Ÿฌํ•œ Reconstruction Error๋ฅผ Anomaly Score ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

Support Vector-based Novelty Detection

  • ์ •๊ทœ ๋ฐ ๋น„์ •์ƒ ๊ด€์ธก์น˜๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ํ•จ์ˆ˜๋ฅผ ์ฐพ์•„ ์ง์ ‘ ์ •๊ทœ ์˜์—ญ์˜ ๊ฒฝ๊ณ„ ์ •์˜ํ•ด์ฃผ๋Š” ๋ฐฉ๋ฒ•๋ก 

    1-C-SVM_SVDD

1-C-SVM (One-class SVM)

  • ๋ฐ์ดํ„ฐ๋ฅผ ์ปค๋„์— ํ•ด๋‹นํ•˜๋Š” feature space ์ƒ์— ๋งคํ•‘ํ•˜์—ฌ ์ •์ƒ๋ฐ์ดํ„ฐ๋ฅผ ์›์ ์œผ๋กœ๋ถ€ํ„ฐ ์ตœ๋Œ€ํ•œ ๋ฉ€์–ด์ง€๋„๋ก ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ์ด ์ฃผ๋ชฉ์ ์ด๋‹ค.

pf1

  • Formulation (Primal, Dual์„ ํ†ตํ•ด ์ฆ๋ช…๋  ์ˆ˜ ์žˆ๋‹ค)

    pf2

    pf3

  • Support Vector of 1-SVM

    Support Vector of 1-SVM

vl์˜ ์˜๋ฏธ(์—ญํ• )

  1. ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” support vector์˜ ์ตœ์†Œ ๊ฐœ์ˆ˜ vl๊ฐœ

  2. margin์„ ๋ฒ—์–ด๋‚˜์„œ ์žˆ์„ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€์˜ support vector ๊ฐœ์ˆ˜ vl๊ฐœ

v์˜ ์˜๋ฏธ(์—ญํ• )

  1. v๋Š” support vector์˜ ํ•˜ํ•œ ๋น„์œจ (lower bound for the fraction of support vector)

  2. v๋Š” ์—๋Ÿฌ์˜ ์ƒํ•œ ๋น„์œจ (upper bound for the fraction of errors)

  3. v๊ฐ€ ํด์ˆ˜๋ก ๋ณต์žกํ•œ decision boundary๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค.

SVDD (Support Vector Data Description)

  • ๋ฐ์ดํ„ฐ๋ฅผ ์ปค๋„์— ํ•ด๋‹นํ•˜๋Š” feature space ์ƒ์— ๋งคํ•‘ํ•˜์—ฌ ์ •์ƒ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ์‹ธ ์•ˆ์„ ์ˆ˜ ์žˆ๋Š” ์ดˆ๊ตฌ(hypershere)๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ์ฃผ๋ชฉ์ ์ด๋‹ค.
  • formulation

    SVDD

  • Support Vector of SVDD

    SVDD2

  • s์˜ ์—ญํ•  (RBF์ปค๋„ ์‚ฌ์šฉ ์‹œ) : ์ž‘์„์ˆ˜๋ก ๋ณต์žก๋„๊ฐ€ ์ปค์ง€๊ณ , ํด์ˆ˜๋ก ๊ตฌ์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค.

    SVDD3

Isolation Forest

  • ํŠน์ • ๊ฐ์ฒด๋ฅผ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์™€ ๋ถ„๋ฆฌ(๊ณ ๋ฆฝ)ํ•˜๋Š” ๊ณผ์ •์„ ํ†ตํ•ด ์ •์ƒ ๋ฐ์ดํ„ฐ์™€ ์ด์ƒ ๋ฐ์ดํ„ฐ๋ฅผ ํŒ๋ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

Motivation

โ‘  ์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ๋Š” ๊ณ ๋ฆฝ์‹œํ‚ค๊ธฐ๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์‰ฌ์šธ ๊ฒƒ์ด๋‹ค. (= ์ ์€ split์ด ํ•„์š”ํ•  ๊ฒƒ์ด๋‹ค)

โ‘ก ์ •์ƒ ๋ฐ์ดํ„ฐ๋Š” ๊ณ ๋ฆฝ์‹œํ‚ค๊ธฐ๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์–ด๋ ค์šธ ๊ฒƒ์ด๋‹ค. (= ๋งŽ์€ split์ด ํ•„์š”ํ•  ๊ฒƒ์ด๋‹ค)

IF1

  • ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌ ์‹œ, ์ž„์˜์˜ ๋ณ€์ˆ˜, ์ž„์˜์˜ ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. Isolation Tree์—์„œ ํŠน์ • ๊ฐœ์ฒด์˜ ๊ณ ๋ฆฝ๊นŒ์ง€ ํ•„์š”ํ•œ split์˜ ๊ฐœ์ˆ˜๋“ค์˜ ํ‰๊ท (average path to the terminal node)๋ฅผ Isolation Forest์˜ novelty score๋กœ ์ด์šฉํ•œ๋‹ค.

    IF2

์šฉ์–ด์ •๋ฆฌ

(1) Isolation Tree(iTree)

  • ์ „์ฒด dataset์—์„œ X dataset์„ samplingํ•˜๊ณ , ์ด๋ ‡๊ฒŒ ์„ ํƒ๋œ X๋ฅผ ์ง€์†์ ์œผ๋กœ randomํ•˜๊ฒŒ ์„ ํƒ๋œ variable๊ณผ value๋กœ ๋ถ„ํ• ํ•œ๋‹ค.
  • ์ด๋•Œ, X =1์ด๊ฑฐ๋‚˜, ๊ฐ terminal node์— ์žˆ๋Š” Instance์˜ ๊ฐ’์ด ๋ชจ๋‘ ๊ฐ™๊ฑฐ๋‚˜, hyper parameter๋กœ ์ •ํ•ด ๋‘” height limit์— ๊ฑธ๋ฆฌ๋ฉด ํ•™์Šต์„ ๋ฉˆ์ถ˜๋‹ค.

(2) Path Length

  • Isolation ๋  ๋•Œ๊นŒ์ง€ ์‚ฌ์šฉ๋œ split์˜ ๊ฐœ์ˆ˜๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, ์˜ค์ผ๋Ÿฌ ์ƒ์ˆ˜๋ฅผ ํ†ตํ•ด ์ •๊ทœํ™”๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค.

    IF3

(3) Novelty Score

  • ์ด์ƒ์น˜ ์ ์ˆ˜๋Š” ์•„๋ž˜ ์‹์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜๋˜๋ฉฐ, ํŠน์ • ๊ฐ์ฒด์˜ ํ‰๊ท  path length๊ฐ€ ํฐ ๊ฒฝ์šฐ(์ •์ƒ ๋ฐ์ดํ„ฐ์ผ ํ™•๋ฅ ์ด ํด ๊ฒฝ์šฐ), s(x,n)๋Š” 0์œผ๋กœ ๊ทผ์‚ฌํ•˜๊ณ , ๋ฐ˜๋Œ€๋กœ path length๊ฐ€ ์ž‘์€ ๊ฒฝ์šฐ(์ด์ƒ์น˜ ๋ฐ์ดํ„ฐ์ผ ํ™•๋ฅ ์ด ํด ๊ฒฝ์šฐ)๋Š” 1์— ๊ทผ์‚ฌํ•˜๊ฒŒ ๋œ๋‹ค.

    IF4

(์ฐธ๊ณ ) Extended Isolation Forest

  • Isolation Forest๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ˆ˜ํ‰์„  ๋˜๋Š” ์ˆ˜์ง์„ ๋งŒ์„ ์ด์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ๊ณ ๋ฆฝ์‹œํ‚จ๋‹ค. ํ•˜์ง€๋งŒ, ์ด์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ๋ช‡ ๊ฐ€์ง€ ์œ ํ˜•์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ์ด์ƒ์น˜ ๊ตฌ๊ฐ„์„ ์ •์ƒ์˜์—ญ์ด๋ผ๊ณ  ๋ณด๋Š” ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค.

    EIF

  • ์ด๋Ÿฌํ•œ ๋‹จ์ ์„ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์ถ•์— ์ˆ˜์ง์ด ์•„๋‹Œ, slope๋ฅผ ํ†ตํ•œ random cut ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ ๊ฒƒ์ด Extended Isolation Forest์ด๋‹ค.

    EIF2

๊ธด ๊ธ€ ์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค ^~^



-->