[ํŠธ๋ฆฌ] ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ML ์•Œ๊ณ ๋ฆฌ์ฆ˜

Posted by Euisuk's Dev Log on June 24, 2024

[ํŠธ๋ฆฌ] ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ML ์•Œ๊ณ ๋ฆฌ์ฆ˜

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/ํŠธ๋ฆฌ-ํŠธ๋ฆฌ-๊ธฐ๋ฐ˜-ML-์•Œ๊ณ ๋ฆฌ์ฆ˜

ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ML ์•Œ๊ณ ๋ฆฌ์ฆ˜

ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋จธ์‹ ๋Ÿฌ๋‹(ML) ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฐ์ดํ„ฐ์˜ ํŠน์ง•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์˜ˆ์ธก(Regression) ๋˜๋Š” ๋ถ„๋ฅ˜(Classification)๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฃจํŠธ ๋…ธ๋“œ์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ๋‚ด๋ถ€ ๋…ธ๋“œ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ•˜๊ณ , ๋ฆฌํ”„ ๋…ธ๋“œ์—์„œ ์ตœ์ข… ์˜ˆ์ธก๊ฐ’์„ ์‚ฐ์ถœํ•ฉ๋‹ˆ๋‹ค. ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์–ด ํ•ด์„์ด ์šฉ์ดํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์žฅ๋‹จ์ 

์žฅ์ 

  1. ํ•ด์„ ์šฉ์ด์„ฑ (Interpretability): ํŠธ๋ฆฌ ๊ตฌ์กฐ๋Š” ์‹œ๊ฐ์ ์œผ๋กœ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๊ณ , ๋ชจ๋ธ์˜ ๊ฒฐ์ • ๊ณผ์ •์„ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ์–ด๋–ป๊ฒŒ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š”์ง€ ์‰ฝ๊ฒŒ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋ฉฐ, ํŠนํžˆ ์˜์‚ฌ ๊ฒฐ์ •์ด ์ค‘์š”ํ•œ ๋ถ„์•ผ์—์„œ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.
  2. ๋น„์„ ํ˜• ๊ด€๊ณ„ ๋ชจ๋ธ๋ง (Non-linear Relationships): ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋น„์„ ํ˜•์ ์ธ ๋ฐ์ดํ„ฐ ๊ด€๊ณ„๋ฅผ ์ž˜ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์–ด ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ํŠน์ง• ์„ ํƒ (Feature Selection): ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ค‘์š”ํ•œ ํŠน์ง•์„ ์ž๋™์œผ๋กœ ์„ ํƒํ•˜๊ณ  ๋ถˆํ•„์š”ํ•œ ํŠน์ง•์„ ๋ฌด์‹œํ•˜์—ฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  4. ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ํƒ€์ž… ์ฒ˜๋ฆฌ (Handling Various Data Types): ์—ฐ์†ํ˜• ๋ฐ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด ๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ ๋ฐ์ดํ„ฐ์…‹์— ์œ ์—ฐํ•˜๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  5. ๋กœ๋ฒ„์ŠคํŠธํ•จ (Robustness): ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ ์ด์ƒ์น˜(outliers)๋‚˜ ์žก์Œ(noise)์— ๊ฐ•ํ•œ ํŠน์„ฑ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์กฐ๊ฑด์—์„œ๋„ ์•ˆ์ •์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋‹จ์ 

  1. ๊ณผ์ ํ•ฉ ์œ„ํ—˜ (Risk of Overfitting): ๋‹จ์ผ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋Š” ๊ณผ์ ํ•ฉ์— ๋ฏผ๊ฐํ•˜์—ฌ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋„ˆ๋ฌด ๋งž์ถฐ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€์ง€์น˜๊ธฐ(pruning) ๋“ฑ์˜ ๊ธฐ๋ฒ•์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋†’์€ ๊ณ„์‚ฐ ๋น„์šฉ (High Computational Cost): ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํŠธ๋ฆฌ๋ฅผ ํ•™์Šตํ•˜๊ณ  ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์‹œ๊ฐ„์ด ๋งŽ์ด ์†Œ์š”๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋ถ€์ŠคํŒ… ๊ณ„์—ด ๋ชจ๋ธ(์˜ˆ: GBM, AdaBoost ๋“ฑ)์€ ๋งŽ์€ ํŠธ๋ฆฌ๋ฅผ ํ•™์Šตํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋” ๋†’์Šต๋‹ˆ๋‹ค.
  3. ๋ฐ์ดํ„ฐ ๋ฏผ๊ฐ์„ฑ (Data Sensitivity): ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ์˜ ์ž‘์€ ๋ณ€ํ™”์—๋„ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์˜ ์•ˆ์ •์„ฑ์„ ์ €ํ•˜์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  4. ๋ณ€์ˆ˜ ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ ๋ณต์žก์„ฑ (Complexity of Interactions): ํŠธ๋ฆฌ ๊ตฌ์กฐ๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ๋ณ€์ˆ˜ ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์ด ๋ณต์žกํ•ด์งˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ชจ๋ธ์˜ ํ•ด์„์„ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  5. ๊ฒฐ์ • ๊ฒฝ๊ณ„ ๋ถˆ์—ฐ์†์„ฑ (Discontinuity of Decision Boundaries): ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์€ ๊ฒฐ์ • ๊ฒฝ๊ณ„๊ฐ€ ๋ถˆ์—ฐ์†์ ์ด์–ด์„œ, ์‹ค์ œ ๋ฐ์ดํ„ฐ์˜ ์—ฐ์†์„ฑ์„ ์ž˜ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ํŠนํžˆ ๊ฒฝ๊ณ„๊ฐ€ ์ค‘์š”ํ•œ ์—ฐ์†ํ˜• ๋ณ€์ˆ˜์— ๋Œ€ํ•ด ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ML ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ฐœ์ „

ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ํšŒ๊ท€/๋ถ„๋ฅ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์—ญ์‚ฌ๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ฐœ์ „ ๊ณผ์ •์—์„œ ์ค‘์š”ํ•œ ์œ„์น˜๋ฅผ ์ฐจ์ง€ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ทธ ์ด์ „์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•œ๊ณ„๋ฅผ ๋ณด์™„ํ•˜๊ณ  ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜์— ์ฃผ์š” ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ฐœ์ „ ์ˆœ์„œ์™€ ํŠน์ง•์„ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ธฐ์ž…ํ•œ ์—ญ์‚ฌ๋Š” ์ฐพ์€ ๋…ผ๋ฌธ ๋‚ ์งœ๋ฅผ ํ† ๋Œ€๋กœ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค. (์‹ค์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐœ๋ฐœ ๋‚ ์งœ์™€ ์ƒ์ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!!)

1. Decision Tree (์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด)

  • ์—ญ์‚ฌ: 1960๋…„๋Œ€ (๋…ผ๋ฌธ ๋งํฌ)
  • ํŠน์ง•: ๋ฐ์ดํ„ฐ์˜ ํŠน์ง•์— ๋”ฐ๋ผ ๋ถ„ํ• ์„ ๋ฐ˜๋ณตํ•˜์—ฌ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ID3, C4.5, CART ๋“ฑ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ: ์—”ํŠธ๋กœํ”ผ, ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„ํ•  ๊ธฐ์ค€์„ ์ •ํ•จ.

2. Random Forest (๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ)

  • ์—ญ์‚ฌ: 2001๋…„ (Leo Breiman) (๋…ผ๋ฌธ ๋งํฌ)
  • ํŠน์ง•: ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฒฐ์ • ํŠธ๋ฆฌ๋ฅผ ์•™์ƒ๋ธ”ํ•˜์—ฌ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ๊ฐ ํŠธ๋ฆฌ๋Š” ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ๋ง๊ณผ ๋ฌด์ž‘์œ„ ํŠน์ง• ์„ ํƒ์„ ํ†ตํ•ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ: ๋ฐฐ๊น…(bagging) ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ํŠธ๋ฆฌ์˜ ์˜ˆ์ธก์„ ํ‰๊ท  ๋‚ด๊ฑฐ๋‚˜ ๋‹ค์ˆ˜๊ฒฐ๋กœ ๊ฒฐํ•ฉ.

3. Gradient Boosting (๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…)

  • ์—ญ์‚ฌ: 2001๋…„ (Jerome Friedman) (๋…ผ๋ฌธ ๋งํฌ)
  • ํŠน์ง•: ์ด์ „ ํŠธ๋ฆฌ์˜ ์˜ค์ฐจ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ˆœ์ฐจ์ ์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ฃผ๋กœ ํšŒ๊ท€ ๋ถ„์„์— ์‚ฌ์šฉ๋˜์ง€๋งŒ, ๋ถ„๋ฅ˜์—๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ: ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ํ•™์Šต์‹œํ‚ด.

4. AdaBoost (Adaptive Boosting)

  • ์—ญ์‚ฌ: 1996๋…„ (Yoav Freund, Robert Schapire) (๋…ผ๋ฌธ ๋งํฌ)
  • ํŠน์ง•: ์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ์ƒ˜ํ”Œ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋” ๋ถ€์—ฌํ•˜์—ฌ ๋‹ค์Œ ํŠธ๋ฆฌ๋ฅผ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ฐ•๋ ฅํ•œ ํ•™์Šต๊ธฐ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ: ๊ฐ ํ•™์Šต๊ธฐ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•˜์—ฌ ์ตœ์ข… ์˜ˆ์ธก ๋ชจ๋ธ์„ ๊ตฌ์„ฑ.

5. Extra Trees (Extremely Randomized Trees)

  • ์—ญ์‚ฌ: 2006๋…„ (Pierre Geurts, Damien Ernst, Louis Wehenkel) (๋…ผ๋ฌธ ๋งํฌ)
  • ํŠน์ง•: ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ, ํŠธ๋ฆฌ์˜ ๋ถ„ํ•  ๊ธฐ์ค€์„ ์™„์ „ํžˆ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ๋งŽ์€ ํŠธ๋ฆฌ๋ฅผ ํ•„์š”๋กœ ํ•˜์ง€๋งŒ, ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ: ๋ถ„ํ•  ์‹œ ๋ฌด์ž‘์œ„๋กœ ํŠน์ง•๊ณผ ๋ถ„ํ• ์ ์„ ์„ ํƒ.

6. XGBoost (Extreme Gradient Boosting)

  • ์—ญ์‚ฌ: 2016๋…„ (Tianqi Chen) (๋…ผ๋ฌธ ๋งํฌ)
  • ํŠน์ง•: ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์˜ ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. ์ •๊ทœํ™”์™€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๋“ฑ์˜ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ: ์ •๊ทœํ™” ํ•ญ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€.

7. LightGBM (Light Gradient Boosting Machine)

  • ์—ญ์‚ฌ: 2017๋…„ (Microsoft) (๋…ผ๋ฌธ ๋งํฌ)
  • ํŠน์ง•: ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์™€ ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฆฌํ”„ ์ค‘์‹ฌ ํŠธ๋ฆฌ ๋ถ„ํ•  ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ: Gradient-based One-Side Sampling (GOSS)์™€ Exclusive Feature Bundling (EFB) ๊ธฐ๋ฒ• ์‚ฌ์šฉ.

8. CatBoost (Categorical Boosting)

  • ์—ญ์‚ฌ: 2018๋…„ (Yandex) (๋…ผ๋ฌธ ๋งํฌ)
  • ํŠน์ง•: ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ˆœ์ฐจ ๋ถ€์ŠคํŒ…๊ณผ ๊ทธ๋ผ๋””์–ธํŠธ ๊ณ„์‚ฐ์— ์ตœ์ ํ™”๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ: ์ˆœ์ฐจ์ ์ธ ์ˆœ์„œ์— ๋”ฐ๋ผ ์นดํ…Œ๊ณ ๋ฆฌํ˜• ๋ณ€์ˆ˜์˜ ํ‰๊ท ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜๋ฅผ ์ฒ˜๋ฆฌ.

9. Histogram-based Gradient Boosting (HGBT)

  • ์—ญ์‚ฌ: ํŠน์ •ํ•˜๊ธฐ ์–ด๋ ค์›€. (LightGBM ๊ฐœ๋ฐœ ๋‹น์‹œ, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํšจ์œจ์„ฑ์„ ํ™•์ธํ•จ => ์ด๋ฅผ ๋ฐœ์ „์‹œํ‚จ ํ˜•ํƒœ๊ฐ€ ๋ฐ”๋กœ HGBT)
  • ํŠน์ง•: ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ์˜ ์—ฐ์† ๋ณ€์ˆ˜๋ฅผ ๋ฒ„ํ‚ท์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ์ฒ˜๋ฆฌ ์†๋„๋ฅผ ๋†’์ž…๋‹ˆ๋‹ค.
  • ์˜ˆ์‹œ: ๊ฐ ํŠน์ง•์„ ํžˆ์Šคํ† ๊ทธ๋žจ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋น ๋ฅด๊ฒŒ ์ตœ์  ๋ถ„ํ• ์„ ์ฐพ์Œ.

์ด ์™ธ์—๋„ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ณ€ํ˜•๊ณผ ๊ฐœ์„ ์ด ๊ณ„์†ํ•ด์„œ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ์ตœ์‹  ๊ธฐ์ˆ  ๋™ํ–ฅ์— ๋”ฐ๋ผ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์— ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ์ฃผ์š” ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์—ญ์‚ฌ์  ์ˆœ์„œ์™€ ํŠน์ง•์„ ์„ค๋ช…ํ•˜๊ณ , ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฐœ๋… ๋ฐ ์›๋ฆฌ, ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์ œ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

1. Decision Tree (์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด)

์ถœ์ฒ˜: ์ง์žฅ์ธ ๊ณ ๋‹ˆ์˜ ๋ฐ์ดํ„ฐ ๋ถ„์„

๊ฐœ๋… ๋ฐ ์›๋ฆฌ

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋Š” ๋ฐ์ดํ„ฐ์˜ ํŠน์ง•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ถ„ํ• ํ•˜์—ฌ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํŠธ๋ฆฌ๋Š” ๋ฃจํŠธ ๋…ธ๋“œ์—์„œ ์‹œ์ž‘ํ•˜์—ฌ ๋‚ด๋ถ€ ๋…ธ๋“œ์—์„œ ๋ฐ์ดํ„ฐ์˜ ํŠน์ง•์— ๋”ฐ๋ผ ๋ถ„ํ• ๋˜๊ณ , ๋ฆฌํ”„ ๋…ธ๋“œ์—์„œ ์ตœ์ข… ์˜ˆ์ธก๊ฐ’์„ ์‚ฐ์ถœํ•ฉ๋‹ˆ๋‹ค. ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์˜ ์ฃผ์š” ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๋ถ„ํ•  ๊ธฐ์ค€ (Split Criterion): ๋ฐ์ดํ„ฐ ๋ถ„ํ• ์˜ ๊ธฐ์ค€์„ ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ์—”ํŠธ๋กœํ”ผ(entropy)์™€ ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„(gini impurity)๊ฐ€ ๋Œ€ํ‘œ์ ์ž…๋‹ˆ๋‹ค.
  2. ์ •๋ณด ํš๋“ (Information Gain): ํŠน์ • ๋ถ„ํ•  ๊ธฐ์ค€์— ์˜ํ•ด ๋ฐ์ดํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋ถ„ํ• ๋˜๋Š”์ง€๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค. ์ •๋ณด ํš๋“์ด ์ตœ๋Œ€ํ™”๋˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  3. ๊ฐ€์ง€์น˜๊ธฐ (Pruning): ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํŠธ๋ฆฌ์˜ ์„ฑ์žฅ์„ ์ œํ•œํ•˜๊ฑฐ๋‚˜ ๋ถˆํ•„์š”ํ•œ ๊ฐ€์ง€๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์ œ

์•„๋ž˜๋Š” scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
X, y = iris.data, iris.target

# ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# ํŠธ๋ฆฌ ์‹œ๊ฐํ™”
plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()

์ฝ”๋“œ ์„ค๋ช…

  1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• : load_iris() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•„์ด๋ฆฌ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ , train_test_split()์„ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต: DecisionTreeClassifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , fit() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ์ง€๋‹ˆ ๋ถˆ์ˆœ๋„๋ฅผ ๋ถ„ํ•  ๊ธฐ์ค€์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ , ์ตœ๋Œ€ ๊นŠ์ด๋ฅผ 3์œผ๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€: ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  4. ํŠธ๋ฆฌ ์‹œ๊ฐํ™”: plot_tree() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต๋œ ํŠธ๋ฆฌ๋ฅผ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ํŠธ๋ฆฌ์˜ ๊ฐ ๋…ธ๋“œ๋Š” ํŠน์ง•๊ณผ ํด๋ž˜์Šค์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

2. Random Forest (๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ)

์ถœ์ฒ˜ : GeeksforGeeks

๊ฐœ๋… ๋ฐ ์›๋ฆฌ

๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ์•™์ƒ๋ธ”ํ•˜์—ฌ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ฐ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋Š” ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šต๋˜๋ฉฐ, ์ตœ์ข… ์˜ˆ์ธก์€ ๊ฐ ๋‚˜๋ฌด์˜ ์˜ˆ์ธก์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์˜ ์ฃผ์š” ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๋ฐฐ๊น… (Bagging): ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ๋ง์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ํŠธ๋ ˆ์ด๋‹ ์„ธํŠธ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ๊ฐ ์„ธํŠธ์— ๋Œ€ํ•ด ๋…๋ฆฝ์ ์ธ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค.
  2. ๋žœ๋ค ํŠน์ง• ์„ ํƒ: ๊ฐ ๋…ธ๋“œ์—์„œ ๋ถ„ํ• ํ•  ๋•Œ ์ „์ฒด ํŠน์ง• ์ค‘ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ์ผ๋ถ€ ํŠน์ง•๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  3. ์•™์ƒ๋ธ”: ๊ฐ ๋‚˜๋ฌด์˜ ์˜ˆ์ธก์„ ํ‰๊ท  ๋‚ด๊ฑฐ๋‚˜ ๋‹ค์ˆ˜๊ฒฐ ํˆฌํ‘œ๋ฅผ ํ†ตํ•ด ์ตœ์ข… ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์ œ

์•„๋ž˜๋Š” scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
X, y = iris.data, iris.target

# ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
clf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”
feature_importances = clf.feature_importances_
features = iris.feature_names
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance in Random Forest')
plt.show()

์ฝ”๋“œ ์„ค๋ช…

  1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• : load_iris() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•„์ด๋ฆฌ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ , train_test_split()์„ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต: RandomForestClassifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , fit() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” 100๊ฐœ์˜ ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ๊ฐ ํŠธ๋ฆฌ์˜ ์ตœ๋Œ€ ๊นŠ์ด๋ฅผ 3์œผ๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€: ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  4. ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”: feature_importances_ ์†์„ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํŠน์ง•์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , seaborn์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ” ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

3. Gradient Boosting (๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…)

์ถœ์ฒ˜ : GeeksforGeeks

๊ฐœ๋… ๋ฐ ์›๋ฆฌ

๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์€ ์ˆœ์ฐจ์ ์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์•™์ƒ๋ธ” ํ•™์Šต ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ฐ ํŠธ๋ฆฌ๋Š” ์ด์ „ ํŠธ๋ฆฌ์˜ ์˜ค์ฐจ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ํ•™์Šต๋˜๋ฉฐ, ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํŠธ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์˜ ์ฃผ์š” ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์ˆœ์ฐจ์  ํ•™์Šต: ๋ชจ๋ธ์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๋ฉฐ, ๊ฐ ๋‹จ๊ณ„์—์„œ ์ด์ „ ๋ชจ๋ธ์˜ ์ž”์—ฌ ์˜ค์ฐจ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ๋ชจ๋ธ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
  2. ์†์‹ค ํ•จ์ˆ˜ (Loss Function): ์˜ˆ์ธก ์˜ค์ฐจ๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ํ•จ์ˆ˜๋กœ, ์ฃผ๋กœ ํšŒ๊ท€์—์„œ๋Š” ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ(MSE), ๋ถ„๋ฅ˜์—์„œ๋Š” ๋กœ๊ทธ ์†์‹ค(Log Loss)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  3. ๊ทธ๋ผ๋””์–ธํŠธ ๊ณ„์‚ฐ: ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ ๋‹จ๊ณ„์—์„œ ์ž”์—ฌ ์˜ค์ฐจ์˜ ๊ทธ๋ผ๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ชจ๋ธ์„ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์ œ

์•„๋ž˜๋Š” scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์„ ๊ตฌํ˜„ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from sklearn.datasets import load_iris
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
X, y = iris.data, iris.target

# ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
clf = GradientBoostingClassifier(n

_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”
feature_importances = clf.feature_importances_
features = iris.feature_names
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance in Gradient Boosting')
plt.show()

์ฝ”๋“œ ์„ค๋ช…

  1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• : load_iris() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•„์ด๋ฆฌ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ , train_test_split()์„ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต: GradientBoostingClassifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , fit() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” 100๊ฐœ์˜ ํŠธ๋ฆฌ, ํ•™์Šต๋ฅ  0.1, ์ตœ๋Œ€ ๊นŠ์ด 3์œผ๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€: ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  4. ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”: feature_importances_ ์†์„ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํŠน์ง•์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , seaborn์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ” ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

4. AdaBoost (Adaptive Boosting)

์ถœ์ฒ˜ : Towards AI

๊ฐœ๋… ๋ฐ ์›๋ฆฌ

AdaBoost (Adaptive Boosting)๋Š” ์•ฝํ•œ ํ•™์Šต๊ธฐ(Weak Learner)๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๊ฐ•ํ•œ ํ•™์Šต๊ธฐ(Strong Learner)๋ฅผ ๋งŒ๋“œ๋Š” ์•™์ƒ๋ธ” ํ•™์Šต ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. AdaBoost๋Š” ์ดˆ๊ธฐ ํ•™์Šต๊ธฐ์˜ ์˜ค๋ฅ˜๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ํ›„์† ํ•™์Šต๊ธฐ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ์ ์ฐจ์ ์œผ๋กœ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ์ฃผ์š” ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ: ์ดˆ๊ธฐ ํ•™์Šต๊ธฐ์˜ ์˜ค๋ฅ˜๊ฐ€ ํฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ๋” ๋งŽ์€ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ๋‹ค์Œ ํ•™์Šต๊ธฐ๊ฐ€ ์ด ์˜ค๋ฅ˜๋ฅผ ์ค„์ด๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  2. ์ˆœ์ฐจ์  ํ•™์Šต: ๊ฐ ํ•™์Šต๊ธฐ๋Š” ์ด์ „ ํ•™์Šต๊ธฐ์˜ ์˜ค๋ฅ˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ˆœ์ฐจ์ ์œผ๋กœ ํ•™์Šต๋ฉ๋‹ˆ๋‹ค.
  3. ์•™์ƒ๋ธ”: ์ตœ์ข… ์˜ˆ์ธก์€ ๋ชจ๋“  ํ•™์Šต๊ธฐ์˜ ๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์—ฌ๋œ ํˆฌํ‘œ๋ฅผ ํ†ตํ•ด ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์ œ

์•„๋ž˜๋Š” scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ AdaBoost๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from sklearn.datasets import load_iris
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
X, y = iris.data, iris.target

# ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# AdaBoost ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
base_estimator = DecisionTreeClassifier(max_depth=1, random_state=42)
clf = AdaBoostClassifier(estimator=base_estimator, n_estimators=50, learning_rate=1.0, random_state=42)
clf.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”
feature_importances = clf.feature_importances_
features = iris.feature_names
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance in AdaBoost')
plt.show()

์ฝ”๋“œ ์„ค๋ช…

  1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• : load_iris() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•„์ด๋ฆฌ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ , train_test_split()์„ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต: DecisionTreeClassifier๋ฅผ ์•ฝํ•œ ํ•™์Šต๊ธฐ๋กœ ์‚ฌ์šฉํ•˜์—ฌ AdaBoostClassifier๋ฅผ ์ƒ์„ฑํ•˜๊ณ , fit() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ์ตœ๋Œ€ ๊นŠ์ด 1์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ์•ฝํ•œ ํ•™์Šต๊ธฐ๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ, 50๊ฐœ์˜ ํ•™์Šต๊ธฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  3. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€: ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  4. ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”: feature_importances_ ์†์„ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํŠน์ง•์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , seaborn์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ” ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

5. Extra Trees (Extremely Randomized Trees)

์ถœ์ฒ˜ : stackexchange, difference-between-random-forest-and-extremely-randomized-trees

๊ฐœ๋… ๋ฐ ์›๋ฆฌ

Extra Trees (Extremely Randomized Trees)๋Š” ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ, ๋” ๋งŽ์€ ๋ฌด์ž‘์œ„์„ฑ์„ ๋„์ž…ํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ์•™์ƒ๋ธ” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ๊ฐ€ ๊ฐ ๋…ธ๋“œ์—์„œ ์ตœ์ ์˜ ๋ถ„ํ• ์„ ์ฐพ๋Š” ๋ฐ˜๋ฉด, Extra Trees๋Š” ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ๋ถ„ํ•  ๊ธฐ์ค€์„ ์‚ฌ์šฉํ•˜์—ฌ ํŠธ๋ฆฌ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ๋ชจ๋ธ์˜ ํŽธํ–ฅ์€ ์•ฝ๊ฐ„ ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ถ„์‚ฐ์€ ๊ฐ์†Œํ•˜์—ฌ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์™„์ „ ๋ฌด์ž‘์œ„ ๋ถ„ํ• : ๊ฐ ๋…ธ๋“œ์—์„œ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ๋ถ„ํ•  ๊ธฐ์ค€์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ฐฐ๊น… (Bagging): ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŠธ๋ฆฌ๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๊ณ , ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์•™์ƒ๋ธ”ํ•ฉ๋‹ˆ๋‹ค.
  3. ํŠน์ง•์˜ ๋ฌด์ž‘์œ„ ์„ ํƒ: ๊ฐ ๋…ธ๋“œ์—์„œ ๋ถ„ํ• ํ•  ๋•Œ ์‚ฌ์šฉํ•  ํŠน์ง•์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์™€ ์ฐจ์ด

์ €๋Š” ๊ฐœ์ธ์ ์œผ๋กœ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์™€ ์—‘์ŠคํŠธ๋ผ ํŠธ๋ฆฌ์˜ ๊ฐœ๋…์ด ๋งŽ์ด ํ˜ผ๋™์Šค๋Ÿฝ๋”๋ผ๊ณ ์š”โ€ฆ ๊ทธ๋ž˜์„œ ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค!

  • ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ธฐ๋ณธ์ ์œผ๋กœ ์•™์ƒ๋ธ”(ensemble) ๋ฐฉ์‹์˜ ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด(decision tree) ๋ชจ๋ธ์ด๋ผ๋Š” ์ ์—์„œ ์œ ์‚ฌํ•˜์ง€๋งŒ, ๋ช‡ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ์ฐจ์ด์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ(Random Forest)์™€ ์—‘์ŠคํŠธ๋ผ ํŠธ๋ฆฌ(Extra Trees)์˜ ์ฐจ์ด์ ์„ ๋ช…ํ™•ํ•˜๊ฒŒ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋…ธ๋“œ ๋ถ„ํ•  ๋ฐฉ์‹๊ณผ ์ƒ˜ํ”Œ๋ง ๋ฐฉ์‹์˜ ์ฐจ์ด๋ฅผ ์ฃผ์˜ ๊นŠ๊ฒŒ ์‚ดํŽด๋ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๐ŸŒณ๐ŸŒณ 1. ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ (Random Forest) ๐ŸŒณ๐ŸŒณ

  • ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ๋ง (Bootstrap Sampling, ์ƒ์ด):
    • ๊ฐ ํŠธ๋ฆฌ๋Š” ์›๋ณธ ๋ฐ์ดํ„ฐ์—์„œ ๋ฌด์ž‘์œ„๋กœ ์ƒ˜ํ”Œ์„ ์„ ํƒํ•˜์—ฌ ํ•™์Šตํ•จ.
    • ์ƒ˜ํ”Œ๋ง์€ ๋ณต์› ์ถ”์ถœ ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๊ฐ€ ์—ฌ๋Ÿฌ ๋ฒˆ ์„ ํƒ๋  ์ˆ˜ ์žˆ์Œ.
  • ํ”ผ์ฒ˜ ๋ฌด์ž‘์œ„์„ฑ (Feature Sampling, ๋™์ผ):
    • ๊ฐ ๋…ธ๋“œ๋ฅผ ๋ถ„ํ• ํ•  ๋•Œ, ์ „์ฒด ํ”ผ์ฒ˜ ์ค‘ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ํ”ผ์ฒ˜์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•จ.
  • ๋…ธ๋“œ ์ƒ˜ํ”Œ ๋ถ„ํ•  (Node Sample Splits, ์ƒ์ด):
    • ๊ฐ ๋…ธ๋“œ์—์„œ ์ตœ์ ์˜ ๋ถ„ํ• ์„ ์ฐพ๊ธฐ ์œ„ํ•ด ์„ ํƒ๋œ ํ”ผ์ฒ˜์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•จ.
    • ์„ ํƒ๋œ ํ”ผ์ฒ˜๋“ค ์ค‘์—์„œ ๋ถ„ํ•  ๊ธฐ์ค€์ด ๋˜๋Š” ์ž„๊ณ„๊ฐ’์„ ์ฐพ์Œ.

๐ŸŒฒ๐ŸŒฒ 2. ์—‘์ŠคํŠธ๋ผ ํŠธ๋ฆฌ (Extra Trees) ๐ŸŒฒ๐ŸŒฒ

  • ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ๋ง ์—†์Œ (No Bootstrap Sampling, ์ƒ์ด):
    • ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ์ƒ˜ํ”Œ๋ง์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ.
    • ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํŠธ๋ฆฌ๋ฅผ ํ•™์Šตํ•จ.
  • ํ”ผ์ฒ˜ ๋ฌด์ž‘์œ„์„ฑ (Feature Sampling, ๋™์ผ):
    • ๊ฐ ๋…ธ๋“œ๋ฅผ ๋ถ„ํ• ํ•  ๋•Œ, ์ „์ฒด ํ”ผ์ฒ˜ ์ค‘ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒ๋œ ํ”ผ์ฒ˜์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•จ.
  • ๋…ธ๋“œ ๋žœ๋ค ๋ถ„ํ•  (Node Random Splits, ์ƒ์ด):
    • ์„ ํƒ๋œ ํ”ผ์ฒ˜์— ๋Œ€ํ•ด ๋ถ„ํ•  ์ž„๊ณ„๊ฐ’์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•จ.
    • ์—ฌ๋Ÿฌ ๋ฌด์ž‘์œ„ ๋ถ„ํ•  ์ค‘ ๊ฐ€์žฅ ์ข‹์€ ๋ถ„ํ• ์„ ์„ ํƒํ•จ.
    • ์ด ๊ณผ์ •์—์„œ ๋ฌด์ž‘์œ„์„ฑ์ด ๋” ๋งŽ์ด ๋„์ž…๋˜์–ด, ๊ณ„์‚ฐ ์†๋„๊ฐ€ ๋นจ๋ผ์ง€๊ณ  ํŠธ๋ฆฌ๋“ค์ด ์„œ๋กœ ๋” ๋…๋ฆฝ์ ์ž„.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์ œ

์•„๋ž˜๋Š” scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Extra Trees๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from sklearn.datasets import load_iris
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
X, y = iris.data, iris.target

# ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Extra Trees ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
clf = ExtraTreesClassifier(n_estimators=100, max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”
feature_importances = clf.feature_importances_
features = iris.feature_names
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance in Extra Trees')
plt.show()

์ฝ”๋“œ ์„ค๋ช…

  1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• : load_iris() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•„์ด๋ฆฌ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ , train_test_split()์„ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต: ExtraTreesClassifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Extra Trees ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , fit() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” 100๊ฐœ์˜ ํŠธ๋ฆฌ์™€ ์ตœ๋Œ€ ๊นŠ์ด 3์„ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€: ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  4. ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”: feature_importances_ ์†์„ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํŠน์ง•์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , seaborn์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ” ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

6. XGBoost (Extreme Gradient Boosting)

์ถœ์ฒ˜: ResearchGate, Flow chart of XGBoost

๊ฐœ๋… ๋ฐ ์›๋ฆฌ

XGBoost (Extreme Gradient Boosting)๋Š” ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™•์žฅํ•˜์—ฌ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚จ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. XGBoost๋Š” ์ •๊ทœํ™”, ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ, ์กฐ๊ธฐ ์ข…๋ฃŒ ๋“ฑ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ์กด์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์ •๊ทœํ™” (Regularization): L1, L2 ์ •๊ทœํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๋ฅผ ์ œ์–ดํ•˜๊ณ  ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ (Parallel Processing): ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.
  3. ์กฐ๊ธฐ ์ข…๋ฃŒ (Early Stopping): ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋ฉˆ์ถ”๋ฉด ํ•™์Šต์„ ์กฐ๊ธฐ ์ข…๋ฃŒํ•˜์—ฌ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  4. ๋ถ„ํ•  ๊ฒ€์ƒ‰ ์ตœ์ ํ™”: ๋ถ„ํ• ์ ์„ ํšจ์œจ์ ์œผ๋กœ ์ฐพ๊ธฐ ์œ„ํ•ด ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์ œ

์•„๋ž˜๋Š” xgboost ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ XGBoost๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
X, y = iris.data, iris.target

# ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# XGBoost ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
clf = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, use_label_encoder=False)
clf.fit(X_train, y_train, eval_metric='logloss')

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”
feature_importances = clf.feature_importance()
features = iris.feature_names
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance in XGBoost')
plt.show()

์ฝ”๋“œ ์„ค๋ช…

  1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• : load_iris() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•„์ด๋ฆฌ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ , train_test_split()์„ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต: XGBClassifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ XGBoost ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , fit() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” 100๊ฐœ์˜ ํŠธ๋ฆฌ, ํ•™์Šต๋ฅ  0.1, ์ตœ๋Œ€ ๊นŠ์ด 3์œผ๋กœ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. use_label_encoder=False๋Š” ์ตœ์‹  ๋ฒ„์ „์˜ XGBoost์—์„œ ๊ฒฝ๊ณ ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์„ค์ •์ž…๋‹ˆ๋‹ค.
  3. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€: ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  4. ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”: feature_importance() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํŠน์ง•์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , seaborn์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ” ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

7. LightGBM (Light Gradient Boosting Machine)

์ถœ์ฒ˜ : https://www.linkedin.com/pulse/xgboost-vs-lightgbm-ashik-kumar/

๊ฐœ๋… ๋ฐ ์›๋ฆฌ

LightGBM (Light Gradient Boosting Machine)์€ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์™€ ๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋œ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. LightGBM์€ ๋ฆฌํ”„ ์ค‘์‹ฌ ํŠธ๋ฆฌ ๋ถ„ํ•  ๋ฐฉ์‹๊ณผ ์—ฌ๋Ÿฌ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ์ฃผ์š” ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๋ฆฌํ”„ ์ค‘์‹ฌ ํŠธ๋ฆฌ ๋ถ„ํ•  (Leaf-wise Tree Growth): ํŠธ๋ฆฌ์˜ ๋ฆฌํ”„๋ฅผ ํ™•์žฅํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ์†์‹ค์„ ๊ฐ€์žฅ ๋งŽ์ด ์ค„์ด๋Š” ๋ฆฌํ”„๋ฅผ ์„ ํƒํ•˜์—ฌ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊นŠ์ด ์ค‘์‹ฌ ๋ถ„ํ• ๋ณด๋‹ค ๋” ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค.
  2. Gradient-based One-Side Sampling (GOSS): ์ค‘์š”ํ•œ ์ƒ˜ํ”Œ์„ ๋” ๋งŽ์ด ์‚ฌ์šฉํ•˜๊ณ  ๋œ ์ค‘์š”ํ•œ ์ƒ˜ํ”Œ์„ ์ค„์—ฌ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.
  3. Exclusive Feature Bundling (EFB): ์ƒํ˜ธ ๋ฐฐํƒ€์ ์ธ ํŠน์ง•์„ ํ•˜๋‚˜๋กœ ๋ฌถ์–ด ํŠน์ง•์˜ ์ˆ˜๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.
  4. Histogram-based Decision Tree: ์—ฐ์†ํ˜• ํŠน์ง•์„ ํžˆ์Šคํ† ๊ทธ๋žจ์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ๋น ๋ฅธ ๋ถ„ํ• ์ ์„ ์ฐพ์Šต๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์ œ

์•„๋ž˜๋Š” lightgbm ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ LightGBM์„ ๊ตฌํ˜„ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
X, y = iris.data, iris.target

# ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# LightGBM ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# LightGBM ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
params = {
    'boosting_type': 'gbdt',
    'objective': 'multiclass',
    'num_class': 3,
    'metric': 'multi_logloss',
    'learning_rate': 0.1,
    'max_depth': 3,
    'num_leaves': 31,
    'random_state': 42
}

# ๋ชจ๋ธ ํ•™์Šต
clf = lgb.train(params, train_data, num_boost_round=100, valid_sets=[test_data], early_stopping_rounds=10)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = clf.predict(X_test, num_iteration=clf.best_iteration)
y_pred = [np.argmax(line) for line in y_pred]
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”
feature_importances = clf.feature_importance()
features = iris.feature_names
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance in LightGBM')
plt.show()

์ฝ”๋“œ ์„ค๋ช…

  1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• : load_iris() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•„์ด๋ฆฌ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ , train_test_split()์„ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. LightGBM ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ: lgb.Dataset๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ LightGBM์˜ ๋ฐ์ดํ„ฐ์…‹ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  3. ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •: params ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๋‹ค์ค‘ ํด๋ž˜์Šค ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  4. ๋ชจ๋ธ ํ•™์Šต: lgb.train() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. early_stopping_rounds๋ฅผ ์„ค์ •ํ•˜์—ฌ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋ฉˆ์ถ”๋ฉด ํ•™์Šต์„ ์กฐ๊ธฐ ์ข…๋ฃŒํ•ฉ๋‹ˆ๋‹ค.
  5. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€: ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  6. ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”: feature_importance() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํŠน์ง•์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , seaborn์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ” ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

8. CatBoost (Categorical Boosting)

์ถœ์ฒ˜ : https://www.mdpi.com/sensors/sensors-23-01811/article_deploy/html/images/sensors-23-01811-g003.png

๊ฐœ๋… ๋ฐ ์›๋ฆฌ

CatBoost (Categorical Boosting)๋Š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋œ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค. CatBoost๋Š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์˜ ๊ณ ์œ ํ•œ ์ฒ˜๋ฆฌ ๋ฐฉ๋ฒ•๊ณผ ์ˆœ์ฐจ ๋ถ€์ŠคํŒ… ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์˜ˆ์ธก ์„ฑ๋Šฅ๊ณผ ํ•™์Šต ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค. ์ฃผ์š” ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์ˆœ์ฐจ ๋ถ€์ŠคํŒ… (Ordered Boosting): ๋ฐ์ดํ„ฐ ์ˆœ์„œ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์„ž์–ด ๋ถ€์ŠคํŒ… ๋‹จ๊ณ„๋งˆ๋‹ค ์ƒˆ๋กœ์šด ์ˆœ์„œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ฒ˜๋ฆฌ: ๊ฐ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์— ๋Œ€ํ•ด ๊ณ ์œ ํ•œ ํ†ต๊ณ„๋Ÿ‰(ํ‰๊ท  ๋ชฉํ‘œ๊ฐ’ ๋“ฑ)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  3. ๋Œ€์นญ ํŠธ๋ฆฌ (Symmetric Trees): ๊ท ํ˜• ์žกํžŒ ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธก ์†๋„๋ฅผ ๋†’์ด๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  4. ๊ธฐํƒ€ ์ตœ์ ํ™”: GPU ์ง€์›, ์กฐ๊ธฐ ์ข…๋ฃŒ, ์ •๊ทœํ™” ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์ œ

์•„๋ž˜๋Š” catboost ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ CatBoost๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import catboost as cb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
X, y = iris.data, iris.target

# ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# CatBoost ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
clf = cb.CatBoostClassifier(iterations=100, learning_rate=0.1, depth=3, random_seed=42, verbose=0)
clf.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”
feature_importances = clf.get_feature_importance()
features = iris.feature_names
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title

('Feature Importance in CatBoost')
plt.show()

์ฝ”๋“œ ์„ค๋ช…

  1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• : load_iris() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•„์ด๋ฆฌ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ , train_test_split()์„ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต: CatBoostClassifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ CatBoost ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , fit() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” 100๋ฒˆ์˜ ๋ฐ˜๋ณต(iterations), ํ•™์Šต๋ฅ  0.1, ์ตœ๋Œ€ ๊นŠ์ด 3์„ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€: ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  4. ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”: get_feature_importance() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํŠน์ง•์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , seaborn์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ” ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

9. Histogram-based Gradient Boosting (HGBT)

์ถœ์ฒ˜ : https://ars.els-cdn.com/content/image/1-s2.0-S0926580523000274-gr3.jpg

๊ฐœ๋… ๋ฐ ์›๋ฆฌ

ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์€ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์†ํ˜• ํŠน์ง•์„ ๋ฒ„ํ‚ท์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๋ถ„ํ• ์ ์„ ์ฐพ๋Š” ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์€ ๊ณ„์‚ฐ ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋ฉฐ, ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์…‹์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ฃผ์š” ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ํžˆ์Šคํ† ๊ทธ๋žจ ๋ณ€ํ™˜: ์—ฐ์†ํ˜• ํŠน์ง•์„ ์ผ์ •ํ•œ ๊ฐ„๊ฒฉ ๋˜๋Š” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๋”ฐ๋ผ ๋ฒ„ํ‚ท์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋น ๋ฅธ ๋ถ„ํ• ์  ์ฐพ๊ธฐ: ๊ฐ ๋ฒ„ํ‚ท ๋‚ด์—์„œ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ถ„ํ• ์ ์„ ๋น ๋ฅด๊ฒŒ ์ฐพ์Šต๋‹ˆ๋‹ค.
  3. ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ: ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ํ†ตํ•ด ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ๋†’์€ ํšจ์œจ์„ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ์ฝ”๋“œ ์˜ˆ์ œ

์•„๋ž˜๋Š” scikit-learn์˜ HistGradientBoostingClassifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ…์„ ๊ตฌํ˜„ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris = load_iris()
X, y = iris.data, iris.target

# ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต
clf = HistGradientBoostingClassifier(max_iter=100, learning_rate=0.1, max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”
feature_importances = clf.feature_importances_
features = iris.feature_names
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance in Histogram-based Gradient Boosting')
plt.show()

์ฝ”๋“œ ์„ค๋ช…

  1. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• : load_iris() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์•„์ด๋ฆฌ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ , train_test_split()์„ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ•™์Šต: HistGradientBoostingClassifier๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ธฐ๋ฐ˜ ๊ทธ๋ž˜๋””์–ธํŠธ ๋ถ€์ŠคํŒ… ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๊ณ , fit() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” 100๋ฒˆ์˜ ๋ฐ˜๋ณต(iterations), ํ•™์Šต๋ฅ  0.1, ์ตœ๋Œ€ ๊นŠ์ด 3์„ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. ์˜ˆ์ธก ๋ฐ ํ‰๊ฐ€: ํ•™์Šต๋œ ๋ชจ๋ธ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๊ณ , ์ •ํ™•๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  4. ์ค‘์š” ํŠน์ง• ์‹œ๊ฐํ™”: feature_importances_ ์†์„ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ํŠน์ง•์˜ ์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , seaborn์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ” ๊ทธ๋ž˜ํ”„๋กœ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

10. ์ •๋ฆฌ

๋‹ค์Œ์€ scikit-learn์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•™์Šตํ•˜๊ณ  ๊ฒ€์ฆ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค. ์ด ์ฝ”๋“œ๋Š” ์•„์ด๋ฆฌ์Šค ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ์„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•ํƒœ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier, HistGradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb
import catboost as cb
import pandas as pd

# ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ๋ถ„ํ• 
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฆฌ์ŠคํŠธ
classifiers = {
    "Decision Tree": DecisionTreeClassifier(max_depth=3, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
    "AdaBoost": AdaBoostClassifier(estimator=DecisionTreeClassifier(max_depth=1, random_state=42), n_estimators=50, learning_rate=1.0, random_state=42),
    "Extra Trees": ExtraTreesClassifier(n_estimators=100, max_depth=3, random_state=42),
    "XGBoost": xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, use_label_encoder=False, eval_metric='logloss'),
    "LightGBM": lgb.LGBMClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
    "CatBoost": cb.CatBoostClassifier(iterations=100, learning_rate=0.1, depth=3, random_seed=42, verbose=0),
    "HistGradientBoosting": HistGradientBoostingClassifier(max_iter=100, learning_rate=0.1, max_depth=3, random_state=42)
}

# ์„ฑ๋Šฅ ์ €์žฅ์„ ์œ„ํ•œ ๋ฆฌ์ŠคํŠธ
results = []

# ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณ„ ์„ฑ๋Šฅ ํ‰๊ฐ€
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    train_accuracy = clf.score(X_train, y_train)
    test_accuracy = clf.score(X_test, y_test)
    cross_val_scores = cross_val_score(clf, X, y, cv=5)
    cross_val_mean = cross_val_scores.mean()
    cross_val_std = cross_val_scores.std()

    results.append({
        "Algorithm": name,
        "Train Accuracy": train_accuracy,
        "Test Accuracy": test_accuracy,
        "Cross-Validation Mean": cross_val_mean,
        "Cross-Validation Std": cross_val_std
    })

# ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋ณ€ํ™˜
results_df = pd.DataFrame(results)

# ๊ฒฐ๊ณผ ์ถœ๋ ฅ
print(results_df)

์˜ค๋Š˜์€ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ML ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ „๋ฐ˜์— ๋Œ€ํ•ด์„œ ์ •๋ฆฌํ•˜๊ณ , ๊ธฐ๋ณธ์ ์ธ ์ฝ”๋“œ๊นŒ์ง€ ์ ์–ด๋ณด๋Š” ์‹œ๊ฐ„์„ ๊ฐ€์ ธ๋ดค๋Š”๋ฐ์š” ๐Ÿ“Š ์ €๋„ ๊ฐœ์ธ์ ์œผ๋กœ ํ•œ๋ฒˆ์ฏค ์ •๋ฆฌํ•˜๊ณ  ๊ฐ€๊ณ  ์‹ถ์—ˆ๋˜ ๊ฐœ๋…์ด๋ผ ์œ ์ตํ–ˆ๋˜ ์‹œ๊ฐ„์ด์—ˆ๋˜ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!!

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค ๐Ÿ™



-->