[Pandas] ํŒ๋‹ค์Šค ํ”ผํด(.pkl) ํŒŒ์ผ์˜ ์••์ถ• ๋ฐฉ์‹ ๋น„๊ต

Posted by Euisuk's Dev Log on March 4, 2025

[Pandas] ํŒ๋‹ค์Šค ํ”ผํด(.pkl) ํŒŒ์ผ์˜ ์••์ถ• ๋ฐฉ์‹ ๋น„๊ต

์›๋ณธ ๊ฒŒ์‹œ๊ธ€: https://velog.io/@euisuk-chung/Pandas-pickleํŒŒ์ผ์˜-์••์ถ•-๋ฐฉ์‹-๋น„๊ต-๋ฐ-ํ™œ์šฉ-๊ฐ€์ด๋“œ

Pandas์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ๋กœ๋“œํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋กœ .pkl(pickle) ํŒŒ์ผ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Pickle ํŒŒ์ผ์€ Python ๊ฐ์ฒด๋ฅผ ์ง๋ ฌํ™”(serialize)ํ•˜์—ฌ ์ €์žฅํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”์ด๋„ˆ๋ฆฌ ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ํšจ์œจ์ ์œผ๋กœ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

(์ฐธ๊ณ ) โ€œ๋ฐ์ดํ„ฐ ์ง๋ ฌํ™”โ€œ๋ž€ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅ ๋งค์ฒด์— ์ €์žฅํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•์‹ ๋˜๋Š” ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•ด ์ „์†กํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•์‹์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

Source: https://devopedia.org/data-serialization

  • Pickle ํŒŒ์ผ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๋น„๊ต์  ๋น ๋ฅด๊ฒŒ ์ €์žฅ ๋ฐ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

Pandas ํ”ผํด ํŒŒ์ผ์ด๋ž€?

Pandas์˜ to_pickle() ๋ฐ read_pickle() ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์†์‰ฝ๊ฒŒ ์ €์žฅํ•˜๊ณ  ๋‹ค์‹œ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Pickle ํŒŒ์ผ์€ CSV๋‚˜ JSON๊ณผ ๊ฐ™์€ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ํŒŒ์ผ ํ˜•์‹๊ณผ ๋‹ฌ๋ฆฌ, ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ์™€ ํ˜•์‹์„ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๋ณด๋‹ค ๋น ๋ฅด๊ณ  ์‰ฝ๊ฒŒ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

    • ์ผ๋ฐ˜์ ์œผ๋กœ, Pickle ํŒŒ์ผ์€ ๋ฐ์ดํ„ฐ ์ €์žฅ ์†๋„์™€ ์ฝ๊ธฐ ์†๋„๊ฐ€ ๋น ๋ฅด๋ฉฐ, Pandas์—์„œ ์ง€์›ํ•˜๋Š” ๋‹ค์–‘ํ•œ ๊ฐ์ฒด๋ฅผ ์†์‰ฝ๊ฒŒ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๊ทธ๋Ÿฌ๋‚˜ Pickle ํŒŒ์ผ์€ Python์—์„œ๋งŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•์‹์ด๋ฏ€๋กœ ๋‹ค๋ฅธ ์–ธ์–ด์™€์˜ ํ˜ธํ™˜์„ฑ์ด ๋‚ฎ์€ ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก (์ฐธ๊ณ ) ํ”ผํด(Pickle) ํŒŒ์ผ๊ณผ ํŒŒ์ผ€์ด(Parquet) ํŒŒ์ผ์˜ ์ €์žฅ ๋ฐ ๋กœ๋“œ ์†๋„๋Š” ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ, ํŒŒ์ผ ํฌ๊ธฐ, ์‹œ์Šคํ…œ ํ™˜๊ฒฝ ๋“ฑ์— ๋”ฐ๋ผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์ผ๋ถ€ ๋ฌธํ—Œ์—์„œ๋Š” ํŒŒ์ผ€์ด ํ˜•์‹์ด ํ”ผํด๋ณด๋‹ค ๋” ๋น ๋ฅธ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๊ณ  ํ•˜์ง€๋งŒ, ์ด๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ์™€ ๋‚ด์šฉ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ”ผํด์€ ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋Š” ๋ฐ ๊ฐ•์ ์ด ์žˆ์ง€๋งŒ, ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ๋•Œ๋Š” Parquet ํ˜•์‹์ด ๋” ์ ํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์‹ค์ œ๋กœ Pycon2023์—์„œ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” Parquet์ด ๋” ํšจ์œจ์ ์ด์—ˆ๋‹ค๋Š” ๋ฐœํ‘œ๋„ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. (๋งํฌ)

๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ๊ฒฝ์šฐ ํ”ผํด ํŒŒ์ผ์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์••์ถ•์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Pandas๋Š” ๋‹ค์–‘ํ•œ ์••์ถ• ๋ฐฉ์‹์„ ์ง€์›ํ•˜๋ฉฐ, ๊ฐ๊ฐ์˜ ํŠน์ง•๊ณผ ์‚ฌ์šฉ๋ฒ•์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

์ด๋ฒˆ ๊ธ€์—์„œ๋Š” Pandas์˜ ํ”ผํด ํŒŒ์ผ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ฃผ์š” ์••์ถ• ๋ฐฉ์‹(gzip, bz2, zip, xz)์˜ ํŠน์ง•๊ณผ ์ฐจ์ด์ ์„ ์ •๋ฆฌํ•˜๊ณ , ์–ธ์ œ ์–ด๋–ค ๋ฐฉ์‹์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ํ•จ๊ป˜ ์‚ดํŽด๋ณด๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“š Pandas Documentation ๊ธฐ์ค€ ์•„๋ž˜ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค:

  • https://pandas.pydata.org/docs/user_guide/io.html#pickling ์ค‘ Compressed pickle files
  • ์œ„ Documention์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์žฅ์œผ๋กœ ๊ฐ„๋žตํ•˜๊ฒŒ ์„œ์ˆ ํ•˜์ง€๋งŒ,

    โ€œread_pickle(), DataFrame.to_pickle() and Series.to_pickle() can read and write compressed pickle files. The compression types of gzip, bz2, xz, zstd are supported for reading and writing. The zip file format only supports reading and must contain only one data file to be read.โ€

  • ๋ณธ ๊ฒŒ์‹œ๊ธ€์—์„œ๋Š” ์ด๋ฅผ ์ข€ ๋” ๋ฉด๋ฐ€ํ•˜๊ฒŒ ํŒŒํ—ค์ณ๋ณด๊ณ ์ž ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  1. ์ฃผ์š” ์••์ถ• ๋ฐฉ์‹ ๋น„๊ต

์ฃผ์š” ์••์ถ• ๊ธฐ๋ฒ•๋“ค(gzip, bz2, xz, zstd)์˜ ํŠน์ง•์„ ์ •๋ฆฌํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์••์ถ• ๋ฐฉ์‹ ์••์ถ•๋ฅ  ์••์ถ• ์†๋„ ํ•ด์ œ ์†๋„ ์‚ฌ์šฉ ์šฉ๋„ ํŠน์ง•
gzip ์ค‘๊ฐ„ ๋น ๋ฆ„ ๋น ๋ฆ„ ์ผ๋ฐ˜์ ์ธ ๋ฐ์ดํ„ฐ ์••์ถ• ๋งŽ์€ ์‹œ์Šคํ…œ์—์„œ ๊ธฐ๋ณธ ์ง€์›
bz2 ๋†’์Œ ๋А๋ฆผ ๋А๋ฆผ ๋†’์€ ์••์ถ•๋ฅ ์ด ํ•„์š”ํ•  ๋•Œ CPU ์‚ฌ์šฉ๋Ÿ‰์ด ๋†’์Œ
zip ์ค‘๊ฐ„~๋‚ฎ์Œ ๋น ๋ฆ„ ๋น ๋ฆ„ ๋‹ค์–‘ํ•œ ํŒŒ์ผ์„ ๋ฌถ์„ ๋•Œ ์œˆ๋„์šฐ ํ™˜๊ฒฝ์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ
xz ๋งค์šฐ ๋†’์Œ ๋งค์šฐ ๋А๋ฆผ ๋А๋ฆผ ์žฅ๊ธฐ ๋ณด๊ด€์šฉ ๋ฐฑ์—… ์••์ถ• ํ•ด์ œ๊ฐ€ ๋А๋ฆด ์ˆ˜ ์žˆ์Œ

๊ฐ๊ฐ์˜ ๊ธฐ๋ฒ•๋“ค์— ๋Œ€ํ•ด์„œ ์ข€ ๋” ์ž์„ธํ•˜๊ฒŒ ๋“ค์—ฌ๋‹ค ๋ณผ๊นŒ์š”? ๐Ÿ‘€


  1. ๊ฐ ์••์ถ• ๋ฐฉ์‹์˜ ํŠน์ง• ๋ฐ ์‚ฌ์šฉ ์˜ˆ์ œ

(1) gzip

  • ํŠน์ง•:

    • ์••์ถ• ๋ฐ ํ•ด์ œ ์†๋„๊ฐ€ ๋น ๋ฅด๋ฉฐ, ๋„คํŠธ์›Œํฌ ์ „์†ก ์‹œ์—๋„ ๋งŽ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
  • ์ ํ•ฉํ•œ ์ƒํ™ฉ:

    • ๋น ๋ฅธ ๋ฐ์ดํ„ฐ ์ €์žฅ ๋ฐ ๋กœ๋”ฉ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ
    • ํŒŒ์ผ ํฌ๊ธฐ๋ณด๋‹ค๋Š” ์†๋„๊ฐ€ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ

์˜ˆ์ œ ์ฝ”๋“œ:

1
2
3
4
5
6
7
8
9
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# gzip ์••์ถ•์„ ์ ์šฉํ•˜์—ฌ ํ”ผํด ์ €์žฅ
df.to_pickle('data.pkl.gz', compression='gzip')

# ์ €์žฅ๋œ ํ”ผํด ํŒŒ์ผ์„ ์ฝ๊ธฐ
df_loaded = pd.read_pickle('data.pkl.gz', compression='gzip')

(2) bz2

  • ํŠน์ง•:

    • ๋†’์€ ์••์ถ•๋ฅ ์„ ์ œ๊ณตํ•˜์ง€๋งŒ, ์••์ถ• ๋ฐ ํ•ด์ œ ์†๋„๊ฐ€ ๋А๋ฆฝ๋‹ˆ๋‹ค.
  • ์ ํ•ฉํ•œ ์ƒํ™ฉ:

    • ์ €์žฅ ๊ณต๊ฐ„ ์ ˆ์•ฝ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ
    • ์••์ถ•/ํ•ด์ œ ์†๋„๊ฐ€ ๋А๋ ค๋„ ๊ดœ์ฐฎ์€ ๊ฒฝ์šฐ

์˜ˆ์ œ ์ฝ”๋“œ:

1
2
3
4
# bz2 ์••์ถ• ์ ์šฉ
df.to_pickle('data.pkl.bz2', compression='bz2')

df_loaded = pd.read_pickle('data.pkl.bz2', compression='bz2')

(3) zip

  • ํŠน์ง•:

    • ์—ฌ๋Ÿฌ ํŒŒ์ผ์„ ํ•˜๋‚˜๋กœ ๋ฌถ์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ์šด์˜์ฒด์ œ์—์„œ ๋„๋ฆฌ ์ง€์›๋ฉ๋‹ˆ๋‹ค.
  • ์ ํ•ฉํ•œ ์ƒํ™ฉ:

    • ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŒŒ์ผ์„ ํ•˜๋‚˜์˜ ์••์ถ• ํŒŒ์ผ๋กœ ์ €์žฅํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ
    • ํ˜ธํ™˜์„ฑ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ (ํŠนํžˆ Windows ํ™˜๊ฒฝ)

์˜ˆ์ œ ์ฝ”๋“œ:

1
2
3
4
# zip ์••์ถ• ์ ์šฉ
df.to_pickle('data.pkl.zip', compression='zip')

df_loaded = pd.read_pickle('data.pkl.zip', compression='zip')

(4) xz

  • ํŠน์ง•:

    • ๋งค์šฐ ๋†’์€ ์••์ถ•๋ฅ ์„ ์ œ๊ณตํ•˜์ง€๋งŒ, ์••์ถ• ๋ฐ ํ•ด์ œ ์†๋„๊ฐ€ ๋งค์šฐ ๋А๋ฆฝ๋‹ˆ๋‹ค.
  • ์ ํ•ฉํ•œ ์ƒํ™ฉ:

    • ์žฅ๊ธฐ ๋ณด๊ด€์šฉ ๋ฐ์ดํ„ฐ ์••์ถ•
    • ์ €์žฅ ๊ณต๊ฐ„์„ ์ตœ๋Œ€ํ•œ ์ ˆ์•ฝํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ

์˜ˆ์ œ ์ฝ”๋“œ:

1
2
3
4
# xz ์••์ถ• ์ ์šฉ
df.to_pickle('data.pkl.xz', compression='xz')

df_loaded = pd.read_pickle('data.pkl.xz', compression='xz')

  1. ์–ธ์ œ ์–ด๋–ค ์••์ถ• ๋ฐฉ์‹์„ ์„ ํƒํ•ด์•ผ ํ• ๊นŒ?

๊ฐ ๋ฐฉ์‹์˜ ํŠน์ง•์„ ์ข…ํ•ฉํ•˜์—ฌ, ์ƒํ™ฉ๋ณ„ ์ ํ•ฉํ•œ ์••์ถ• ๋ฐฉ์‹์„ ์ •๋ฆฌํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

์ƒํ™ฉ ์ถ”์ฒœ ์••์ถ• ๋ฐฉ์‹ ์ด์œ 
์†๋„๊ฐ€ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ gzip ๋น ๋ฅธ ์••์ถ• ๋ฐ ํ•ด์ œ
์ตœ๋Œ€ ์••์ถ•๋ฅ ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ xz ๋งค์šฐ ๋†’์€ ์••์ถ•๋ฅ  ์ œ๊ณต
์†๋„๋ณด๋‹ค ์••์ถ•๋ฅ ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ bz2 gzip๋ณด๋‹ค ๋†’์€ ์••์ถ•๋ฅ  ์ œ๊ณต
์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŒŒ์ผ์„ ๋ฌถ์„ ๋•Œ zip ํ˜ธํ™˜์„ฑ์ด ์ข‹๊ณ  ์—ฌ๋Ÿฌ ํŒŒ์ผ์„ ์••์ถ• ๊ฐ€๋Šฅ

  1. ๊ฒฐ๋ก 

Pandas์—์„œ ํ”ผํด ํŒŒ์ผ์„ ์ €์žฅํ•  ๋•Œ ์••์ถ•์„ ์ ์šฉํ•˜๋ฉด ํŒŒ์ผ ํฌ๊ธฐ๋ฅผ ์ค„์ด๊ณ , ์ „์†ก ๋ฐ ์ €์žฅ ๊ณต๊ฐ„์„ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๊ฐ ์••์ถ• ๋ฐฉ์‹๋งˆ๋‹ค ์žฅ๋‹จ์ ์ด ์กด์žฌํ•˜๋ฏ€๋กœ, ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ๊ณผ ์‚ฌ์šฉ ๋ชฉ์ ์— ๋งž์ถฐ ์ ์ ˆํ•œ ์••์ถ• ๊ธฐ๋ฒ•์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

  • gzip: ๋น ๋ฅธ ์••์ถ• ๋ฐ ํ•ด์ œ ์†๋„๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ ์ถ”์ฒœ
  • bz2: ๋†’์€ ์••์ถ•๋ฅ ์„ ์›ํ•˜์ง€๋งŒ ์†๋„๋Š” ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ
  • zip: ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํŒŒ์ผ์„ ๋ฌถ์–ด ์ €์žฅํ•˜๊ฑฐ๋‚˜, Windows ํ™˜๊ฒฝ์—์„œ ํ˜ธํ™˜์„ฑ์„ ๊ณ ๋ คํ•  ๋•Œ
  • xz: ์ตœ๋Œ€ํ•œ ๋†’์€ ์••์ถ•๋ฅ ์ด ํ•„์š”ํ•˜์ง€๋งŒ ์†๋„๊ฐ€ ๋А๋ ค๋„ ๊ดœ์ฐฎ์„ ๋•Œ

๊ฐœ์ธ์ ์œผ๋กœ ์—…๋ฌด์—์„œ๋Š” ์ฃผ๋กœ gzip์„ ํ™œ์šฉํ•˜์ง€๋งŒ, ์ด๋ฒˆ ๊ธ€์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ์••์ถ• ๊ธฐ๋ฒ•์˜ ํŠน์„ฑ์„ ์ •๋ฆฌํ•˜๋ฉฐ ๋Œ€์•ˆ์„ ์ฐพ์•„๋ณด๊ณ ์ž ํ–ˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๋Ÿฌ๋ถ„์€ ์–ด๋–ค ์••์ถ• ๋ฐฉ์‹์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜์‹œ๋‚˜์š”? ๊ฐ์ž์˜ ๊ฒฝํ—˜๊ณผ ํ™œ์šฉ ์‚ฌ๋ก€๋ฅผ ๊ณต์œ ํ•ด ์ฃผ์„ธ์š”! ๐Ÿ˜Š

์˜ค๋Š˜๋„ ์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค ๐Ÿ™‡โ€โ™‚๏ธ



-->