본문 바로가기

Data Science/Pandas

[pandas] Optimizing DataFrame's Memory

1. Estimating the amount of memory

The Pandas DataFrame.info() method provides information on non-null counts, dtype, and memory usage of data frames. The memory_usage='deep' keyword can confirm more accurate memory usage.

 

import pandas as pd 
df = pd.read_csv('file.csv') 
df.info(memory_usage='deep')

 

1.1 Pandas BlockManager

The Pandas's BlockManager Class optimizes data by type and stores it separately. BlockManager Class behaves like an API and interacts with BlockManager to process data frames when we process values. Each block is stored as a Numpy ndarray, so it is fast.

 

print(df._data) 

# Blockmanager stored in _data attribute
BlockManager
Items: Index(['col1', 'col2', ..., 'coln'], dtype='object') 
Axis 1: RangeIndex(start=0, stop=1432534, step=1) 
FloatBlock: [0, 3, 5, ..., k], k x 1432534, dtype: float64
ObjectBlock: [1, 2, 4, ..., n-k], n-k x 1432534, dtype:object

 

1.2 Float Columns

The float64 data type is a decimal value that uses 64 bits(8 bytes) and has 1432534 lines inside the data frame, so the float64 column uses 11460272 bytes inside the data frame.

 

1.3 Object Columns

The object data type is a data type for storing string data. Python is a high-level, interpreted languae, which consumes more storage memory and slows down. This is because the data inside the Python list stores the address where the value is stored, not actual value.

 

The object type estimates the dataframe memory by storing only the stored address data(8 bytes) rather than checking the connected value.

 

2. Calculate the amount of memory

The DataFrame.size attributes returns the number of values stored inside the data frame with rows x columns.

 

num_entries = df.size 
total_bytes = num_entries*8 
total_megabytes = total_bytes / (2*20) 
print(total_megabytes)

 

The memory size of the data frame considering the size of the object data can be checked through the info(memory_usage='deep') method. In other words, if the size of the memory without the keyword is 8.1MB and the size of the memory using the keyword is 80.1MB, the size of the string data can be estimated to be 72MB.

 

The method of measuring the exact data size is to measure the size of the memory for an object-type data frame.

 

  • DataFrame.select_dtypes(include = ['object']) : Get a datafrme containing only the columns with the datatype
  • DataFrame.memory_usage(deep=True) : Return the amount of memory each column consumes

 

obj_cols = df.select_dtypes(include=['object']) 
obj_cols_mem = obj_cols.memory_usage(deep=True) 
obj_cols_sum = obj_cols_mem.sum() / (2**20)

 

3. Optimizing DataFrame by data types

3.1 Optimizing Numerical data type

Numeric data can be optimized by fixing the length of bits held by numerical data. The numpy.iinfo() method returns the minimum and maximum integers that each data type has.

 

Missing values present inside the data frame are represented by np.nan. The DataFrame.isnull() method returns a True/False data frame depending on whether it is missing. Using DataFrame.isnull().sum() method, the number of missing values for each column can be checked.

 

def change_to_int(df, col_name): 
    # Get the minimum and maximum values 
    col_max = df[col_name].max() 
    col_min = df[col_name].min() 
    for dtype_name in ['int8', 'int16', 'int32', 'int64']: 
        # Check if this data type can hold all values. 
        if col_max < np.iinfo(dtype_name).max() and col_min > np.iinfo(dtype_name).min: 
            df[col_name] = df[col_name].astype(dtype_name) 
            break 
            
# Optimize float columns 
float_cols = df.select_dtypes(include=['float64']).columns 
for col in float_cols: 
    change_to_int(df, col) 
    
print(df[float_cols].dtype)

 

The pandas.to_numerical() method not only changes the general data type, but also changes the datatype to the minimum data type entered in the downcast keyword. That is, it can be said to be a built-in function of the cange_to_int() function.

 

for col in float_cols: 
    df[col] = pd.to_numeric(df[col], downcast='float') 

print(df[float_cols].dtype)

 

3.2 Optimizing in datetime

There is a data type in Pandas that represents date date called datetime64. pd.to_datetime() changes the data into the column to datetime data type.

 

df['date1'] = pd.to_datetime(df['date1']) 
df['date1'].memory_usage()

 

3.3 Optimizing in category

Like the enumerated type used in SQLite, there is a category data type inside Pandas. Category type is a data type that maps each data to an integer, and is very efficient in representing a small set of data.

 

However, if the unique value exceeds 50% of the total data, it becomes inefficient because there is a storage problem for storing category type.

 

# memory usage before changing type 
print(df['cat_col1'].memory_usage(deep=True))

# memory usage after changing type 
df['cat_col1'] = df['cat_col1'].astype('category') 
print(df['cat_col1'].memory_usgae(deep=True)) 

# Return the integer values of the category type to represents each value. 
print(df['cat_col1'].cat.codes) 

# Apply it to optimization 
cat_cols = [fea for fea in df.select_dtypes(include=['object']) if df[fea].nunique() < (len(df)/2)]

for col in cat_col : 
    df[col] = df[col].astype('category')

print(moma.info(memory_usage = 'deep'))

 

4. Optimizing DataFrame while reading

If the data frame cannot be opened due to a problem with the storage space, it is necessary to speicfy the data type while loading the data frame. The dtype keyword in the pandas.read_csv() method can optimize a data frame by specifying the data type when importing the data frame.

 

  • dtype : Accept a dictionary that has column names as the key and Numpy type object as the values 
  • parse_dates : Accepts a list of strings containing the nmae of columns we want to parse as datetime 
  • usecols : Specify which columns we want to include 

 

keep_cols = ['col1', 'col2', 'col3'] 

df = pd.read_csv('file.csv', parse_date=['date1', 'date2'], usecols=keep_cols) 
df.head()

'Data Science > Pandas' 카테고리의 다른 글

[pandas] Introduction to Pandas  (0) 2022.10.04
[pandas] Basic Data Exploration  (0) 2022.09.19
[pandas] Useful personal function for EDA  (0) 2022.09.18
[pandas] Cut rows based on integer  (0) 2022.09.18
[pandas] Set options  (0) 2022.09.18