본문 바로가기

Data Science/Pandas

(6)
[pandas] Optimizing DataFrame's Memory 1. Estimating the amount of memory The Pandas DataFrame.info() method provides information on non-null counts, dtype, and memory usage of data frames. The memory_usage='deep' keyword can confirm more accurate memory usage. import pandas as pd df = pd.read_csv('file.csv') df.info(memory_usage='deep') 1.1 Pandas BlockManager The Pandas's BlockManager Class optimizes data by type and stores it sepa..
[pandas] Introduction to Pandas 1. Importing Pandas library The Numpy library provides useful operations for performing algebraic operations and has the advantage of fast execution time. However, Numpy's ndarray has the disadvantage of being able to recognize only numerical data and storing only the same data type. Pandas library allow us to work on different data types. import pandas as pd 2. Reading csv format file using Pan..
[pandas] Basic Data Exploration 1. Basic Exploratory Data Analysis Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python probramming language. The most important thing of pandas library is DataFrame. DataFrame have table object consisted of rows and columns. import pandas as pd # Save filpath to variable for easier access melbourne_file_path = '../input..
[pandas] Useful personal function for EDA 1. Check missing records def missing(df) : missing_number = df.isnull().sum().sort_values(ascending = False) missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending = False) missing_values = pd.concat([missing_number, missing_percent], axis = 1, keys = ['Missing_number', 'Missing_percent']) return missing_values 2. Grouping columns by its feature def categorize(df) : Quan..
[pandas] Cut rows based on integer To cut rows based on integer and convert its type into category, there are two method : pd.cut() : Set boundary while we cut rows based on integer pd.qcut() : Set automatic boundary while we cut rows based on integer. After using this method, final datatype of columns become Categorical class. bins = [1, 20, 30, 50, 70, 100] labels = ["미성년자", "청년", "중년", "장년", "노년"] titanic['age_cat'] = pd.cut(t..
[pandas] Set options Pandas has an options API configure and customize global behavior related to DataFrame display, date behavior and more. The most using options for dataframe are below : import pandas # Max column views pd.options.display.max_columns = 999 # Suppress scientific notation pd.set_option('display.float_format', lambda x: '%.5f' % x)