1. Importing Pandas library
The Numpy library provides useful operations for performing algebraic operations and has the advantage of fast execution time. However, Numpy's ndarray has the disadvantage of being able to recognize only numerical data and storing only the same data type. Pandas library allow us to work on different data types.
import pandas as pd
2. Reading csv format file using Pandas
Pandas provides a pandas.read_csv() method for reading the csv file. In addition, various parameters inside provide various options for reading CSVs. A typical keyword is as follows.
- sep : specify separator
- encoding : specify different encoding
import pandas as pd
df = pd.read_csv('file.csv')
3. The core object of Pandas : DataFrame
Because the Pandas library is based on the Numpy packagge, there is a DataFrame object abased on ndarray. So the DataFrame object is similar to a 2-dimensional numpy array and has many methods and attributes.
- df.shape : attribute shows a tuple with the number of rows and columns of the dataframe.
- df.head() : returns the first five rows
- df.tail() : returns the last five rows
3.1 Indexing DataFrame
df.iloc
DataFrame has a df.iloc attribute. The iloc attribute allows access through the index of rows and columns, symbolizing an integer location. The index method of start:end:step used in ndarray can also be used.
df.iat
Unlike iloc, the iat attribute cannot scoped.
df_odd = df.iloc[1::2, :3]
fifth_row = df.iat[5,:]
df.iloc
Pandas not only searches for data with index number, but also can seach through the names of indexes and columns. The df.loc property allows data to be retrieved through the name of rows and columns.
The method of setting one of the rows as an index is as follows.
- pd.read_csv('file.csv', index_col=n)
- df.set_index('col1', inplace=True)
df.set_index('col1', inplace=True)
df_col1 = df.loc['col1', ['col2', 'col3']]
To convert the set index back to the row of the data frame, the following method can be used.
- df.reset_index(inplace=True)
df_col1.reset_index(inplace=True)
In order to index multiple columns, we can also index them with row and column names inside the list, just as we index specific rows through a list in ndarray.
It should be noted that if a single column is written only inside the list, it becomes a Series object, and if it is written in the list, it becomes a DataFrame object.
num_df = df.loc[:, df.select_dtypes(exclude=['object']).columns]
# The object of this code will be Series
res_series = df.loc[:, 'col1']
# The object of this code will be DataFrame
res_df = df.loc[:, ['col1']]
4. Index object of Pandas
The Index object of the DataFrame is stored as an integer starting with 0 as default. If we want to specify a new index, a new index object can be specified through the pd.Index() method.
index_start_one = pd.Index(range(1, len(df)+1))
df.set_index(index_start_one, inplace=True)
df_100 = df.loc[100]
5. The core object of Pandas : Series
The Series obejct of Pandas is a data structure that stores 1-dimensional data. Series object have the same structure as 1-dimensional arrays, and when derived from a data frame, they inherit the index and rows of the dataframe. Therefore, the following method can be used :
- Series.values : Return values of Series
- Series.max() : Return the max value of Series
- Series.min : Return the min value of Series
- Series.mean() : Return the average value of Series
- Series.value_counts() : Return the counts of each value a Series contains
- Series.to_dict() : Convert Series into a dictionary
import pandas as pd
max_col1 = df.col1.max()
min_col1 = df.col1.min()
counts = df.col1.value_counts()
col1_dict = counts.to_dict()
6. Boolean Indexing
The boolean mask was used to find rows and columns that satisfiy the conditions in Numpy. Similary, in Pandas, DataFrame object that meet conditions can be inquired through a boolean mask that return True/False. Using df.loc, only specific rows that satisfy the conditions can be extracted.
col1_val1 = df[df['col1'] == 'val1']
non_col1_val1 = df[~df['col'] == 'val1']
complex_conditions = df[(df['col'] == 'val1') & (df['val2'] > 1000)]
complex_conditions_rows = df.loc[(df['col'] == 'val1') & (df['val2'] > 1000), ['col3', 'col4']]
7. Make new columns
In the method of allocating the calculated value to the new column using the existing columns, the calculation can be assigned to DataFrame[[new_col]. And also, the pre-generated series may be assigned to the DataFrame, because DataFrame is the same as multiple Series.
df['new_col'] = df['col2'] / df['col3'] * df['col1']
above_zero = df[df['col3'] > 0, 'col2'].mean()
new_col = 135 / above_zero
df['new_col'] = new_col
'Data Science > Pandas' 카테고리의 다른 글
[pandas] Optimizing DataFrame's Memory (0) | 2022.10.11 |
---|---|
[pandas] Basic Data Exploration (0) | 2022.09.19 |
[pandas] Useful personal function for EDA (0) | 2022.09.18 |
[pandas] Cut rows based on integer (0) | 2022.09.18 |
[pandas] Set options (0) | 2022.09.18 |