본문 바로가기

Data Science/Pandas

[pandas] Introduction to Pandas

1. Importing Pandas library

The Numpy library provides useful operations for performing algebraic operations and has the advantage of fast execution time. However, Numpy's ndarray has the disadvantage of being able to recognize only numerical data and storing only the same data type. Pandas library allow us to work on different data types.

 

import pandas as pd

 

2. Reading csv format file using Pandas

Pandas provides a pandas.read_csv() method for reading the csv file. In addition, various parameters inside provide various options for reading CSVs. A typical keyword is as follows.

 

  • sep : specify separator
  • encoding : specify different encoding

 

import pandas as pd 
df = pd.read_csv('file.csv')

 

3. The core object of Pandas : DataFrame

Because the Pandas library is based on the Numpy packagge, there is a DataFrame object abased on ndarray. So the DataFrame object is similar to a 2-dimensional numpy array and has many methods and attributes.

 

  • df.shape : attribute shows a tuple with the number of rows and columns of the dataframe.
  • df.head() : returns the first five rows
  • df.tail() : returns the last five rows

 

3.1 Indexing DataFrame

df.iloc

DataFrame has a df.iloc attribute. The iloc attribute allows access through the index of rows and columns, symbolizing an integer location. The index method of start:end:step used in ndarray can also be used.

 

df.iat

Unlike iloc, the iat attribute cannot scoped.

 

df_odd = df.iloc[1::2, :3]
fifth_row = df.iat[5,:]

 

df.iloc

Pandas not only searches for data with index number, but also can seach through the names of indexes and columns. The df.loc property allows data to be retrieved through the name of rows and columns.

 

The method of setting one of the rows as an index is as follows.

 

  • pd.read_csv('file.csv', index_col=n)
  • df.set_index('col1', inplace=True)

 

df.set_index('col1', inplace=True) 
df_col1 = df.loc['col1', ['col2', 'col3']]

 

To convert the set index back to the row of the data frame, the following method can be used.

 

  • df.reset_index(inplace=True)

 

df_col1.reset_index(inplace=True)

 

In order to index multiple columns, we can also index them with row and column names inside the list, just as we index specific rows through a list in ndarray.

 

It should be noted that if a single column is written only inside the list, it becomes a Series object, and if it is written in the list, it becomes a DataFrame object.

 

num_df = df.loc[:, df.select_dtypes(exclude=['object']).columns] 

# The object of this code will be Series 
res_series = df.loc[:, 'col1']

# The object of this code will be DataFrame 
res_df = df.loc[:, ['col1']]

 

4. Index object of Pandas

The Index object of the DataFrame is stored as an integer starting with 0 as default. If we want to specify a new index, a new index object can be specified through the pd.Index() method.

 

index_start_one = pd.Index(range(1, len(df)+1)) 
df.set_index(index_start_one, inplace=True) 

df_100 = df.loc[100]

 

5. The core object of Pandas : Series

The Series obejct of Pandas is a data structure that stores 1-dimensional data. Series object have the same structure as 1-dimensional arrays, and when derived from a data frame, they inherit the index and rows of the dataframe. Therefore, the following method can be used :

 

  • Series.values : Return values of Series
  • Series.max() : Return the max value of Series
  • Series.min : Return the min value of Series
  • Series.mean() : Return the average value of Series
  • Series.value_counts() : Return the counts of each value a Series contains
  • Series.to_dict() : Convert Series into a dictionary

 

import pandas as pd 

max_col1 = df.col1.max() 
min_col1 = df.col1.min() 
counts = df.col1.value_counts() 
col1_dict = counts.to_dict()

 

6. Boolean Indexing

The boolean mask was used to find rows and columns that satisfiy the conditions in Numpy. Similary, in Pandas, DataFrame object that meet conditions can be inquired through a boolean mask that return True/False. Using df.loc, only specific rows that satisfy the conditions can be extracted.

 

col1_val1 = df[df['col1'] == 'val1'] 
non_col1_val1 = df[~df['col'] == 'val1']

complex_conditions = df[(df['col'] == 'val1') & (df['val2'] > 1000)] 
complex_conditions_rows = df.loc[(df['col'] == 'val1') & (df['val2'] > 1000), ['col3', 'col4']]

 

7. Make new columns

In the method of allocating the calculated value to the new column using the existing columns, the calculation can be assigned to DataFrame[[new_col]. And also, the pre-generated series may be assigned to the DataFrame, because DataFrame is the same as multiple Series.

 

df['new_col'] = df['col2'] / df['col3'] * df['col1']

above_zero = df[df['col3'] > 0, 'col2'].mean() 
new_col = 135 / above_zero
df['new_col'] = new_col

 

 

'Data Science > Pandas' 카테고리의 다른 글

[pandas] Optimizing DataFrame's Memory  (0) 2022.10.11
[pandas] Basic Data Exploration  (0) 2022.09.19
[pandas] Useful personal function for EDA  (0) 2022.09.18
[pandas] Cut rows based on integer  (0) 2022.09.18
[pandas] Set options  (0) 2022.09.18