본문 바로가기

Data Science/Numpy

[numpy] Processing Datasets, Boolean, and Datatypes in Numpy

1. Datasets in Numpy

1.1 Load csv file into ndarray

The numpy.genfromtxt() method stores numeric data inside a text file in ndarray.

 

import numpy as np
file = np.genfromtxt('file.csv', delimiter=',') 
fisrt_five = file[:5,:]

 

The result stored in the ndarray is denoted by scientific notation and nan. np.nan stands for not a number and means character data inside a csv file. In addition, since each numeric data is denoted by scientific notation, it is necessary to mark it as a general number through the np.set_printoptions(suppress=True) method. 

 

np.set_printoptions(supress=True) 
print(first_five)

 

1.2 Load csv file with name

The Numpy does not accept the names of column beacuse it only deals with numeric data, but the names parameter of the np.genfromtxt() method allows the first row to be recognized as the names of columns.

 

file_with_names = np.genfromtxt('file.csv', delimiter=',' names=True)
print(file_witih_names['col'])

 

2. Boolean Indexing

The Numpy may store boolean data generated through a comparison operator in an ndarray. Since ndarray stores only numerical data, True is stored as 1 and False is stored as 0, so computation is also possible. The comparison operator can also generate boolen data by applying broadcasting.

 

subset1 = file[:, 0]
subset2 = file[:, 1]
more_1 = subset1 > subset2
equal = subset1 == subset2 

num_more_1 = more_1.sum() 
num_equal = equal.sum()

 

  • np.sum() : The total number of boolean data can be checked
  • np.any() : The np.any() allows us to determine whether any boolean data is True.
  • np.all() : Through the np.all() method, we can check whether all boolean data is True.

 

The &, |. ~ operator performs a relation operation on different boolean arrays generated through the comparison operator. 

 

subset1 = file[:, 0]
subset2 = file[:, 1]
subset3 = (subset1 > subset2) & (subset1 == subset2) 
counts = subset3.sum()

 

The useful thing about Numpy is that the generated boolean array can be used to filter the ndarray. A this time, the boolean array must be the same as the shape of the ndarray.

 

data_with_conditions = file[subset3]

 

3. Numpy Datatypes

The numpy.ndarray stores only the same data type. To check the data type of the array, use the ndarray.dtype attribute. To specify a data type when creating an array, write the desired data type in the dtype parameter in the np.array() method. To change the data type to another data type, use the ndarray.astype() method.

 

x = np.array([1, 2, 3, 4, 5], dtype=np.float64) 
x[0] = 5.5
print(x)

 

The reason why ndarray cannot store only one data type is that the numpy package is written based on language C. This is because C language in typed language unlike Python.

 

The Numpy defines the data type of ndarray as a type that can contain all data. For example, in the case of [1, 2, True, False) array, the data type of the array is integer becuase the boolean type cannot contain 2. In this case, it is likely that the information stored by the array will sometimes be lost.

 

import numpy as np
values = [3.14, 6.42, 5.0, 0.5]
x = np.array(values, dtype = np.int64)
print(x)

 

3.1 Fixed-length bit representation

Python's numerical expression method can express negative and positive numbers through fixed-length bit representation. Therefore, the number with some limitation may be provided in space by using the corresponding fixed length bit. To evaluate the size of the data that the array is consuming, use the ndarray.nbytes property. However, ambiguous fixed-length representations are likely to cause underflow and overflow, which could compromise the data.

 

x = np.array([-127, -57, -6, 0, 9, 42, 125], dtype=np.int8)
print(x-2)
print(x+3)

'Data Science > Numpy' 카테고리의 다른 글

[numpy] Arithmetics with Numpy Arrays  (0) 2022.10.03
[numpy] Basic operations of Numpy array  (0) 2022.10.03