본문 바로가기

Language/Python

[chardet] Encoding and Representing Text

1. What is Encoding?

Encoding is a processing or processing method that converts the form or form of encoding information, and in the case of character encoding, it is a method of encoding a set of characters. Since the computer does not accept number other than 0 and 1, encoding is required to express chracters. ASCII encoding has 128 characters codes. In the case of ASCII encdoing, in addition to characters, control characters for controlling the computer are included.

 

The encoded string can be stored in side the storage and is stored as 1 bytes, so it is defined as a Bytes object. Encoded strings output broken strings when switched to different encoding methods, but do not require much processing. That is why it is very important to know how the file or document is encoded and decode it.

 

text = "The Swedish word for quest is sökande"
encoded = text.encode(encoding = 'ascii', errors = 'replace')

print(encoded)
print(type(encoded))

 

Strings other than alphabets are limited to one byte, so other encoding methods are required. The methods that appeared at this time were CP949, EUC-KR, UTF-8, and UTF-16. This method is referred to as Unicode and refers to a code in which keys and values are mapped in a 1:1 manner.

 

2. Open files after checking encoding

import chardet

with open("kyto_restaurants.csv", mode = "rb") as file : 
    raw_bytes = file.read()
    detected_encoding = chardet.detect(raw_bytes)['encoding']

import csv 

with open("kyto_restaurant.csv", mode = 'r', encoding = detected_encoding) as file :
    rows = list(csv.reader(file))

 

 

 

 

'Language > Python' 카테고리의 다른 글

[Syntax] Meaning of dot Notation  (0) 2022.09.26
[collections] Make frequency table automatically  (1) 2022.09.23
[csv] Read files  (0) 2022.09.22
[folium] Visualization Map on Python  (0) 2022.09.18
[heapq] Implementing Binary Heap in Python  (0) 2022.09.14