본문 바로가기

DB/PostgreSQL

[psycopg2] Optimizing Data Types of Tables

1. Check data type of table

Inadequate data type schema cause errors when we write queries. The cursor.description attribute of the pyscopg2 package outputs the data type of the database table. The cursor.description property consists of a column object consisting of name and type_code. The name has the name of the column, and the type_code has an integer value corresponding to the data type.

 

import psycopg2
conn = psycopg2.connect("dbname=db user=ur")
cursor = conn.cursor() 

query = "SELECT * FROM table LIMIT 0;"
cursor.execute(query) 
print(cursor.description)

 

There is special built-in table we can use when we want to know type name corresponding to type_code.  pg_catalog.pg_type table consists of a type_code(oid) and a data type name(typename). typename is the name of the data type, and oit represents an integer value of the data type.

 

Name Type Description
oid oid Row identifier
typename name The name of data type

 

Since pg_catalog.pg_type is also table, it is possible to write a query confirming a specific code integer value.

 

query = "SELECT typename FROM pg_catalog.pg_type WHERE oid=nn;"
cursor.execute(query)
type_of_nn = cursor.fetchone()[0]
print(f"The typename of type code {nn} is {type_of_nn}")

 

2. Optimizing data type of table

If we check the data type of the column inside the pg_catalog.pg_type table and find the cause of the error, it is necessary to change it to a data type that does not cause the error. There is also a notification name of the data type inside the pg_catalog.pg_type table. The reason why it is important to change data type appropriately is that it can increase the efficiency of disk space and prevent the speed and error of queries.

 

2.1 Optimizing numerical values

Postgres Internal Name Name Storage Size Description Range
int2 smallint 2 bytes small-range integer - 32768 ~ 32767
int4 integer 4 bytes typical choice for integer - 2147483648 ~
2147483647
int8 bigint 8 bytes large-range integer - 9.223372e+18 ~
9.223372e+18
numeric decimal/numeric variable user-specified precision up to 16383 digits
after the decimal point
float4 real 4 bytes variable-precision 6 decimal digits
float8 double precision 8 bytes variable-precision 15 decimal digits
int4 serial 4 bytes automatic incrementing integer 1 ~ 2147483647
int8 bigserial 8 bytes large autoincrementing integer 1 ~
9.223372e+18

 

Trough ALTER COLUMN ~ TYPE clause, we can change one column's type to a desired data type. The full workflow of checking and changing data type is same as below :

 

# Check type of all columns
import psycpg2
conn = psycopg.connect("dbname=db user=ur")
cursor = conn.cursor()

query = "SELECT * FROM [table_name] LIMIT 0;"
cursor.execute(query)
print(cursor.description)

# Check name of type occurs error
query = "SELECT typename FROM pg_catalog.pg_type WHERE oid = nn;"
cursor.execute(query)
type_of_nn = cursor.fetchone()[0]
print(type_of_nn)

# Change datatype of column
query = "ALTER TABLE table ALTER COLUMN col1 TYPE int8;"
cursor.execute(query)

# Commit changes to database
conn.commit()

# Shut transaction block
conn.close()

 

2.2 Optimizing text values

Name Description
character varying(n), varchar(n) variable-length with limit
character(n), char(n) fixed-length, blanked padded
text variable unlimited length

 

For text data, any length of string may exist inside the table. This flexibility leads to waste of space needed to store data. For space efficiency, it is possible to check the unique values of each column and size the data based on the lngest string to prevent waste. In the case of string data, the caracter size of the data type may be limited by inputting (N) into parentheses.

 

# Check max size of unique value 
import csv
with open([file_name].csv, 'r') as file:
    next(file)
    reader = csv.reader(file)
    unique_words = set()
    for row in reader:
        check_col = row[n]
        unique_words.add(score_phrase)
    
max_len = 0
for unique_word in unique_words :  
    if len(unique_word) >= max_len : 
        max_len = len(unique_word)
print(max_len)

# Change Type of text value 
import psycopg2
conn = psycopg2.connect("dbname=db user=ur")
cursor = conn.cursor()

query = "ALTER TABLE table ALTER COLUMN col1 TYPE varchar(11);"
cur.execute(query)

# Commit changes 
conn.commit()
conn.close()

 

2.3 Optimizing using enumerated datatype

Using enumerated data type is more efficient than checking all unique values to determine the largest length of string. Enumerated type is a data type that forms a set of named values and is an identifier that serves as a constant for the language. Enumerated data are assigned 4 bytes to each index and stored in the pg_enum table. Once stored, the enumerated data allows the Postgres engine to immediately check and store values.

 

import psycopg2
conn = psycopg2.connect("dbname=db user=ur")
cursor = conn.cursor()

query = """
CREATE TYPE enum_name AS ENUM(
    'val1', 'val2', 'val3', ... 
); 

ALTER TABLE table_name 
ALTER COLUMN column TYPE enum_name
USING CAST(col_name AS enum_name);
"""
cursor.excute(query)

conn.commit()
conn.close()

 

For a boolean type that exists as True/False, use the boolean type name. In the case of the boolean type, it is recognized as the enumerated type inside Postgres.

 

ALTER TABLE [table_name]
ALTER COLUMN [col_name] TYPE boolean 
USING [col_name]::boolean

 

2.4 Optimizing datetime values

Like datetime objects inside Python, date data types exists inside SQL. The workflow code that delets the existing divided columns, creates a new data type column, and insert the values is as follows. 

 

import datetime
import psycopg2
import csv

conn = psycopg2.connect("dbname=db user=ur")
cursor = conn.cursor()

query1 = "ALTER TABLE table ADD COLUMN new_datetime;" 
query2 = """
ALTER TABLE table 
DROP COLUMN col1, 
DROP COLUMN col2;
"""

cursor.execute(query1)
cursor.execute(query2) 

with open('file.csv', 'r') as file: 
    next(file) 
    reader = csv.reader(file)
    for row in reader: 
        year = int(row[1]) 
        month = int(row[2]) 
        day = int(row[3]) 
        date = datetime.date(year, month, day)
        row = row[0] + row[4:]
        row.append(date)
        query = "INSERT INTO table VALUES (%s, %s, %s, %s, %s, %s, %s);"
        cursor.execute(query, row) 
        
conn.commit() 
conn.close()