1. Speed problem after deleting all rows in table
In the case of UPDATE and DELETE, one of the representative desctrutive queries, if we upload it back to the table after executing the cluases, we can see that the contents of the data have not chagned, but the upload speed has been slower than before.
import psycopg2
conn = psycopg2.connect(dbname='db', user='ur', password='pw')
ccursor = conn.cursor()
cursor.execute("SELECT * FROM table;")
num_rows_bf = len(cursor.fetchall())
cursor.execute("DELETE FROM table;")
cursor.execute("SELECT * FROM table;")
num_rows_af = len(cursor.fetchall())
print(num_rows_bf, num_rows_af)
When the DELETE clause is used, rows inside the existing data table exist in a dead state. This is the same as deleting files from the Windows operating system, but the files remain in the Trash folder.
Postgres ensures the consistency of other users' data that exists inside the database server. Therefore, it is necessary to guarantee the ACID until it is committed inside the Postgres server.
Postgres uses multi-version control to ensure the cohesion of different users. This multi-version control is used by a user to ensure the cohesiveness of the expected version of the database state during a transaction.
2. Multi-Version Concurrency Control
Multiversion conscurrency control is a concurrency control method commonly used by database management systems that provides concurrent access to databases and implements transactional memory in programming languages.
Without concurrency control, if someone reads the database and someone else tries to record it, the reader is likely to see half-written or inconsistent data.
When the MVCC database needs to update data, it creates a new version of the data item instead of the new data overwritting the existing data item. That is, several versions are stored. The version that each transcation watches depends on the implemented isolation level. The most common isolation level implemented by MVCC is snapshot isolation, where a transcation ovserves the state of the data as it did at the beginning of the transcation.
3. How to vacuum dead rows in table?
3.1 Check the number of dead row in table
Dead rows guarantees cohesion in the database table inside the transcation, but when a SELECT statement is applied to the table, the execution time increase because the size of the entire table increase. To verify that the table has dead rows, check the pg_stat_all_tables table in pg_catalog schema.
import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw")
cursor = conn.cursor()
cursor.execute("SELECT n_dead_tup FROM pg_catalog.pg_stat_all_tables WHERE relname = 'table';")
table_dead_rows = cursor.fetchone()[0]
print(table_dead_rows)
3.2 Vacuum dead rows in table
The VACUUM clause removes dead rows that exist inside the table. If it is used without the table name option, all users and tables are processed. The main goal of the VACUUM command is to eliminate dead rows that occupy the table, called 'reclain table storage space occupied by dead tuples'.
import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw")
conn.autocommit = True
cursor = conn.cursor()
cursor.execute("VACUUM VERBOSE table;")
for notice in conn.notices:
print(notice)
The VACUUM clause should not be used inside the transaction block, so connection.autocommit should be set to True. The VERBOSE option also returns the number of removed dead rows and stores them inside the connection.notices attribute.
Name | Description |
nonremovable row versions | the number of rows remaining (or actual data not DELETE) to ensure ACID if another transaction is initiated before dead row is removed. |
xmin | ID of the transaction to insert the row |
xmax | ID of the transaction to remove the row |
Postgres's xmin and xmax are used to determine if a transaction is actually running and can be deleted permanenetly.
4. Advantages of using VACUUM
4.1 ANALYZE option
An advantage of the VACUUM clause is its ability to update table statistics. The main weakness of the EXPLAIN cluase was the inaccuracy of table statistical data. However, the ANALYZE option in the VACUUM clause enhances accuracy by updating the new cost and the number of rows in the pg_stats.
import json
import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw")
conn.autocommit = True
cursor = conn.cursor()
cursor.execute("EXPLAIN SELECT * FROM table;")
plan_before = cursor.fetchall()
cursor.execute("VACUUM ANALYZE table;")
cursor.execute("EXPLAIN SELECT * FROM table;")
plan_after = cursor.fetchall()
print(plan_before)
print(plan_after)
4.2 FULL option
When a table is vacuummed, the frozen disk space (the space occupied by dead rows) remains intact and is used to allocate new data. The FULL option will free up space for later use for the entire database server. This operation overrids the space for the entire database server, but during vacuumming, it declares an exclusive lock, making it impossible to perform any query in the middle.
The pg_total_relation_size function refers to the disk capacity by the table, and the pg_size_pretty function converts the results from the pg_total_relation_size function into readable values.
import json
import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw")
conn.autocommit = True
cursor = conn.cursor()
cursor.execute("SELECT * FROM table;")
cursor.execute("SELECT pg_size_pretty(pg_total_relation_size('table'));")
space_before = cursor.fetchone()
cursor.execute("VACUUM FULL table;")
cursor.execute("SELECT pg_size_pretty(pg_total_relation_size('table'));")
space_after = cursor.fetchone()
print(space_before, space_after)
5. Autovacuum
Postgres provides the ability to automatically apply VACUUM clause to tables. This feature is perodically applied to the table with autovacuum to remove dead rows and ensure that the statistics are up to date. Fortunately, the latest version of Postgres has been updated to automatically apply autovacuum.
The criteria for applying VACUUM are as follows :
- Are we conducting general analysis without adding a large number of new rows or deleting rows in the main table?
- Do we want to delete a large amount of data inside the table recently and perform complex command?
- Is there not enough free space on the database server?
These questions are provided as guidelines for whether to apply the vacuum command. In other words, the task of checking the node type of the query with the EXPLAIN cluase and checking the number of deadrows in the n_dead_up columns as a way to improve performance in performing the query is a criterion for applying the VACUUM clause.
Tables to determine when VACUUM has become available in the last_vacuum and last_autovacuum columns in pg_stat_user_table.
import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw")
conn.autocommit = True
cursor = conn.cursor()
cursor.execute("VACUUM tabke;")
import time
time.sleep(1)
cursor.execute("SELECT last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE relname = 'table';")
timestamps = cursor.fetchone()
print(timestamps)
Source from :
- https://ko.wikipedia.org/wiki/%EB%8B%A4%EC%A4%91_%EB%B2%84%EC%A0%84_%EB%8F%99%EC%8B%9C%EC%84%B1_%EC%A0%9C%EC%96%B4
- https://postgresql.kr/docs/9.3/routine-vacuuming.html
- https://postgresql.kr/docs/9.3/routine-vacuuming.html
- https://www.dataquest.io/
'DB > PostgreSQL' 카테고리의 다른 글
[Theorem] Differences between conn.commit and conn.autocommit (0) | 2022.10.01 |
---|---|
[Theorem] Difference between VACUUM and VACUUM FULL (0) | 2022.10.01 |
[psycopg2] More efficient Index Scan (0) | 2022.10.01 |
[psycopg2] Efficient query with Index Scan (0) | 2022.10.01 |
[psycopg2] Debugging Postgres Queries (0) | 2022.10.01 |