1. Exection of queries
The steps to perform an SQL query statement are as follows:
- Query parsing : The query statement is divided into correct grammar units and transformed into query trees if there are no errors.
- Query rewrite : If there is a special rule identified in the system catalog, reapply it to the query tree.
- Query planning : Planner and Optimizer optimize query tree and send it to executor to find optimal time query.
- Query execution : To perform a query
2. How to know about execution of query? EXPLAIN clause
2.1 What is EXPLAIN clause?
The EXPLAIN clause returns an internal step in which the Postgres engine performs the query instead of receiving and executing the SQL query. That is, we show the steps of performing the query most quickly and efficiently in the query planning step. Even if the same results are produced, the possible paths are indicated in the order in which the complexity of the query increase, and this is indicated in "Seq Scan...".
2.2 What is Seq Scan?
The Seq Scan method is a method of reading a record while performing a full scan of a table, and is selected when there is no index or when the index exists but has a wide ragne to be read. In the way that all rows of the table are read one by one, the statup cost and the total cost are specified.
2.3 How to use EXPLAIN clause?
query = """
EXPLAIN
[statement to explain]
"""
# result of EXPLAIN command
[('Seq Scan on homeless_by_coc (cost=0.00..2145.29 rows=86529 width=480)',)]
To output the results of an EXPLAIN clause of a given format instead of a basic text result, the output format can be determined by entering (FORMAT [format]) clause after the EXPLAIN cluase.
import psycopg2
conn = psycopg2.conned(dbname="db", user="ur", password="pass")
cursor = conn.cursor()
import json
cursor.execute("EXPLAIN (FORMAT json) SELECT COUNT(*) FROM table WHERE col1 > condition;")
query_plan = cursor.fetchone()
print(json.dumps(query_plan, indent=2))
# Result of query_plan in json format
[
{
"Plan": {
"Node Type": "Aggregate",
"Strategy": "Plain",
"Partial Mode": "Simple",
"Parallel Aware": false,
"Startup Cost": 2433.72,
"Total Cost": 2433.73,
"Plan Rows": 1,
"Plan Width": 8,
"Plans": [
{
"Node Type": "Seq Scan",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"Relation Name": table,
"Alias": "homeless_by_coc",
"Startup Cost": 0.0,
"Total Cost": 2361.61,
"Plan Rows": 28843,
"Plan Width": 0,
"Filter": "(col1 > condition::type)"
}
]
}
}
]
3. Translating Nested Node
3.1 What is Nesting Node?
The result of query_plan written in JSON format was written in dictionary with overlapping built-in structures. The query_plan consists of nodes that mean each setp, and the embedded nodes form a parent-child relationship with dependence on other nodes. For the previous query_plan, the built-in "Seq Scan" run first, followed by the "Aggregate" node.
In other words, nodes embedded deeper can be executed first and express the execution order of complex query statements, and the structure is called query plan tree.
3.2 About Cost
"Startup Cost" and "Total Cost" stored in query_plan are keys divided by cost=2433.72 ... 2443.73 in text notation. Each value is an arbitrary unit of time, not an actual unit of time. Startup Cost includes the time it takes to sort and aggregate rows before they are divided. Total Cost means the time it takes to include the Startup Cost and complete the node plan.
3.3 Run cost of the Sequential Scan
The time cost of sequential scan is as follows.
$$(cpu\_tuple\_cost + cpu\_operator\_cost) \times N_{tuple} + seq\_page\_cost \times N_{page}$$
- cpu_tuple_cost : Estimated cost of processing a row while performing a query statement (default=0.01).
- cpu_operator_cost : Cost to process operators and functions while performing query statements (default=0.0025).
- seg_page_cost : Cost of fetching disk pages (default=1.0).
- \(N_{tuples}\) : Number of tuples inside the table.
- \(N_{page}\) : Number of disk pages used in table.
The number of tuples and pages can be found in the retuples and relpages of pg_class table inside Postgres.
import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw")
cursor = conn.cursor()
pu_tuple_cost = 0.01
cpu_operator_cost = 0.0025
seq_page_cost = 1.0
cursor.execute("SELECT retuples, relpages FROM pg_class WHERE relname = 'table';")
n_tuples, n_pages = cursor.fetchone()
total_cost = (cpu_tuple_cost + cpu_operator_cost) * n_tuples + seq_page_cost * n_page
print(total_cost)
3.4 Run cost in real time
The query_plan of EXPLAIN clause does not store the actual query execution time, but stores the estimated cost. Considerations for actually performing query statements are working more complexity than expected. For example, "Plan Rows" differs from "Actual Rows". This is the difference that occur in considering an additional table to query for receiving a query statement.
If we want to consider the actual plan node, we can add the ANALYZE option before the FORMAT option to see information about the actual one.
import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw")
cursor = conn.cursor()
cursor.execute("EXPLAIN (ANALYZE, FORMAT json) SELECT COUNT(*) FROM table WHERE col1 > condition;")
query_plan = cursor.fetchone()
print(json.dumps(query_plan, indent=2))
# Reusult of query_plan in json
[
[
{
"Plan": {
"Node Type": "Aggregate",
"Strategy": "Plain",
"Partial Mode": "Simple",
"Parallel Aware": false,
"Startup Cost": 2433.72,
"Total Cost": 2433.73,
"Plan Rows": 1,
"Plan Width": 8,
"Actual Startup Time": 17.527,
"Actual Total Time": 17.528,
"Actual Rows": 1,
"Actual Loops": 1,
"Plans": [
{
"Node Type": "Seq Scan",
"Parent Relationship": "Outer",
"Parallel Aware": false,
"Relation Name": "table",
"Alias": "table",
"Startup Cost": 0.0,
"Total Cost": 2361.61,
"Plan Rows": 28843,
"Plan Width": 0,
"Actual Startup Time": 3.713,
"Actual Total Time": 13.116,
"Actual Rows": 50589,
"Actual Loops": 1,
"Filter": "(col1 > condition::type)",
"Rows Removed by Filter": 35940
}
]
},
"Planning Time": 0.289,
"Triggers": [],
"Execution Time": 17.642
}
]
]
The ANALZE option not only measures the Seq Scan, but also measures the time it takes to perfrom a general query. Use EXPLAIN ANALZE with FORMAT option removed. In the case of DDL such as DELETE, UPDATE, and ALTER, the table is changed, so in order to measure the execution time of the query statement, conn.rollback() must be used at the end to cancel the query statement.
import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw")
cursor = conn.cursor()
cursor.execute("EXPLAIN ANALZE ALTER TABLE table ADD COLUMN col1 text")
query_plan = cursor.fetchone()
# Reverting the change, after debugging query plans
conn.rollback()
4. Debuggin JOINs queries
JOINs are computationally expensive to perform and the biggest culprit in delaying execution time. Before a join can occur, a Seq Scan is perfomed on each joined table. These operations can quickly become inefficient as the size of the table increase.
'DB > PostgreSQL' 카테고리의 다른 글
[psycopg2] More efficient Index Scan (0) | 2022.10.01 |
---|---|
[psycopg2] Efficient query with Index Scan (0) | 2022.10.01 |
[psycopg2] Exploring Postgres Database Internals (0) | 2022.09.29 |
[psycopg2] Database Management (0) | 2022.09.29 |
[psycopg2] User Management (0) | 2022.09.29 |