본문 바로가기

DB/PostgreSQL

[psycopg2] More efficient Index Scan

1. Bitmap Index/Heap Scan

Index Scan has the advantage of showing much faster efficiency that Seq Scan. In SQL, there are cases where a WHERE cluase uses a compound conditional statement rather than a single conditional statement.

 

At this time, if one column is designated as INDEX and the other column is not designated as INDEX, it can be seen that the nested structure of Bitmap Index Scan and Bitmap Heap Scan is formed, not Seq Scan or Index Scan, which is expected as a result of query_plan.

 

  • Bitmap Index Scan : Query all heap pages including results that satisfy the query.
  • Bitmap Heap Scan : After full scanning each heap page, reconfirm the conditions.

 

import json
import psycopg2
conn = psycopg2.connect("dbname=db user=ur password=pwd")
cursor = conn.cursor()

cursor.execute("CREATE INDEX col1_index ON table(col1);")
conn.commit()

cursor.execute("""
    EXPLAIN (FORMAT json) 
    SELECT * 
      FROM table 
     WHERE col1 = 'nn' AND col2 < 2500;
""")

query_plan = cur.fetchone()
print(json.dumps(query_plan, indent=2))

# Result of query_plan in josn 
[
  {
    "Plan": {
      "Node Type": "Bitmap Heap Scan",
      "Parallel Aware": false,
      "Relation Name": "table",
      "Alias": "tablec",
      "Startup Cost": 11.7,
      "Total Cost": 903.46,
      "Plan Rows": 144,
      "Plan Width": 12,
      "Recheck Cond": "(col1 = 'nn'::bpchar)",
      "Filter": "(col2 < 2500::date)",
      "Plans": [
        {
          "Node Type": "Bitmap Index Scan",
          "Parent Relationship": "Outer",
          "Parallel Aware": false,
          "Index Name": "col1_index",
          "Startup Cost": 0.0,
          "Total Cost": 11.67,
          "Plan Rows": 433,
          "Plan Width": 0,
          "Index Cond": "(col1 = 'nn'::bpchar)"
        }
      ]
    }
  }
]

 

2. How Postgres Stores Its data

The columns of the table inside Postgres are separated and stored as a set of heap pages. When we create an index inside Posgres, Postgres efficiently find pages with rows required by the query. In other words, for Index Scan using the b-tree structure, Postgres ask b-tree where the rows of heap pages and pages are located, loads them into heap pages, and searches for rows.

 

Heap pages are stored on disk and loading pages from memory costs. Therefore, when using Index Scan, if query requires a very large amount of rows, page load is generated for each row.

 

 

When Index Scan is used, Postgres loads page2, page7, page5 respectively. An alternative way to appear is to find a heap page containing the results using the index, and to fech the results by reading the entire page. The advantage of the above method is that each page can be loaded only once, and the disadvantage is that the selected page must be fully scanned. In other words, it is a compromise method.

 

Bitmap Index Scan finds a heap page that satisfies the result, but doe not know which row, and Bitmap Heap Scan finds a row that satisfies the condition by performing a full scan through the condition that exists in the Recheck Cond field.

 

3. Multi-Column Index Scan

The method of setting two columns as indexes at the same time is called Multi-Column Index. A method of setting a multi-column index is the same as a method of generating a single index. However, we can separate the additional columns into "," in parenthesis and fill them out additionally. 

 

import psycopg2
import json 
conn = psycopg2.connect(dbname="db", user="ur", password="pw") 
cursor = conn.cursor() 

cursor.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
     SELECT * 
       FROM table 
      WHERE col1 = 'nn' AND col2 < 2500; 
""")

plan_single_index = cur.fetchone() 
print(plan_single_index[0][0]["Execution Time"])

cursor.execute("CREATE INDEX col1_col2_index ON table(col1, col2);")
conn.commit() 

cursor.execute("""
    EXPLAIN (ANALYZE, FORMAT json)
     SELECT * 
       FROM table 
      WHERE col1 = 'nn' AND col2 < 2500; 
""")

plan_multi_index = cur.fetchone()
print(plan_multi_index[0][0]["Execution Time"])

 

3.1 Cautions of Multi-Column Index

 

Multi-Column Index is a method of searching rows in a table arranged based on multiple index columns, so it shows a much faster execution speed. The sorting order of columns is also inportant, and data sorted by character is easy to search, whlie data sorted by integer is not easy to search.

 

The multi-column index may also be applied to the entire column of the table. However, applying all rows to the index may not be very helpful in terms of time cost. This is because each column is influenced by the order of previous stored columns, so poor alignment leads to inefficient search.

 

Eventually, it uses Seq Scan.

 

import psycopg2
import json
conn = psycopg2.connect("dbname=db user=ur password=pw")
cursor = conn.cursor()

cursor.execute("""CREATE INDEX col1_col2_col3_index 
                      ON table(col1, col2, col3);""")
conn.commit() 

cursor.execute("""
    EXPLAIN (ANALYZE, FORMAT json) 
    SELECT * 
      FROM table
     WHERE col2 > 2500 AND col3 > 5000;
""")

query_plan = cur.fetchone()
print(json.dumps(query_plan, indent=2))

# Result of query_plan in json 
[
  [
    {
      "Plan": {
        "Node Type": "Seq Scan",
        "Parallel Aware": false,
        "Relation Name": "table",
        "Alias": "table",
        "Startup Cost": 0.0,
        "Total Cost": 2579.95,
        "Plan Rows": 9614,
        "Plan Width": 480,
        "Actual Startup Time": 3.25,
        "Actual Total Time": 11.208,
        "Actual Rows": 251,
        "Actual Loops": 1,
        "Filter": "((col2 > 2500::int8) AND (col3 > 5000))",
        "Rows Removed by Filter": 86279
      },
      "Planning Time": 0.342,
      "Triggers": [],
      "Execution Time": 11.259
    }
  ]
]

 

3. Index after applying Built-in functions

The most common method used to search for string data is case-insentive, so a function such as LOWER(), or UPPER() is applied to the WHERE clause. However, when the built-in function is applied to the column, Seq Scan is performed using conditional statements without considering the stored index. Therefore, when applying the index, it is necessary to generate the index in a way that uses the applied built-in function together. 

 

import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw") 
cursor = conn.cursor() 

cursor.execute("CREATE INDEX col1_index ON table(LOWER(col1));")
conn.commit() 

cursor.execute("""
    EXPLAIN (ANALYZE, FORMAT json) 
    SELECT * 
      FROM table
     WHERE LOWER(col1) = 'nn';
""")
query_plan = cursor.fetchone()

 

4. Partial Index

We can also index a subset of rows depending on the conditions. This index is called a partial index and mainly uses the option WHERE col IS NOT NULL; to specify only non-missing data as an index.

 

In addition, not only rows designated as indexes can be restricted, but range cna be displayed based on other columns.

 

The main reason for using the partial index is to avoid indexing frequently occurring values. If the row length is very long, we do not need to specify an index because we do not use Index Scan. Therefore, the space required to store the index may be reduced.

 

Another useful method can be used to accelerate query when a given column is a constant value. If the index is designated on the premise of another conditional statement, the effect of applying Index Scan only inside the table of the conditions can be obtained when performing the query.

 

import psycopg2
conn = psycopg2.connect("dbname=db user=ur password=pw")
cursor = conn.cursor()

cursor.execute("CREATE INDEX col1_partial_index ON table(col1) WHERE col2 = 0;")
conn.commit()

cursor.execute("""
    SELECT *
      FROM table 
     WHERE col1 = 'nn' AND col2 = 0;
""")

nn_zero_col2 = cursor.fetchall()

 

5. Workflow of applying proper index

import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw") 
cursor = conn.cursor() 

cursor.execute("CREATE INDEX col1_col2_index ON table(col1, LOWER(col2)) WHERE col3 = 0;") 
conn.commit() 

cursor.execute("""
    EXPLAIN (ANALYZE, FORMAT json) 
    SELECT tb1.col1, tb2.col2
      FROM table1 AS tb1
     INNER JOIN table2 AS tb2
        ON tb1.col1 = tb2.col2
     WHERE tb1.col1 = 'nn' AND tb1.col3 = 0 AND LOEWR(tbl1.col2) = 'mm';
""") 
print(json.dumps(cursor.fetchone(), indet=2))

# Result of quey_plan in json 
[
  [
    {
      "Plan": {
        "Node Type": "Nested Loop",
        "Parallel Aware": false,
        "Join Type": "Inner",
        "Startup Cost": 0.41,
        "Total Cost": 24.84,
        "Plan Rows": 3,
        "Plan Width": 54,
        "Actual Startup Time": 0.041,
        "Actual Total Time": 0.042,
        "Actual Rows": 0,
        "Actual Loops": 1,
        "Inner Unique": false,
        "Plans": [
          {
            "Node Type": "Index Scan",
            "Parent Relationship": "Outer",
            "Parallel Aware": false,
            "Scan Direction": "Forward",
            "Index Name": "col1_col2_index",
            "Relation Name": "table1",
            "Alias": "tb1",
            "Startup Cost": 0.41,
            "Total Cost": 8.43,
            "Plan Rows": 1,
            "Plan Width": 20,
            "Actual Startup Time": 0.04,
            "Actual Total Time": 0.04,
            "Actual Rows": 0,
            "Actual Loops": 1,
            "Index Cond": "((col1 = 'nn'::bpchar) AND (lower((col2)::text) = 'mm'::text))",
            "Rows Removed by Index Recheck": 0
          },
          {
            "Node Type": "Seq Scan",
            "Parent Relationship": "Inner",
            "Parallel Aware": false,
            "Relation Name": "table2",
            "Alias": "tb2",
            "Startup Cost": 0.0,
            "Total Cost": 16.38,
            "Plan Rows": 3,
            "Plan Width": 58,
            "Actual Startup Time": 0.0,
            "Actual Total Time": 0.0,
            "Actual Rows": 0,
            "Actual Loops": 0,
            "Filter": "(col2 = 'mm'::bpchar)",
            "Rows Removed by Filter": 0
          }
        ]
      },
      "Planning Time": 0.594,
      "Triggers": [],
      "Execution Time": 0.073
    }
  ]
]

 

 

 

Source from : https://www.dataquest.io/