본문 바로가기

DB/PostgreSQL

[psycopg2] Efficient query with Index Scan

1. What is Index Scan?

import psycopg2
import json 
conn = psycopg2.connect(dbname="db", user="ur", password="pw") 
cursor = conn.cursor() 

cursor.execute("EXPLAIN (ANALYZE, FORMAT josn) SELECT * FROM table WHRE id = nn;") 

query_plan = cursor.fetchone() 
print(json.dumps(query_plan, indent=2)) 

# Result of query_plan in json 
[
  [
    {
      "Plan": {
        "Node Type": "Index Scan",
        "Parallel Aware": false,
        "Scan Direction": "Forward",
        "Index Name": "table_pkey",
        "Relation Name": "table",
        "Alias": "table",
        "Startup Cost": 0.29,
        "Total Cost": 8.31,
        "Plan Rows": 1,
        "Plan Width": 480,
        "Actual Startup Time": 0.039,
        "Actual Total Time": 0.04,
        "Actual Rows": 1,
        "Actual Loops": 1,
        "Index Cond": "(id = nn)",
        "Rows Removed by Index Recheck": 0
      },
      "Planning Time": 3.051,
      "Triggers": [],
      "Execution Time": 0.081
    }
  ]
]

 

When a query is written using the primary key, it can be seen that a new node type of "Index Scan" is created, not "Seq Scan", which is generated when the EXPLAIN clause was previously used. If we set the default key when table is created, Postgres generates a B-tree index to generate a column that allows efficient queries. With Index Scan, we can do things as follows:

 

  • We can find rows of given valeus according to indexed column.
  • We can find all indexed rows in a given range.

 

In other words, the Seq Scan is a method of reading all pages of a book to find information, and the Index Scan method is a method of finding the desired information through the table of contents. 

 

2. Time Comparison between Seq Scan and Index Scan

import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw") 
cursor = conn.cursor() 

cursor.excute("""
    EXPLAIN (ANALYZE, FORMAT json) 
    SELECT * 
      FROM table 
     WHERE col = 'specific'
     LIMIT 1; 
""")
col_name_plan = cursor.fetchone() 
print(col_name_plan[0][0]["Execution Time"])

cursor.execute("""
    EXPLAIN (ANALYZE, FORMAT json) 
    SELECT * 
      FROM table 
     WHERE id = nnn;
""")
id_plan = cursor.fetchone() 
print(id_plan[0][0]["Execution Time"])

 

As a result of the measurement, the execution time of col_name_plan is 0.746 milliseconds and the execution time of id_plan is 0.097 milliseconds, which show that the index scan execution time is overwhelmingly fast. In general, the Index Scan method has a time cost of \(log_{2}(N)\), so it is obvious that is faster than the Seq Scan with a time cost of \(N\).

 

3. How to use Index?

3.1 Create Index

However, it is very rare to use a default key in the process of finding data. Therefore, even if we do not use the default key, we need to use Index Scan. Postgres creates an object called INDEX to help us use Index Scan. When specifing an index, the name must not have the same name in the entire database.

 

As Python algorithm's time complexity and space complexity show an inversly proportional relationship, generating an index to use Index Scan is the same as reducing the efficiency of space to increase the efficiency of time.

 

import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw") 
cusor = conn.cursor()

cursor.execute("CREATE INDEX col_name_index ON table(col);") 
conn.commit() 
conn.close()

 

3.2 Check Index

Name Type References Description
schemaname name pg_namespace.nspname Name of schema containing table and index
tablename name pg_class.relname Name of table the index is for
indexname name pg_calss.relname Name of the index
tablespace name pg_tablespace.spcname Name of tablespace contianing index(null if default for database)
indexdef text   Index definition (a reconstructed CREATE INDEX command)

 

  • schemaname, tablename : The name of the schema and table from which the index was created.
  • indexdef : How the index was created(CREATE UNIQUE INDEX for the primary key and CREATE INDEX for the generated index).

 

Altough the Index Scan method is faster than the Seq Scan method, it reauires data space because the index must be stored inside the database. If too much space is required, the performance speed of Index Scan is almost no different from that of Seq Scan. The information of the generated index sis stored in the pg_indexes table of the system catalog.

 

By default, when an index is generated, Postgres generates a btree index. Postgres provides hash, gist, spgist, gin, and brin in addition to the btree index, and uses btree as default. For the hash index, it is used in the WHERE clause to present the conditions.

 

3.3 Drop Index

If the space occupied by the generated index is very large, the method of using Index Scan is not different from the method of using Seq Scan. Therefore, the DROP INDEX cluase is used to use the index. However, if there is no matching index name, an error is output, and to omit it, IF EXISTS cna be used to prevent the error from being output.

 

import psycopg2
import json
conn = psycopg2.connect(dbname="db", user="ur", password="pw")
cursor = conn.cursor() 

cursor.execute("DROP INDEX IF EXISTS col_name_index;")
conn.commit()
conn.close()

 

4. Applying more efficient Index Scan

import psycopg2
conn = psycopg2.connect(dbname="db", user="ur", password="pw") 
cursor = conn.cursor() 

cursor.execute("""
    EXPLAIN (ANALYZE, FORMAT json) 
    SELECT tb1.col1, tb1.col2, tb2.col1
      FROM table1 AS tb1, table2 AS tb2
     WHERE tb1.col1 = tb2.col1;
""")
no_index_plan = cursor.fetchone()
print(no_index_plan[0][0]['Execution Time']) 

cursor.execute("CREATE INDEX col1_index ON table1(col1);") 
cursor.execute("""
    EXPLAIN (ANALYZE, FORMAT json) 
    SELECT tb1.col1, tb1.col2, tb2.col1
      FROM table1 AS tb1, table2 AS tb2
     WHERE tb1.col1 = tb2.col1;
""")
index_plan = cursor.fetchone() 
print(index_plan[0][0]['Execution Time']) 

conn.commit()

# Result : no_index_plan(58.063) vs index_scan(58.443)

 

Let's say we use index scanning to construct more efficient query. If the generated index is used to find the value of another table, there is no difference from finding the value using Seq Scan. In other words, even if the index is applied, the value to be checked inside the INNER JOIN clause is not significantly different from the number of rows in the entire table, which is the same as applying Seq Scan(this was confirmed in the previous Debugging Postgres Queries).

 

However, an efficient query can be written in the generated index is assigned a prerequisite to enable Indes Scan inside a special condition such as a WHERE cluase.

 

cursor.execute("CREATE INDEX col1_index ON table1(col1);") 
cursor.execute("""
    EXPLAIN (ANALYZE, FORMAT json) 
    SELECT tb1.col1, tb1.col2, tb2.col1
      FROM table1 AS tb1 
     INNER JOIN table2 AS tb2
        ON tb1.col1 = tb2.col1
     WHERE tb1.col1 > 2500;
""")
index_plan = cursor.fetchone() 
print(index_plan[0][0]['Execution Time']) 

conn.commit()

# This way will be much faster! 
# Because Index Scan of tb1.col1 > 2500 will be executed first.