[pandas] Augmenting Pandas with SQLite

1. SQLite with Pandas

The Pandas library works by storing data inside the memory. However, DBMS such as SQLite process data on a disk. That is, if Pandas can process data inside available memory, SQLite processes data inside available disk space. By expanding this, processing data on server inside the cloud will allow much cheaper disk space than memory.

In most cases, data is extracted from a table in a database stored in SQLite, and data is analyzed, explored, and visualized in Pandas. Te way to move CSV file data into the SQLite database is to manually separate each statement and enter the value using the INSERT statement. Alternatively, there is also a method of transferring each chunk of a file to a table using the DataFrame.to_sql() method as a data frame.

2. Load file on Database Table

To use the DataFrame.to_sql() method, data should not be loaded into tables that exist inside the database. Therefore, if_exists='append' keyword must be used to load a crashed data frame into an existing table. Additionally, the index option should be entered as False so that the index value of the data frame is not added to the SQLite table.

import sqlite3
import pandas as pd 
conn = sqlite.connect('file.db') 

chunk_iter = pd.read_csv('file.csv', chunksize=1000) 
for chunk in chunk_iter: 
    chunk.to_sql('table', conn, if_exists='append', index=False)

Using the DataFrame.to_sql() method, SQLite automatically converts and stores the data types stored in the data frame to fit SQLite.

Pandas.read_sql() is a method that can query to SQLite.

result_df = pd.read_sql('PRAGMA table_info(table);', conn) 
print(result_df)

3. Optimize datatype on database table

Using the optimized method through the Pandas library, optimization can be performed inside the data frame and then uploaded to the SQL table.

chunk_iter = pd.read_csv('file.csv', chunksize=1000) 
for chunk in chunk_iter: 
    chunk['col1'] = chunk['col1'].astype('int32') 
    chunk.to_sql('table', conn, if_exists='append', index=False) 
df = pd.read_sql('PRAGMA table_info(table);', conn) 
print(df)

4. Workflow of generating a Pandas dataframe

The two workflows that create the Pandas DataFrame object are as follows :

After performing an operation inside SQL, the result is parsed into a data frame.
Data is extracted from inside SQL and then calculated through pandas

What makes Pandas better than SQLite is that it has a variety functions, methods, and operators that efficiently handle data frames, and it is fast to process inside memory.

# Case1 : Performing an operation inside SQL 
import pandas as pd 
import sqlite3
conn = sqlite3.connect('fiile.db') 

counts = pd.read_sql("""
    SELECT col1, COUNT(*) AS counts 
      FROM table 
     GROUP BY col1
     ORDER BY counts DESC;""", conn)

counts.head(10)

# Case2 : Performing an operation inside Pandas 
import pandas as pd 
import sqlite3
conn = sqlite3.connect('file.db')

df = pd.read_sql('SELECT col1 FROM table;', conn)
counts = df['col1'].value_counts()

5. Read table in chunk from database

If the memory problem or table size is too large when reading a table from a database, the chunksize keyword of read_sql can be used to break down the data frame into chunks and handl it as if it were working.

query = "SELECT col1 FROM table;"

chunk_iter = pd.read_sql(query, conn, chunksize=100) 
for chunk in chunk_iter:
    pass 
    # Processing each chunk

저작자표시

'DB > SQLite' 카테고리의 다른 글

[sqlite3] Using SQLite in IDEs (0)	2022.09.28
[sqlite3] Creating a sqlite database from CSV file (0)	2022.09.24
[Syntax] Data Definition Language (0)	2022.09.24
[Syntax] Writing Efficient Query (0)	2022.09.24
[Syntax] Merging Tables (0)	2022.09.23

See the forest

[pandas] Augmenting Pandas with SQLite

1. SQLite with Pandas

2. Load file on Database Table

3. Optimize datatype on database table

4. Workflow of generating a Pandas dataframe

5. Read table in chunk from database

'DB > SQLite' 카테고리의 다른 글

티스토리툴바

[pandas] Augmenting Pandas with SQLite

1. SQLite with Pandas

2. Load file on Database Table

3. Optimize datatype on database table

4. Workflow of generating a Pandas dataframe

5. Read table in chunk from database

'DB > SQLite' 카테고리의 다른 글

'DB/SQLite' Related Articles

티스토리툴바