1. SQLite with Pandas
The Pandas library works by storing data inside the memory. However, DBMS such as SQLite process data on a disk. That is, if Pandas can process data inside available memory, SQLite processes data inside available disk space. By expanding this, processing data on server inside the cloud will allow much cheaper disk space than memory.
In most cases, data is extracted from a table in a database stored in SQLite, and data is analyzed, explored, and visualized in Pandas. Te way to move CSV file data into the SQLite database is to manually separate each statement and enter the value using the INSERT statement. Alternatively, there is also a method of transferring each chunk of a file to a table using the DataFrame.to_sql() method as a data frame.
2. Load file on Database Table
To use the DataFrame.to_sql() method, data should not be loaded into tables that exist inside the database. Therefore, if_exists='append' keyword must be used to load a crashed data frame into an existing table. Additionally, the index option should be entered as False so that the index value of the data frame is not added to the SQLite table.
import sqlite3
import pandas as pd
conn = sqlite.connect('file.db')
chunk_iter = pd.read_csv('file.csv', chunksize=1000)
for chunk in chunk_iter:
chunk.to_sql('table', conn, if_exists='append', index=False)
Using the DataFrame.to_sql() method, SQLite automatically converts and stores the data types stored in the data frame to fit SQLite.
Pandas.read_sql() is a method that can query to SQLite.
result_df = pd.read_sql('PRAGMA table_info(table);', conn)
print(result_df)
3. Optimize datatype on database table
Using the optimized method through the Pandas library, optimization can be performed inside the data frame and then uploaded to the SQL table.
chunk_iter = pd.read_csv('file.csv', chunksize=1000)
for chunk in chunk_iter:
chunk['col1'] = chunk['col1'].astype('int32')
chunk.to_sql('table', conn, if_exists='append', index=False)
df = pd.read_sql('PRAGMA table_info(table);', conn)
print(df)
4. Workflow of generating a Pandas dataframe
The two workflows that create the Pandas DataFrame object are as follows :
- After performing an operation inside SQL, the result is parsed into a data frame.
- Data is extracted from inside SQL and then calculated through pandas
What makes Pandas better than SQLite is that it has a variety functions, methods, and operators that efficiently handle data frames, and it is fast to process inside memory.
# Case1 : Performing an operation inside SQL
import pandas as pd
import sqlite3
conn = sqlite3.connect('fiile.db')
counts = pd.read_sql("""
SELECT col1, COUNT(*) AS counts
FROM table
GROUP BY col1
ORDER BY counts DESC;""", conn)
counts.head(10)
# Case2 : Performing an operation inside Pandas
import pandas as pd
import sqlite3
conn = sqlite3.connect('file.db')
df = pd.read_sql('SELECT col1 FROM table;', conn)
counts = df['col1'].value_counts()
5. Read table in chunk from database
If the memory problem or table size is too large when reading a table from a database, the chunksize keyword of read_sql can be used to break down the data frame into chunks and handl it as if it were working.
query = "SELECT col1 FROM table;"
chunk_iter = pd.read_sql(query, conn, chunksize=100)
for chunk in chunk_iter:
pass
# Processing each chunk
'DB > SQLite' 카테고리의 다른 글
[sqlite3] Using SQLite in IDEs (0) | 2022.09.28 |
---|---|
[sqlite3] Creating a sqlite database from CSV file (0) | 2022.09.24 |
[Syntax] Data Definition Language (0) | 2022.09.24 |
[Syntax] Writing Efficient Query (0) | 2022.09.24 |
[Syntax] Merging Tables (0) | 2022.09.23 |