1. What is Parallel Processing
In order to process a large dataset in a situation where available memory was limited, data had to be processed by splitting it into chunks. Parallel processing increases the speed of work by simultaneously processing chunks.
A central processing unit(CPU) is hardware that processes comuputer operations. The old CPU was able to perform only one task in a single core, but the current CPU is now capable of parallel processing using multi-core processing.
Python's multiprocessing module provides a package that enables parallel processing.
import multiprocessing
2. Process object
2.1 Process.start()
The multiprocessing Process() method creates a process for parallel processing by creating a Process object. The target parameter of the Process() method enters the process (functions) to be processed. The generated Process object is executed through the start() method.
import time
def wait():
time.sleep(0.5)
print("Done waiting")
process = multiprocessing.Process(target=wait)
process.start()
print("Finished")
process.join()
2.2 Process.join()
As a result of the execution of the above program, "Finished" is the first output and "Done waiting" is executed next.
Python IDE proceeds the code in the order of exposure. However, after executing process.start(), the wait() function executed in parallel is delayed by time.sleep(), so print("Finished") is executed first.
Process.join() is used to terminate an ongoing process to execute another waiting process. Therefore, the code for producing the desired result is as follows:
import time
def wait():
time.sleep(0.5)
print("Done waiting")
processing = multiprocessing.Process(target=wait)
process.start()
process.join()
print("Finished")
3. Execution time of using Parallel Processing
In the workflow of multiprocessing, the main program waits until all processes are finished an synthesizes each result to output the entire result. However, since each process is executed in parallel, the execution time is divided by the number of process.
1. Divide each operation by chunk.
2. Perform a process for each chunk.
3. Wait until all processes are finished.
4. Summarize the results.
import time
def wait():
time.sleep(0.5)
print("Done waiting")
start = time.time()
wait()
end = time.time()
elapsed1 = end - start
start = time.time()
p1 = multiprocessing.Process(target=wait)
p2 = multiprocessing.Process(target=wait)
p1.start()
p2.start()
p1.join()
p2.join()
end = time.time()
elapsed2 = end - start
# elapsed1 : 0.50099635
# elapsed2 : 0.54812479
As a result of the implementation, there is little difference between elapsed1 and elapsed2.
4. Parallel Processing with arguments
If an input value is required for a function configured in a process, the Process() method can receive arguments for the function through the args parameter in addition to the target parameter. In addition, the input arguments are literal, so the form of the object can be substituted as it is.
def sum3(x, y, z):
print(x + y + z)
def list_average(values):
print(sum(values) / len(values))
sum3_process = multiprocessing.Process(target=sum3, args=(3, 2, 5))
list_average_process = multiprocessing.Process(target=list_average, args=((1, 2, 3, 4, 5),))
sum3_process.start()
list_average_process.start()
sum_process.join()
list_average_process.join()
'Data Engineering > Map Reduce' 카테고리의 다른 글
[pandas] Processing Dataframes in Chunks (0) | 2022.10.12 |
---|