Two different APIs related to Process Pool in Python
Intro
As a machine learning engineer, I work with vast amounts of data, performing tasks such as data cleaning and information extraction. These tasks are typically data-parallel and do not involve race conditions. Such workloads are usually referred to as CPU-intensive tasks. Typically, these tasks can be formulated as map(fn, data)
.
Python is limited by its Global Interpreter Lock (GIL). As a result, using multithreading could not improve the performance. Instead, multiprocessing should be used. Usually, you would not manually manage all the launched processes but would instead use a process pool. Today, I’d like to discuss two different APIs related to multiprocessing: multiprocessing.Pool
and concurrent.futures.ProcessPoolExecutor
, and provide a simple comparison.