concurrent.futures interface that unifies both models.Learn Python Series):Your CPU has 8, 12, maybe 16 cores. Python uses one of them. One. And if you've ever wondered why your CPU-intensive Python script pegs a single core at 100% while the rest sit idle — welcome to the GIL, Python's most controversial design decision.
But here's the nuance that most "Python is slow" hot takes miss entirely: for I/O-bound work (network requests, database queries, file reads), Python's threading works beautifully. The problem is specifically CPU-bound parallelism. Knowing the difference — and picking the right concurrency tool — is the entire game.
In episodes #40 and #41, we covered asyncio — Python's single-threaded, cooperative concurrency model. This episode covers the other two concurrency models: threading (for I/O-bound concurrency with shared memory) and multiprocessing (for CPU-bound true parallelism). By the end, you'll know exactly when to reach for each one ;-)
These two words get used interchangeably, but they mean different things:
Concurrency is about managing multiple tasks at once. Tasks may interleave (take turns) on a single core — like a chef preparing three dishes by switching between them.
Parallelism is about executing multiple tasks simultaneously. Tasks run at the same time on different cores — like three chefs each preparing one dish.
Threading in CPython provides concurrency (interleaving). Multiprocessing provides parallelism (simultaneous execution). The distinction matters enormously, because concurrent code may not run any faster (tasks still share a single core), while parallel code can achieve linear speedup on multi-core machines.
CPython (the standard Python interpreter — the one you're almost certainly using) has a Global Interpreter Lock: a mutex that prevents multiple threads from executing Python bytecode at the same time.
Even on a 16-core machine, if you spawn 16 Python threads doing computation, they don't run in parallel. They take turns — only one thread holds the GIL and executes bytecode at any given moment. The others wait.
Why on earth would Python do this?
It's a pragmatic engineering tradeoff. CPython's memory management uses reference counting for garbage collection:
import sys
a = [] # refcount = 1
b = a # refcount = 2
print(sys.getrefcount(a)) # 3 (including the getrefcount argument itself)
del b # refcount drops to 2
Every assignment, every function call, every variable read modifies reference counts. Without the GIL, every single one of these operations would need its own fine-grained lock to be thread-safe. That would make all Python code slower (even single-threaded code), and would be a nightmare to implement correctly. The GIL is a single coarse lock that makes the entire interpreter thread-safe in one stroke.
The practical impact:
time.sleep()). While one thread waits for a response, another thread can run.A note on Python 3.13+: PEP 703 introduced an experimental "free-threaded" CPython build that removes the GIL entirely. As of early 2026, this is still experimental and not the default build. The concepts in this episode remain the standard approach for production Python.
When your program spends most of its time waiting — for network responses, for disk reads, for database queries — threading provides real speedup because threads release the GIL during these waits:
import threading
import time
import urllib.request
def fetch_url(url):
"""Fetch a URL and print response size."""
start = time.perf_counter()
with urllib.request.urlopen(url) as response:
data = response.read()
elapsed = time.perf_counter() - start
print(f" {url}: {len(data):,} bytes in {elapsed:.2f}s")
urls = [
'https://www.python.org',
'https://docs.python.org/3/',
'https://pypi.org',
'https://peps.python.org',
'https://wiki.python.org',
]
# Sequential: each request waits for the previous one
print("Sequential:")
start = time.perf_counter()
for url in urls:
fetch_url(url)
print(f"Total: {time.perf_counter() - start:.2f}s\n")
# Threaded: requests happen concurrently
print("Threaded:")
start = time.perf_counter()
threads = []
for url in urls:
t = threading.Thread(target=fetch_url, args=(url,))
t.start()
threads.append(t)
for t in threads:
t.join() # Wait for all threads to finish
print(f"Total: {time.perf_counter() - start:.2f}s")
Typical results on a reasonable connection:
That's a ~4x speedup with minimal code change. During each urlopen() call, the thread releases the GIL, allowing other threads to make their requests simultaneously. The threads aren't truly parallel (they share one Python interpreter), but they overlap their waiting time — which is what matters for I/O-bound work.
Threads share memory. That's both their advantage (easy data sharing) and their curse (easy data corruption). Here's the classic race condition:
import threading
counter = 0
def increment():
global counter
for _ in range(100_000):
counter += 1 # This is NOT atomic!
threads = [threading.Thread(target=increment) for _ in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"Expected: 1,000,000")
print(f"Actual: {counter}") # Something less, e.g. 734,219
Why? counter += 1 looks atomic but compiles to multiple bytecode instructions:
counter from global scope1counterThe GIL can release between any of these steps. So Thread A loads counter = 42, then Thread B loads counter = 42 (same value!), both add 1, and both store 43. Two increments, but the counter only went up by 1. This is called a lost update.
A threading.Lock ensures mutual exclusion:
import threading
counter = 0
lock = threading.Lock()
def increment():
global counter
for _ in range(100_000):
with lock: # Only one thread at a time
counter += 1
threads = [threading.Thread(target=increment) for _ in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"Result: {counter}") # Exactly 1,000,000
The with lock: context manager acquires the lock, executes the body, and releases it. Other threads trying to acquire the same lock will block until it's released.
But be careful — overusing locks negates the benefit of threading. If every operation acquires a lock, you've effectively serialized your code. The art is locking only the critical sections where shared state is modified.
Python's threading module provides several tools beyond basic locks:
import threading
# RLock: Re-entrant lock (same thread can acquire multiple times)
rlock = threading.RLock()
with rlock:
with rlock: # Wouldn't deadlock! RLock allows re-entry
pass
# Semaphore: Allow up to N concurrent accesses
semaphore = threading.Semaphore(3) # Max 3 threads at once
with semaphore:
pass # Only 3 threads can be here simultaneously
# Event: Signal between threads
event = threading.Event()
def waiter():
print("Waiting for signal...")
event.wait() # Blocks until event is set
print("Got the signal!")
def signaler():
import time
time.sleep(1)
event.set() # Wake up all waiters
threading.Thread(target=waiter).start()
threading.Thread(target=signaler).start()
# Condition: Wait for a specific condition
condition = threading.Condition()
data_ready = False
def producer():
global data_ready
with condition:
data_ready = True
condition.notify_all() # Wake up consumers
def consumer():
with condition:
condition.wait_for(lambda: data_ready)
print("Data is ready!")
Semaphore is particularly useful for rate-limiting — for example, limiting concurrent API requests to avoid hitting rate limits. We'll see this with ThreadPoolExecutor shortly.
For computation that genuinely needs parallel execution, you need separate processes. Each process gets its own Python interpreter with its own GIL:
import multiprocessing
import time
import math
def compute_primes(limit):
"""Find primes up to limit using trial division."""
primes = []
for num in range(2, limit):
if all(num % p != 0 for p in primes if p * p <= num):
primes.append(num)
return len(primes)
ranges = [200_000, 200_000, 200_000, 200_000]
# Sequential
start = time.perf_counter()
results_seq = [compute_primes(r) for r in ranges]
seq_time = time.perf_counter() - start
print(f"Sequential: {seq_time:.2f}s — {results_seq}")
# Parallel with Pool
start = time.perf_counter()
with multiprocessing.Pool(processes=4) as pool:
results_par = pool.map(compute_primes, ranges)
par_time = time.perf_counter() - start
print(f"Parallel: {par_time:.2f}s — {results_par}")
print(f"Speedup: {seq_time / par_time:.1f}x")
On a 4-core machine, you'll see close to a 4x speedup. Each process has its own GIL, its own memory space, and runs on its own core. True parallelism.
Processes don't share memory by default (unlike threads). This is actually a feature — isolated memory means no race conditions by design. But when you need to exchange data, Python provides several mechanisms:
multiprocessing.Queue is the most common IPC (inter-process communication) tool:
from multiprocessing import Process, Queue
import time
def worker(task_queue, result_queue, worker_id):
"""Process tasks from queue, put results in result queue."""
while True:
task = task_queue.get()
if task is None:
break # Poison pill — shut down
# Simulate expensive computation
result = sum(i**2 for i in range(task))
result_queue.put((worker_id, task, result))
# Create queues
tasks = Queue()
results = Queue()
# Start 4 workers
workers = []
for i in range(4):
p = Process(target=worker, args=(tasks, results, i))
p.start()
workers.append(p)
# Send tasks
for n in [100_000, 200_000, 150_000, 300_000, 250_000, 180_000]:
tasks.put(n)
# Send poison pills (one per worker)
for _ in workers:
tasks.put(None)
# Collect results
for _ in range(6):
worker_id, task, result = results.get()
print(f" Worker {worker_id}: computed sum of squares up to {task:,}")
# Clean up
for p in workers:
p.join()
The poison pill pattern (sending None to signal shutdown) is a clean way to terminate worker processes. Each worker pulls tasks from the queue, processes them, and pushes results to a result queue. This is essentially a home-built task pool.
Value and ArrayWhen you genuinely need shared state between processes (use sparingly!):
from multiprocessing import Process, Value, Lock
def increment(shared_counter, lock, n):
for _ in range(n):
with lock:
shared_counter.value += 1
counter = Value('i', 0) # 'i' = signed int, initial value 0
lock = Lock()
processes = [
Process(target=increment, args=(counter, lock, 100_000))
for _ in range(4)
]
for p in processes:
p.start()
for p in processes:
p.join()
print(f"Counter: {counter.value}") # 400,000
Value('i', 0) creates a shared integer in memory mapped between processes. The 'i' is a ctypes type code — 'i' for int, 'd' for double, 'f' for float. You still need a lock to prevent race conditions, just like with threads.
concurrent.futures: the high-level interfaceThe concurrent.futures module provides ThreadPoolExecutor and ProcessPoolExecutor — two classes with the exact same API but different execution models. This makes switching between threads and processes trivial:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from concurrent.futures import as_completed
import time
import urllib.request
def fetch_size(url):
"""Return (url, size) tuple."""
with urllib.request.urlopen(url, timeout=10) as resp:
return url, len(resp.read())
def compute_factorial(n):
"""CPU-heavy: compute factorial iteratively."""
result = 1
for i in range(2, n + 1):
result *= i
return n, len(str(result)) # Return n and digit count
# I/O-bound: use ThreadPoolExecutor
urls = [
'https://www.python.org',
'https://docs.python.org/3/',
'https://pypi.org',
]
print("Fetching URLs (threaded):")
with ThreadPoolExecutor(max_workers=5) as executor:
futures = {executor.submit(fetch_size, url): url for url in urls}
for future in as_completed(futures):
url, size = future.result()
print(f" {url}: {size:,} bytes")
# CPU-bound: use ProcessPoolExecutor
numbers = [50_000, 60_000, 70_000, 80_000]
print("\nComputing factorials (parallel processes):")
with ProcessPoolExecutor(max_workers=4) as executor:
futures = {executor.submit(compute_factorial, n): n for n in numbers}
for future in as_completed(futures):
n, digits = future.result()
print(f" {n}! has {digits:,} digits")
The beauty of this API is as_completed() — it yields futures as they finish, regardless of submission order. First result back? You see it first. No waiting for slow tasks to unblock fast ones.
Futures capture exceptions cleanly:
from concurrent.futures import ThreadPoolExecutor
def risky_fetch(url):
if 'badurl' in url:
raise ConnectionError(f"Cannot connect to {url}")
return f"Success: {url}"
urls = ['https://python.org', 'https://badurl.invalid', 'https://pypi.org']
with ThreadPoolExecutor(max_workers=3) as executor:
futures = {executor.submit(risky_fetch, url): url for url in urls}
for future in futures:
try:
result = future.result(timeout=5)
print(f" OK: {result}")
except ConnectionError as e:
print(f" FAILED: {e}")
except TimeoutError:
print(f" TIMEOUT: {futures[future]}")
Exceptions raised inside a worker are stored in the Future object and re-raised when you call .result(). This is much cleaner than the low-level threading approach where exceptions in threads are silently swallowed.
A deadlock occurs when two or more threads (or processes) each hold a resource the other needs, and neither will release theirs first:
import threading
import time
lock_a = threading.Lock()
lock_b = threading.Lock()
def thread_1():
with lock_a:
print("Thread 1: holding lock_a, waiting for lock_b...")
time.sleep(0.1) # Gives thread_2 time to acquire lock_b
with lock_b:
print("Thread 1: got both locks")
def thread_2():
with lock_b:
print("Thread 2: holding lock_b, waiting for lock_a...")
time.sleep(0.1) # Gives thread_1 time to acquire lock_a
with lock_a:
print("Thread 2: got both locks")
t1 = threading.Thread(target=thread_1)
t2 = threading.Thread(target=thread_2)
t1.start()
t2.start()
# Both threads hang forever — deadlock!
Thread 1 holds lock_a and waits for lock_b. Thread 2 holds lock_b and waits for lock_a. Neither can proceed. Your program hangs silently — no exception, no error message, just... nothing.
Prevention strategies:
Always acquire locks in the same order. If both threads acquire lock_a first, then lock_b, no deadlock can occur.
Use timeouts:
acquired = lock_b.acquire(timeout=2.0)
if not acquired:
print("Couldn't get lock_b — backing off")
# Release lock_a, retry later, or fail gracefully
Avoid holding multiple locks when possible. Restructure your code so each critical section needs only one lock.
Use higher-level abstractions. Queue, concurrent.futures, and multiprocessing.Pool handle synchronization internally — you don't manage locks yourself.
After years of writing concurrent Python, here's my practical decision tree:
| Scenario | Tool | Why |
|---|---|---|
| Many HTTP requests | ThreadPoolExecutor or asyncio | I/O-bound, GIL not a factor |
| Database queries in parallel | ThreadPoolExecutor | I/O-bound, shared connection pool |
| Processing large images | ProcessPoolExecutor | CPU-bound, need true parallelism |
| Number crunching (no NumPy) | ProcessPoolExecutor | CPU-bound Python code |
| Number crunching (with NumPy) | NumPy/threading | NumPy releases GIL internally |
| 10,000+ concurrent connections | asyncio | Lowest overhead per connection |
| Simple background task | threading.Thread | Low overhead, easy to set up |
| Periodic background work | threading.Timer | Built-in scheduling |
A few important nuances:
NumPy releases the GIL. If your CPU-bound work is NumPy operations, threading actually works — NumPy's C extensions release the GIL during computation. Same goes for many other C-extension libraries (Pandas, scikit-learn, etc.).
Process startup is expensive. Creating a process takes ~100ms and duplicates the entire interpreter. Don't create processes for tiny tasks. Use a Pool or ProcessPoolExecutor to reuse processes across many tasks.
Async is not faster for CPU work. We covered asyncio in episodes #40 and #41 — it's excellent for I/O concurrency, but it's single-threaded. Don't reach for asyncio when your bottleneck is computation.
Sometimes you need both. A common pattern is multiprocessing for CPU-bound work with threading inside each process for I/O:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import urllib.request
import time
def fetch_and_process(url):
"""Fetch a page (I/O) then do computation (CPU)."""
# I/O-bound: fetch the page
with urllib.request.urlopen(url, timeout=10) as resp:
data = resp.read()
# CPU-bound: process the data (simulate with hash computation)
import hashlib
for _ in range(1000):
data = hashlib.sha256(data).digest()
return url, len(data)
urls = [
'https://www.python.org',
'https://docs.python.org/3/',
'https://pypi.org',
'https://peps.python.org',
] * 3 # 12 URLs total
# Use ProcessPoolExecutor for the combined I/O + CPU work
start = time.perf_counter()
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(fetch_and_process, urls))
elapsed = time.perf_counter() - start
print(f"Processed {len(results)} URLs in {elapsed:.2f}s")
for url, size in results[:4]:
print(f" {url}: {size} bytes (after hashing)")
Each process handles both the I/O and CPU work for its assigned URLs. Since processes have their own GIL, the CPU work runs in true parallel across cores.
if __name__ == '__main__' guardOne gotcha that catches every Python beginner with multiprocessing:
from multiprocessing import Process
def worker():
print("Worker running")
# This MUST be inside the guard on Windows/macOS:
if __name__ == '__main__':
p = Process(target=worker)
p.start()
p.join()
On Windows and macOS (with the "spawn" start method), new processes import the main module to set up. Without the guard, this creates an infinite loop of process spawning. Always protect multiprocessing code with if __name__ == '__main__'. On Linux (which uses "fork" by default), it works without the guard — but it's good practice everywhere.
Let's put it all together with a practical example — processing a directory of text files in parallel:
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import re
from collections import Counter
def analyze_file(filepath):
"""Analyze a single text file: word count, line count, top words."""
text = Path(filepath).read_text(encoding='utf-8', errors='ignore')
words = re.findall(r'\b[a-z]+\b', text.lower())
return {
'file': filepath.name,
'lines': text.count('\n'),
'words': len(words),
'unique_words': len(set(words)),
'top_5': Counter(words).most_common(5),
}
def analyze_directory(directory, pattern='*.txt'):
"""Analyze all matching files in parallel."""
files = list(Path(directory).glob(pattern))
if not files:
print(f"No {pattern} files found in {directory}")
return
print(f"Analyzing {len(files)} files using {4} processes...\n")
results = []
with ProcessPoolExecutor(max_workers=4) as executor:
future_to_file = {
executor.submit(analyze_file, f): f for f in files
}
for future in as_completed(future_to_file):
filepath = future_to_file[future]
try:
result = future.result()
results.append(result)
print(f" Done: {result['file']} "
f"({result['words']:,} words, "
f"{result['unique_words']:,} unique)")
except Exception as e:
print(f" Error processing {filepath.name}: {e}")
# Summary
total_words = sum(r['words'] for r in results)
total_lines = sum(r['lines'] for r in results)
print(f"\nTotal: {total_words:,} words, "
f"{total_lines:,} lines across {len(results)} files")
if __name__ == '__main__':
analyze_directory('/path/to/text/files', '*.txt')
Each file is processed by a separate process — true parallel execution on multiple cores. Results stream back via as_completed(), so you see progress as files finish. Error handling is built in via the Future pattern. And the if __name__ == '__main__' guard ensures clean process spawning.
In this episode, we explored concurrency and parallelism in Python:
Lock, RLock, Semaphore, or EventValue/Array objectsconcurrent.futures provides ThreadPoolExecutor and ProcessPoolExecutor with the same clean APIas_completed() yields results as they finish — no waiting for slow tasksif __name__ == '__main__' guard with multiprocessingThe golden rule: profile first, then choose your concurrency model based on whether the bottleneck is I/O (threads or async) or CPU (processes). Don't guess — measure ;-)