Learn Python Series (#48) - Concurrency - Threading vs Multiprocessing

Repository

https://github.com/realScipio/learn-python-series

What will I learn

You will learn the fundamental difference between threading and multiprocessing in Python;
what the Global Interpreter Lock (GIL) is, why it exists, and why it matters for performance;
when to use threads vs. processes vs. async (we covered async in episodes #40 and #41);
how to safely share data between threads and between processes;
common concurrency pitfalls — race conditions, deadlocks — and how to avoid them;
the high-level concurrent.futures interface that unifies both models.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution, such as (for example) the Anaconda Distribution;
The ambition to learn Python programming.

Difficulty

Intermediate, advanced

Curriculum (of the `Learn Python Series`):

GitHub Account

https://github.com/realScipio

Learn Python Series (#48) - Concurrency - Threading vs Multiprocessing

Your CPU has 8, 12, maybe 16 cores. Python uses one of them. One. And if you've ever wondered why your CPU-intensive Python script pegs a single core at 100% while the rest sit idle — welcome to the GIL, Python's most controversial design decision.

But here's the nuance that most "Python is slow" hot takes miss entirely: for I/O-bound work (network requests, database queries, file reads), Python's threading works beautifully. The problem is specifically CPU-bound parallelism. Knowing the difference — and picking the right concurrency tool — is the entire game.

In episodes #40 and #41, we covered asyncio — Python's single-threaded, cooperative concurrency model. This episode covers the other two concurrency models: threading (for I/O-bound concurrency with shared memory) and multiprocessing (for CPU-bound true parallelism). By the end, you'll know exactly when to reach for each one ;-)

The mental model: concurrency vs. parallelism

These two words get used interchangeably, but they mean different things:

Concurrency is about managing multiple tasks at once. Tasks may interleave (take turns) on a single core — like a chef preparing three dishes by switching between them.

Parallelism is about executing multiple tasks simultaneously. Tasks run at the same time on different cores — like three chefs each preparing one dish.

Threading in CPython provides concurrency (interleaving). Multiprocessing provides parallelism (simultaneous execution). The distinction matters enormously, because concurrent code may not run any faster (tasks still share a single core), while parallel code can achieve linear speedup on multi-core machines.

The Global Interpreter Lock (GIL): what it is and why it exists

CPython (the standard Python interpreter — the one you're almost certainly using) has a Global Interpreter Lock: a mutex that prevents multiple threads from executing Python bytecode at the same time.

Even on a 16-core machine, if you spawn 16 Python threads doing computation, they don't run in parallel. They take turns — only one thread holds the GIL and executes bytecode at any given moment. The others wait.

Why on earth would Python do this?

It's a pragmatic engineering tradeoff. CPython's memory management uses reference counting for garbage collection:

import sys

a = []          # refcount = 1
b = a           # refcount = 2
print(sys.getrefcount(a))  # 3 (including the getrefcount argument itself)
del b           # refcount drops to 2

Every assignment, every function call, every variable read modifies reference counts. Without the GIL, every single one of these operations would need its own fine-grained lock to be thread-safe. That would make all Python code slower (even single-threaded code), and would be a nightmare to implement correctly. The GIL is a single coarse lock that makes the entire interpreter thread-safe in one stroke.

The practical impact:

I/O-bound tasks: Threading works great — threads release the GIL during I/O operations (network calls, file reads, time.sleep()). While one thread waits for a response, another thread can run.
CPU-bound tasks: Threading provides zero speedup — the GIL prevents parallel execution of Python bytecode. You need separate processes (each with its own interpreter and its own GIL) for true parallelism.

A note on Python 3.13+: PEP 703 introduced an experimental "free-threaded" CPython build that removes the GIL entirely. As of early 2026, this is still experimental and not the default build. The concepts in this episode remain the standard approach for production Python.

Threading for I/O-bound work

When your program spends most of its time waiting — for network responses, for disk reads, for database queries — threading provides real speedup because threads release the GIL during these waits:

import threading
import time
import urllib.request

def fetch_url(url):
    """Fetch a URL and print response size."""
    start = time.perf_counter()
    with urllib.request.urlopen(url) as response:
        data = response.read()
    elapsed = time.perf_counter() - start
    print(f"  {url}: {len(data):,} bytes in {elapsed:.2f}s")

urls = [
    'https://www.python.org',
    'https://docs.python.org/3/',
    'https://pypi.org',
    'https://peps.python.org',
    'https://wiki.python.org',
]

# Sequential: each request waits for the previous one
print("Sequential:")
start = time.perf_counter()
for url in urls:
    fetch_url(url)
print(f"Total: {time.perf_counter() - start:.2f}s\n")

# Threaded: requests happen concurrently
print("Threaded:")
start = time.perf_counter()
threads = []
for url in urls:
    t = threading.Thread(target=fetch_url, args=(url,))
    t.start()
    threads.append(t)

for t in threads:
    t.join()  # Wait for all threads to finish

print(f"Total: {time.perf_counter() - start:.2f}s")

Typical results on a reasonable connection:

Sequential: ~2.5 seconds (each request waits for the previous one)
Threaded: ~0.6 seconds (all requests happen concurrently)

That's a ~4x speedup with minimal code change. During each urlopen() call, the thread releases the GIL, allowing other threads to make their requests simultaneously. The threads aren't truly parallel (they share one Python interpreter), but they overlap their waiting time — which is what matters for I/O-bound work.

Thread safety and race conditions

Threads share memory. That's both their advantage (easy data sharing) and their curse (easy data corruption). Here's the classic race condition:

import threading

counter = 0

def increment():
    global counter
    for _ in range(100_000):
        counter += 1  # This is NOT atomic!

threads = [threading.Thread(target=increment) for _ in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"Expected: 1,000,000")
print(f"Actual:   {counter}")  # Something less, e.g. 734,219

Why? counter += 1 looks atomic but compiles to multiple bytecode instructions:

LOAD counter from global scope
LOAD constant 1
ADD them together
STORE result back to counter

The GIL can release between any of these steps. So Thread A loads counter = 42, then Thread B loads counter = 42 (same value!), both add 1, and both store 43. Two increments, but the counter only went up by 1. This is called a lost update.

Fixing it with locks

A threading.Lock ensures mutual exclusion:

import threading

counter = 0
lock = threading.Lock()

def increment():
    global counter
    for _ in range(100_000):
        with lock:           # Only one thread at a time
            counter += 1

threads = [threading.Thread(target=increment) for _ in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"Result: {counter}")  # Exactly 1,000,000

The with lock: context manager acquires the lock, executes the body, and releases it. Other threads trying to acquire the same lock will block until it's released.

But be careful — overusing locks negates the benefit of threading. If every operation acquires a lock, you've effectively serialized your code. The art is locking only the critical sections where shared state is modified.

Other synchronization primitives

Python's threading module provides several tools beyond basic locks:

import threading

# RLock: Re-entrant lock (same thread can acquire multiple times)
rlock = threading.RLock()
with rlock:
    with rlock:  # Wouldn't deadlock! RLock allows re-entry
        pass

# Semaphore: Allow up to N concurrent accesses
semaphore = threading.Semaphore(3)  # Max 3 threads at once
with semaphore:
    pass  # Only 3 threads can be here simultaneously

# Event: Signal between threads
event = threading.Event()

def waiter():
    print("Waiting for signal...")
    event.wait()  # Blocks until event is set
    print("Got the signal!")

def signaler():
    import time
    time.sleep(1)
    event.set()  # Wake up all waiters

threading.Thread(target=waiter).start()
threading.Thread(target=signaler).start()

# Condition: Wait for a specific condition
condition = threading.Condition()
data_ready = False

def producer():
    global data_ready
    with condition:
        data_ready = True
        condition.notify_all()  # Wake up consumers

def consumer():
    with condition:
        condition.wait_for(lambda: data_ready)
        print("Data is ready!")

Semaphore is particularly useful for rate-limiting — for example, limiting concurrent API requests to avoid hitting rate limits. We'll see this with ThreadPoolExecutor shortly.

Multiprocessing for CPU-bound work

For computation that genuinely needs parallel execution, you need separate processes. Each process gets its own Python interpreter with its own GIL:

import multiprocessing
import time
import math

def compute_primes(limit):
    """Find primes up to limit using trial division."""
    primes = []
    for num in range(2, limit):
        if all(num % p != 0 for p in primes if p * p <= num):
            primes.append(num)
    return len(primes)

ranges = [200_000, 200_000, 200_000, 200_000]

# Sequential
start = time.perf_counter()
results_seq = [compute_primes(r) for r in ranges]
seq_time = time.perf_counter() - start
print(f"Sequential: {seq_time:.2f}s — {results_seq}")

# Parallel with Pool
start = time.perf_counter()
with multiprocessing.Pool(processes=4) as pool:
    results_par = pool.map(compute_primes, ranges)
par_time = time.perf_counter() - start
print(f"Parallel:   {par_time:.2f}s — {results_par}")
print(f"Speedup:    {seq_time / par_time:.1f}x")

On a 4-core machine, you'll see close to a 4x speedup. Each process has its own GIL, its own memory space, and runs on its own core. True parallelism.

Process communication

Processes don't share memory by default (unlike threads). This is actually a feature — isolated memory means no race conditions by design. But when you need to exchange data, Python provides several mechanisms:

Queues

multiprocessing.Queue is the most common IPC (inter-process communication) tool:

from multiprocessing import Process, Queue
import time

def worker(task_queue, result_queue, worker_id):
    """Process tasks from queue, put results in result queue."""
    while True:
        task = task_queue.get()
        if task is None:
            break  # Poison pill — shut down
        
        # Simulate expensive computation
        result = sum(i**2 for i in range(task))
        result_queue.put((worker_id, task, result))

# Create queues
tasks = Queue()
results = Queue()

# Start 4 workers
workers = []
for i in range(4):
    p = Process(target=worker, args=(tasks, results, i))
    p.start()
    workers.append(p)

# Send tasks
for n in [100_000, 200_000, 150_000, 300_000, 250_000, 180_000]:
    tasks.put(n)

# Send poison pills (one per worker)
for _ in workers:
    tasks.put(None)

# Collect results
for _ in range(6):
    worker_id, task, result = results.get()
    print(f"  Worker {worker_id}: computed sum of squares up to {task:,}")

# Clean up
for p in workers:
    p.join()

The poison pill pattern (sending None to signal shutdown) is a clean way to terminate worker processes. Each worker pulls tasks from the queue, processes them, and pushes results to a result queue. This is essentially a home-built task pool.

Shared memory with `Value` and `Array`

When you genuinely need shared state between processes (use sparingly!):

from multiprocessing import Process, Value, Lock

def increment(shared_counter, lock, n):
    for _ in range(n):
        with lock:
            shared_counter.value += 1

counter = Value('i', 0)  # 'i' = signed int, initial value 0
lock = Lock()

processes = [
    Process(target=increment, args=(counter, lock, 100_000))
    for _ in range(4)
]

for p in processes:
    p.start()
for p in processes:
    p.join()

print(f"Counter: {counter.value}")  # 400,000

Value('i', 0) creates a shared integer in memory mapped between processes. The 'i' is a ctypes type code — 'i' for int, 'd' for double, 'f' for float. You still need a lock to prevent race conditions, just like with threads.

`concurrent.futures`: the high-level interface

The concurrent.futures module provides ThreadPoolExecutor and ProcessPoolExecutor — two classes with the exact same API but different execution models. This makes switching between threads and processes trivial:

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
from concurrent.futures import as_completed
import time
import urllib.request

def fetch_size(url):
    """Return (url, size) tuple."""
    with urllib.request.urlopen(url, timeout=10) as resp:
        return url, len(resp.read())

def compute_factorial(n):
    """CPU-heavy: compute factorial iteratively."""
    result = 1
    for i in range(2, n + 1):
        result *= i
    return n, len(str(result))  # Return n and digit count

# I/O-bound: use ThreadPoolExecutor
urls = [
    'https://www.python.org',
    'https://docs.python.org/3/',
    'https://pypi.org',
]

print("Fetching URLs (threaded):")
with ThreadPoolExecutor(max_workers=5) as executor:
    futures = {executor.submit(fetch_size, url): url for url in urls}
    for future in as_completed(futures):
        url, size = future.result()
        print(f"  {url}: {size:,} bytes")

# CPU-bound: use ProcessPoolExecutor
numbers = [50_000, 60_000, 70_000, 80_000]

print("\nComputing factorials (parallel processes):")
with ProcessPoolExecutor(max_workers=4) as executor:
    futures = {executor.submit(compute_factorial, n): n for n in numbers}
    for future in as_completed(futures):
        n, digits = future.result()
        print(f"  {n}! has {digits:,} digits")

The beauty of this API is as_completed() — it yields futures as they finish, regardless of submission order. First result back? You see it first. No waiting for slow tasks to unblock fast ones.

Error handling with futures

Futures capture exceptions cleanly:

from concurrent.futures import ThreadPoolExecutor

def risky_fetch(url):
    if 'badurl' in url:
        raise ConnectionError(f"Cannot connect to {url}")
    return f"Success: {url}"

urls = ['https://python.org', 'https://badurl.invalid', 'https://pypi.org']

with ThreadPoolExecutor(max_workers=3) as executor:
    futures = {executor.submit(risky_fetch, url): url for url in urls}
    
    for future in futures:
        try:
            result = future.result(timeout=5)
            print(f"  OK: {result}")
        except ConnectionError as e:
            print(f"  FAILED: {e}")
        except TimeoutError:
            print(f"  TIMEOUT: {futures[future]}")

Exceptions raised inside a worker are stored in the Future object and re-raised when you call .result(). This is much cleaner than the low-level threading approach where exceptions in threads are silently swallowed.

Deadlocks: the silent killer

A deadlock occurs when two or more threads (or processes) each hold a resource the other needs, and neither will release theirs first:

import threading
import time

lock_a = threading.Lock()
lock_b = threading.Lock()

def thread_1():
    with lock_a:
        print("Thread 1: holding lock_a, waiting for lock_b...")
        time.sleep(0.1)  # Gives thread_2 time to acquire lock_b
        with lock_b:
            print("Thread 1: got both locks")

def thread_2():
    with lock_b:
        print("Thread 2: holding lock_b, waiting for lock_a...")
        time.sleep(0.1)  # Gives thread_1 time to acquire lock_a
        with lock_a:
            print("Thread 2: got both locks")

t1 = threading.Thread(target=thread_1)
t2 = threading.Thread(target=thread_2)
t1.start()
t2.start()
# Both threads hang forever — deadlock!

Thread 1 holds lock_a and waits for lock_b. Thread 2 holds lock_b and waits for lock_a. Neither can proceed. Your program hangs silently — no exception, no error message, just... nothing.

Prevention strategies:

Always acquire locks in the same order. If both threads acquire lock_a first, then lock_b, no deadlock can occur.
Use timeouts:

acquired = lock_b.acquire(timeout=2.0)
if not acquired:
    print("Couldn't get lock_b — backing off")
    # Release lock_a, retry later, or fail gracefully

Avoid holding multiple locks when possible. Restructure your code so each critical section needs only one lock.
Use higher-level abstractions. Queue, concurrent.futures, and multiprocessing.Pool handle synchronization internally — you don't manage locks yourself.

The decision matrix: which tool when?

After years of writing concurrent Python, here's my practical decision tree:

Scenario	Tool	Why
Many HTTP requests	`ThreadPoolExecutor` or `asyncio`	I/O-bound, GIL not a factor
Database queries in parallel	`ThreadPoolExecutor`	I/O-bound, shared connection pool
Processing large images	`ProcessPoolExecutor`	CPU-bound, need true parallelism
Number crunching (no NumPy)	`ProcessPoolExecutor`	CPU-bound Python code
Number crunching (with NumPy)	NumPy/threading	NumPy releases GIL internally
10,000+ concurrent connections	`asyncio`	Lowest overhead per connection
Simple background task	`threading.Thread`	Low overhead, easy to set up
Periodic background work	`threading.Timer`	Built-in scheduling

A few important nuances:

NumPy releases the GIL. If your CPU-bound work is NumPy operations, threading actually works — NumPy's C extensions release the GIL during computation. Same goes for many other C-extension libraries (Pandas, scikit-learn, etc.).

Process startup is expensive. Creating a process takes ~100ms and duplicates the entire interpreter. Don't create processes for tiny tasks. Use a Pool or ProcessPoolExecutor to reuse processes across many tasks.

Async is not faster for CPU work. We covered asyncio in episodes #40 and #41 — it's excellent for I/O concurrency, but it's single-threaded. Don't reach for asyncio when your bottleneck is computation.

Combining threading and multiprocessing

Sometimes you need both. A common pattern is multiprocessing for CPU-bound work with threading inside each process for I/O:

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import urllib.request
import time

def fetch_and_process(url):
    """Fetch a page (I/O) then do computation (CPU)."""
    # I/O-bound: fetch the page
    with urllib.request.urlopen(url, timeout=10) as resp:
        data = resp.read()
    
    # CPU-bound: process the data (simulate with hash computation)
    import hashlib
    for _ in range(1000):
        data = hashlib.sha256(data).digest()
    
    return url, len(data)

urls = [
    'https://www.python.org',
    'https://docs.python.org/3/',
    'https://pypi.org',
    'https://peps.python.org',
] * 3  # 12 URLs total

# Use ProcessPoolExecutor for the combined I/O + CPU work
start = time.perf_counter()
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(fetch_and_process, urls))
elapsed = time.perf_counter() - start

print(f"Processed {len(results)} URLs in {elapsed:.2f}s")
for url, size in results[:4]:
    print(f"  {url}: {size} bytes (after hashing)")

Each process handles both the I/O and CPU work for its assigned URLs. Since processes have their own GIL, the CPU work runs in true parallel across cores.

The `if name == 'main'` guard

One gotcha that catches every Python beginner with multiprocessing:

from multiprocessing import Process

def worker():
    print("Worker running")

# This MUST be inside the guard on Windows/macOS:
if __name__ == '__main__':
    p = Process(target=worker)
    p.start()
    p.join()

On Windows and macOS (with the "spawn" start method), new processes import the main module to set up. Without the guard, this creates an infinite loop of process spawning. Always protect multiprocessing code with if __name__ == '__main__'. On Linux (which uses "fork" by default), it works without the guard — but it's good practice everywhere.

Real-world example: parallel file processing

Let's put it all together with a practical example — processing a directory of text files in parallel:

from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import re
from collections import Counter

def analyze_file(filepath):
    """Analyze a single text file: word count, line count, top words."""
    text = Path(filepath).read_text(encoding='utf-8', errors='ignore')
    words = re.findall(r'\b[a-z]+\b', text.lower())
    
    return {
        'file': filepath.name,
        'lines': text.count('\n'),
        'words': len(words),
        'unique_words': len(set(words)),
        'top_5': Counter(words).most_common(5),
    }

def analyze_directory(directory, pattern='*.txt'):
    """Analyze all matching files in parallel."""
    files = list(Path(directory).glob(pattern))
    
    if not files:
        print(f"No {pattern} files found in {directory}")
        return
    
    print(f"Analyzing {len(files)} files using {4} processes...\n")
    
    results = []
    with ProcessPoolExecutor(max_workers=4) as executor:
        future_to_file = {
            executor.submit(analyze_file, f): f for f in files
        }
        
        for future in as_completed(future_to_file):
            filepath = future_to_file[future]
            try:
                result = future.result()
                results.append(result)
                print(f"  Done: {result['file']} "
                      f"({result['words']:,} words, "
                      f"{result['unique_words']:,} unique)")
            except Exception as e:
                print(f"  Error processing {filepath.name}: {e}")
    
    # Summary
    total_words = sum(r['words'] for r in results)
    total_lines = sum(r['lines'] for r in results)
    print(f"\nTotal: {total_words:,} words, "
          f"{total_lines:,} lines across {len(results)} files")

if __name__ == '__main__':
    analyze_directory('/path/to/text/files', '*.txt')

Each file is processed by a separate process — true parallel execution on multiple cores. Results stream back via as_completed(), so you see progress as files finish. Error handling is built in via the Future pattern. And the if __name__ == '__main__' guard ensures clean process spawning.

Oké, samengevat

In this episode, we explored concurrency and parallelism in Python:

Concurrency (threading) manages multiple tasks by interleaving; parallelism (multiprocessing) executes them simultaneously
The GIL prevents threads from running Python bytecode in parallel — it exists to protect CPython's reference counting
Threading works great for I/O-bound tasks because I/O operations release the GIL
Multiprocessing provides true parallelism via separate processes, each with its own GIL
Race conditions occur when threads share mutable state without synchronization — use Lock, RLock, Semaphore, or Event
Processes communicate via Queues, Pipes, or shared Value/Array objects
concurrent.futures provides ThreadPoolExecutor and ProcessPoolExecutor with the same clean API
as_completed() yields results as they finish — no waiting for slow tasks
Deadlocks happen when locks are acquired in inconsistent order — always acquire in the same order, or use timeouts
NumPy (and similar C-extensions) release the GIL, so threading works for NumPy-heavy computation
Always use the if __name__ == '__main__' guard with multiprocessing

The golden rule: profile first, then choose your concurrency model based on whether the bottleneck is I/O (threads or async) or CPU (processes). Don't guess — measure ;-)