Skip to content

gh-115634: Fix ProcessPoolExecutor deadlock with max_tasks_per_child#140900

Merged
gpshead merged 1 commit into
python:mainfrom
gpshead:fix-gh-115634-process-pool-hang
Jul 3, 2026
Merged

gh-115634: Fix ProcessPoolExecutor deadlock with max_tasks_per_child#140900
gpshead merged 1 commit into
python:mainfrom
gpshead:fix-gh-115634-process-pool-hang

Conversation

@gpshead

@gpshead gpshead commented Nov 2, 2025

Copy link
Copy Markdown
Member

Summary

Fix a deadlock in ProcessPoolExecutor when using max_tasks_per_child.
The executor stopped scheduling queued tasks after a worker process exited
upon reaching its task limit.

from concurrent.futures import ProcessPoolExecutor

if __name__ == "__main__":
    with ProcessPoolExecutor(1, max_tasks_per_child=2) as exe:
        futs = [exe.submit(print, i) for i in range(10)]

Prints 0 and 1, then hangs forever.

The bug

The idle worker semaphore counts task completions, not idle workers: a
token is released on every non-final task completion, but only submit()
consumes tokens. When a backlog is queued, workers take their next task
directly from the call queue, so a token can outlive the idle period it
was released for -- and outlive the worker itself once the worker exits
at its max_tasks_per_child limit. The worker-replacement path then
acquired such a stale token, concluded an idle worker existed, and never
spawned a replacement. With no workers left, the queued tasks deadlocked.

Present since max_tasks_per_child was introduced in Python 3.11
(GH-27373). Affects 3.11 through 3.15.0beta3.

The fix

Worker replacement after a process exit no longer consults the semaphore:
a new _replace_dead_worker() spawns a replacement whenever the pool is
below max_workers, using len(self._processes) as the source of truth.
The submit() path is unchanged, preserving on-demand spawning and idle
worker reuse (bpo-39207). This is the semantics suggested by @pitrou in
the GH-115642 review; the approach was proposed and production-tested by
@tabrezm.

Replacement is skipped when the executor is shutting down with no
pending work, so a worker reaching its task limit during shutdown does
not spawn a process that shutdown immediately reaps. The previously
disabled gh-90622 assertion in _adjust_process_count() is re-enabled:
with worker replacement split out, that method is reachable only from
submit() with a non-fork start method, so the assertion is valid now.

The semaphore still drifts above the true idle count while a backlog
exists. After this change the drift is harmless: it can only briefly
suppress pool growth below max_workers (each suppressed spawn drains
one token), never lose a worker. Whether the semaphore should be
redesigned or removed is left as a follow-up (see GH-115642 discussion).

The documentation warning added in GH-140897 is replaced with a
versionchanged:: next entry, so each branch's docs will name the
release that fixed it; backports should likewise replace the notes
added by GH-143302 / GH-143303.


Related Issues

Acknowledgments

Claude Sonnet 4.5 helped do the original work on this PR last year. Claude Fable 5 helped tighten it up and validate everything.

@gpshead

This comment was marked as outdated.

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions Bot added the stale Stale PR or inactive for long period of time. label May 1, 2026
…child

The idle worker semaphore counts task completions, not idle workers, so
it can hold a stale token released by a worker that later exited upon
reaching its max_tasks_per_child limit. The worker replacement path
consumed such tokens and skipped spawning a replacement, deadlocking
the remaining queued tasks once no workers were left.

Replace dead workers based on len(self._processes) without consulting
the semaphore. The submit() path is unchanged, preserving on-demand
spawning and idle worker reuse.

Replace the documentation note added in pythonGH-140897 with a versionchanged
entry now that the bug is fixed.

Based on a fix proposed by Tabrez Mohammed.
@gpshead gpshead force-pushed the fix-gh-115634-process-pool-hang branch from c8dd63c to a0d0402 Compare July 3, 2026 05:39
@read-the-docs-community

Copy link
Copy Markdown

Documentation build overview

📚 cpython-previews | 🛠️ Build #33422489 | 📁 Comparing a0d0402 against main (0a13efc)

  🔍 Preview build  

2 files changed
± library/concurrent.futures.html
± whatsnew/changelog.html

@gpshead gpshead self-assigned this Jul 3, 2026
@gpshead gpshead added needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes needs backport to 3.15 pre-release feature fixes, bugs and security fixes and removed stale Stale PR or inactive for long period of time. labels Jul 3, 2026
@gpshead gpshead marked this pull request as ready for review July 3, 2026 05:46
@gpshead gpshead merged commit b706767 into python:main Jul 3, 2026
61 checks passed
@miss-islington-app

Copy link
Copy Markdown

Thanks @gpshead for the PR 🌮🎉.. I'm working now to backport this PR to: 3.13, 3.14, 3.15.
🐍🍒⛏🤖

@bedevere-app

bedevere-app Bot commented Jul 3, 2026

Copy link
Copy Markdown

GH-152926 is a backport of this pull request to the 3.15 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.15 pre-release feature fixes, bugs and security fixes label Jul 3, 2026
@miss-islington-app

Copy link
Copy Markdown

Sorry, @gpshead, I could not cleanly backport this to 3.13 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker b706767d8fd7d21afc3f156fb9c173bc99855e0e 3.13

@bedevere-app

bedevere-app Bot commented Jul 3, 2026

Copy link
Copy Markdown

GH-152927 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.14 bugs and security fixes label Jul 3, 2026
gpshead added a commit that referenced this pull request Jul 3, 2026
…_child (GH-140900) (#152927)

gh-115634: Fix ProcessPoolExecutor deadlock with max_tasks_per_child (GH-140900)

The idle worker semaphore counts task completions, not idle workers, so
it can hold a stale token released by a worker that later exited upon
reaching its max_tasks_per_child limit. The worker replacement path
consumed such tokens and skipped spawning a replacement, deadlocking
the remaining queued tasks once no workers were left.

Replace dead workers based on len(self._processes) without consulting
the semaphore. The submit() path is unchanged, preserving on-demand
spawning and idle worker reuse.

Replace the documentation note added in GH-140897 with a versionchanged
entry now that the bug is fixed.

Based on a fix proposed by Tabrez Mohammed.
(cherry picked from commit b706767)

Co-authored-by: Gregory P. Smith <68491+gpshead@users.noreply.github.com>
gpshead added a commit that referenced this pull request Jul 3, 2026
…_child (GH-140900) (#152926)

gh-115634: Fix ProcessPoolExecutor deadlock with max_tasks_per_child (GH-140900)

The idle worker semaphore counts task completions, not idle workers, so
it can hold a stale token released by a worker that later exited upon
reaching its max_tasks_per_child limit. The worker replacement path
consumed such tokens and skipped spawning a replacement, deadlocking
the remaining queued tasks once no workers were left.

Replace dead workers based on len(self._processes) without consulting
the semaphore. The submit() path is unchanged, preserving on-demand
spawning and idle worker reuse.

Replace the documentation note added in GH-140897 with a versionchanged
entry now that the bug is fixed.

Based on a fix proposed by Tabrez Mohammed.
(cherry picked from commit b706767)

Co-authored-by: Gregory P. Smith <68491+gpshead@users.noreply.github.com>
@bedevere-app

bedevere-app Bot commented Jul 3, 2026

Copy link
Copy Markdown

GH-152928 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.13 bugs and security fixes label Jul 3, 2026
gpshead added a commit that referenced this pull request Jul 3, 2026
…_child (GH-140900) (#152928)

The idle worker semaphore counts task completions, not idle workers, so
it can hold a stale token released by a worker that later exited upon
reaching its max_tasks_per_child limit. The worker replacement path
consumed such tokens and skipped spawning a replacement, deadlocking
the remaining queued tasks once no workers were left.

Replace dead workers based on len(self._processes) without consulting
the semaphore. The submit() path is unchanged, preserving on-demand
spawning and idle worker reuse.

Replace the documentation note added in GH-140897 with a versionchanged
entry now that the bug is fixed.

Based on a fix proposed by Tabrez Mohammed.

(cherry picked from commit b706767)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants