Resource Guardian - Process Budget Protection

Canopy — What Happened and What We Fixed

Incident — June 3, 2026

Process budget exhausted: 200/200 processes consumed. All operations halted for ~30 minutes.

A burst of concurrent background workers plus lingering deploy processes hit the container's hard limit. No automated system caught it because the safeguard that should have blocked new spawns existed in documentation but was never wired into the actual spawn path.

Key Insight

The guard existed. It just was never connected. The pre-spawn gate was documented as "the gate before any heavy subprocess" — but no caller actually invoked it before launching background workers. Aspirational documentation, not enforced code.

What broke: Too many concurrent background workers (each consuming 15–30 processes) launched on a single container, while lingering deploy processes piled up. The container's 200-process hard cap was hit in seconds — faster than any existing 15-minute health check could detect.

Why nothing caught it: Every safeguard monitored the wrong chokepoint. The health system watched daemons and known leak patterns. The deploy tool had its own budget check. But the actual agent-launch path — the one that creates the most processes — had zero resource check. The hook that fired on every agent launch only verified organizational routing ("is this going to the right team lead?"), not resource budget.

What we fixed: The pre-spawn gate is now wired directly into the agent-launch path. Before any new background worker spawns, the system checks process budget, thread count, and context usage inline (under 1ms, no overhead). If budget hits 85%, the spawn is blocked with guidance. A fast budget check after every action gives seconds-scale visibility. Deploy processes now clean up automatically on success. And heavy work overflows to the twin civilization (Secondary node) automatically — roughly doubling throughput while preventing exhaustion on either side.

Can it silently recur? No. The guard is in the spawn path itself. Every background worker launch must pass through it. There is no alternate path.

What Changed (Summary)

Pre-Spawn Gate

Budget check wired into the actual agent-launch path. Blocks at 85% process usage. No bypass possible.

Fast Budget Visibility

Lightweight inline check after every action. Warns at 70%, critical at 85%. Seconds-scale, not 15-minute cycles.

Deploy Cleanup

Successful deploys now reap their own child processes immediately. Previously only cleaned up on timeout.

Twin-System Load Balancing

3-worker cap per container. Overflow dispatches to Secondary node automatically. Local queue fallback if twin is unreachable. Work is never dropped.

Understory — How the Guard Flows

The Pre-Spawn Gate (every background worker launch)

Background Worker Requested

Conductor (or any process) requests a new background agent. The pre-tool hook intercepts the request before it executes.

Read Process Budget (inline, <1ms)

Direct file reads from the container's process tracking — no subprocess spawned. Gets current count and hard maximum.

Reads: pids.current + pids.max from cgroup

Check Context Window

Also reads context-usage state. Each new worker adds to the primary's context pressure — spawning agents when context is bloated is dangerous.

Blocks if context ≥ 80%

Budget OK (<70%) → Approve

Spawn proceeds normally. No delay, no overhead.

Budget Warning (70–84%) → Warn + Approve

Spawn proceeds, but operator gets a visible warning: "Consider offloading to twin civilization."

140–169 out of 200 processes

Budget Critical (≥85%) → Auto-Reap, Then Block or Dispatch

First attempts to free processes by reaping known leaks (stale deploy processes, orphans). If budget drops below 85% after reap, spawn proceeds. If still critical: blocks the spawn with explicit guidance ("Reap leaks or offload to Secondary node").

≥170 out of 200 processes = hard block

Twin-System Load Balancing (Parallel dispatch)

Local Cap: 3 Concurrent Background Workers

Hard-enforced limit per container. Each background worker consumes 15–30 processes (runtime + interpreter children + threads). Three is the safe maximum at a 200-process budget.

4th Worker Requested → Check Twin

SSH health check to Secondary node (5-second timeout). Reads Secondary node's process budget. If Secondary node has capacity, the work is dispatched there.

Fallback: Local Queue

If Secondary node is unreachable or also at capacity, the work enters a local queue. When a running worker finishes and frees budget, the queued work starts. Work is never dropped.

Fast Budget Check (every action)

Post-Action Hook Fires

After every tool use, the post-action hook reads the process budget inline (two file reads, under 1ms). No subprocess spawned — this is critical because spawning a subprocess to check process budget would worsen the very problem it monitors.

Threshold Check

At 70%: visible warning. At 85%: critical alert with reap instructions. This gives seconds-scale visibility versus the previous 15-minute health cycle — a 100x improvement in detection speed.

Background Reap Cycle

Between the 15-minute health checks, a lightweight 2-minute reap cycle targets known process leaks: stale deploy tool processes (now 120-second max age, down from 600), orphaned interpreter processes, and accumulated agent-related processes. This fills the gap between the per-action fast check (warns only) and the full health cycle (reaps + reports).

Root Level — How It's Built

Implementation details, thresholds, file-level changes, and the full post-mortem. Collapsed by default — expand what you need.

Why did the container hit 200/200 processes?

Too many concurrent processes: 3–4 background workers (each 15–30 processes), multiple lingering deploy tool processes, SSH sessions, hook subprocesses, and 4 always-on daemons.

Why were so many processes running simultaneously?

The Conductor launched 3+ background agents plus triggered multiple page deploys (each spawning a deploy tool child process), with no concurrency limiter or budget check on agent spawn.

Why was there no concurrency limiter on agent spawns?

The preflight gate existed but was only called from the deploy tool path, not the agent-launch path. The hook that fires on agent launches only checked organizational routing (CEO rule), not resource budget.

Why did existing monitors not catch the spike?

Health checks run every 15 minutes. A spawn-burst reaches 200 processes in seconds — ~100x faster than the polling interval. The resource governor only reaps known-pattern leaks; if the top consumers are "legitimate" agents, it sees nothing to reap.

Why did deploy processes pile up?

The deploy tool has a 5-minute timeout but did NOT kill child processes after a successful deploy — only on timeout. The resource governor only reaps deploy processes older than 10 minutes. Multiple 3–4 minute deploys accumulated "legally."

Safeguard	Existed?	Why It Missed
Resource Governor	Yes	Not on spawn path. Only called by the deploy tool and health check. A burst of 3 agents in 30 seconds never passes through any gate.
Core system Preflight	Yes	Same gap: not on spawn path. Comment at line 39 says "the GATE before any heavy subprocess" but this was aspirational, not enforced. No caller invokes it before agent launch.
Health Check (Health monitor)	Yes	15-minute cadence. Exhaustion happens in seconds. Also: Health monitor only reaps known-pattern leaks. If top consumers are "legitimate" agents, Health monitor sees nothing to reap.
Partner Health	Yes	Wrong scope: monitors the other container (Secondary node), not Primary node's own budget.
Post-Action Hook	Yes	Monitors context tokens, not process budget. Context was at 75% — well below threshold. Zero awareness of process count.
Pre-Tool Gate	Yes	Fires on agent launches but checks organizational routing (CEO rule), not resources. An agent passes the CEO check but consumes 15–30 processes.
Context Manager	Yes	Wrong metric entirely. Monitors token consumption, not process budget.

Pattern

Seven safeguards existed. All seven watched the wrong chokepoint. The system monitored what it already monitored (daemons, known leaks, context tokens) but had zero gate on the path that creates the most processes: background agent launches.

Parameter	Value	Purpose
Process budget warning	70% (140/200)	Visible warning, spawn still proceeds
Process budget critical	85% (170/200)	Auto-reap attempted, then block if still critical
Context window block	80%	Block new spawns when context is bloated
Max concurrent workers	3 per container	Hard cap, overflow to twin or queue
Thread count warning	>400 threads	cgroup counts threads as PIDs; 2 agents x 30 threads = 60+ against the 200 limit
Deploy process max age	120s (was 600s)	Deploy should complete in 2 min; older = stuck
Twin SSH timeout	5s	Don't block on unreachable twin
Background reap interval	2 min	Fills gap between per-action warn and 15-min full cycle

File	Change
`unified_pretool_gate` (pre-tool hook)	Added process budget + context check to the agent-launch branch. Inline file reads, no subprocess. Blocks at 85%, warns at 70%.
`post_tool_use` (post-action hook)	Added inline process budget read after every action. Two file reads (<1ms). Warns at 70%, critical at 85%.
`deploy_pipeline` (deploy tool)	Post-deploy cleanup: kills lingering deploy child processes on SUCCESS, not just on timeout.
`resource_governor` (resource management)	Deploy process max age lowered from 600s to 120s. Thread count surfaced in status report. Per-process thread breakdown added.

Time (CT)	Event
~12:30	Conductor running 3+ background agents (holly-fields, event-lifecycle, persistence-skill, infra)
~12:45	Multiple deploys triggered (Holly tracker regens, dashboard, page updates)
~12:50	Deploy tool processes from earlier deploys still alive (no post-deploy reap)
~12:55	Process count approaching 200. Hook processes (each forks an interpreter) compound pressure
~13:00	200/200 hit. "Cannot fork" errors. New agents fail, SSH fails, tools fail
~13:05	Manual kill of deploy/interpreter orphans frees budget. Conductor restarted
~13:16	Fresh session launched. Budget at 93–95/200 post-recovery