Resource Guardian

How the orchestration runtime protects itself from process-budget exhaustion — and balances load across both civilizations
Resolved — guard live, twin-balanced

Canopy — What Happened and What We Fixed

Incident — June 3, 2026
Process budget exhausted: 200/200 processes consumed. All operations halted for ~30 minutes.
A burst of concurrent background workers plus lingering deploy processes hit the container's hard limit. No automated system caught it because the safeguard that should have blocked new spawns existed in documentation but was never wired into the actual spawn path.
Key Insight
The guard existed. It just was never connected. The pre-spawn gate was documented as "the gate before any heavy subprocess" — but no caller actually invoked it before launching background workers. Aspirational documentation, not enforced code.

What broke: Too many concurrent background workers (each consuming 15–30 processes) launched on a single container, while lingering deploy processes piled up. The container's 200-process hard cap was hit in seconds — faster than any existing 15-minute health check could detect.

Why nothing caught it: Every safeguard monitored the wrong chokepoint. The health system watched daemons and known leak patterns. The deploy tool had its own budget check. But the actual agent-launch path — the one that creates the most processes — had zero resource check. The hook that fired on every agent launch only verified organizational routing ("is this going to the right team lead?"), not resource budget.

What we fixed: The pre-spawn gate is now wired directly into the agent-launch path. Before any new background worker spawns, the system checks process budget, thread count, and context usage inline (under 1ms, no overhead). If budget hits 85%, the spawn is blocked with guidance. A fast budget check after every action gives seconds-scale visibility. Deploy processes now clean up automatically on success. And heavy work overflows to the twin civilization (Secondary node) automatically — roughly doubling throughput while preventing exhaustion on either side.

Can it silently recur? No. The guard is in the spawn path itself. Every background worker launch must pass through it. There is no alternate path.

What Changed (Summary)

Pre-Spawn Gate
Budget check wired into the actual agent-launch path. Blocks at 85% process usage. No bypass possible.
Fast Budget Visibility
Lightweight inline check after every action. Warns at 70%, critical at 85%. Seconds-scale, not 15-minute cycles.
Deploy Cleanup
Successful deploys now reap their own child processes immediately. Previously only cleaned up on timeout.
Twin-System Load Balancing
3-worker cap per container. Overflow dispatches to Secondary node automatically. Local queue fallback if twin is unreachable. Work is never dropped.

Understory — How the Guard Flows

The Pre-Spawn Gate (every background worker launch)

1
Background Worker Requested
Conductor (or any process) requests a new background agent. The pre-tool hook intercepts the request before it executes.
2
Read Process Budget (inline, <1ms)
Direct file reads from the container's process tracking — no subprocess spawned. Gets current count and hard maximum.
Reads: pids.current + pids.max from cgroup
3
Check Context Window
Also reads context-usage state. Each new worker adds to the primary's context pressure — spawning agents when context is bloated is dangerous.
Blocks if context ≥ 80%
4a
Budget OK (<70%) → Approve
Spawn proceeds normally. No delay, no overhead.
4b
Budget Warning (70–84%) → Warn + Approve
Spawn proceeds, but operator gets a visible warning: "Consider offloading to twin civilization."
140–169 out of 200 processes
4c
Budget Critical (≥85%) → Auto-Reap, Then Block or Dispatch
First attempts to free processes by reaping known leaks (stale deploy processes, orphans). If budget drops below 85% after reap, spawn proceeds. If still critical: blocks the spawn with explicit guidance ("Reap leaks or offload to Secondary node").
≥170 out of 200 processes = hard block

Twin-System Load Balancing (Parallel dispatch)

1
Local Cap: 3 Concurrent Background Workers
Hard-enforced limit per container. Each background worker consumes 15–30 processes (runtime + interpreter children + threads). Three is the safe maximum at a 200-process budget.
2
4th Worker Requested → Check Twin
SSH health check to Secondary node (5-second timeout). Reads Secondary node's process budget. If Secondary node has capacity, the work is dispatched there.
3
Fallback: Local Queue
If Secondary node is unreachable or also at capacity, the work enters a local queue. When a running worker finishes and frees budget, the queued work starts. Work is never dropped.

Fast Budget Check (every action)

1
Post-Action Hook Fires
After every tool use, the post-action hook reads the process budget inline (two file reads, under 1ms). No subprocess spawned — this is critical because spawning a subprocess to check process budget would worsen the very problem it monitors.
2
Threshold Check
At 70%: visible warning. At 85%: critical alert with reap instructions. This gives seconds-scale visibility versus the previous 15-minute health cycle — a 100x improvement in detection speed.

Background Reap Cycle

Between the 15-minute health checks, a lightweight 2-minute reap cycle targets known process leaks: stale deploy tool processes (now 120-second max age, down from 600), orphaned interpreter processes, and accumulated agent-related processes. This fills the gap between the per-action fast check (warns only) and the full health cycle (reaps + reports).

Root Level — How It's Built

Implementation details, thresholds, file-level changes, and the full post-mortem. Collapsed by default — expand what you need.

Why did the container hit 200/200 processes?
Too many concurrent processes: 3–4 background workers (each 15–30 processes), multiple lingering deploy tool processes, SSH sessions, hook subprocesses, and 4 always-on daemons.
Why were so many processes running simultaneously?
The Conductor launched 3+ background agents plus triggered multiple page deploys (each spawning a deploy tool child process), with no concurrency limiter or budget check on agent spawn.
Why was there no concurrency limiter on agent spawns?
The preflight gate existed but was only called from the deploy tool path, not the agent-launch path. The hook that fires on agent launches only checked organizational routing (CEO rule), not resource budget.
Why did existing monitors not catch the spike?
Health checks run every 15 minutes. A spawn-burst reaches 200 processes in seconds — ~100x faster than the polling interval. The resource governor only reaps known-pattern leaks; if the top consumers are "legitimate" agents, it sees nothing to reap.
Why did deploy processes pile up?
The deploy tool has a 5-minute timeout but did NOT kill child processes after a successful deploy — only on timeout. The resource governor only reaps deploy processes older than 10 minutes. Multiple 3–4 minute deploys accumulated "legally."
Safeguard Existed? Why It Missed
Resource Governor Yes Not on spawn path. Only called by the deploy tool and health check. A burst of 3 agents in 30 seconds never passes through any gate.
Core system Preflight Yes Same gap: not on spawn path. Comment at line 39 says "the GATE before any heavy subprocess" but this was aspirational, not enforced. No caller invokes it before agent launch.
Health Check (Health monitor) Yes 15-minute cadence. Exhaustion happens in seconds. Also: Health monitor only reaps known-pattern leaks. If top consumers are "legitimate" agents, Health monitor sees nothing to reap.
Partner Health Yes Wrong scope: monitors the other container (Secondary node), not Primary node's own budget.
Post-Action Hook Yes Monitors context tokens, not process budget. Context was at 75% — well below threshold. Zero awareness of process count.
Pre-Tool Gate Yes Fires on agent launches but checks organizational routing (CEO rule), not resources. An agent passes the CEO check but consumes 15–30 processes.
Context Manager Yes Wrong metric entirely. Monitors token consumption, not process budget.
Pattern
Seven safeguards existed. All seven watched the wrong chokepoint. The system monitored what it already monitored (daemons, known leaks, context tokens) but had zero gate on the path that creates the most processes: background agent launches.
Parameter Value Purpose
Process budget warning 70% (140/200) Visible warning, spawn still proceeds
Process budget critical 85% (170/200) Auto-reap attempted, then block if still critical
Context window block 80% Block new spawns when context is bloated
Max concurrent workers 3 per container Hard cap, overflow to twin or queue
Thread count warning >400 threads cgroup counts threads as PIDs; 2 agents x 30 threads = 60+ against the 200 limit
Deploy process max age 120s (was 600s) Deploy should complete in 2 min; older = stuck
Twin SSH timeout 5s Don't block on unreachable twin
Background reap interval 2 min Fills gap between per-action warn and 15-min full cycle
File Change
unified_pretool_gate (pre-tool hook) Added process budget + context check to the agent-launch branch. Inline file reads, no subprocess. Blocks at 85%, warns at 70%.
post_tool_use (post-action hook) Added inline process budget read after every action. Two file reads (<1ms). Warns at 70%, critical at 85%.
deploy_pipeline (deploy tool) Post-deploy cleanup: kills lingering deploy child processes on SUCCESS, not just on timeout.
resource_governor (resource management) Deploy process max age lowered from 600s to 120s. Thread count surfaced in status report. Per-process thread breakdown added.
Time (CT) Event
~12:30Conductor running 3+ background agents (holly-fields, event-lifecycle, persistence-skill, infra)
~12:45Multiple deploys triggered (Holly tracker regens, dashboard, page updates)
~12:50Deploy tool processes from earlier deploys still alive (no post-deploy reap)
~12:55Process count approaching 200. Hook processes (each forks an interpreter) compound pressure
~13:00200/200 hit. "Cannot fork" errors. New agents fail, SSH fails, tools fail
~13:05Manual kill of deploy/interpreter orphans frees budget. Conductor restarted
~13:16Fresh session launched. Budget at 93–95/200 post-recovery
📚Library