What broke: Too many concurrent background workers (each consuming 15–30 processes) launched on a single container, while lingering deploy processes piled up. The container's 200-process hard cap was hit in seconds — faster than any existing 15-minute health check could detect.
Why nothing caught it: Every safeguard monitored the wrong chokepoint. The health system watched daemons and known leak patterns. The deploy tool had its own budget check. But the actual agent-launch path — the one that creates the most processes — had zero resource check. The hook that fired on every agent launch only verified organizational routing ("is this going to the right team lead?"), not resource budget.
What we fixed: The pre-spawn gate is now wired directly into the agent-launch path. Before any new background worker spawns, the system checks process budget, thread count, and context usage inline (under 1ms, no overhead). If budget hits 85%, the spawn is blocked with guidance. A fast budget check after every action gives seconds-scale visibility. Deploy processes now clean up automatically on success. And heavy work overflows to the twin civilization (Secondary node) automatically — roughly doubling throughput while preventing exhaustion on either side.
Can it silently recur? No. The guard is in the spawn path itself. Every background worker launch must pass through it. There is no alternate path.
Between the 15-minute health checks, a lightweight 2-minute reap cycle targets known process leaks: stale deploy tool processes (now 120-second max age, down from 600), orphaned interpreter processes, and accumulated agent-related processes. This fills the gap between the per-action fast check (warns only) and the full health cycle (reaps + reports).
Implementation details, thresholds, file-level changes, and the full post-mortem. Collapsed by default — expand what you need.
| Safeguard | Existed? | Why It Missed |
|---|---|---|
| Resource Governor | Yes | Not on spawn path. Only called by the deploy tool and health check. A burst of 3 agents in 30 seconds never passes through any gate. |
| Core system Preflight | Yes | Same gap: not on spawn path. Comment at line 39 says "the GATE before any heavy subprocess" but this was aspirational, not enforced. No caller invokes it before agent launch. |
| Health Check (Health monitor) | Yes | 15-minute cadence. Exhaustion happens in seconds. Also: Health monitor only reaps known-pattern leaks. If top consumers are "legitimate" agents, Health monitor sees nothing to reap. |
| Partner Health | Yes | Wrong scope: monitors the other container (Secondary node), not Primary node's own budget. |
| Post-Action Hook | Yes | Monitors context tokens, not process budget. Context was at 75% — well below threshold. Zero awareness of process count. |
| Pre-Tool Gate | Yes | Fires on agent launches but checks organizational routing (CEO rule), not resources. An agent passes the CEO check but consumes 15–30 processes. |
| Context Manager | Yes | Wrong metric entirely. Monitors token consumption, not process budget. |
| Parameter | Value | Purpose |
|---|---|---|
| Process budget warning | 70% (140/200) | Visible warning, spawn still proceeds |
| Process budget critical | 85% (170/200) | Auto-reap attempted, then block if still critical |
| Context window block | 80% | Block new spawns when context is bloated |
| Max concurrent workers | 3 per container | Hard cap, overflow to twin or queue |
| Thread count warning | >400 threads | cgroup counts threads as PIDs; 2 agents x 30 threads = 60+ against the 200 limit |
| Deploy process max age | 120s (was 600s) | Deploy should complete in 2 min; older = stuck |
| Twin SSH timeout | 5s | Don't block on unreachable twin |
| Background reap interval | 2 min | Fills gap between per-action warn and 15-min full cycle |
| File | Change |
|---|---|
unified_pretool_gate (pre-tool hook) |
Added process budget + context check to the agent-launch branch. Inline file reads, no subprocess. Blocks at 85%, warns at 70%. |
post_tool_use (post-action hook) |
Added inline process budget read after every action. Two file reads (<1ms). Warns at 70%, critical at 85%. |
deploy_pipeline (deploy tool) |
Post-deploy cleanup: kills lingering deploy child processes on SUCCESS, not just on timeout. |
resource_governor (resource management) |
Deploy process max age lowered from 600s to 120s. Thread count surfaced in status report. Per-process thread breakdown added. |
| Time (CT) | Event |
|---|---|
| ~12:30 | Conductor running 3+ background agents (holly-fields, event-lifecycle, persistence-skill, infra) |
| ~12:45 | Multiple deploys triggered (Holly tracker regens, dashboard, page updates) |
| ~12:50 | Deploy tool processes from earlier deploys still alive (no post-deploy reap) |
| ~12:55 | Process count approaching 200. Hook processes (each forks an interpreter) compound pressure |
| ~13:00 | 200/200 hit. "Cannot fork" errors. New agents fail, SSH fails, tools fail |
| ~13:05 | Manual kill of deploy/interpreter orphans frees budget. Conductor restarted |
| ~13:16 | Fresh session launched. Budget at 93–95/200 post-recovery |