Real software. Real tests (just ran). Real live evidence. No claims.
A priority queue that decides which jobs run first. Health checks go before background work. Every recurring task (cleanup, snapshots, credential sync) gets submitted here and runs in priority order.
tools/agent_rm.py — 767 lines of Python
def submit_task(self, task_id, priority, payload, max_runtime_s=30.0): """Submit a task to the MLFQ. priority: 0 (interactive), 1 (standard), 2 (background)""" if priority not in (0, 1, 2): raise ValueError(f"Invalid priority {priority}") if self._find_task(task_id) is not None: raise ValueError(f"Task {task_id} already exists") task = Task(task_id=task_id, priority=priority, original_priority=priority, payload=payload, max_runtime_s=max_runtime_s, ...) self.queues[priority].append(task) self._persist() return {"task_id": task_id, "queue": priority, ...}
A background watchdog that checks every running process, detects ones that are stuck or exceeded their time limit, and kills them cleanly. Prevents the system from accumulating zombie processes that waste resources.
tools/zombie_reaper.py — 793 lines of Python
def check_all(self) -> List[ZombieReport]: """Check all registered PIDs for zombie status. 1. Check if process still exists 2. Check if PID was recycled (cmdline changed) 3. Check if runtime exceeds max_runtime_s""" self._total_checked += 1 for task_id in list(self._registry.keys()): proc = self._registry.get(task_id) pid = proc.pid runtime_s = now - proc.registered_at if not _process_exists(pid): self.unregister_pid(task_id) # already dead elif current_cmdline != proc.cmdline_snapshot: self.unregister_pid(task_id) # PID recycled elif runtime_s > proc.max_runtime_s: self._reap_process(task_id, pid) # SIGTERM then SIGKILL
Manages the AI's conversation memory. Archives old conversations, compresses them into searchable knowledge, and flushes context when it gets too full. Prevents the AI from running out of thinking space mid-task.
tools/context_lifecycle_manager.py — 868 lines of Python
"""Context Lifecycle Manager (CLM) Intelligence layer on top of ctx_watchdog (sensing layer). Manages the full context lifecycle: Tier 2 (Cold) : JSONL archives of raw conversation messages Compression : Rule-based semantic compression Tier 1 (Warm) : SQLite database of compressed knowledge Flush Protocol : Orchestrates archive -> compress -> store """ # Commands: # status - show current memory usage # archive - archive a session's messages # query - search compressed knowledge # flush - orchestrate full archive->compress->store # check - read watchdog status and act if needed
Takes a high-level objective (like "research competitors, then analyze, then report") and breaks it into a dependency graph of sub-tasks. Classifies each sub-task as light/medium/heavy and figures out which can run in parallel. Then submits the plan to the Task Scheduler (#1) for execution.
tools/ingestion_planner.py — 1,110 lines of Python
class IngestionPlanner: """Ingest user objectives, build execution plan DAGs, and submit to AgentRM for scheduling.""" def ingest_objective(self, objective): # 1. Split objective into sub-tasks # 2. Detect dependencies between them # 3. Classify each as light/medium/heavy # 4. Build an ExecutionPlan (DAG) return plan def submit_plan(self, plan_id): # Submit ready tasks to AgentRM queue return result
Two AI systems (Secondary node and Primary node) share the same infrastructure. This module decides which one is "in charge" at any moment. It uses a lock (lease) with a heartbeat — if the leader stops heartbeating, the other one can take over. Prevents both AIs from stepping on each other's work.
tools/lease_manager.py — 998 lines of Python
def try_acquire_lease(force=False): """Attempt to claim leadership. Checks if existing lease is expired (heartbeat + ttl < now). If expired or no lease, claims it with a new term number. Uses file locking to prevent race conditions.""" lock_fd = open(lock_path, "w") fcntl.flock(lock_fd.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB) lease = _load_lease() if lease["holder"] is not None and not force: if not _is_lease_expired(lease): return {"acquired": True, ...} # already ours # Expired or empty -- claim it new_term = lease.get("term", 0) + 1 _write_lease({"holder": CIV_NAME, "term": new_term, ...})
Creates point-in-time snapshots of system state every 2 hours. If the leader fails over (#5), this module loads the latest snapshot, checks its integrity (SHA-256 hash), replays any events that happened after the snapshot, and re-queues unfinished work. It's the disaster recovery system.
tools/state_recovery.py — 1,193 lines of Python
def create_checkpoint(self, checkpoint_id=None, extra_state=None): """Create a validated state checkpoint. Captures task summary and any extra state. Stores as JSON with SHA-256 integrity hash.""" state_snapshot = {"system": CIV_NAME, "checkpoint_id": checkpoint_id} task_summary = self._get_task_summary() state_snapshot["task_summary"] = task_summary term = self._get_current_term() integrity_hash = _compute_hash(state_snapshot) checkpoint = Checkpoint( checkpoint_id=checkpoint_id, state_snapshot=state_snapshot, integrity_hash=integrity_hash, term=term, ...)
A 3-tier security system that controls who can do what. Tier 1 (Trusted Core) gets full access. Tier 2 (Department Leads) gets schema-locked access. Tier 3 (External) gets read-only, air-gapped access. Also scans for prompt injection attacks and logs every access attempt.
tools/security_layer.py — 975 lines of Python
class SecurityLayer: """Unified facade for the MCP/JSON-RPC security layer. Combines entity registry, JSON-RPC validation, injection scanning, access control, and audit logging.""" def __init__(self, registry_path, audit_dir, risk_threshold): self.audit = AuditLogger(log_dir=audit_dir) self.registry = EntityRegistry(path=registry_path) self.validator = JSONRPCValidator() self.scanner = InjectionScanner(threshold=risk_threshold) self.access = AccessController(self.registry, self.audit) def validate_message(self, message, sender_id): """Validate a JSON-RPC message from sender. Resolves sender's tier, then validates.""" entity = self.registry.get_entity(sender_id) if entity is None: return ValidationResult(valid=False, errors=[f"Unknown sender: '{sender_id}'"])
Not a service — this is the test suite that proves modules 1–7 work correctly, individually and together. 578 tests across all modules. Tests things like: "does the scheduler correctly submit to the priority queue?", "does the reaper actually kill stuck processes?", "does recovery correctly restore from a checkpoint?"
| Task Scheduler (agent_rm) | 43 passed |
| Stuck-Process Cleaner (zombie_reaper) | 56 passed |
| Memory Manager (clm) | 58 passed |
| Leader Lock (lease_manager) | 57 passed |
| Work Planner (ingestion_planner) | 129 passed |
| Backup & Recovery (state_recovery) | 88 passed |
| Security Gate (security_layer) | 101 passed |
| Integration (cross-module) | 46 passed |
| Total | 578 passed |
This is the glue. It runs 24/7 in the background and submits every recurring job (zombie cleanup, credential sync, snapshots, health checks, checkpoint creation) through the Task Scheduler (#1) on a timed schedule. It's the reason modules 1, 2, 5, and 6 are doing real work — the scheduler daemon is what keeps calling them.
tools/scheduler_daemon.py — 395 lines of Python