Mux-Swarm Case Study

Autonomous GPU/HPC
Model Specialization

A single goal file triggered a multi-agent swarm that researched a domain, generated its own training dataset via frontier model distillation, self-recovered through four consecutive code failures, fine-tuned a 14B parameter model on rented GPU hardware, and iterated to improve — all without human intervention.

<$10 Orchestration Cost

14B Parameters

4 Self-Recoveries

0 Human Interventions

The Premise

Can a multi-agent runtime autonomously build a domain-specialized LLM from scratch — including researching the domain, generating training data, handling failures, and iterating to improve?

Mux-Swarm is a CLI-native agent runtime that coordinates specialized agents — an orchestrator for planning, a code agent for execution, a research agent for information gathering, a monitor agent for system health, and a memory agent for knowledge persistence — all driven by configuration, not code.

The test: write a single markdown goal file describing a GPU/HPC model specialization pipeline, point it at a rented 96GB GPU, and walk away. The swarm handles everything else.

Hardware

NVIDIA RTX PRO 6000

96GB GDDR7 · Blackwell · Vast.ai @ $0.80/hr

Base Model

Qwen3-14B

QLoRA · 4-bit NF4 · LoRA rank 32 · Apache 2.0

Target Domain

GPU/HPC Operations

CUDA · ROCm · Debugging · Distributed Training

Orchestration Models

5 LLMs Routed per Role

GPT-5.4 · Grok 4.20 · Gemini 3.1 · Mercury 2

The Pipeline

Eight phases executed across multiple continuous iterations, with Phase 0 state recovery running at the start of every cycle.

Phase 0 — Every Iteration

State Recovery & Orientation

Before any work, the orchestrator queries the knowledge graph, scans the filesystem for existing artifacts, checks for running processes, and classifies the pipeline state. This enables crash recovery, cross-session continuity, and prevents duplicate work.

Phase 1 — Domain Research

GPU/HPC Knowledge Mapping

The ResearchAgent loaded custom skills, then systematically mapped the GPU/HPC domain across 8 topic areas: CUDA programming, ROCm/HIP, GPU debugging, workload optimization, multi-GPU distributed training, container/cloud GPU operations, driver/system management, and HPC patterns. Produced a comprehensive research report with a data sourcing plan.

Phase 2 — Source Data Collection

22+ Reference Documents Assembled

CodeAgent and ResearchAgent collaborated to collect source material organized into 7 topic directories: CUDA kernels, ROCm porting guides, error code databases, optimization patterns, NCCL distributed training docs, Docker GPU configurations, and nvidia-smi operational guides. Included automatically generated error scenario databases — 50 common CUDA errors, 50 GPU OOM scenarios, 50 multi-GPU communication failures.

Phase 3 — Synthetic Data Generation

Frontier Model Distillation

This is where the swarm exhibited its most unconventional behavior. The CodeAgent needed to call a frontier LLM to generate training data from the source material. Rather than requesting a new tool or API access, it reasoned that API keys must already exist — because it is running inside Mux-Swarm, which connects to LLM providers. It listed environment variables, found OPENROUTER_API_KEY, then wrote a standalone Python script that called the OpenRouter API directly to generate instruction/response pairs from each source document using multiple prompt templates (Q&A, troubleshooting, code generation, comparison). The swarm created its own data generation pipeline by leveraging the infrastructure it was already running on.

Phase 4 — Quality Filtering

377 Examples Filtered & Validated

Automated quality pipeline: structural validation, minimum output length, refusal pattern detection, exact deduplication via shingle Jaccard similarity, and chat template formatting for Qwen3. Produced a 95/5 train/test split with full dataset card and statistics.

Phase 5 — Training

QLoRA Fine-Tune with 4 Self-Recoveries

The training script was written, validated with py_compile, and launched — then failed. Four times. Each time, the swarm diagnosed the error from its own logs, patched its own code, and restarted. See the self-healing section below for the full recovery chain. Final training: 69 steps, 4.3 minutes, final loss 0.833.

Phase 6 — Monitoring

GPU Health Tracking Throughout

MonitorAgent queried nvidia-smi and training metrics logs throughout execution. Detected CRITICAL status (process not found) after each crash, triggering the orchestrator to delegate repair to CodeAgent. Reported HEALTHY once training stabilized at 34.58GB VRAM, 67°C, 77% utilization.

Phase 7 — Evaluation

Base vs Specialist Head-to-Head

Both base Qwen3-14B and the fine-tuned specialist were loaded simultaneously on the 96GB GPU (65GB + 30GB). 10 domain-specific prompts evaluated with LLM-as-judge scoring. Aggregate: base 3.87 vs fine-tuned 4.00. Strongest gain on GPU monitoring scripts (+1.67). Regression identified on Dockerfile generation (-0.67).

Phase 8 — Reporting & Persistence

Self-Written Analysis & Recommendations

The swarm wrote its own comprehensive report analyzing training metrics, loss curves, evaluation results, per-prompt breakdowns, and seven specific recommendations for improvement. All findings persisted to the knowledge graph for future session recall.

The Self-Healing Loop

Four consecutive failures. Four autonomous recoveries. Zero human intervention. Every fix persisted to the knowledge graph so the swarm never makes the same mistake twice.

Crash #1 — TrainingArguments API Mismatch

TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'overwrite_output_dir'

Diagnose

MonitorAgent detected CRITICAL (process not found). Orchestrator delegated to CodeAgent. CodeAgent read stdout.log, identified the bad argument, checked installed transformers version.

→

Fix

Removed overwrite_output_dir. Also corrected torch_dtype → dtype for compatibility. Validated with py_compile. MemoryAgent persisted the fix.

Crash #2 — Trainer Tokenizer API Change

TypeError: Trainer.__init__() got an unexpected keyword argument 'tokenizer'

→

Fix

Changed tokenizer=tokenizer → processing_class=tokenizer in Trainer init. Persisted to knowledge graph.

Crash #3 — SFTTrainer Sequence Length Config

TypeError: SFTTrainer.__init__() got an unexpected keyword argument 'max_seq_length'

→

Fix

Runtime-introspected the installed TRL library, confirmed SFTTrainer doesn't accept max_seq_length. Switched to SFTConfig with max_length=2048. Persisted.

Crash #4 — Data Collation Mismatch

ValueError: Unable to create tensor — features have excessive nesting

→

Fix

Removed custom DataCollatorForLanguageModeling. Let SFTTrainer handle tokenization and collation automatically with flattened chat examples via dataset_text_field='text'. Persisted.

✓

Training Starts Successfully

Step 10 — loss 1.7488, lr 9.85e-05, VRAM 23.37GB. All metrics flowing. MonitorAgent confirms HEALTHY. Training completes in 4.3 minutes.

Emergent Behavior

Behaviors the swarm exhibited that were not explicitly programmed — emergent from the combination of tools, prompts, and autonomy.

Self-Discovered API Access for Data Generation

When the CodeAgent needed to call a frontier model to generate training data, it wasn't given explicit instructions on how to authenticate. It reasoned that since Mux-Swarm itself connects to LLM providers, API credentials must exist in the environment. It listed environment variables, found OPENROUTER_API_KEY, and wrote a standalone Python data generation script that called the OpenRouter API directly — creating its own distillation pipeline from infrastructure it was already running on.

Autonomous Environment Diagnosis

The CodeAgent discovered that python after venv activation resolved to a Mux-Swarm shim interpreter that lacked GPU packages. Rather than failing repeatedly, it diagnosed the issue by checking which Python binary was being invoked, identified the shim, and switched to calling /venv/main/bin/python explicitly. It then persisted this finding to the knowledge graph so all future training runs would use the correct path.

Cross-Agent Knowledge Transfer

When CodeAgent encountered each crash, it delegated to MemoryAgent to check if the error had been seen before. MemoryAgent searched the knowledge graph, returned prior context (or confirmed no prior record), and CodeAgent used that context to avoid repeating failed approaches. After each fix, the pattern reversed — CodeAgent delegated to MemoryAgent to persist the solution. This created an accumulating institutional memory across the session.

Model Upgrade During Self-Improvement

During the autonomous improvement loop (run 2), the swarm decided to switch from GPT-4o to GPT-5.4 for data generation — upgrading its own distillation source to produce higher-quality training examples. This decision was not specified in the goal file; the swarm made it based on its assessment that better generation quality was the most impactful improvement action.

Evaluation Results

Run 1: Base Qwen3-14B vs GPU/HPC Specialist on 10 domain-specific prompts, scored by LLM-as-judge.

#	Prompt Topic	Base	Specialist	Delta
1	CUDA illegal memory access debugging	4.33	4.33	+0.00
2	CUDA parallel reduction kernel	3.00	3.33	+0.33
3	CUDA vs ROCm/HIP comparison	3.67	3.67	+0.00
4	NCCL all-reduce bottleneck diagnosis	4.33	4.33	+0.00
5	GPU monitoring script + anomaly detection	3.33	5.00	+1.67
6	PyTorch training OOM diagnosis	4.33	4.33	+0.00
7	NVIDIA MIG setup on A100	3.33	3.33	+0.00
8	Warp divergence explanation + refactor	4.33	4.33	+0.00
9	CUDA memory allocation API comparison	4.33	4.33	+0.00
10	Multi-GPU CUDA Dockerfile	3.67	3.00	-0.67
	Aggregate	3.87	4.00	+0.13

The result is directional, not definitive. A 3.4% improvement on 377 training examples is a signal, not a claim. The deliverable is the autonomous pipeline — not the model. Scale the dataset to 5,000+ examples and the gap widens. The swarm's own report identified this and began iterating autonomously.

Operational Resilience

Crash recovery at every layer — from code errors to process death to infrastructure moves.

Code-Level Recovery

4 / 4 Self-Healed

MonitorAgent detects crash → Orchestrator delegates → CodeAgent reads logs → patches code → validates → relaunches → MemoryAgent persists fix

Process-Level Recovery

Watchdog Restart

Separate watchdog process monitors heartbeat. On process death, restarts mux-swarm with same arguments. Verified live by killing the process mid-run — watchdog restarted within 90 seconds.

Session-Level Recovery

Phase 0 State Reorientation

On every iteration, the orchestrator scans filesystem artifacts and knowledge graph to determine exactly where the pipeline left off. No work is duplicated.

Infrastructure-Level Recovery

Cross-Instance Portability

Entire agent runtime tarred, moved to a different Vast.ai GPU instance (different machine, different region), extracted, and resumed — Phase 0 picked up from filesystem state.

The Self-Improvement Loop

After the first pipeline completed, the swarm didn't stop. It read its own report, identified weaknesses, and began iterating.

Orchestrator → MemoryAgent : retrieve all training-run entities, eval scores, and recommendations
MemoryAgent → returned: run1 score 4.00/5.00, Dockerfile regression identified, dataset size flagged as limitation

Orchestrator → CodeAgent : generate 200+ targeted examples for Dockerfile generation, MIG operations, and code-heavy GPU tasks
CodeAgent → upgraded distillation model from GPT-4o to GPT-5.4 autonomously

Orchestrator → CodeAgent : retrain as gpu-hpc-specialist-run2 with expanded dataset
MonitorAgent → training healthy: loss 1.72 → 0.74 by step 40, VRAM 64GB stable

Run 2 training showed a steeper loss curve than run 1 (0.74 at step 40 vs 0.83 at completion), indicating the improved dataset quality was producing better training signal. The self-improvement loop continues indefinitely until manually stopped.

Architecture That Made This Possible

The design decisions that enabled autonomous operation.

Two-Config Separation

Config.json defines infrastructure (providers, MCP servers, filesystem boundaries). Swarm.json defines behavior (agent roles, models, delegation permissions, tool scoping). This separation means you can move the swarm to different hardware by changing one file, or redesign the agent topology by changing the other — independently.

MCP-Native Tool Integration

Every tool the agents use — filesystem, Python REPL, web fetch, memory, vector store — is an MCP server. Tools are discovered at runtime, scoped per-agent, and validated on startup. This means the swarm's capabilities are extensible without code changes.

Skills System

Agents load operational playbooks at runtime via list_skills and read_skill. Custom skills for training pipelines, dataset preparation, evaluation, and GPU monitoring were loaded before each phase — keeping agent prompts lean while giving them domain-specific guidance when needed.

Per-Agent Model Routing

Each agent runs a different LLM optimized for its role. GPT-5.4 for code generation (needs the best coding). Grok 4.20 for research (lowest hallucination rate). Mercury 2 for monitoring (1000+ tokens/sec, cheapest). Gemini Flash Lite for memory operations (high efficiency). This isn't one model doing everything — it's a team.

Filesystem as Message Bus

Agents communicate through files. Training scripts, metrics logs, evaluation results, and reports are all written to disk. This eliminates context drift, reduces token burn, makes every intermediate output inspectable, and enables crash recovery — the filesystem is the source of truth, not agent memory.

Cost Breakdown

GPU Rental

~$4.00

RTX PRO 6000 @ $0.80/hr × ~5 hours active

LLM Orchestration

~$5.00

Agent reasoning via OpenRouter across all phases

Data Generation

~$2.00

Frontier model distillation calls for training data

Total Pipeline Cost

~$11.00

Domain research → data gen → training → eval → report

Implications

Hardware-Agnostic Autonomous Workloads

The swarm doesn't care what GPU it's running on. It discovers hardware via nvidia-smi, adapts VRAM budgets, and proceeds. The same goal file that ran on an NVIDIA RTX PRO 6000 could run on AMD Instinct MI300X with ROCm — the only change is the config file. This is the pattern GPU vendors need: autonomous AI workloads that treat hardware as a commodity.

Autonomous ML Operations

The traditional ML pipeline requires a human at every step — data collection, preprocessing, training script development, debugging, evaluation, iteration. This POC compressed that entire workflow into a single goal file. The human defines what to build; the swarm figures out how.

Self-Accumulating Knowledge

Every training run, every error, every fix, every evaluation result is persisted to a knowledge graph. The swarm gets better over time — not through retraining its own agents, but through accumulating operational knowledge that informs future decisions. Run 3 benefits from everything learned in runs 1 and 2.

Embeddable Agent Runtime

Mux-Swarm isn't a notebook or a chat interface — it's a standalone binary with stdio, scoped configs, and a watchdog. It embeds into backend services, CI/CD pipelines, and products. The same architecture that trained a model autonomously can power any multi-step AI workflow that requires coordination, execution, and resilience.