A single goal file triggered a multi-agent swarm that researched a domain, generated its own training dataset via frontier model distillation, self-recovered through four consecutive code failures, fine-tuned a 14B parameter model on rented GPU hardware, and iterated to improve — all without human intervention.
Can a multi-agent runtime autonomously build a domain-specialized LLM from scratch — including researching the domain, generating training data, handling failures, and iterating to improve?
Mux-Swarm is a CLI-native agent runtime that coordinates specialized agents — an orchestrator for planning, a code agent for execution, a research agent for information gathering, a monitor agent for system health, and a memory agent for knowledge persistence — all driven by configuration, not code.
The test: write a single markdown goal file describing a GPU/HPC model specialization pipeline, point it at a rented 96GB GPU, and walk away. The swarm handles everything else.
96GB GDDR7 · Blackwell · Vast.ai @ $0.80/hr
QLoRA · 4-bit NF4 · LoRA rank 32 · Apache 2.0
CUDA · ROCm · Debugging · Distributed Training
GPT-5.4 · Grok 4.20 · Gemini 3.1 · Mercury 2
Eight phases executed across multiple continuous iterations, with Phase 0 state recovery running at the start of every cycle.
OPENROUTER_API_KEY, then wrote a standalone Python script that called the OpenRouter API directly to generate instruction/response pairs from each source document using multiple prompt templates (Q&A, troubleshooting, code generation, comparison). The swarm created its own data generation pipeline by leveraging the infrastructure it was already running on.Four consecutive failures. Four autonomous recoveries. Zero human intervention. Every fix persisted to the knowledge graph so the swarm never makes the same mistake twice.
TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'overwrite_output_dir'overwrite_output_dir. Also corrected torch_dtype → dtype for compatibility. Validated with py_compile. MemoryAgent persisted the fix.TypeError: Trainer.__init__() got an unexpected keyword argument 'tokenizer'tokenizer=tokenizer → processing_class=tokenizer in Trainer init. Persisted to knowledge graph.TypeError: SFTTrainer.__init__() got an unexpected keyword argument 'max_seq_length'max_seq_length. Switched to SFTConfig with max_length=2048. Persisted.ValueError: Unable to create tensor — features have excessive nestingDataCollatorForLanguageModeling. Let SFTTrainer handle tokenization and collation automatically with flattened chat examples via dataset_text_field='text'. Persisted.Behaviors the swarm exhibited that were not explicitly programmed — emergent from the combination of tools, prompts, and autonomy.
When the CodeAgent needed to call a frontier model to generate training data, it wasn't given explicit instructions on how to authenticate. It reasoned that since Mux-Swarm itself connects to LLM providers, API credentials must exist in the environment. It listed environment variables, found OPENROUTER_API_KEY, and wrote a standalone Python data generation script that called the OpenRouter API directly — creating its own distillation pipeline from infrastructure it was already running on.
The CodeAgent discovered that python after venv activation resolved to a Mux-Swarm shim interpreter that lacked GPU packages. Rather than failing repeatedly, it diagnosed the issue by checking which Python binary was being invoked, identified the shim, and switched to calling /venv/main/bin/python explicitly. It then persisted this finding to the knowledge graph so all future training runs would use the correct path.
When CodeAgent encountered each crash, it delegated to MemoryAgent to check if the error had been seen before. MemoryAgent searched the knowledge graph, returned prior context (or confirmed no prior record), and CodeAgent used that context to avoid repeating failed approaches. After each fix, the pattern reversed — CodeAgent delegated to MemoryAgent to persist the solution. This created an accumulating institutional memory across the session.
During the autonomous improvement loop (run 2), the swarm decided to switch from GPT-4o to GPT-5.4 for data generation — upgrading its own distillation source to produce higher-quality training examples. This decision was not specified in the goal file; the swarm made it based on its assessment that better generation quality was the most impactful improvement action.
Run 1: Base Qwen3-14B vs GPU/HPC Specialist on 10 domain-specific prompts, scored by LLM-as-judge.
| # | Prompt Topic | Base | Specialist | Delta |
|---|---|---|---|---|
| 1 | CUDA illegal memory access debugging | 4.33 | 4.33 | +0.00 |
| 2 | CUDA parallel reduction kernel | 3.00 | 3.33 | +0.33 |
| 3 | CUDA vs ROCm/HIP comparison | 3.67 | 3.67 | +0.00 |
| 4 | NCCL all-reduce bottleneck diagnosis | 4.33 | 4.33 | +0.00 |
| 5 | GPU monitoring script + anomaly detection | 3.33 | 5.00 | +1.67 |
| 6 | PyTorch training OOM diagnosis | 4.33 | 4.33 | +0.00 |
| 7 | NVIDIA MIG setup on A100 | 3.33 | 3.33 | +0.00 |
| 8 | Warp divergence explanation + refactor | 4.33 | 4.33 | +0.00 |
| 9 | CUDA memory allocation API comparison | 4.33 | 4.33 | +0.00 |
| 10 | Multi-GPU CUDA Dockerfile | 3.67 | 3.00 | -0.67 |
| Aggregate | 3.87 | 4.00 | +0.13 |
The result is directional, not definitive. A 3.4% improvement on 377 training examples is a signal, not a claim. The deliverable is the autonomous pipeline — not the model. Scale the dataset to 5,000+ examples and the gap widens. The swarm's own report identified this and began iterating autonomously.
Crash recovery at every layer — from code errors to process death to infrastructure moves.
MonitorAgent detects crash → Orchestrator delegates → CodeAgent reads logs → patches code → validates → relaunches → MemoryAgent persists fix
Separate watchdog process monitors heartbeat. On process death, restarts mux-swarm with same arguments. Verified live by killing the process mid-run — watchdog restarted within 90 seconds.
On every iteration, the orchestrator scans filesystem artifacts and knowledge graph to determine exactly where the pipeline left off. No work is duplicated.
Entire agent runtime tarred, moved to a different Vast.ai GPU instance (different machine, different region), extracted, and resumed — Phase 0 picked up from filesystem state.
After the first pipeline completed, the swarm didn't stop. It read its own report, identified weaknesses, and began iterating.
Run 2 training showed a steeper loss curve than run 1 (0.74 at step 40 vs 0.83 at completion), indicating the improved dataset quality was producing better training signal. The self-improvement loop continues indefinitely until manually stopped.
The design decisions that enabled autonomous operation.
Config.json defines infrastructure (providers, MCP servers, filesystem boundaries). Swarm.json defines behavior (agent roles, models, delegation permissions, tool scoping). This separation means you can move the swarm to different hardware by changing one file, or redesign the agent topology by changing the other — independently.
Every tool the agents use — filesystem, Python REPL, web fetch, memory, vector store — is an MCP server. Tools are discovered at runtime, scoped per-agent, and validated on startup. This means the swarm's capabilities are extensible without code changes.
Agents load operational playbooks at runtime via list_skills and read_skill. Custom skills for training pipelines, dataset preparation, evaluation, and GPU monitoring were loaded before each phase — keeping agent prompts lean while giving them domain-specific guidance when needed.
Each agent runs a different LLM optimized for its role. GPT-5.4 for code generation (needs the best coding). Grok 4.20 for research (lowest hallucination rate). Mercury 2 for monitoring (1000+ tokens/sec, cheapest). Gemini Flash Lite for memory operations (high efficiency). This isn't one model doing everything — it's a team.
Agents communicate through files. Training scripts, metrics logs, evaluation results, and reports are all written to disk. This eliminates context drift, reduces token burn, makes every intermediate output inspectable, and enables crash recovery — the filesystem is the source of truth, not agent memory.
RTX PRO 6000 @ $0.80/hr × ~5 hours active
Agent reasoning via OpenRouter across all phases
Frontier model distillation calls for training data
Domain research → data gen → training → eval → report
The swarm doesn't care what GPU it's running on. It discovers hardware via nvidia-smi, adapts VRAM budgets, and proceeds. The same goal file that ran on an NVIDIA RTX PRO 6000 could run on AMD Instinct MI300X with ROCm — the only change is the config file. This is the pattern GPU vendors need: autonomous AI workloads that treat hardware as a commodity.
The traditional ML pipeline requires a human at every step — data collection, preprocessing, training script development, debugging, evaluation, iteration. This POC compressed that entire workflow into a single goal file. The human defines what to build; the swarm figures out how.
Every training run, every error, every fix, every evaluation result is persisted to a knowledge graph. The swarm gets better over time — not through retraining its own agents, but through accumulating operational knowledge that informs future decisions. Run 3 benefits from everything learned in runs 1 and 2.
Mux-Swarm isn't a notebook or a chat interface — it's a standalone binary with stdio, scoped configs, and a watchdog. It embeds into backend services, CI/CD pipelines, and products. The same architecture that trained a model autonomously can power any multi-step AI workflow that requires coordination, execution, and resilience.