Give an agent your LLM training code. Let it hypothesize, train for 5 minutes, evaluate, keep what works. Wake up to 100 experiments and a better model.
Three files. One metric. The agent touches exactly one Python file. Everything else is fixed. This keeps diffs reviewable and experiments comparable across all runs.
evaluate_bpb() ground truth metric.program.md and current train.py to understand the research direction and current state.train.py and runs it. Training runs for exactly 5 wall-clock minutes, regardless of what changed.val_bpb from output. Lower is better. Vocab-size-independent, so architectural changes are fairly compared.Requirements: Single NVIDIA GPU (tested on H100) · Python 3.10+ · uv · ~45 GB VRAM for defaults
val_bpb — validation bits per byte. Lower is better. Vocabulary-size-independent, so every architectural change is a fair comparison. The fixed 5-minute time budget makes all experiments directly comparable.
Simplicity is baked into the evaluation. A 0.001 improvement that adds 20 lines of hacky code? Not worth it. Equal performance from deleted code? Always keep. Complexity has a cost.
train.py. Scope is manageable. Every diff is reviewable by a human. No surprise edits to infra.pyproject.toml — no new packages allowed.program.md. The agent experiments. You review diffs. Collaborative intelligence — not fully autonomous.The community has already ported autoresearch to smaller compute platforms. The original requires NVIDIA GPU — these forks bring it everywhere.
Clone the repo, run four commands, point your agent at it. Come back in the morning with a log of experiments and a model that's measurably better than the one you started with.