AutoResearch,
explained.
A minimal framework by Andrej Karpathy that lets an AI agent run hundreds of machine learning experiments overnight — autonomously — while you sleep. You wake up to a model that is measurably better than the one you started with.
AutoResearch gives an AI agent (Claude, Codex, etc.) your neural network training code, and lets it hypothesize improvements, train for 5 minutes, evaluate the result, keep what worked, and repeat — 100 times overnight, completely on its own.
The core idea in plain English
You are a researcher. Normally, you'd spend your day tweaking a learning rate, running a training job, waiting 30 minutes, checking if it improved, tweaking something else, waiting again. That's slow, repetitive, and limited to your waking hours.
AutoResearch replaces that cycle with an AI agent that does the same thing — but 12 experiments per hour, 24 hours a day. The agent reads your training code, hypothesizes a change, trains for exactly 5 minutes, checks if the model got better, keeps the change if yes, reverts if no, and starts over. When you wake up, it has run ~100 experiments and found the best version of your model it could discover in that time.
Karpathy's vision is larger: this is the primitive step toward a SETI@home-style distributed research network where thousands of agents collaborate like an entire research community — running in parallel, across many machines, sharing discoveries.
The problem it solves
ML research is bottlenecked by human time and attention, not compute.
When you train a neural network, there are hundreds of decisions that interact in non-obvious ways: learning rate, model depth, attention window pattern, optimizer parameters, batch size, tokenizer vocabulary size. Nobody knows in advance which combination works best for your specific setup. You have to try things.
The traditional research loop looks like this:
"What if I increase the learning rate from 3e-4 to 5e-4?" Takes 30 seconds to think of.
Opens file, makes change, saves. 2 minutes. Then launches training job.
Training runs. Could be 5 minutes, could be 6 hours. Human goes to do something else, forgets, comes back later.
Reads the metric. Decides keep or revert. Forms next hypothesis. Back to step A.
At best, a researcher runs 8–12 experiments per day. A GPU sits idle every night, every weekend. The bottleneck is human availability, not compute.
AutoResearch removes the human from steps A–D entirely. The agent runs the loop 12x per hour, through the night. By morning you have as many experiments as a human researcher runs in 2–3 weeks of focused work.
AutoResearch doesn't generate new research ideas from scratch or design novel architectures. The agent works within the space of changes that make sense for train.py — hyperparameter tuning, architectural tweaks, optimizer modifications. You, the human, steer the direction via program.md.
How it works
The entire system is three files. This is not an accident — minimalism is the design.
prepare.pyDownloads the training dataset, trains a BPE tokenizer (8192 vocab), provides the dataloader, and defines the ground-truth evaluation function evaluate_bpb(). This file is read-only. The agent cannot touch it, you cannot touch it. It is the fixed environment.
train.pyThe agent's canvas. Contains the full GPT model architecture, Muon + AdamW optimizer, and training loop (~630 lines). Everything in this file is fair game: model depth, attention pattern, optimizer hyperparameters, batch size, model size — the agent can change anything. Each training run is fixed at exactly 5 wall-clock minutes.
program.mdYour research direction written in Markdown. The agent reads this before forming hypotheses. It tells the agent what approach to take, what areas to explore or avoid, and provides context about the architecture. This is the only file you edit. Think of it as programming a research team instead of writing code.
The experiment loop (detailed)
Agent reads program.md (your direction), train.py (current state), and results.tsv (history of all previous experiments and their val_bpb scores).
Agent reasons about what to change and why. One targeted change per experiment — not random mutations. The agent looks at what worked before, what patterns emerge, and proposes the next most promising experiment.
Agent edits train.py in-place. One change, cleanly made. Since the diff is always reviewable by a human, changes are kept minimal and readable.
Agent runs uv run train.py. The training loop runs for exactly 5 wall-clock minutes, then prints a summary block with val_bpb, training_seconds, total_seconds, and peak_vram_mb.
Lower val_bpb? Keep the change. Higher? Revert train.py to the previous version. Equal but simpler code? Keep the simplification. Record the result in results.tsv and loop.
The metric: val_bpb
val_bpb stands for validation bits per byte. It measures how many bits, on average, the model needs to encode one byte of text it hasn't seen during training. Lower is better — a model that has better learned the patterns in language needs fewer bits to describe each byte.
train.py achieves after 5 minutes of training on one H100.Why this metric specifically?
Vocabulary-size independent. Standard perplexity depends on the size of your vocabulary — if you change from 8192 to 4096 tokens, perplexity becomes incomparable. val_bpb normalizes to bytes, so architectural changes (including tokenizer changes) are fairly compared against each other.
Fixed time budget makes experiments comparable. The 5-minute constraint means a model that achieves 0.97 val_bpb is strictly better than one that achieves 0.98 val_bpb — because they both had the same 5 minutes of compute. A bigger model that only achieves 0.99 in 5 minutes is worse than a smaller model that achieves 0.97 in the same time.
A 0.001 val_bpb improvement that adds 20 lines of complex code is not worth it. An equal val_bpb score achieved by deleting code is always a win. The agent is explicitly instructed to weight simplicity — this prevents the code from becoming an unmaintainable pile of hacks after 100 experiments.
Setup & requirements
The default configuration requires a single NVIDIA GPU. It was developed and tested on an H100. The default model uses ~45 GB VRAM. If you're on smaller hardware, see the "Running on smaller GPUs" section below.
What you need
| Requirement | Detail | Notes |
|---|---|---|
| NVIDIA GPU | Single GPU, CUDA 12.8 | H100 tested. RTX 4090 works with reduced settings. |
| Python 3.10+ | System or pyenv | Do not use conda — use uv instead. |
| uv | Python package manager | Required. Much faster than pip. One-liner install. |
| Disk space | ~5–10 GB | Dataset download + tokenizer + model checkpoints. |
| AI agent | Claude, Codex, or similar | AutoResearch doesn't ship an agent — you bring your own. |
Installation
After prepare.py runs, you'll have a tokenized dataset and a trained BPE tokenizer cached on disk. You only do this once — subsequent runs load from cache.
Your first run
Before handing control to an agent, run a baseline yourself to confirm everything works.
That val_bpb of ~0.9979 is your baseline. Every experiment the agent runs will be compared against the current best score. Improvements get kept. Regressions get reverted.
Now create results.tsv with a header row — the agent will fill it in:
The first run is slow (~25 seconds) because PyTorch compiles CUDA kernels. Subsequent runs are faster — the kernels are cached. The 5-minute training budget starts after compilation, so compilation time doesn't eat into your experiment budget.
Going autonomous
AutoResearch doesn't ship an AI agent — it is designed to be driven by any coding agent you have access to. The recommended approach is Claude (via Claude Code or Claude API) or OpenAI Codex.
With Claude Code (recommended)
With Claude API (programmatic)
The agent needs to read and write train.py and results.tsv. It should NOT have permission to modify prepare.py. In Claude Code, disable prompts for file edits in the project directory, or the agent will interrupt every 5 minutes to ask for approval — defeating the purpose.
Writing a good program.md
program.md is how you guide the agent's research direction. It is the most important human contribution to the loop. The default in the repo is intentionally bare-bones — you are expected to evolve it over time based on results.
What goes into program.md
Specific hypotheses you want the agent to test, or areas to focus on. "Explore different WINDOW_PATTERN values (try 'LL', 'SL', 'SSLL')" or "Focus on Muon optimizer parameters — current β=0.95 may not be optimal."
Known dead ends from previous runs, or areas that are intentionally out of scope. "Do not increase DEPTH beyond 12 — VRAM cost too high." "Do not change the data pipeline."
As your overnight runs accumulate, paste in the summary of what experiments succeeded. "Run mar12 found: lr=5e-4 ▼0.006, DEPTH=10 ▲reverted, batch=2^20 ▼0.004." This lets the agent build on prior knowledge.
Hardware limits and scope boundaries. "peak_vram_mb must stay under 48000." "All changes must be reversible single-variable modifications."
Real use cases
Here is who this is actually useful for and what problems it concretely solves.
Traditional hyperparameter search (grid search, Bayesian optimization) requires you to define a search space upfront. AutoResearch doesn't — the agent reasons about what to try next based on what worked, like a human researcher would. It's more flexible and doesn't require a special tuning framework.
Want to know whether banded attention helps? Whether a specific depth works better? Whether you need that skip connection? Frame each as an experiment in program.md, let the agent run 100 ablations overnight, wake up to a clear answer. What would take a researcher 2 weeks takes one overnight run.
AutoResearch finds the best model for your specific hardware. The 5-minute budget means a model that fits well in your GPU's memory and compute profile wins — not a model that someone else benchmarked on different hardware. This is uniquely valuable when you have a specific deployment target.
Is Muon actually better than AdamW for your use case? What about Lion? AutoResearch lets you run a controlled comparison: same architecture, same data, same 5-minute budget — only the optimizer changes. The fixed time budget ensures fair comparison between methods with different compute costs per step.
AutoResearch is an incredible learning tool. You can use it to build intuition about neural network training: "Does learning rate really matter this much?" "Is weight decay helping?" "What does changing depth actually do?" The agent runs the experiments; you read the results and build intuition you can't get from textbooks.
For teams training models regularly, AutoResearch can run every night on the latest training code. When engineers arrive in the morning, they have a log of discovered improvements to review and optionally merge. The human reviews diffs — the agent does the search. This is the collaborative future Karpathy describes in the repo's framing.
What the agent typically experiments with
Based on the repo design and community results, here's the space of changes a well-prompted agent explores:
| Category | Specific changes | Typical impact |
|---|---|---|
| Learning rate | Base LR, warmup steps, decay schedule, Muon vs AdamW LR ratio | ±0.003–0.01 val_bpb |
| Batch size | TOTAL_BATCH_SIZE (powers of 2), gradient accumulation steps | ±0.002–0.008 val_bpb |
| Architecture depth | DEPTH (controls all other dims), head count derived from depth | ±0–0.015 val_bpb (VRAM cost) |
| Attention pattern | WINDOW_PATTERN ('L', 'LL', 'SSL', 'SSSL', 'SSLL') | ±0.001–0.005 val_bpb |
| Optimizer params | Muon β (momentum), weight decay, AdamW ε | ±0.001–0.006 val_bpb |
| Regularization | Dropout, weight decay, gradient clipping value | ±0.001–0.003 val_bpb |
| Simplifications | Removing unused features, cleaning initialization code | 0 val_bpb — but wins on simplicity criterion |
Running on smaller GPUs
The default settings require ~45 GB VRAM (H100 class). For consumer hardware (RTX 3090/4090, A100 40GB), you need to scale down. The community has already done this — there are forks for Apple Silicon (MLX) and Windows RTX. If you want to run the original codebase on smaller hardware, here are the knobs to turn:
With small models (DEPTH ≤ 4), the default dataset has too much entropy — the model can't learn meaningful patterns in 5 minutes. The TinyStories dataset is narrower in scope, so small models see real improvement. The community-maintained autoresearch-mlx and autoresearch-win-rtx forks handle these defaults automatically.
FAQ
No — that's the point. You set it up, give the agent broad permissions to edit train.py and read/write results.tsv, prompt it to start, and leave. Come back in the morning. The only time you intervene is if the agent crashes or gets stuck in an error loop, which is rare with a well-configured setup.
No. The agent is instructed to revert train.py to the previous version when an experiment fails or produces a regression. By design, each experiment is one targeted change — not a rewrite. And because the repo uses git branches (you create autoresearch/run-name before starting), you can always reset to any previous state. The fixed 5-minute budget also means a broken experiment that crashes early just fails fast and gets reverted.
Karpathy's repo recommends Claude or Codex. In practice, Claude Code (Claude Sonnet or Opus) works very well because it can read files, make targeted edits, run shell commands, and parse output — all the capabilities needed for the loop. OpenAI's Codex CLI also works. The agent needs: file read/write, shell execution, and the ability to run in a long-running loop without human confirmation for each step.
Two costs: GPU compute and AI agent API calls. GPU compute depends on your cloud provider — an H100 on Lambda Labs is ~$2–3/hour, so an 8-hour run is ~$16–24. Agent API costs depend on your provider; ~100 experiments means ~100 API calls, each reading ~10K tokens and writing a small patch. With Claude Sonnet, this is roughly $5–15 for the agent side. Total: ~$20–40 for an overnight run that previously would require 2–3 weeks of researcher time.
Yes — by design. The fixed 5-minute budget means the optimal model is the one that fits best in your hardware's VRAM and memory bandwidth. A larger model might be best on an H100 but OOM on a 4090. This makes results non-comparable across platforms — but it means AutoResearch finds the genuinely optimal model for your specific machine, which is what you actually want for deployment.
The original repo explicitly disallows adding packages — the agent can only use what's in pyproject.toml. This is a deliberate constraint to keep experiments reproducible and the diff reviewable. For dataset changes, you would need to fork the repo and modify prepare.py — which is fixed in the original. The community forks (autoresearch-mlx, autoresearch-win-rtx, autoresearch-tinystories) handle these customizations.
Check results.tsv — it has every experiment with hypothesis, val_bpb, delta, and KEPT/REVERTED status. Sort by delta to see the biggest improvements. The analysis.ipynb notebook in the repo helps visualize the improvement trajectory across the run. The final train.py in the branch contains all kept changes — that's your improved model.