Complete Guide

AutoResearch,
explained.

A minimal framework by Andrej Karpathy that lets an AI agent run hundreds of machine learning experiments overnight — autonomously — while you sleep. You wake up to a model that is measurably better than the one you started with.

One sentence version

AutoResearch gives an AI agent (Claude, Codex, etc.) your neural network training code, and lets it hypothesize improvements, train for 5 minutes, evaluate the result, keep what worked, and repeat — 100 times overnight, completely on its own.

The core idea in plain English

You are a researcher. Normally, you'd spend your day tweaking a learning rate, running a training job, waiting 30 minutes, checking if it improved, tweaking something else, waiting again. That's slow, repetitive, and limited to your waking hours.

AutoResearch replaces that cycle with an AI agent that does the same thing — but 12 experiments per hour, 24 hours a day. The agent reads your training code, hypothesizes a change, trains for exactly 5 minutes, checks if the model got better, keeps the change if yes, reverts if no, and starts over. When you wake up, it has run ~100 experiments and found the best version of your model it could discover in that time.

Karpathy's vision is larger: this is the primitive step toward a SETI@home-style distributed research network where thousands of agents collaborate like an entire research community — running in parallel, across many machines, sharing discoveries.

The problem it solves

ML research is bottlenecked by human time and attention, not compute.

When you train a neural network, there are hundreds of decisions that interact in non-obvious ways: learning rate, model depth, attention window pattern, optimizer parameters, batch size, tokenizer vocabulary size. Nobody knows in advance which combination works best for your specific setup. You have to try things.

The traditional research loop looks like this:

A

Human has idea

"What if I increase the learning rate from 3e-4 to 5e-4?" Takes 30 seconds to think of.

B

Human edits code

Opens file, makes change, saves. 2 minutes. Then launches training job.

C

Human waits

Training runs. Could be 5 minutes, could be 6 hours. Human goes to do something else, forgets, comes back later.

D

Human checks result

Reads the metric. Decides keep or revert. Forms next hypothesis. Back to step A.

At best, a researcher runs 8–12 experiments per day. A GPU sits idle every night, every weekend. The bottleneck is human availability, not compute.

AutoResearch removes the human from steps A–D entirely. The agent runs the loop 12x per hour, through the night. By morning you have as many experiments as a human researcher runs in 2–3 weeks of focused work.

What it does NOT do

AutoResearch doesn't generate new research ideas from scratch or design novel architectures. The agent works within the space of changes that make sense for train.py — hyperparameter tuning, architectural tweaks, optimizer modifications. You, the human, steer the direction via program.md.

How it works

The entire system is three files. This is not an accident — minimalism is the design.

Fixed — Never Touch

prepare.py

Downloads the training dataset, trains a BPE tokenizer (8192 vocab), provides the dataloader, and defines the ground-truth evaluation function evaluate_bpb(). This file is read-only. The agent cannot touch it, you cannot touch it. It is the fixed environment.

Agent Iterates

train.py

The agent's canvas. Contains the full GPT model architecture, Muon + AdamW optimizer, and training loop (~630 lines). Everything in this file is fair game: model depth, attention pattern, optimizer hyperparameters, batch size, model size — the agent can change anything. Each training run is fixed at exactly 5 wall-clock minutes.

Human Programs

program.md

Your research direction written in Markdown. The agent reads this before forming hypotheses. It tells the agent what approach to take, what areas to explore or avoid, and provides context about the architecture. This is the only file you edit. Think of it as programming a research team instead of writing code.

The experiment loop (detailed)

01

Context read

Agent reads program.md (your direction), train.py (current state), and results.tsv (history of all previous experiments and their val_bpb scores).

02

Hypothesis formation

Agent reasons about what to change and why. One targeted change per experiment — not random mutations. The agent looks at what worked before, what patterns emerge, and proposes the next most promising experiment.

03

Code edit

Agent edits train.py in-place. One change, cleanly made. Since the diff is always reviewable by a human, changes are kept minimal and readable.

04

Training

Agent runs uv run train.py. The training loop runs for exactly 5 wall-clock minutes, then prints a summary block with val_bpb, training_seconds, total_seconds, and peak_vram_mb.

05

Evaluate and decide

Lower val_bpb? Keep the change. Higher? Revert train.py to the previous version. Equal but simpler code? Keep the simplification. Record the result in results.tsv and loop.

The metric: val_bpb

val_bpb stands for validation bits per byte. It measures how many bits, on average, the model needs to encode one byte of text it hasn't seen during training. Lower is better — a model that has better learned the patterns in language needs fewer bits to describe each byte.

Baseline (start)

0.9979

What the unmodified train.py achieves after 5 minutes of training on one H100.

After 126 experiments

0.9697

A 2.8% improvement. Achieved automatically overnight with no human intervention.

Why this metric specifically?

Vocabulary-size independent. Standard perplexity depends on the size of your vocabulary — if you change from 8192 to 4096 tokens, perplexity becomes incomparable. val_bpb normalizes to bytes, so architectural changes (including tokenizer changes) are fairly compared against each other.

Fixed time budget makes experiments comparable. The 5-minute constraint means a model that achieves 0.97 val_bpb is strictly better than one that achieves 0.98 val_bpb — because they both had the same 5 minutes of compute. A bigger model that only achieves 0.99 in 5 minutes is worse than a smaller model that achieves 0.97 in the same time.

The simplicity rule

A 0.001 val_bpb improvement that adds 20 lines of complex code is not worth it. An equal val_bpb score achieved by deleting code is always a win. The agent is explicitly instructed to weight simplicity — this prevents the code from becoming an unmaintainable pile of hacks after 100 experiments.

Setup & requirements

Hardware requirement

The default configuration requires a single NVIDIA GPU. It was developed and tested on an H100. The default model uses ~45 GB VRAM. If you're on smaller hardware, see the "Running on smaller GPUs" section below.

What you need

Requirement	Detail	Notes
NVIDIA GPU	Single GPU, CUDA 12.8	H100 tested. RTX 4090 works with reduced settings.
Python 3.10+	System or pyenv	Do not use conda — use uv instead.
uv	Python package manager	Required. Much faster than pip. One-liner install.
Disk space	~5–10 GB	Dataset download + tokenizer + model checkpoints.
AI agent	Claude, Codex, or similar	AutoResearch doesn't ship an agent — you bring your own.

Installation

bash — run once

# 1. Clone the repo
git clone https://github.com/karpathy/autoresearch
cd autoresearch

# 2. Install uv (skip if already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc  # or restart terminal

# 3. Install dependencies from lock file
# (PyTorch 2.9.1 + CUDA 12.8 + kernels + rustbpe + others)
uv sync

# 4. Download data + train tokenizer (one-time, ~2 min)
# This saves to ~/.cache/autoresearch/
uv run prepare.py

After prepare.py runs, you'll have a tokenized dataset and a trained BPE tokenizer cached on disk. You only do this once — subsequent runs load from cache.

Your first run

Before handing control to an agent, run a baseline yourself to confirm everything works.

bash — baseline run

# Run the baseline experiment (takes exactly ~5 minutes + startup)
uv run train.py

# You'll see training progress, then this summary at the end:
---
val_bpb:          0.997900
training_seconds: 300.1
total_seconds:    325.9
peak_vram_mb:     45060.2

That val_bpb of ~0.9979 is your baseline. Every experiment the agent runs will be compared against the current best score. Improvements get kept. Regressions get reverted.

Now create results.tsv with a header row — the agent will fill it in:

bash — initialize results log

echo -e "experiment\thypothesis\tval_bpb\tdelta\tstatus" > results.tsv

What happens during startup

The first run is slow (~25 seconds) because PyTorch compiles CUDA kernels. Subsequent runs are faster — the kernels are cached. The 5-minute training budget starts after compilation, so compilation time doesn't eat into your experiment budget.

Going autonomous

AutoResearch doesn't ship an AI agent — it is designed to be driven by any coding agent you have access to. The recommended approach is Claude (via Claude Code or Claude API) or OpenAI Codex.

With Claude Code (recommended)

Prompt to paste into Claude Code

"Read program.md and train.py to understand the current setup.
Create a new branch autoresearch/mar14, initialize results.tsv
if it doesn't exist, establish the baseline by running
`uv run train.py`, then begin autonomous experimentation.
Run continuously — for each experiment: form a hypothesis,
edit train.py, run the training, parse val_bpb from output,
keep if improved, revert if not, record in results.tsv, repeat.
Disable all file permission prompts and run without interruption."

With Claude API (programmatic)

python — agent loop skeleton

import anthropic, subprocess, re

client = anthropic.Anthropic()

def run_experiment():
    result = subprocess.run(["uv", "run", "train.py"],
                            capture_output=True, text=True)
    match = re.search(r"val_bpb:\s+([\d.]+)", result.stdout)
    return float(match.group(1)) if match else None

# Feed agent: program.md + train.py + results history
# Agent proposes edits → apply → run_experiment() → decide
# Loop until morning or N experiments reached

Give the agent broad file permissions

The agent needs to read and write train.py and results.tsv. It should NOT have permission to modify prepare.py. In Claude Code, disable prompts for file edits in the project directory, or the agent will interrupt every 5 minutes to ask for approval — defeating the purpose.

Writing a good program.md

program.md is how you guide the agent's research direction. It is the most important human contribution to the loop. The default in the repo is intentionally bare-bones — you are expected to evolve it over time based on results.

What goes into program.md

①

What to explore

Specific hypotheses you want the agent to test, or areas to focus on. "Explore different WINDOW_PATTERN values (try 'LL', 'SL', 'SSLL')" or "Focus on Muon optimizer parameters — current β=0.95 may not be optimal."

②

What to avoid

Known dead ends from previous runs, or areas that are intentionally out of scope. "Do not increase DEPTH beyond 12 — VRAM cost too high." "Do not change the data pipeline."

③

Context about what worked

As your overnight runs accumulate, paste in the summary of what experiments succeeded. "Run mar12 found: lr=5e-4 ▼0.006, DEPTH=10 ▲reverted, batch=2^20 ▼0.004." This lets the agent build on prior knowledge.

④

Constraints

Hardware limits and scope boundaries. "peak_vram_mb must stay under 48000." "All changes must be reversible single-variable modifications."

            markdown — example program.md additions
          

## Research direction for run mar14

Previous run found:
- lr=5e-4 improved val_bpb by 0.006 — KEPT
- DEPTH=10 increased VRAM past limit — REVERTED
- TOTAL_BATCH_SIZE=2**20 improved val_bpb by 0.004 — KEPT

## Focus areas for this run
1. Muon optimizer β parameter (current 0.95, try 0.96–0.99)
2. WINDOW_PATTERN variants (current "SSSL", try "SSL", "SSLL")
3. Weight initialization schemes

## Hard constraints
- peak_vram_mb must stay under 48000
- Do not modify the evaluation function
- Each experiment: one variable at a time

Real use cases

Here is who this is actually useful for and what problems it concretely solves.

ML Research

Hyperparameter search without a search framework

Traditional hyperparameter search (grid search, Bayesian optimization) requires you to define a search space upfront. AutoResearch doesn't — the agent reasons about what to try next based on what worked, like a human researcher would. It's more flexible and doesn't require a special tuning framework.

Concrete outcome In one overnight run: lr, batch size, Muon β, and WINDOW_PATTERN all optimized together. Val_bpb went from 0.9979 → 0.9697 in 126 experiments.

ML Research

Architecture ablation studies

Want to know whether banded attention helps? Whether a specific depth works better? Whether you need that skip connection? Frame each as an experiment in program.md, let the agent run 100 ablations overnight, wake up to a clear answer. What would take a researcher 2 weeks takes one overnight run.

Concrete outcome "Is WINDOW_PATTERN='SSSL' necessary?" → Agent tried 'L', 'LL', 'SSL', 'SSLL' → 'SSSL' is best by 0.003 val_bpb on this GPU

Production

Platform-specific optimization

AutoResearch finds the best model for your specific hardware. The 5-minute budget means a model that fits well in your GPU's memory and compute profile wins — not a model that someone else benchmarked on different hardware. This is uniquely valuable when you have a specific deployment target.

Concrete outcome Running on an A100 80GB vs H100? The agent finds optimal batch sizes and model depth for your specific GPU's memory bandwidth and compute.

Production

Optimizer comparison at fixed cost

Is Muon actually better than AdamW for your use case? What about Lion? AutoResearch lets you run a controlled comparison: same architecture, same data, same 5-minute budget — only the optimizer changes. The fixed time budget ensures fair comparison between methods with different compute costs per step.

Concrete outcome Agent tried AdamW-only, Muon-only, Muon+AdamW hybrid with different β values → Muon+AdamW (β=0.98) best by 0.004 val_bpb

Learning

Understanding what actually matters in transformers

AutoResearch is an incredible learning tool. You can use it to build intuition about neural network training: "Does learning rate really matter this much?" "Is weight decay helping?" "What does changing depth actually do?" The agent runs the experiments; you read the results and build intuition you can't get from textbooks.

Concrete outcome Students and practitioners use this to understand the sensitivity of models to various hyperparameters — with real data, not toy examples.

Team / Company

Continuous overnight improvement pipeline

For teams training models regularly, AutoResearch can run every night on the latest training code. When engineers arrive in the morning, they have a log of discovered improvements to review and optionally merge. The human reviews diffs — the agent does the search. This is the collaborative future Karpathy describes in the repo's framing.

Concrete outcome Pipeline: daytime human commits new architecture ideas → overnight AutoResearch explores hyperparameter space → morning review + merge best improvements

What the agent typically experiments with

Based on the repo design and community results, here's the space of changes a well-prompted agent explores:

Category	Specific changes	Typical impact
Learning rate	Base LR, warmup steps, decay schedule, Muon vs AdamW LR ratio	±0.003–0.01 val_bpb
Batch size	TOTAL_BATCH_SIZE (powers of 2), gradient accumulation steps	±0.002–0.008 val_bpb
Architecture depth	DEPTH (controls all other dims), head count derived from depth	±0–0.015 val_bpb (VRAM cost)
Attention pattern	WINDOW_PATTERN ('L', 'LL', 'SSL', 'SSSL', 'SSLL')	±0.001–0.005 val_bpb
Optimizer params	Muon β (momentum), weight decay, AdamW ε	±0.001–0.006 val_bpb
Regularization	Dropout, weight decay, gradient clipping value	±0.001–0.003 val_bpb
Simplifications	Removing unused features, cleaning initialization code	0 val_bpb — but wins on simplicity criterion

Running on smaller GPUs

The default settings require ~45 GB VRAM (H100 class). For consumer hardware (RTX 3090/4090, A100 40GB), you need to scale down. The community has already done this — there are forks for Apple Silicon (MLX) and Windows RTX. If you want to run the original codebase on smaller hardware, here are the knobs to turn:

python — train.py adjustments for smaller VRAM

# In train.py — start with DEPTH, it controls most other dimensions
DEPTH = 4     # default is 8 — halving this cuts VRAM ~4x
TOTAL_BATCH_SIZE = 2**16  # default 2**18 — reduce if OOM
WINDOW_PATTERN = "L"  # default "SSSL" — plain attention is cheaper

# In prepare.py — these are fixed once set, change before prepare.py run
MAX_SEQ_LEN = 256    # default 1024 — huge VRAM savings
EVAL_TOKENS = 100_000 # default 1M — faster eval on smaller hardware

Use TinyStories for small compute

With small models (DEPTH ≤ 4), the default dataset has too much entropy — the model can't learn meaningful patterns in 5 minutes. The TinyStories dataset is narrower in scope, so small models see real improvement. The community-maintained autoresearch-mlx and autoresearch-win-rtx forks handle these defaults automatically.

FAQ

Do I need to babysit the agent while it runs? +

No — that's the point. You set it up, give the agent broad permissions to edit train.py and read/write results.tsv, prompt it to start, and leave. Come back in the morning. The only time you intervene is if the agent crashes or gets stuck in an error loop, which is rare with a well-configured setup.

Can the agent break my training code permanently? +

No. The agent is instructed to revert train.py to the previous version when an experiment fails or produces a regression. By design, each experiment is one targeted change — not a rewrite. And because the repo uses git branches (you create autoresearch/run-name before starting), you can always reset to any previous state. The fixed 5-minute budget also means a broken experiment that crashes early just fails fast and gets reverted.

What AI agent should I use? +

Karpathy's repo recommends Claude or Codex. In practice, Claude Code (Claude Sonnet or Opus) works very well because it can read files, make targeted edits, run shell commands, and parse output — all the capabilities needed for the loop. OpenAI's Codex CLI also works. The agent needs: file read/write, shell execution, and the ability to run in a long-running loop without human confirmation for each step.

How much does it cost to run overnight? +

Two costs: GPU compute and AI agent API calls. GPU compute depends on your cloud provider — an H100 on Lambda Labs is ~$2–3/hour, so an 8-hour run is ~$16–24. Agent API costs depend on your provider; ~100 experiments means ~100 API calls, each reading ~10K tokens and writing a small patch. With Claude Sonnet, this is roughly $5–15 for the agent side. Total: ~$20–40 for an overnight run that previously would require 2–3 weeks of researcher time.

Will the results be different on my GPU vs someone else's? +

Yes — by design. The fixed 5-minute budget means the optimal model is the one that fits best in your hardware's VRAM and memory bandwidth. A larger model might be best on an H100 but OOM on a 4090. This makes results non-comparable across platforms — but it means AutoResearch finds the genuinely optimal model for your specific machine, which is what you actually want for deployment.

What if I want to add new packages or change the dataset? +

The original repo explicitly disallows adding packages — the agent can only use what's in pyproject.toml. This is a deliberate constraint to keep experiments reproducible and the diff reviewable. For dataset changes, you would need to fork the repo and modify prepare.py — which is fixed in the original. The community forks (autoresearch-mlx, autoresearch-win-rtx, autoresearch-tinystories) handle these customizations.

How do I know which experiments to review in the morning? +

Check results.tsv — it has every experiment with hypothesis, val_bpb, delta, and KEPT/REVERTED status. Sort by delta to see the biggest improvements. The analysis.ipynb notebook in the repo helps visualize the improvement trajectory across the run. The final train.py in the branch contains all kept changes — that's your improved model.