By Andrej Karpathy · March 2026

AI that
researches
overnight

Give an agent your LLM training code. Let it hypothesize, train for 5 minutes, evaluate, keep what works. Wake up to 100 experiments and a better model.

autoresearch · overnight run
$ uv run train.py # experiment 001 — baseline val_bpb: 0.997900 # experiment 002 — lr=4e-4 val_bpb: 0.991200 ▼ -0.0067 KEPT # experiment 003 — depth=10 val_bpb: 0.994800 ▲ +0.0036 REVERTED # experiment 004 — muon β=0.98 val_bpb: 0.988100 ▼ -0.0031 KEPT # experiment 005 — batch=2**20 val_bpb: 0.984300 ▼ -0.0038 KEPT # experiment 097 — 8h later val_bpb: 0.9697 # total improvement: -0.0282 (2.8%) $
0
Minutes per experiment
0
Experiments per overnight run
0K+
GitHub stars in first week
3
Files that matter
Architecture

Deliberately
minimal.

Three files. One metric. The agent touches exactly one Python file. Everything else is fixed. This keeps diffs reviewable and experiments comparable across all runs.

prepare.py
Fixed constants. One-time data prep — downloads training data, trains a BPE tokenizer (8192 vocab). Provides dataloaders and the evaluate_bpb() ground truth metric.
Fixed — Do Not Modify
train.py
The agent's canvas. Full GPT model architecture, Muon + AdamW optimizer, training loop. Everything is fair game — architecture, hyperparameters, batch size, optimizer. ~630 lines, all editable.
Agent Iterates This
program.md
Your research direction in Markdown. Guides the agent's hypotheses. This is the only file you — the human — are expected to edit. Think of it as programming your research organization.
Human Programs This
The Experiment Loop
01
Read context
Agent reads program.md and current train.py to understand the research direction and current state.
02
Form hypothesis
Agent proposes one change: a different learning rate, deeper architecture, modified optimizer parameters, different batch size — one thing at a time.
03
Edit & train
Agent edits train.py and runs it. Training runs for exactly 5 wall-clock minutes, regardless of what changed.
uv run train.py
04
Evaluate
Reads val_bpb from output. Lower is better. Vocab-size-independent, so architectural changes are fairly compared.
05
Keep or revert
Improvement? Keep. Regression? Revert. Simplification with equal performance? Always keep. Repeat ~12 times per hour, ~100 times overnight.
Quick Start

Running in
four commands.

01
Install uv
The fast Python project manager. Required — do not use pip. One-liner install.
02
Install dependencies
PyTorch 2.9.1 (CUDA 12.8), kernels, rustbpe, and supporting packages. Resolved from lock file.
03
Prepare data (once)
Downloads training dataset, trains BPE tokenizer, prepares shards. Takes ~2 minutes. One-time.
04
Train & go
Run a baseline experiment. Verify your setup works. Then point your agent at the repo and sleep.
bash
# 1. Install uv (if you don't have it) curl -LsSf https://astral.sh/uv/install.sh | sh # 2. Install dependencies uv sync # 3. Download data + train tokenizer (~2 min, one-time) uv run prepare.py # 4. Run a baseline experiment (~5 min) uv run train.py
prompt — Claude / Codex
# Paste this into your agent to kick off autonomous research: "Read program.md and train.py, then let's kick off a new experiment. Set up the run, establish the baseline, and begin iterating autonomously."

Requirements: Single NVIDIA GPU (tested on H100) · Python 3.10+ · uv · ~45 GB VRAM for defaults

The Metric

One number
that matters.

val_bpb — validation bits per byte. Lower is better. Vocabulary-size-independent, so every architectural change is a fair comparison. The fixed 5-minute time budget makes all experiments directly comparable.

Simplicity is baked into the evaluation. A 0.001 improvement that adds 20 lines of hacky code? Not worth it. Equal performance from deleted code? Always keep. Complexity has a cost.

Lower is better
Every experiment you run, the goal is to push val_bpb down. When you sleep, the agent does the same.
Fixed budget
Always 5 wall-clock minutes. Bigger model, smaller model, any change — they all get the same time.
Example Overnight Run — 126 Experiments
0.9979
start → 0.9697 after 8 hours
Start: 0.9979 End: 0.9697
Experiments kept 74 / 126
#001 0.9979 BASELINE
#012 0.9912 ▼ −0.0067 KEPT
#031 0.9841 ▼ −0.0071 KEPT
#058 0.9783 ▼ −0.0058 KEPT
#097 0.9697 ▼ −0.0086 KEPT
Design Choices

Constraints are
features.

Single file to modify
The agent only touches train.py. Scope is manageable. Every diff is reviewable by a human. No surprise edits to infra.
Fixed 5-minute budget
Wall-clock time, excluding startup. Bigger models, smaller batches, architectural changes — all get 5 minutes. The agent optimizes for your platform, not some benchmark machine.
Simplicity criterion
Improvement that adds 20 hacky lines? Skip it. Equal performance from removing code? Always keep. Complexity has a cost that compounds across 100 experiments.
One metric
val_bpb. Not perplexity, not accuracy on some eval suite — bits per byte. Vocab-size-independent so every change is directly comparable, no matter what the agent touched.
Self-contained
No distributed training. No complex configs. One GPU, one file, one metric. Everything the agent needs to run is in pyproject.toml — no new packages allowed.
Human in the loop
You program the research direction via program.md. The agent experiments. You review diffs. Collaborative intelligence — not fully autonomous.
Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. This repo is the story of how it all began.
Andrej Karpathy, March 2026
Community

Running on every
platform.

The community has already ported autoresearch to smaller compute platforms. The original requires NVIDIA GPU — these forks bring it everywhere.

karpathy/autoresearch
Original. Single NVIDIA GPU, tested on H100. PyTorch 2.9.1 + CUDA 12.8. The ground truth.
autoresearch-mlx
Apple Silicon port. Uses MLX instead of PyTorch. No CUDA required. Runs natively on M1/M2/M3 Mac.
autoresearch-win-rtx
Windows + consumer RTX GPU support. Lower VRAM defaults — runs on RTX 3080/4090 class hardware.
autoresearch-tinystories
Tuned for small compute. Uses TinyStories dataset, smaller models, lower VRAM. Great for learning.
Get Started

Your agent is ready
to research tonight.

Clone the repo, run four commands, point your agent at it. Come back in the morning with a log of experiments and a model that's measurably better than the one you started with.

Clone on GitHub Read the docs →