By Andrej Karpathy · March 2026

AI that
researches
overnight

Give an agent your LLM training code. Let it hypothesize, train for 5 minutes, evaluate, keep what works. Wake up to 100 experiments and a better model.

Get Started Full Guide →

autoresearch · overnight run

$ uv run train.py # experiment 001 — baseline val_bpb: 0.997900 # experiment 002 — lr=4e-4 val_bpb: 0.991200 ▼ -0.0067 KEPT # experiment 003 — depth=10 val_bpb: 0.994800 ▲ +0.0036 REVERTED # experiment 004 — muon β=0.98 val_bpb: 0.988100 ▼ -0.0031 KEPT # experiment 005 — batch=2**20 val_bpb: 0.984300 ▼ -0.0038 KEPT … # experiment 097 — 8h later val_bpb: 0.9697 # total improvement: -0.0282 (2.8%) $

Architecture

Deliberately
minimal.

Three files. One metric. The agent touches exactly one Python file. Everything else is fixed. This keeps diffs reviewable and experiments comparable across all runs.

prepare.py

Fixed constants. One-time data prep — downloads training data, trains a BPE tokenizer (8192 vocab). Provides dataloaders and the evaluate_bpb() ground truth metric.

Fixed — Do Not Modify

train.py

The agent's canvas. Full GPT model architecture, Muon + AdamW optimizer, training loop. Everything is fair game — architecture, hyperparameters, batch size, optimizer. ~630 lines, all editable.

Agent Iterates This

program.md

Your research direction in Markdown. Guides the agent's hypotheses. This is the only file you — the human — are expected to edit. Think of it as programming your research organization.

Human Programs This

The Experiment Loop

Read context

Agent reads program.md and current train.py to understand the research direction and current state.

Form hypothesis

Agent proposes one change: a different learning rate, deeper architecture, modified optimizer parameters, different batch size — one thing at a time.

Edit & train

Agent edits train.py and runs it. Training runs for exactly 5 wall-clock minutes, regardless of what changed.

uv run train.py

Evaluate

Reads val_bpb from output. Lower is better. Vocab-size-independent, so architectural changes are fairly compared.

Keep or revert

Improvement? Keep. Regression? Revert. Simplification with equal performance? Always keep. Repeat ~12 times per hour, ~100 times overnight.

Quick Start

Running in
four commands.

Install uv

The fast Python project manager. Required — do not use pip. One-liner install.

Install dependencies

PyTorch 2.9.1 (CUDA 12.8), kernels, rustbpe, and supporting packages. Resolved from lock file.

Prepare data (once)

Downloads training dataset, trains BPE tokenizer, prepares shards. Takes ~2 minutes. One-time.

Train & go

Run a baseline experiment. Verify your setup works. Then point your agent at the repo and sleep.

bash

# 1. Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies
uv sync

# 3. Download data + train tokenizer (~2 min, one-time)
uv run prepare.py

# 4. Run a baseline experiment (~5 min)
uv run train.py

            prompt — Claude / Codex
          

# Paste this into your agent to kick off autonomous research:

"Read program.md and train.py, then let's kick off
a new experiment. Set up the run, establish the
baseline, and begin iterating autonomously."

Requirements: Single NVIDIA GPU (tested on H100) · Python 3.10+ · uv · ~45 GB VRAM for defaults

The Metric

One number
that matters.

val_bpb — validation bits per byte. Lower is better. Vocabulary-size-independent, so every architectural change is a fair comparison. The fixed 5-minute time budget makes all experiments directly comparable.

Simplicity is baked into the evaluation. A 0.001 improvement that adds 20 lines of hacky code? Not worth it. Equal performance from deleted code? Always keep. Complexity has a cost.

Lower is better

Every experiment you run, the goal is to push val_bpb down. When you sleep, the agent does the same.

Fixed budget

Always 5 wall-clock minutes. Bigger model, smaller model, any change — they all get the same time.

Example Overnight Run — 126 Experiments

0.9979

start → 0.9697 after 8 hours

Start: 0.9979 End: 0.9697

Experiments kept 74 / 126

#001 0.9979 — BASELINE

#012 0.9912 ▼ −0.0067 KEPT

#031 0.9841 ▼ −0.0071 KEPT

#058 0.9783 ▼ −0.0058 KEPT

#097 0.9697 ▼ −0.0086 KEPT

Design Choices

Constraints are
features.

◻

Single file to modify

The agent only touches train.py. Scope is manageable. Every diff is reviewable by a human. No surprise edits to infra.

◷

Fixed 5-minute budget

Wall-clock time, excluding startup. Bigger models, smaller batches, architectural changes — all get 5 minutes. The agent optimizes for your platform, not some benchmark machine.

◈

Simplicity criterion

Improvement that adds 20 hacky lines? Skip it. Equal performance from removing code? Always keep. Complexity has a cost that compounds across 100 experiments.

◎

One metric

val_bpb. Not perplexity, not accuracy on some eval suite — bits per byte. Vocab-size-independent so every change is directly comparable, no matter what the agent touched.

◫

Self-contained

No distributed training. No complex configs. One GPU, one file, one metric. Everything the agent needs to run is in pyproject.toml — no new packages allowed.

◳

Human in the loop

You program the research direction via program.md. The agent experiments. You review diffs. Collaborative intelligence — not fully autonomous.

AI that
researches
overnight

Deliberately
minimal.

Running in
four commands.

One number
that matters.

Constraints are
features.

Running on every
platform.

Your agent is ready
to research tonight.

AI that researches overnight

Deliberatelyminimal.

Running infour commands.

One numberthat matters.

Constraints arefeatures.

Running on everyplatform.

Your agent is readyto research tonight.

AI that
researches
overnight

Deliberately
minimal.

Running in
four commands.

One number
that matters.

Constraints are
features.

Running on every
platform.

Your agent is ready
to research tonight.