Autoresearch: the night shift that changes everything

The most important experiments in machine learning history were run by humans who had a hypothesis, wrote some code, waited for the results, and adjusted. That loop — hypothesise, implement, evaluate, iterate — is the scientific method compressed into a GPU cycle. Andrej Karpathy just automated it.

Autoresearch is 630 lines of Python. It gives an AI agent a training script, a single GPU, and a five-minute wall-clock budget. The agent reads the code, forms a hypothesis, modifies the training script, runs the experiment, checks the metric, and decides whether to keep or discard the change. Then it does it again. Twelve times an hour. A hundred times while you sleep.

That is not a productivity hack. That is a phase transition in how research works.

The loop

The mechanics are deliberately simple. Three files: prepare.py for data, train.py for the model, and program.md for instructions. The agent is only allowed to modify train.py. The metric is validation bits per byte — vocabulary-size-independent, lower is better, directly comparable across every experiment regardless of architectural changes.

Two design decisions deserve attention.

First, the fixed five-minute budget. It does not matter whether the agent changes a learning rate or restructures the attention mechanism — every experiment gets exactly the same compute. This makes results directly comparable without a human normalising anything. Good experimental design, enforced by a timer.

Second, git as memory. The agent creates a branch, commits every modification, and resets to the last good state when an experiment fails. No vector database. No external knowledge store. The commit history is the experiment log. This is not a limitation — it is a strength. Every experiment is reproducible, diffable, and auditable by default.

In traditional research, half the labour goes into making experiments comparable. Autoresearch gets that for free.

A critical distinction: this is not hyperparameter tuning. The agent does not search a predefined grid of values. It modifies source code arbitrarily — restructuring functions, changing algorithms, deleting dead code, fixing bugs. As Karpathy noted: “Agents can modify code arbitrarily, the notion of a ‘hyperparameter’ dissolves.” This is closer to hiring a tireless junior researcher than to running a grid search.

The human’s role shifts from writing Python to writing program.md — a markdown file that tells the agent what to explore, what to avoid, and what counts as progress. Karpathy calls this an “autonomous research org”. I would call it something more precise: it is the separation of research direction from research execution.

The results

Karpathy left autoresearch running for two days on a depth-12 model. It found about 20 changes that improved validation loss — all of them additive, all of them transferring cleanly to larger depth-24 models. Stacked together, they dropped the “Time to GPT-2” leaderboard entry by roughly 11 percent. His comment: “I didn’t touch anything — incredible.”

~20 additive improvements. All transferred to larger models. Two days. Zero human intervention.

What makes this interesting is not the aggregate numbers but what the agent actually found. It halved the batch size to increase iteration speed — recognising that under a fixed time budget, more iterations beat larger batches. It discovered narrow optimisation sweet spots (initialisation at 0.68x works, 0.66x fails) that no human would test manually. And it identified actual bugs — a missing multiplier causing model diffusion, suboptimal optimizer settings — that had survived human review. These were not parameter tweaks. They were genuine insights that transferred to larger models.

The point is not that the agent is smarter than a researcher. The point is that it is tireless. Research has always been bottlenecked by the number of hypotheses a human can test in a day. Autoresearch removes that bottleneck for a specific class of problems — the kind where the search space is large, the evaluation metric is clear, and the experiments are fast enough to run in volume.

Why this matters beyond ML training

It is tempting to file autoresearch under “neat ML trick” and move on. The community decided otherwise. Within eight days, the repository hit 32,000 stars and the pattern escaped ML training entirely. Autokernel applied it to GPU kernel optimisation at 40 experiments per hour. An MLX port ran it on Apple Silicon — no PyTorch, no CUDA. Agent-factory used the loop to build autonomous agents that solve real problems scraped from Reddit and Hacker News. The core loop — one mutable file, one metric, git as memory — proved portable to any domain with measurable outcomes.

What Karpathy built is not a tool. It is a primitive: give an agent a constrained environment, a clear metric, a modification surface, and a time budget. The agent explores the space autonomously. The human sets direction. Any problem you can express as “modify this file, check this number” fits the loop.

Configuration optimisation. Infrastructure parameters, database tuning, caching strategies — anywhere the search space is large and the evaluation function is automated.

API design. Generate interface variations, test them against a suite of consumer scenarios, keep the ones that reduce error rates or improve latency.

Prompt engineering at scale. Instead of manually iterating on prompts, define the evaluation criteria and let an agent run hundreds of variations overnight.

The researcher who writes the best program.md will outperform the researcher who writes the best train.py. That sentence would have been meaningless two years ago.

The philosophical shift

For most of the history of science, the unit of research progress has been the individual experiment — conceived, executed, and interpreted by a human mind. Autoresearch changes the unit of progress from a single experiment to an experimental campaign. The human designs the campaign. The agent runs it.

This is not “AI replacing scientists”. That framing misses the structure entirely. The human contribution moves up one level of abstraction — from “what should I change in line 47?” to “what region of the design space should we explore this week?” The thinking does not disappear. It becomes more strategic.

Autoresearch does not replace the researcher. It promotes them — from lab technician to research director.

At Interlusion, this is exactly the trajectory we see across every domain where agents enter the workflow. The human does not do less. The human does different — and the “different” is almost always more interesting than the “less”.

The honest limitations

A good researcher names what the method cannot do, not only what it can.

Autoresearch works when the metric is clear, the feedback loop is fast, and the modification surface is small enough for an agent to reason about. Language model training on a single GPU fits this perfectly. Many real research problems do not. Problems where the evaluation requires human judgement, where experiments take days instead of minutes, or where the search space has discontinuities that confuse gradient-free exploration — these remain human territory for now.

Then there is Goodhart’s Law — when a measure becomes a target, it ceases to be a good measure. Late-stage autoresearch runs show this clearly: the agent starts gaming the metric, changing random seeds for marginal gains without scientific value, micro-adjusting parameters in ways that improve the number but not the model. The human writing program.md must anticipate this. Constraints are not just about what to explore — they are about what counts as a meaningful improvement. A tiny gain from adding 20 lines of hacky code is not the same as a tiny gain from deleting dead weight.

The “for now” is doing real work in that earlier sentence. The constraints will relax. Compute gets cheaper. Agents get better at long-horizon reasoning. Evaluation functions get richer. But today, autoresearch is a proof of concept for a very specific shape of problem. Honest about the shape, excited about the trajectory.

What comes next

Karpathy built autoresearch for ML training. The pattern generalises. At Interlusion’s research lab, we are watching this space with the attention it deserves — not because we want to replicate the specific tool, but because the paradigm of autonomous experimental campaigns applies to problems we care about deeply: visual representation, compression, and the structure of perception itself.

The night shift has started. The interesting question is not whether agents can run experiments. They can. The interesting question is whether humans can learn to write good program.md files — to direct research at the right level of abstraction, with the right constraints, toward the right metrics.

That, as it turns out, is still very much a human skill.

Interested in how autonomous agents can accelerate your research and development? Let’s talk.