Karpathy Ships MicroGPT: A Complete LLM in 200 Lines of Pure Python

March 12, 2026 llm, gpt, python, education, open-source

Andrej Karpathy published MicroGPT on February 12, 2026 — a single Python file containing every algorithmic component required to train and run a GPT-class language model. The script is 200 lines long, requires no external libraries, and implements tokenization, automatic differentiation, a GPT-2-style transformer, the Adam optimizer, and both training and inference loops from scratch.

The model uses 4,192 parameters. For comparison, GPT-2 shipped with 1.6 billion. Karpathy describes MicroGPT as the endpoint of a multi-year reduction effort spanning several prior projects — micrograd, makemore, and nanoGPT — each of which isolated a different piece of the LLM stack.

What’s in the File

MicroGPT packs the entire LLM pipeline into a linear script with no classes beyond a single Value type for automatic differentiation. Here’s what each section covers:

Component	Implementation	Notes
Dataset	32,000 names from `names.txt`	Each name treated as a separate document
Tokenizer	Character-level, 27 tokens	26 lowercase letters + BOS delimiter
Autograd	Custom `Value` class with 6 ops	add, mul, pow, log, exp, relu
Architecture	GPT-2 variant	RMSNorm (not LayerNorm), ReLU (not GeLU), no biases
Optimizer	Adam	Standard implementation
Parameters	4,192 total	16-dim embeddings, 4 attention heads, 1 layer, 16-token context

Architecture Details

The transformer configuration is deliberately minimal: embedding dimension of 16, a single layer, 4 attention heads with head dimension of 4, and a maximum sequence length of 16 tokens. The architecture follows GPT-2’s structure but swaps in RMSNorm for LayerNorm and ReLU for GeLU, eliminating bias terms entirely. These substitutions reduce code complexity without fundamentally changing the model’s behavior at this scale.

The autograd engine implements scalar-level backpropagation — algorithmically identical to what PyTorch runs on tensors, but operating on individual floating point values. This makes it orders of magnitude slower than production frameworks but transparent enough to step through with a debugger.

What It Actually Produces

Trained on the names dataset, the model generates plausible synthetic names — sequences like “kamon,” “karai,” and “vialan” that follow English phonotactic patterns learned from the training data. The outputs are statistically consistent with the input distribution without reproducing specific entries.

Karpathy frames this explicitly: from the model’s perspective, a ChatGPT conversation is structurally the same as a name — just a longer document. The response is statistical completion, not comprehension.

Practical Implications

MicroGPT isn’t a tool for production inference. Its value is pedagogical and architectural. For anyone building on or debugging transformer systems, this is now the most compact reference implementation available — everything that matters, nothing that doesn’t.

The 200-line constraint also serves as an implicit specification. If you can’t explain what a component does in terms of MicroGPT’s primitives, you’re dealing with an optimization detail, not an algorithmic requirement. That distinction matters when evaluating new model architectures or deciding what to strip from a deployment.

The code is available as a GitHub Gist, a standalone web page at karpathy.ai, and a runnable Google Colab notebook [1].

References

Karpathy, A. “microgpt.” February 12, 2026. http://karpathy.github.io/2026/02/12/microgpt/
MicroGPT source code (GitHub Gist): https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95

---

Configuration details reflect a production environment at time of writing. Implementation specifics vary based on tooling versions, platform updates, and organizational requirements. Validate approaches against current documentation before deployment.

← Back to Journal