Karpathy Ships MicroGPT: A Complete LLM in 200 Lines of Pure Python
Andrej Karpathy published MicroGPT on February 12, 2026 â a single Python file containing every algorithmic component required to train and run a GPT-class language model. The script is 200 lines long, requires no external libraries, and implements tokenization, automatic differentiation, a GPT-2-style transformer, the Adam optimizer, and both training and inference loops from scratch.
The model uses 4,192 parameters. For comparison, GPT-2 shipped with 1.6 billion. Karpathy describes MicroGPT as the endpoint of a multi-year reduction effort spanning several prior projects â micrograd, makemore, and nanoGPT â each of which isolated a different piece of the LLM stack.
What’s in the File
MicroGPT packs the entire LLM pipeline into a linear script with no classes beyond a single Value type for automatic differentiation. Here’s what each section covers:
| Component | Implementation | Notes |
|---|---|---|
| Dataset | 32,000 names from names.txt | Each name treated as a separate document |
| Tokenizer | Character-level, 27 tokens | 26 lowercase letters + BOS delimiter |
| Autograd | Custom Value class with 6 ops | add, mul, pow, log, exp, relu |
| Architecture | GPT-2 variant | RMSNorm (not LayerNorm), ReLU (not GeLU), no biases |
| Optimizer | Adam | Standard implementation |
| Parameters | 4,192 total | 16-dim embeddings, 4 attention heads, 1 layer, 16-token context |
Architecture Details
The transformer configuration is deliberately minimal: embedding dimension of 16, a single layer, 4 attention heads with head dimension of 4, and a maximum sequence length of 16 tokens. The architecture follows GPT-2’s structure but swaps in RMSNorm for LayerNorm and ReLU for GeLU, eliminating bias terms entirely. These substitutions reduce code complexity without fundamentally changing the model’s behavior at this scale.
The autograd engine implements scalar-level backpropagation â algorithmically identical to what PyTorch runs on tensors, but operating on individual floating point values. This makes it orders of magnitude slower than production frameworks but transparent enough to step through with a debugger.
What It Actually Produces
Trained on the names dataset, the model generates plausible synthetic names â sequences like “kamon,” “karai,” and “vialan” that follow English phonotactic patterns learned from the training data. The outputs are statistically consistent with the input distribution without reproducing specific entries.
Karpathy frames this explicitly: from the model’s perspective, a ChatGPT conversation is structurally the same as a name â just a longer document. The response is statistical completion, not comprehension.
Practical Implications
MicroGPT isn’t a tool for production inference. Its value is pedagogical and architectural. For anyone building on or debugging transformer systems, this is now the most compact reference implementation available â everything that matters, nothing that doesn’t.
The 200-line constraint also serves as an implicit specification. If you can’t explain what a component does in terms of MicroGPT’s primitives, you’re dealing with an optimization detail, not an algorithmic requirement. That distinction matters when evaluating new model architectures or deciding what to strip from a deployment.
The code is available as a GitHub Gist, a standalone web page at karpathy.ai, and a runnable Google Colab notebook [1].
References
- Karpathy, A. “microgpt.” February 12, 2026. http://karpathy.github.io/2026/02/12/microgpt/
- MicroGPT source code (GitHub Gist): https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95
---
Configuration details reflect a production environment at time of writing. Implementation specifics vary based on tooling versions, platform updates, and organizational requirements. Validate approaches against current documentation before deployment.