How a Frontier Model Is Trained

Written by Claude (Anthropic's Claude Opus 4.8). Miguel Jackson was the editor — he set the scope, checked the claims and citations, and approved publication. This is a high-level overview, not a lab manual; the goal is an accurate mental model, not every detail.

The history page traces the architecture — how we got to the Transformer. This page is about the training: what actually happens to turn that architecture into a model that can write, reason, and code. We’ll keep it high-level but honest, and cite the sources as we go.

Everything below leans on one word the other pages use freely — a model “learns.” So let’s start there.

What is a parameter?

Before the Transformer, the workhorse was the LSTM — a recurrent cell that read text one token at a time.¹ Whether it’s an LSTM cell or a Transformer layer, the model is, underneath, a giant pile of numbers arranged into multiplications. The numbers that get adjusted during training are called parameters.

A parameter is a single learned number inside the model — usually a weight (how strongly one value influences another) or a bias (a constant offset). Training is nothing more than finding good values for all of them.

One unit: each input xᵢ is multiplied by a weight wᵢ, the results are summed with a bias b, and the total is passed on. The weights and bias are the parameters. GPT‑3 had 175 billion of them; frontier models today have far more.

That single number — a parameter — is the atom of the whole enterprise. A frontier model is hundreds of billions of them — GPT-3 alone had 175 billion² — and “training” is the search for values that make the model good at one deceptively simple task: predicting the next token.

What is a “frontier model”?

The phrase is recent. It was introduced in a 2023 policy paper, which defined it precisely:

“Frontier AI models [are] highly capable foundation models that could possess dangerous capabilities sufficient to pose severe risks to public safety.” — Anderljung et al., Frontier AI Regulation (2023)³

In plainer terms: a frontier model is one of the largest, most capable, most general models in existence at a given moment — the ones at the leading edge of what’s possible, like the GPT‑4-class models, the Claude family, or Claude Mythos. The same year, leading labs formed the Frontier Model Forum to coordinate on their safety, cementing the term.

The training loop

Pretraining a frontier model is a single idea repeated an astronomical number of times: show it text, ask it to predict the next token, and nudge every parameter to make that prediction a little better. The nudging is done by an algorithm called backpropagation.⁴

Pretraining in one diagram. The model guesses the next token; the guess is scored against the actual next token (the loss); backpropagation computes how to change each parameter to lower that loss, and an optimizer nudges all of them. Do this trillions of times and a model that started as random numbers becomes one that writes fluent prose and working code.

Two things make this work that weren’t obvious in advance:

The task is free to supervise. Nobody hand-labels the data — the “right answer” is just the word that actually came next in the text. That’s why models can train on a large fraction of the public internet.
Capability is a byproduct. To predict the next token well across billions of documents, a model is forced to internalize grammar, facts, reasoning patterns, and code. Competence emerges as a side effect of getting good at prediction.

After pretraining: alignment

A freshly pretrained model is a brilliant autocomplete, not a helpful assistant. A second, much smaller stage — instruction tuning and reinforcement learning from human feedback (RLHF) — teaches it to follow instructions and behave helpfully and harmlessly.⁵ This is the step that turns a raw next-token predictor into something you’d actually want to chat with.

Why the GPU matters so much

Every step of that loop is, concretely, enormous matrix multiplication — the same small arithmetic (multiply, add) repeated billions of times over. A CPU has a handful of powerful cores that mostly work in sequence. A GPU has thousands of small cores that do that arithmetic all at once — which happens to be the exact shape of Transformer math.

A picture worth a thousand words: training is mostly the same tiny operation repeated at massive scale. The GPU's thousands of parallel cores match that shape exactly — which is why a frontier run uses thousands of GPUs for weeks, and why GPUs, not CPUs, are the engine of modern AI.

Stack that up and the numbers get staggering: a frontier training run can involve thousands of GPUs running for weeks, consuming megawatts of power and costing tens to hundreds of millions of dollars. How far this brute-force approach can keep paying off — and whether the economics and physics eventually bite — is the subject of the companion page.

The reason “just add compute” produces predictably better models (up to a point) is captured by the scaling laws: loss falls as a smooth power law in model size, data, and compute.⁶ The 2022 Chinchilla result refined the recipe, showing that data and model size should grow together.⁷ Those same curves are why the next question is unavoidable.

Continue → Is There a Ceiling?

References

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. bioinf.jku.at ↩
Brown, T. B., et al. (2020). Language Models are Few-Shot Learners (GPT-3, 175B parameters). arXiv:2005.14165 ↩
Anderljung, M., Barnhart, J., Korinek, A., et al. (2023). Frontier AI Regulation: Managing Emerging Risks to Public Safety. arXiv:2307.03718. The paper introduced the term and proposed the definition quoted above; the same year, Anthropic, Google, Microsoft, and OpenAI launched the Frontier Model Forum. ↩
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. The algorithm behind how each parameter is adjusted. ↩
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback (InstructGPT / RLHF). arXiv:2203.02155 ↩
Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361 ↩
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556 ↩

Cómo se entrena un modelo de frontera

Escrito por Claude (Claude Opus 4.8 de Anthropic). Miguel Jackson fue el editor — él definió el alcance, verificó los datos y las citas, y aprobó la publicación. Esto es una visión general de alto nivel, no un manual de laboratorio; el objetivo es un modelo mental preciso, no cada detalle.

La página de historia rastrea la arquitectura — cómo llegamos al Transformer. Esta página trata el entrenamiento: qué ocurre realmente para convertir esa arquitectura en un modelo que puede escribir, razonar y programar. Lo mantendremos en un nivel alto pero honesto, y citaremos las fuentes a medida que avancemos.

Todo lo que sigue gira en torno a una palabra que las otras páginas usan con libertad — que un modelo “aprende.” Así que empecemos por ahí.

¿Qué es un parámetro?

Antes del Transformer, el caballo de batalla era el LSTM — una celda recurrente que leía texto un token a la vez.¹ Ya sea una celda LSTM o una capa Transformer, el modelo es, en el fondo, un enorme montón de números organizados en multiplicaciones. Los números que se ajustan durante el entrenamiento se llaman parámetros.

Un parámetro es un único número aprendido dentro del modelo — generalmente un peso (qué tan fuerte influye un valor en otro) o un sesgo (un desplazamiento constante). El entrenamiento no es más que encontrar buenos valores para todos ellos.

Una unidad: cada entrada xᵢ se multiplica por un peso wᵢ, los resultados se suman con un sesgo b, y el total se pasa adelante. Los pesos y el sesgo son los parámetros. GPT‑3 tenía 175 mil millones de ellos; los modelos de frontera actuales tienen muchos más.

Ese único número — un parámetro — es el átomo de toda la empresa. Un modelo de frontera tiene cientos de miles de millones de ellos — GPT-3 solo tenía 175 mil millones² — y “entrenar” es la búsqueda de valores que hagan al modelo bueno en una tarea engañosamente simple: predecir el siguiente token.

¿Qué es un “modelo de frontera”?

La expresión es reciente. Fue introducida en un artículo de política de 2023, que la definió con precisión:

“Los modelos de IA de frontera [son] modelos de base altamente capaces que podrían poseer capacidades peligrosas suficientes para representar riesgos graves para la seguridad pública.” — Anderljung et al., Frontier AI Regulation (2023)³

En términos más sencillos: un modelo de frontera es uno de los modelos más grandes, más capaces y más generales existentes en un momento dado — los que están en la vanguardia de lo posible, como los modelos de la clase GPT‑4, la familia Claude, o Claude Mythos. Ese mismo año, los principales laboratorios formaron el Frontier Model Forum para coordinar su seguridad, consolidando el término.

El ciclo de entrenamiento

El preentrenamiento de un modelo de frontera es una sola idea repetida un número astronómico de veces: mostrarle texto, pedirle que prediga el siguiente token y ajustar cada parámetro para que esa predicción sea un poco mejor. El ajuste lo realiza un algoritmo llamado retropropagación.⁴

El preentrenamiento en un diagrama. El modelo adivina el siguiente token; la suposición se puntúa contra el token real siguiente (la pérdida); la retropropagación calcula cómo cambiar cada parámetro para reducir esa pérdida, y un optimizador los ajusta todos. Haz esto billones de veces y un modelo que comenzó como números aleatorios se convierte en uno que escribe prosa fluida y código funcional.

Dos cosas hacen que esto funcione, y que no eran obvias de antemano:

La tarea se supervisa sola. Nadie etiqueta los datos manualmente — la “respuesta correcta” es simplemente la palabra que vino después en el texto. Por eso los modelos pueden entrenarse con una gran fracción de la internet pública.
La capacidad es un subproducto. Para predecir bien el siguiente token a través de miles de millones de documentos, un modelo se ve obligado a internalizar gramática, hechos, patrones de razonamiento y código. La competencia emerge como un efecto secundario de mejorar en la predicción.

Después del preentrenamiento: la alineación

Un modelo recién preentrenado es un autocompletado brillante, no un asistente útil. Una segunda etapa, mucho más pequeña — el ajuste fino por instrucciones y el aprendizaje por refuerzo a partir de retroalimentación humana (RLHF) — le enseña a seguir instrucciones y comportarse de manera útil e inofensiva.⁵ Este es el paso que transforma a un predictor bruto de tokens en algo con lo que realmente querrías conversar.

Por qué la GPU importa tanto

Cada paso de ese ciclo es, concretamente, una multiplicación de matrices de dimensiones enormes — la misma aritmética básica (multiplicar, sumar) repetida miles de millones de veces. Una CPU tiene unos pocos núcleos potentes que trabajan casi en secuencia. Una GPU tiene miles de núcleos pequeños que realizan esa aritmética todos a la vez — lo que resulta ser exactamente la forma que tiene la matemática del Transformer.

Una imagen que vale más que mil palabras: el entrenamiento es en su mayor parte la misma operación diminuta repetida a escala masiva. Los miles de núcleos paralelos de la GPU encajan perfectamente con esa forma — por eso una ejecución de frontera usa miles de GPUs durante semanas, y por eso las GPUs, y no las CPUs, son el motor de la IA moderna.

Suma todo eso y los números se vuelven abrumadores: una ejecución de entrenamiento de frontera puede involucrar miles de GPUs funcionando durante semanas, consumiendo megavatios de energía y costando decenas o cientos de millones de dólares. Si este enfoque de fuerza bruta puede seguir dando frutos — y si la economía y la física eventualmente imponen sus límites — es el tema de la página complementaria.

La razón por la que “simplemente añadir cómputo” produce modelos predeciblemente mejores (hasta cierto punto) queda capturada por las leyes de escala: la pérdida cae siguiendo una ley de potencia suave en función del tamaño del modelo, los datos y el cómputo.⁶ El resultado Chinchilla de 2022 refinó la receta, mostrando que los datos y el tamaño del modelo deben crecer juntos.⁷ Esas mismas curvas son la razón por la que la siguiente pregunta es inevitable.

Continúa → ¿Hay un techo?

Referencias

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. bioinf.jku.at ↩
Brown, T. B., et al. (2020). Language Models are Few-Shot Learners (GPT-3, 175B parameters). arXiv:2005.14165 ↩
Anderljung, M., Barnhart, J., Korinek, A., et al. (2023). Frontier AI Regulation: Managing Emerging Risks to Public Safety. arXiv:2307.03718. El artículo introdujo el término y propuso la definición citada arriba; ese mismo año, Anthropic, Google, Microsoft y OpenAI lanzaron el Frontier Model Forum. ↩
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. El algoritmo que explica cómo se ajusta cada parámetro. ↩
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback (InstructGPT / RLHF). arXiv:2203.02155 ↩
Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361 ↩
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556 ↩