The Physics of Large Models
1. Predictable Progress.
In the early days of deep learning, scaling was an exercise in trial and error. Today, it is a branch of predictive physics. Scaling Laws are empirical relationships that describe how the test loss of a Transformer model decreases as you increase three variables: the number of parameters (N), the amount of training data (D), and the total compute used for training (C).
The power of these laws lies in their predictability. By training a series of small "pilot" models (e.g., from 10M to 1B parameters), engineers can extrapolate exactly how a 175B or 1T parameter model will perform before investing tens of millions of dollars in GPU time. This predictability is what allows the AI industry to sustain its relentless pace of advancement. Without these laws, the risk of a training run failing to meet performance targets would be too great for even the largest tech giants.
We observe that scaling is not merely about "bigger is better," but about the harmony of resources. If you scale parameters without scaling data, the model becomes a "memorizer" with poor generalization. If you scale data without parameters, the model hits a "capacity ceiling" where it cannot absorb new information regardless of how many tokens it sees. The scaling laws provide the exact ratio for this harmony.
The Formalism of Loss.
The relationship between compute, data, and parameters is not merely linear; it follows a power-law distribution. The generalized loss function can be expressed as:
The Unified Scaling Law for Transformer Loss
Kaplan's original findings (2020) suggested that and . This indicated that increasing model size (N) was significantly more effective at reducing loss than increasing data (D). However, this "Model-First" philosophy led to models like GPT-3 being severely under-trained, as they possessed more "capacity" than they had "experience" (tokens).
Chinchilla vs. Kaplan: The Battle of Formulas.
The history of LLM scaling is divided into two eras: the Kaplan Era (2020) and the Chinchilla Era (2022). Kaplan et al. originally posited that model performance was most sensitive to parameter count (N), leading to the creation of massive, sparse models like GPT-3 that were significantly under-trained relative to their potential.
DeepMind's Hoffmann et al. (Chinchilla) debunked this by proving that for a given compute budget (C), the model size (N) and the number of training tokens (D) should be scaled in equal proportions. The results were startling:
Predicted that 70% of compute should go to parameters.
50/50 split: Double compute = params + tokens.
The Chinchilla paper derived that the Data-to-Parameter Ratio should be roughly 20:1. For every 1 parameter, you need at least 20 tokens of training data to reach compute optimality.
Optimal Allocation Logic
Following this logic, a 175B model requires 3.5 Trillion tokens to be compute-optimal. GPT-3 was trained on only 300 Billion tokens, meaning it was operating at only ~10% of its theoretical intelligence-per-FLOP baseline.
4. The 6ND Approximation.
When calculating the compute cost (C) of training a transformer, engineers use the 6ND rule of thumb. This approximation states that the total number of floating-point operations (FLOPs) required to train a model is roughly 6 times the product of the number of parameters (N) and the number of training tokens (D).
Total Training Compute (FLOPs)
Where does the 6 come from? It is the sum of the forward and backward passes. For each token, the forward pass requires approximately 2N operations (one multiply-add per parameter). The backward pass is twice as expensive as the forward pass (4N ops), as it requires calculating both the gradients with respect to the activations and the gradients with respect to the weights.
This 6ND rule assumes no overhead from activation recomputation (gradient checkpointing). In reality, to save VRAM on massive models, engineers often recompute the forward pass during the backward pass. This increases the constant from 6 to 8 (or ). This tradeoff between compute and memory is a central decision in large-scale cluster orchestration.
Hardware-Aware Realities.
Pure scaling laws assume "perfect" hardware where every FLOP is useful. In reality, scaling is bounded by the Model FLOP Utilization (MFU). As models scale, the ratio of communication (inter-node latency) to computation (intra-GPU math) increases, often causing the "observed" scaling law to flatten prematurely.
Memory Bound
Scaling N (parameters) is limited by HBM3e capacity. If N > GPU Memory, throughput collapses due to off-chip swaps.
IO Bound
Large D (data) scaling requires massive disk-to-GPU bandwidth. The pipeline stalls if the GDS (GPUDirect Storage) cannot feed the NIC.
Power Bound
The "Gigawatt Frontier": Scaling compute (C) eventually hits the thermodynamic limit of the data center's cooling infrastructure.
Precision Scaling.
Most scaling laws assume training in FP16 or BF16. However, the industry is rapidly transitioning to FP8 and even INT4 for massive training runs. Precision scaling describes how reducing the number of bits per parameter affects the scaling coefficient.
Experimental data shows that while lower precision introduces "quantization noise," it allows for a larger number of parameters (N) to fit within the same memory and compute budget. For models beyond 500B parameters, the efficiency gains of FP8 often outweigh the slight increase in loss, effectively "stretching" the scaling law along the compute axis.
Bits/Parameter vs. Efficiency (H100/B200 Baseline)
- BF16 (Reference)1.0x Density / 1.0x TFLOPS / ~100% Signal
- FP8 (Performance)2.0x Density / 2.0x TFLOPS / ~99.8% Signal
- FP4 (Next-Gen)4.0x Density / 4.0x TFLOPS / ~98.5% Signal
- INT4 (Extreme)4.0x Density / 8.0x TFLOPS (Specialized) / ~95% Signal
Quality as a Multiplier.
Chinchilla optimality assumes a constant data quality (). However, not all tokens are created equal. High-quality tokens (textbooks, verified code, logic puzzles) have a significantly higher Loss-Reduction Efficiency than low-quality web scrapes.
By filtering for quality, researchers can effectively shift the scaling curve downward. Training on a "Pure" dataset of 100B tokens can sometimes result in the same test loss as training on 500B tokens of "Noisy" data.
Curriculum and Annealing.
The order in which data is presented matters—a concept known as Curriculum Learning. Modern scaling recipes often involve "Data Annealing," where the model is trained on a massive, diverse dataset for 95% of the run, followed by a high-intensity "cool down" phase on ultra-high-quality data.
This final stage effectively "resolves" the model's knowledge, collapsing the final 5% of loss much faster than the initial power law would predict. Analyzing the scaling laws during the annealing phase reveals a "Breakthrough Regime" where intelligence gains become super-linear for a short period as the model synthesizes its latent representations.
Scaling Mixture of Experts (MoE).
The scaling laws for Mixture of Experts (MoE) follow a different trajectory than dense models. In an MoE architecture, the model consists of total parameters (), but only a fraction are active () for any single token.
While dense models follow a strict 6ND compute cost, MoE models allow for Knowledge Scaling without Compute Scaling. By increasing the total number of experts, we can increase the model's "capacity" (memory) without increasing the FLOPs required for inference.
The Sparse Compute Advantage
However, this introduces a Routing Penalty where the scaling efficiency drops slightly as the expert count increases due to load-balancing overhead and communication latency across GPU nodes (the "All-to-All" bottleneck).
MoE Scaling Characteristics
Sub-Linear Compute: Intelligence scales with , but marginal cost scales with .
Expert Diversity: MoE requires higher data diversity (D) to ensure all experts (E) are conditioned correctly.
Communication Bound: Scaling from 8 to 64 experts increases training time by 2x due to All-to-All communication latency, despite identical FLOPs.
Distributed Scaling Overheads.
As models grow beyond the memory of a single GPU, we must use 3D Parallelism (Data, Tensor, and Pipeline). Each of these introduces a "scaling tax" that bends the 6ND compute curve.
The Sources of Scaling Inefficiency
- 01
Pipeline Bubbles
When using Pipeline Parallelism (PP), certain GPUs sit idle while waiting for the activations from previous stages. This idle time (the "bubble") increases with the number of stages, reducing effective compute efficiency.
- 02
Communication Quantization
Synchronizing gradients across 10,000+ GPUs requires massive bandwidth. If the network is slow, the GPUs must wait for "All-Reduce" operations, turning a compute-bound problem into a network-bound one.
- 03
Memory Fragmentation
ZeRO-Redundancy Optimizer (ZeRO) stages reduce memory pressure but add communication steps. The "Scaling-Efficiency Coefficient" () represents the ratio of delivered TFLOPS to peak theoretical TFLOPS.
Implementation: Loss Extrapolation.
import numpy as np
def predict_test_loss(C, alpha=0.05, A=400):
"""
Extrapolates test loss based on compute budget using the Kaplan power law.
L(C) = A * C^(-alpha)
"""
return A * (C**(-alpha))
# Example: Extrapolating from a 1B model to 100B
small_model_compute = 1e20 # FLOPs
target_model_compute = 1e24 # FLOPs
current_loss = predict_test_loss(small_model_compute)
predicted_loss = predict_test_loss(target_model_compute)
print(f"Predicted Loss Improvement: {current_loss - predicted_loss:.4f}")Data-Constrained Regimes.
What happens when a model's optimal data requirement exceeds the total volume of high-quality human text on the internet? This is the Data Ceiling. In a data-constrained regime, scaling laws begin to "bend."
Recent research suggests that training for multiple epochs (up to 4-10 passes) over the same dataset can still yield improvements, though with diminishing returns compared to fresh tokens. The scaling coefficient () effectively decays, necessitating a move toward Multi-Modal Data Scaling (training on video/audio) and Synthetic Data Augmentation.
Epoch Scaling Efficiency
Diminishing Returns of Multi-Epoch Training
Inference-Time Scaling (System 2).
The next frontier of scaling is not in training, but in Inference. Traditional transformers are "System 1" thinkers—they produce tokens in a single forward pass with a fixed compute-per-token cost. Inference Scaling allows a model to "think harder" by using more compute at the moment of generation.
This paradigm shift, popularised by models like OpenAI o1, introduces a new scaling law where accuracy scales as a function of Test-Time Compute. This involves techniques such as:
Training a separate "Reward Model" or "Outcome Verifier" to judge 1,000+ candidate responses. Accuracy increases log-linearly with the number of samples ().
Using Monte Carlo Tree Search (MCTS) or Chain-of-Thought (CoT) to explore deep reasoning paths. For mathematical and coding tasks, this "System 2" compute can bridge the gap between an 8B and a 400B model.
The Inference Scaling Law
The Thermodynamic Limit.
Eventually, scaling hits a physical wall: Power Density. As we scale clusters toward the "Gigawatt Era," the cost of cooling and power delivery begins to exceed the cost of the GPUs themselves.
Future scaling laws will likely need to incorporate Energy Efficiency (Tokens per Joule) as a primary variable. We are moving from a regime of "Intelligence at any cost" to one of "Intelligence within a fixed energy envelope," driving the development of specialized optical interconnects and advanced liquid-cooling manifolds to maintain the power-law trajectory.
Grokking: The Phase Transition.
Scaling laws are usually smooth power curves. However, certain capabilities—like mathematical reasoning or symbolic logic—exhibit Grokking, a phenomenon where a model's performance on a specific task suddenly jumps from 0% to nearly 100% after a critical training duration.
This suggests that the model is initially memorizing samples but eventually "discovers" the underlying algorithmic structure. Scaling the compute beyond the point of standard "convergence" can sometimes trigger these latent phase transitions, turning a mediocre model into a specialist.
Multi-Modal Scaling Laws.
Scaling Vision Transformers (ViT) and Video models (Sora/Lumina) follows a similar power law, but with different exponents. For vision, the scaling is often bound by Patch Size and Resolution.
Research into VLMs (Vision-Language Models) shows that the optimal parameter allocation shifting toward a 1:1 ratio between vision encoders and language decoders for high-reasoning tasks. However, video scaling is uniquely constrained by Temporal Resolution, where the compute cost scales cubically with the length of the video sequence if naive self-attention is used, necessitating the shift toward Diffusion Transformers (DiT).
Philosophy: Bits vs. Intelligence.
At its core, scaling laws treat intelligence as a compression problem. If a model can predict the next token with high accuracy, it has successfully "compressed" the knowledge of the training set. The more bits of information we feed the model (D) and the more bits of capacity it has (N), the more intelligence we can extract.
This lead to the "Scale is All You Need" philosophy, which argues that we don't need "better" algorithms, just bigger ones. While true for foundational capabilities, the industry is now moving toward Context Scaling (long-context windows) and Action Scaling (training on real-world trajectories) to push the frontier further.
The Alignment Scaling Law.
Can safety be scaled? Research into RLHF (Reinforcement Learning from Human Feedback) suggests that model "helpfulness" and "harmlessness" also follow predictable scaling patterns. However, alignment scaling often encounters a Safety-Utility Tradeoff, where pushing the model too hard toward strict alignment can cause a slight decay in its raw reasoning capabilities (the "Alignment Tax").
The future of scaling laws involves Constitutional AI, where a small set of human principles is used to supervise the scaling of a much larger child model, ensuring that as compute grows, the model's value alignment remains robustly locked to human intent.
Frequently Asked Questions
Related Engineering Resources
The era of guessing the behavior of large neural networks is over. Transformer Scaling Laws have provided the industry with a predictive toolkit that rivals the precision of mechanical engineering. As we move toward larger, more complex systems—incorporating sparse architectures, inference-time reasoning, and multi-modal integration—these laws will continue to evolve, but the core principle remains: Intelligence is predictable if you know the math of your compute.
For the network architect, this means that the network is no longer a "support" system; it is the fundamental constraint on the scaling of intelligence. Bridging the gap between the GPU's memory bandwidth and the fabric's collective communication speed is the only way to stay on the power-law curve.
19. Scaling for Long Context.
Standard scaling laws assume a fixed context window (e.g., 2k or 4k tokens). However, as context windows scale to 1M+ tokens, a new scaling dimension emerges. The compute cost of standard self-attention scales quadratically () with context length .
This means for ultra-long context models, the compute budget shifts from "learning new weights" to "processing the prompt." New architectures like FlashAttention-3 and Ring Attention attempt to linearize these costs, but the fundamental "Memory Wall" of the KV-Cache remains. Scaling context effectively requires scaling the HBM3e capacity of the cluster faster than the TFLOPS.
20. Inference-Optimal Scaling.
Chinchilla optimality minimizes *training cost*. However, if you plan to serve a model to 100 million users, the *inference cost* becomes the dominant expense. In this regime, it is actually "Optimal" to over-train a small model.
By training a 7B or 8B parameter model on 15 Trillion tokens (far beyond the Chinchilla point), we create a model that is vastly more capable than its size suggests. This "small but over-trained" model saves billions in inference TCO because it requires fewer GPUs to serve and has lower latency, even if the training run was "inefficient" by raw FLOP standards.
21. The Mystery of Multi-Epochs.
Conventional wisdom (Kaplan) said you should never see a token twice. But in a data-scarce world, we are seeing models Grok and generalize over up to 10 epochs.
The scaling law for multi-epoch training shows that the first 4 epochs provide ~80% of the value of fresh data. This suggests that the model isn't just memorizing; it is finding more efficient internal "circuits" to represent the same information. This is critical for high-value reasoning data where fresh samples are impossible to find.
22. Scaling Reasoning (CoT).
Training on Chain-of-Thought (CoT) data changes the scaling exponents. Reasoning tokens are "Denser" than descriptive tokens.
Models trained on reasoning steps show a steeper scaling curve ( is higher). This implies that as we move from "surface-level" text to "logical" text, we get more intelligence-per-FLOP. This is the foundation of the "Reasoning-First" training paradigm where we prioritize synthetic logical chains over raw web scrapes.
23. Scaling the Alignment Tax.
Alignment—making the model follow instructions and provide safe answers—is not a "Fixed Cost." It scales with model size. As increases, the model becomes more "Injectable" and has more "Latent Knowledge" that can be misaligned.
The scaling law for alignment suggests that the amount of RLHF (Reinforcement Learning from Human Feedback) data required to align a model scales sub-linearly with . A 175B model is inherently "Easier" to align than a 7B model because it understands the *intent* of the instructions better. However, the "Utility Loss" (the drop in raw reasoning performance) increases as we push for more extreme safety, creating a pareto-front of safety vs. capability.
24. The Parallelism Efficiency tax.
In a 100,000 GPU cluster, we use 3D Parallelism: Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP). These are not free. The "Scaling Efficiency" can be modeled as:
The Communication Bottleneck
As we scale to larger models, T_comm increases because more data must move between nodes. To keep efficiency > 0.6 , engineers must use high-bandwidth fabrics (InfiniBand/NDR800). If the network bandwidth doesn't scale as fast as the TFLOPS, the scaling law essentially "stalls"—you are adding GPUs but not decreasing loss.
25. Fault Tolerance Scaling.
At the "Zetta-FLOP" scale, hardware failure is a statistical certainty. A cluster of 100,000 GPUs will experience a failure every few hours. Standard Checkpoint/Restart (C/R) becomes a bottleneck because the time to save the model to disk starts to exceed the time between failures.
Scaling must therefore incorporate In-Memory Checkpointing and Predictive Fault Tolerance. If the system can detect an impending GPU failure via telemetry and migrate its workload before the crash, it effectively "Stretches" the compute budget, allowing the scaling law to continue indefinitely despite imperfect hardware.
"You are our partner in accuracy. If you spot a discrepancy in calculations, a technical typo, or have a field insight to share, don't hesitate to reach out. Your expertise helps us maintain the highest standards of reliability."
Contributors are acknowledged in our technical updates.
