Transformer Scaling Laws: The Compute-Optimal Frontier

1. Predictable Progress.

In the early days of deep learning, scaling was an exercise in trial and error. Today, it is a branch of predictive physics. Scaling Laws are empirical relationships that describe how the test loss of a Transformer model decreases as you increase three variables: the number of parameters (N), the amount of training data (D), and the total compute used for training (C).

The power of these laws lies in their predictability. By training a series of small "pilot" models (e.g., from 10M to 1B parameters), engineers can extrapolate exactly how a 175B or 1T parameter model will perform before investing tens of millions of dollars in GPU time. This predictability is what allows the AI industry to sustain its relentless pace of advancement. Without these laws, the risk of a $\$100\text{M}$ training run failing to meet performance targets would be too great for even the largest tech giants.

We observe that scaling is not merely about "bigger is better," but about the harmony of resources. If you scale parameters without scaling data, the model becomes a "memorizer" with poor generalization. If you scale data without parameters, the model hits a "capacity ceiling" where it cannot absorb new information regardless of how many tokens it sees. The scaling laws provide the exact ratio for this harmony.

The Formalism of Loss.

The relationship between compute, data, and parameters is not merely linear; it follows a power-law distribution. The generalized loss function $L(N, D)$ can be expressed as:

L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

The Unified Scaling Law for Transformer Loss

L(N, D)Expected test loss (Cross-Entropy)

EIrreducible loss (Bayes Error / Noise Floor)

A, BScaling coefficients (Task-dependent constants)

\alpha, \betaScaling exponents (Model architecture properties)

Kaplan's original findings (2020) suggested that $\alpha \approx 0.05$ and $\beta \approx 0.06$ . This indicated that increasing model size (N) was significantly more effective at reducing loss than increasing data (D). However, this "Model-First" philosophy led to models like GPT-3 being severely under-trained, as they possessed more "capacity" than they had "experience" (tokens).

Chinchilla vs. Kaplan: The Battle of Formulas.

The history of LLM scaling is divided into two eras: the Kaplan Era (2020) and the Chinchilla Era (2022). Kaplan et al. originally posited that model performance was most sensitive to parameter count (N), leading to the creation of massive, sparse models like GPT-3 that were significantly under-trained relative to their potential.

DeepMind's Hoffmann et al. (Chinchilla) debunked this by proving that for a given compute budget (C), the model size (N) and the number of training tokens (D) should be scaled in equal proportions. The results were startling:

Kaplan Formula (2020)

N ∝ C^0.73, D ∝ C^0.27

Predicted that 70% of compute should go to parameters.

Chinchilla Formula (2022)

N, D ∝ C^0.5

50/50 split: Double compute = $\sqrt{2}$ params + $\sqrt{2}$ tokens.

The Chinchilla paper derived that the Data-to-Parameter Ratio should be roughly 20:1. For every 1 parameter, you need at least 20 tokens of training data to reach compute optimality.

N^*(C) = G \\cdot C^a, \\quad D^*(C) = \\frac{1}{G} \\cdot C^b

Optimal Allocation Logic

N^*Optimal parameters

D^*Optimal tokens

GConstant ~0.16

a, bExponents ~0.5

Following this logic, a 175B model requires 3.5 Trillion tokens to be compute-optimal. GPT-3 was trained on only 300 Billion tokens, meaning it was operating at only ~10% of its theoretical intelligence-per-FLOP baseline.

4. The 6ND Approximation.

When calculating the compute cost (C) of training a transformer, engineers use the 6ND rule of thumb. This approximation states that the total number of floating-point operations (FLOPs) required to train a model is roughly 6 times the product of the number of parameters (N) and the number of training tokens (D).

C \approx 6 \times N \times D

Total Training Compute (FLOPs)

CTotal compute budget

NNumber of parameters

DNumber of training tokens

Where does the 6 come from? It is the sum of the forward and backward passes. For each token, the forward pass requires approximately 2N operations (one multiply-add per parameter). The backward pass is twice as expensive as the forward pass (4N ops), as it requires calculating both the gradients with respect to the activations and the gradients with respect to the weights.

This 6ND rule assumes no overhead from activation recomputation (gradient checkpointing). In reality, to save VRAM on massive models, engineers often recompute the forward pass during the backward pass. This increases the constant from 6 to 8 (or $8ND$ ). This tradeoff between compute and memory is a central decision in large-scale cluster orchestration.

Hardware-Aware Realities.

Pure scaling laws assume "perfect" hardware where every FLOP is useful. In reality, scaling is bounded by the Model FLOP Utilization (MFU). As models scale, the ratio of communication (inter-node latency) to computation (intra-GPU math) increases, often causing the "observed" scaling law to flatten prematurely.

Memory Bound

Scaling N (parameters) is limited by HBM3e capacity. If N > GPU Memory, throughput collapses due to off-chip swaps.

IO Bound

Large D (data) scaling requires massive disk-to-GPU bandwidth. The pipeline stalls if the GDS (GPUDirect Storage) cannot feed the NIC.

Power Bound

The "Gigawatt Frontier": Scaling compute (C) eventually hits the thermodynamic limit of the data center's cooling infrastructure.

Precision Scaling.

Most scaling laws assume training in FP16 or BF16. However, the industry is rapidly transitioning to FP8 and even INT4 for massive training runs. Precision scaling describes how reducing the number of bits per parameter affects the scaling coefficient.

Experimental data shows that while lower precision introduces "quantization noise," it allows for a larger number of parameters (N) to fit within the same memory and compute budget. For models beyond 500B parameters, the efficiency gains of FP8 often outweigh the slight increase in loss, effectively "stretching" the scaling law along the compute axis.

Bits/Parameter vs. Efficiency (H100/B200 Baseline)

BF16 (Reference)1.0x Density / 1.0x TFLOPS / ~100% Signal
FP8 (Performance)2.0x Density / 2.0x TFLOPS / ~99.8% Signal
FP4 (Next-Gen)4.0x Density / 4.0x TFLOPS / ~98.5% Signal
INT4 (Extreme)4.0x Density / 8.0x TFLOPS (Specialized) / ~95% Signal

Quality as a Multiplier.

Chinchilla optimality assumes a constant data quality ( $Q$ ). However, not all tokens are created equal. High-quality tokens (textbooks, verified code, logic puzzles) have a significantly higher Loss-Reduction Efficiency than low-quality web scrapes.

By filtering for quality, researchers can effectively shift the scaling curve downward. Training on a "Pure" dataset of 100B tokens can sometimes result in the same test loss as training on 500B tokens of "Noisy" data.

Curriculum and Annealing.

The order in which data is presented matters—a concept known as Curriculum Learning. Modern scaling recipes often involve "Data Annealing," where the model is trained on a massive, diverse dataset for 95% of the run, followed by a high-intensity "cool down" phase on ultra-high-quality data.

This final stage effectively "resolves" the model's knowledge, collapsing the final 5% of loss much faster than the initial power law would predict. Analyzing the scaling laws during the annealing phase reveals a "Breakthrough Regime" where intelligence gains become super-linear for a short period as the model synthesizes its latent representations.

Scaling Mixture of Experts (MoE).

The scaling laws for Mixture of Experts (MoE) follow a different trajectory than dense models. In an MoE architecture, the model consists of total parameters ( $N_{total}$ ), but only a fraction are active ( $N_{active}$ ) for any single token.

While dense models follow a strict 6ND compute cost, MoE models allow for Knowledge Scaling without Compute Scaling. By increasing the total number of experts, we can increase the model's "capacity" (memory) without increasing the FLOPs required for inference.

C_{\\text{MoE}} \\approx 2 \\cdot N_{\\text{active}} \\text{ (forward)} + 4 \\cdot N_{\\text{active}} \\text{ (backward)}

The Sparse Compute Advantage

C_{\text{MoE}}Compute for an MoE iteration

N_{\text{active}}Parameters of the active experts

N_{\text{total}}Not present in the compute equation (only impacts VRAM)

However, this introduces a Routing Penalty where the scaling efficiency drops slightly as the expert count increases due to load-balancing overhead and communication latency across GPU nodes (the "All-to-All" bottleneck).

MoE Scaling Characteristics

Sub-Linear Compute: Intelligence scales with $N_{\text{total}}$ , but marginal cost scales with $N_{\text{active}}$ .
Expert Diversity: MoE requires higher data diversity (D) to ensure all experts (E) are conditioned correctly.
Communication Bound: Scaling from 8 to 64 experts increases training time by 2x due to All-to-All communication latency, despite identical FLOPs.

Distributed Scaling Overheads.

As models grow beyond the memory of a single GPU, we must use 3D Parallelism (Data, Tensor, and Pipeline). Each of these introduces a "scaling tax" that bends the 6ND compute curve.

The Sources of Scaling Inefficiency

01
Pipeline Bubbles
When using Pipeline Parallelism (PP), certain GPUs sit idle while waiting for the activations from previous stages. This idle time (the "bubble") increases with the number of stages, reducing effective compute efficiency.
02
Communication Quantization
Synchronizing gradients across 10,000+ GPUs requires massive bandwidth. If the network is slow, the GPUs must wait for "All-Reduce" operations, turning a compute-bound problem into a network-bound one.
03
Memory Fragmentation
ZeRO-Redundancy Optimizer (ZeRO) stages reduce memory pressure but add communication steps. The "Scaling-Efficiency Coefficient" ( $\eta$ ) represents the ratio of delivered TFLOPS to peak theoretical TFLOPS.

Implementation: Loss Extrapolation.

scaling_predictor.py

import numpy as np

def predict_test_loss(C, alpha=0.05, A=400):
    """
    Extrapolates test loss based on compute budget using the Kaplan power law.
    L(C) = A * C^(-alpha)
    """
    return A * (C**(-alpha))

# Example: Extrapolating from a 1B model to 100B
small_model_compute = 1e20 # FLOPs
target_model_compute = 1e24 # FLOPs

current_loss = predict_test_loss(small_model_compute)
predicted_loss = predict_test_loss(target_model_compute)

print(f"Predicted Loss Improvement: {current_loss - predicted_loss:.4f}")

Data-Constrained Regimes.

What happens when a model's optimal data requirement exceeds the total volume of high-quality human text on the internet? This is the Data Ceiling. In a data-constrained regime, scaling laws begin to "bend."

Recent research suggests that training for multiple epochs (up to 4-10 passes) over the same dataset can still yield improvements, though with diminishing returns compared to fresh tokens. The scaling coefficient ( $\alpha$ ) effectively decays, necessitating a move toward Multi-Modal Data Scaling (training on video/audio) and Synthetic Data Augmentation.

Epoch Scaling Efficiency

\\Delta L \\approx \\eta \\times \\log(E)

Diminishing Returns of Multi-Epoch Training

\Delta LImprovement in test loss

ENumber of training epochs

\etaEfficiency coefficient (decreases as E increases)

Inference-Time Scaling (System 2).

The next frontier of scaling is not in training, but in Inference. Traditional transformers are "System 1" thinkers—they produce tokens in a single forward pass with a fixed compute-per-token cost. Inference Scaling allows a model to "think harder" by using more compute at the moment of generation.

This paradigm shift, popularised by models like OpenAI o1, introduces a new scaling law where accuracy scales as a function of Test-Time Compute. This involves techniques such as:

Process 1: Verifier Scaling

Training a separate "Reward Model" or "Outcome Verifier" to judge 1,000+ candidate responses. Accuracy increases log-linearly with the number of samples ( $N_{samples}$ ).

Process 2: Search Scaling

Using Monte Carlo Tree Search (MCTS) or Chain-of-Thought (CoT) to explore deep reasoning paths. For mathematical and coding tasks, this "System 2" compute can bridge the gap between an 8B and a 400B model.

\\Delta \\text{Accuracy} \\propto \\log(C_{\\text{inference}})

The Inference Scaling Law

\Delta \text{Accuracy}Improvement in reasoning benchmark scores

C_{\text{inference}}Total tokens generated (including discarded chains)

The Thermodynamic Limit.

Eventually, scaling hits a physical wall: Power Density. As we scale clusters toward the "Gigawatt Era," the cost of cooling and power delivery begins to exceed the cost of the GPUs themselves.

Future scaling laws will likely need to incorporate Energy Efficiency (Tokens per Joule) as a primary variable. We are moving from a regime of "Intelligence at any cost" to one of "Intelligence within a fixed energy envelope," driving the development of specialized optical interconnects and advanced liquid-cooling manifolds to maintain the power-law trajectory.

Grokking: The Phase Transition.

Scaling laws are usually smooth power curves. However, certain capabilities—like mathematical reasoning or symbolic logic—exhibit Grokking, a phenomenon where a model's performance on a specific task suddenly jumps from 0% to nearly 100% after a critical training duration.

This suggests that the model is initially memorizing samples but eventually "discovers" the underlying algorithmic structure. Scaling the compute beyond the point of standard "convergence" can sometimes trigger these latent phase transitions, turning a mediocre model into a specialist.

"Grokking is the point where the generalization loss suddenly collapses well below the training loss, signaling the transition from lookup-table memorization to circuit-level logic."

Multi-Modal Scaling Laws.

Scaling Vision Transformers (ViT) and Video models (Sora/Lumina) follows a similar power law, but with different exponents. For vision, the scaling is often bound by Patch Size and Resolution.

Research into VLMs (Vision-Language Models) shows that the optimal parameter allocation shifting toward a 1:1 ratio between vision encoders and language decoders for high-reasoning tasks. However, video scaling is uniquely constrained by Temporal Resolution, where the compute cost scales cubically with the length of the video sequence if naive self-attention is used, necessitating the shift toward Diffusion Transformers (DiT).

Philosophy: Bits vs. Intelligence.

At its core, scaling laws treat intelligence as a compression problem. If a model can predict the next token with high accuracy, it has successfully "compressed" the knowledge of the training set. The more bits of information we feed the model (D) and the more bits of capacity it has (N), the more intelligence we can extract.

This lead to the "Scale is All You Need" philosophy, which argues that we don't need "better" algorithms, just bigger ones. While true for foundational capabilities, the industry is now moving toward Context Scaling (long-context windows) and Action Scaling (training on real-world trajectories) to push the frontier further.

The Alignment Scaling Law.

Can safety be scaled? Research into RLHF (Reinforcement Learning from Human Feedback) suggests that model "helpfulness" and "harmlessness" also follow predictable scaling patterns. However, alignment scaling often encounters a Safety-Utility Tradeoff, where pushing the model too hard toward strict alignment can cause a slight decay in its raw reasoning capabilities (the "Alignment Tax").

The future of scaling laws involves Constitutional AI, where a small set of human principles is used to supervise the scaling of a much larger child model, ensuring that as compute grows, the model's value alignment remains robustly locked to human intent.

Compute Budget Modeler

Map your available H100 GPU hours to target model performance. Our calculator applies Chinchilla optimization to find your perfect parameter-to-data ratio.

Frequently Asked Questions

Related Engineering Resources

Technical Article

Synthetic Data Generation

Scaling AI training beyond the human data limit.

Technical Article

Mixture of Experts Explained

The logic of sparse neural architectures.

Technical Article

GPU Cluster Design

Architecting the hardware for $1B training runs.

The era of guessing the behavior of large neural networks is over. Transformer Scaling Laws have provided the industry with a predictive toolkit that rivals the precision of mechanical engineering. As we move toward larger, more complex systems—incorporating sparse architectures, inference-time reasoning, and multi-modal integration—these laws will continue to evolve, but the core principle remains: Intelligence is predictable if you know the math of your compute.

For the network architect, this means that the network is no longer a "support" system; it is the fundamental constraint on the scaling of intelligence. Bridging the gap between the GPU's memory bandwidth and the fabric's collective communication speed is the only way to stay on the power-law curve.

19. Scaling for Long Context.

Standard scaling laws assume a fixed context window (e.g., 2k or 4k tokens). However, as context windows scale to 1M+ tokens, a new scaling dimension emerges. The compute cost of standard self-attention scales quadratically ( $O(L^2)$ ) with context length $L$ .

This means for ultra-long context models, the compute budget shifts from "learning new weights" to "processing the prompt." New architectures like FlashAttention-3 and Ring Attention attempt to linearize these costs, but the fundamental "Memory Wall" of the KV-Cache remains. Scaling context effectively requires scaling the HBM3e capacity of the cluster faster than the TFLOPS.

20. Inference-Optimal Scaling.

Chinchilla optimality minimizes *training cost*. However, if you plan to serve a model to 100 million users, the *inference cost* becomes the dominant expense. In this regime, it is actually "Optimal" to over-train a small model.

By training a 7B or 8B parameter model on 15 Trillion tokens (far beyond the Chinchilla point), we create a model that is vastly more capable than its size suggests. This "small but over-trained" model saves billions in inference TCO because it requires fewer GPUs to serve and has lower latency, even if the training run was "inefficient" by raw FLOP standards.

21. The Mystery of Multi-Epochs.

Conventional wisdom (Kaplan) said you should never see a token twice. But in a data-scarce world, we are seeing models Grok and generalize over up to 10 epochs.

The scaling law for multi-epoch training shows that the first 4 epochs provide ~80% of the value of fresh data. This suggests that the model isn't just memorizing; it is finding more efficient internal "circuits" to represent the same information. This is critical for high-value reasoning data where fresh samples are impossible to find.

22. Scaling Reasoning (CoT).

Training on Chain-of-Thought (CoT) data changes the scaling exponents. Reasoning tokens are "Denser" than descriptive tokens.

Models trained on reasoning steps show a steeper scaling curve ( $\alpha$ is higher). This implies that as we move from "surface-level" text to "logical" text, we get more intelligence-per-FLOP. This is the foundation of the "Reasoning-First" training paradigm where we prioritize synthetic logical chains over raw web scrapes.

23. Scaling the Alignment Tax.

Alignment—making the model follow instructions and provide safe answers—is not a "Fixed Cost." It scales with model size. As $N$ increases, the model becomes more "Injectable" and has more "Latent Knowledge" that can be misaligned.

The scaling law for alignment suggests that the amount of RLHF (Reinforcement Learning from Human Feedback) data required to align a model scales sub-linearly with $N$ . A 175B model is inherently "Easier" to align than a 7B model because it understands the *intent* of the instructions better. However, the "Utility Loss" (the drop in raw reasoning performance) increases as we push for more extreme safety, creating a pareto-front of safety vs. capability.

24. The Parallelism Efficiency tax.

In a 100,000 GPU cluster, we use 3D Parallelism: Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP). These are not free. The "Scaling Efficiency" $\eta$ can be modeled as:

\\eta = \\frac{1}{1 + \\frac{T_{comm}}{T_{comp}}}

The Communication Bottleneck

T_{comm}Time spent on network synchronization

T_{comp}Time spent on GPU math

As we scale to larger models, T_comm increases because more data must move between nodes. To keep efficiency > 0.6 , engineers must use high-bandwidth fabrics (InfiniBand/NDR800). If the network bandwidth doesn't scale as fast as the TFLOPS, the scaling law essentially "stalls"—you are adding GPUs but not decreasing loss.

25. Fault Tolerance Scaling.

At the "Zetta-FLOP" scale, hardware failure is a statistical certainty. A cluster of 100,000 GPUs will experience a failure every few hours. Standard Checkpoint/Restart (C/R) becomes a bottleneck because the time to save the model to disk starts to exceed the time between failures.

Scaling must therefore incorporate In-Memory Checkpointing and Predictive Fault Tolerance. If the system can detect an impending GPU failure via telemetry and migrate its workload before the crash, it effectively "Stretches" the compute budget, allowing the scaling law to continue indefinitely despite imperfect hardware.

Partner in Accuracy

"You are our partner in accuracy. If you spot a discrepancy in calculations, a technical typo, or have a field insight to share, don't hesitate to reach out. Your expertise helps us maintain the highest standards of reliability."

Contributors are acknowledged in our technical updates.

The Compute-Optimal Boundary: Chinchilla in Practice

The Chinchilla scaling law states that for a given compute budget, the optimal model size and training data size should be scaled equally. In mathematical terms: L(N, D) = E + A / N^alpha + B / D^beta, where N is model parameters, D is data tokens, and the exponents alpha and beta determine the allocation. The original DeepMind paper found alpha = beta = 0.5, meaning the budget should be split equally between parameters and data. However, this finding has been significantly refined in the post-Llama era.

Empirical evidence from training Llama 3 405B revealed that the Chinchilla-optimal allocation depends on the **inference budget** as well as the training budget. A model trained to the Chinchilla optimum (15.5 tokens per parameter) achieves the lowest training loss for a given FLOP budget. However, if the model will be heavily used for inference, it is beneficial to overtrain (increase tokens beyond the Chinchilla optimum) because a smaller model that achieves the same loss through more training data is cheaper to serve. This trade-off is captured by the **Compute-Optimal Inference Frontier** (COIF), which models the total cost as C_total = C_train + C_infer * N_infer.

For a 70B-parameter model serving 1 trillion tokens per month, a 10% reduction in parameter count (from 70B to 63B) saves $3.2M per year in inference GPU costs. Compensating with 20% more training tokens adds only $400K to the training budget. The Chinchilla-derived exponents alpha and beta are not universal constants — they depend on the model architecture (dense vs. MoE), the vocabulary size, and the precision (FP16 vs. FP8). OpenAI's GPT-4 technical report disclosed an alpha of 0.34 for their MoE architecture, indicating that MoE models benefit more from scaling parameters than from scaling data compared to dense transformers.

The practical takeaway for AI infrastructure engineers is that the "compute-optimal" point is a moving target. As hardware costs evolve (H100 -> H200 -> B200, 3.35 TB/s -> 4.8 TB/s), the optimal parameter-to-token ratio shifts because the cost per training FLOP drops faster than the cost per inference FLOP. A cluster designed for Chinchilla-optimal training today may be misprovisioned for inference dominance tomorrow.

Infrastructure Scaling Law Implications for Cluster Design

Scaling laws do not only govern model architecture — they have profound implications for the physical infrastructure of AI clusters. The Chinchilla-optimal training of a 1T-parameter model requires 15.5T training tokens, which at 400 TFLOPS-per-GPU (FP16) on H100 consumes approximately 3.1 x 10^24 FLOPs. With 16,384 H100 GPUs operating at 60% utilization, the training wall-clock time is 32 days. During these 32 days, the cluster consumes 16,384 GPUs x 700W x 24 hours x 32 days = 8.8 GWh of energy — the equivalent of powering 800 average US homes for one month. The scaling law tells us the compute requirement; the infrastructure engineer must translate this into power, cooling, and networking capacity.

The critical infrastructure insight from scaling laws is the **Bandwidth-to-Compute Ratio** (BCR): the ratio of All-Reduce network bandwidth required per GPU relative to its compute throughput. For a Chinchilla-optimal training run at scale, the All-Reduce bandwidth must satisfy B_min = (8 * M * P) / (T * N), where M is the model size in bytes per parameter, P is the number of parameters, T is the training step time, and N is the number of GPUs. For a 1T-parameter model on 16,384 GPUs with a 5-second step time and 6 bytes per parameter (FP16 weights + FP32 optimizer), the All-Reduce bandwidth requirement is 8 * 6 * 1T / (5 * 16384) = 586 Gbps per GPU. This is precisely why H100 GPUs are paired with 8x 400G NICs (3.2 Tbps total) — the scaling law dictates the NIC count.

The scaling law also determines the **Memory Capacity Gradient** — the amount of HBM required per parameter. A 1T-parameter model in FP16 requires 2 TB of GPU memory for the weights alone. The optimizer state (FP32 momentum + FP32 variance) adds 8 bytes per parameter (8 TB), and the activations add another 4 bytes per parameter (4 TB) for a standard transformer with sequence length 8K. The total is 14 TB of GPU memory across 16,384 GPUs, or 896 MB per GPU — easily accommodated in H100's 80 GB HBM3. However, scaling to 10T parameters would require 896 MB x 10 = 8.96 GB per GPU — still feasible in 80 GB HBM, but leaving only 10% headroom for activations and intermediates. The scaling law projects that next-generation models (100T parameters) will exceed HBM capacity per GPU, forcing **Memory-Aware Scaling** where the model is split across more GPUs than compute requires, purely to satisfy memory capacity.

The practical implication for cluster architects is that **Scaling Law-Aware Provisioning** must front-run model requirements by 24 months. A cluster designed in 2026 for Chinchilla-optimal training must accommodate models up to 10T parameters with 2:1 parameter-to-token ratio (20T tokens by 2028 projections). This requires 100,000 GPUs with 400 Gbps interconnects and 100 MW of power capacity — a $2B infrastructure investment that must be guided by scaling law projections rather than current generation model requirements.

1. Predictable Progress.

The Formalism of Loss.

Chinchilla vs. Kaplan: The Battle of Formulas.

4. The 6ND Approximation.

Hardware-Aware Realities.

Memory Bound

IO Bound

Power Bound

Precision Scaling.

Bits/Parameter vs. Efficiency (H100/B200 Baseline)

Quality as a Multiplier.

Curriculum and Annealing.

Scaling Mixture of Experts (MoE).

MoE Scaling Characteristics

Distributed Scaling Overheads.

The Sources of Scaling Inefficiency

Implementation: Loss Extrapolation.

Data-Constrained Regimes.

Epoch Scaling Efficiency

Inference-Time Scaling (System 2).

The Thermodynamic Limit.

Grokking: The Phase Transition.

Multi-Modal Scaling Laws.

Philosophy: Bits vs. Intelligence.

The Alignment Scaling Law.

Compute Budget Modeler

Frequently Asked Questions

Related Engineering Resources

Synthetic Data Generation

Mixture of Experts Explained

GPU Cluster Design

19. Scaling for Long Context.

20. Inference-Optimal Scaling.

21. The Mystery of Multi-Epochs.

22. Scaling Reasoning (CoT).

23. Scaling the Alignment Tax.

24. The Parallelism Efficiency tax.

25. Fault Tolerance Scaling.

The Compute-Optimal Boundary: Chinchilla in Practice

Infrastructure Scaling Law Implications for Cluster Design

Technical Standards & References

Related Engineering Resources

MoE Mixture of Experts

Flash Attention Deep Dive

FP8 vs BF16 vs INT8

Parallelism Networking Impact