TL;DR

In this third part of our 4-part series on building language models from scratch, I explore the complete training infrastructure that transforms our clean historical data and custom tokenizer into working language models.

  • Part 1 How to build a Large Language Model from Scratch - covered using the published model
  • Part 2 Building LLMs from Scratch - Part 2: Data Collection & Custom Tokenizers - detailed data collection and custom tokenizer development.

Here, we build the complete training pipeline from a custom GPT architecture through deployment-ready checkpoints.

This post demonstrates how to design custom model architectures, optimize GPU utilization, and implement comprehensive training pipelines that transform our 500M+ character historical corpus into two working language models.

⚠️ Educational Purpose: This is a learning project designed to teach LLM development concepts. For production-scale LLMs, you’ll need significantly larger datasets, more sophisticated infrastructure, and additional considerations that are not covered in this post.

As outlined in Part 1 , both the SLM (117M parameters) and the regular Model (354M parameters) use the same training code and pipeline (04_training/train_model_slm.py and 04_training/train_model.py) with different configurations defined in config.py. The training infrastructure, GPU optimization, checkpointing, and WandB integration are identical - only the model architecture parameters differ.

Both PyTorch checkpoint inference and Hugging Face model inference are fully working and available. Both the SLM and the Regular model are published on Hugging Face Hub . Local PyTorch checkpoints can be used directly for inference with the inference_pytorch.py script.

🔗 GitHub Repository: github.com/bahree/helloLondon - Complete training infrastructure (04_training/), model architecture (config.py), and GPU configuration (08_documentation/GPU_TUNING.md)

🤗 Published Models: SLM Model | Regular Model - Ready-to-use historical language models on Hugging Face

📚 Book Reference: Generative AI in Action - For deeper understanding of core LLM concepts.

1. The Training Challenge: From Data to Working Models

Now that we have our clean historical corpus and custom tokenizer from Part 2 , we need to transform this data into working language models. This isn’t just about running training scripts – it’s about designing an architecture that can learn from historical text, optimizing for the unique patterns of 1500-1850 English, and building infrastructure to handle the computational demands of language model training.

The challenge with historical language modeling isn’t just having enough data - it’s having the right architecture and training process that can learn from the complex linguistic patterns in historical texts. Unlike modern text, historical English contains archaic vocabulary, period-specific terminology, and cultural references that require specialized attention mechanisms and training strategies.

1.1 High-Level Training Process Overview

The model training pipeline transforms our clean historical data and custom tokenizer into working language models through several key stages:

  1. Model Architecture Design - involves a custom GPT implementation optimized for historical text patterns
  2. GPU Configuration covers - multi-GPU training with precision optimization and memory management
  3. Training Infrastructure - includes distributed training, checkpointing, and experiment tracking
  4. Performance Optimization - encompasses mixed precision, compilation, and hardware-specific tuning
  5. Model Validation - covers testing and evaluation of trained models.

Figure 1 below illustrates this complete training pipeline:

graph TD
    A[📚 Clean Historical Corpus<br/>500M+ characters] --> B[🔤 Custom Tokenizer<br/>30K vocab + 150+ special tokens]
    B --> C[🏗️ Model Architecture<br/>Custom GPT for Historical Text]
    
    C --> D[⚙️ GPU Configuration<br/>Multi-GPU + Precision Optimization]
    D --> D1[Mixed Precision<br/>bf16/fp16]
    D --> D2[Torch Compile<br/>JIT optimization]
    D --> D3[Memory Management<br/>Gradient checkpointing]
    
    D1 --> E[🏋️ Training Process<br/>60K iterations, checkpointing]
    D2 --> E
    D3 --> E
    
    E --> E1[SLM: 117M params<br/>7-8 hours training]
    E --> E2[Regular: 354M params<br/>28-32 hours training]
    
    E --> E3[WandB Integration<br/>Experiment tracking]
    E --> E4[Checkpointing<br/>Resume capability]
    E --> E5[Multi-GPU Support<br/>Distributed training]
    
    E1 --> F[📊 Model Evaluation<br/>Historical accuracy testing]
    E2 --> F
    E3 --> F
    E4 --> F
    E5 --> F
    
    F --> G{Quality OK?}
    G -->|Yes| H[🚀 Deployment<br/>Hugging Face + Local Inference]
    G -->|No| I[🔄 Retrain/Adjust]
    I --> E
    H --> J[💬 Text Generation<br/>Historical language output]
    
    style A fill:#e1f5fe
    style C fill:#f3e5f5
    style H fill:#e8f5e8
    style J fill:#fff3e0
Figure 1: Complete Training Pipeline

We will explore each of these components in detail, starting with the model architecture design, but first, let’s discuss why I chose PyTorch as the framework for this project.

1.2 Using PyTorch

I chose PyTorch for this project based on three key factors: educational accessibility, integration with the research ecosystem, and practical convenience. PyTorch provides many components out of the box - transformer blocks, attention layers, feed-forward networks, training loops, and CUDA support - which makes it much easier for learners building their first language model.

From a technical perspective, PyTorch’s memory management and GPU optimization features-including automatic mixed precision, gradient checkpointing, and efficient attention implementations well-suited for the resource-intensive task of training language models on historical text.

PyTorch’s recent developments, such as torch.compile, FlashAttention kernels, and SDPA operator (scaled dot-product attention), provide significant performance improvements, making training more efficient. These improvements enhance both speed and memory efficiency, which are critical for scaling LLMs. Of course, in our case, we’re building a working toy example rather than scaling to production levels, and these optimizations help keep training times reasonable on available hardware.

What about other Frameworks?

I also considered TensorFlow and JAX, but neither seemed right for helloLondon; TensorFlow’s API felt too complex, specifically from a beginner’s perspective. JAX has excellent performance and a clean, functional approach, but it’s more research-focused and has a smaller ecosystem, which would make it harder to follow along and experiment with.

2. Model Architecture Overview

2.1 Understanding the GPT Architecture

Our custom GPT (Generative Pre-trained Transformer) is a decoder-only transformer model designed for autoregressive language modeling on historical text. The architecture consists of four core components, each serving a distinct purpose in the sequence-to-sequence prediction pipeline. These are: token embeddings, position embeddings, causal self-attention mechanisms, and the language modeling head. Let us double-click into each component to understand its role and implementation.

2.1.1 Token Embeddings

Token embeddings convert discrete token IDs from our 30,000-token historical vocabulary into dense, continuous vector representations. Each token (whether it’s a word, subword unit, or special token) is mapped to a point in a high-dimensional space (768 dimensions for SLM, 1024 for the regular model).

This is implemented as a simple lookup table - wte = torch.nn.Embedding(config.vocab_size, config.n_embd). When processing the token sequence, we look up the corresponding vector for each token ID. These embeddings are learned during training - the model learns which tokens should be close together in this vector space based on their co-occurrence patterns in historical text.

For historical language models, this is particularly valuable because rare historical terms (like “yeoman” or “guildhall”) get their own representations that can capture contextual relationships with related terms from that era.

2.1.2 Position Embeddings

Position embeddings encode each token’s absolute position within the sequence. This is crucial because, unlike recurrent models, transformer architectures have no inherent notion of temporal order or sequence position - they process all tokens in parallel. Let us double-click into the problem why.

Think of it like reading words without any sense of order. The words “The cat chased the mouse” would be indistinguishable from “Mouse the chased cat the” - you’d see the same words but lose all meaning because you don’t know which word came first, second, or third. Transformers face exactly this problem because they process all words simultaneously rather than sequentially, unlike older RNN models.

To help the model understand word order, we add position embeddings to the token embeddings. This way, each token’s representation includes information about both “what” the token is and “where” it appears in the sequence. We use learned position embeddings (as opposed to fixed sinusoidal patterns): wpe = torch.nn.Embedding(config.block_size, config.n_embd). For the SLM with a 512-token context window, we learn 512 different position vectors (one for each possible position). Similarly, the regular model with a 1024-token context learns 1024 position vectors.

Position embeddings work like giving each word a “timestamp” or “address” that tells the model where it sits in the sequence:

  1. Token embedding says: “This is the word ‘cat’” → converts to a vector like [0.2, -0.5, 0.8, ...]
  2. Position embedding says: “This word is at position 3” → adds another vector like [0.1, 0.3, -0.2, ...]
  3. Combined: The model sees both “what” the word is AND “where” it appears

The embedding vectors are combined element-wise: x = token_emb + position_emb. This allows the model to understand both what each token is (via the token embedding) and where it appears in the sequence (via the position embedding).

Our model uses learned position embeddings, meaning during training the model discovers that:

  • Position 1 tends to be capitalized (start of sentence)
  • Position 512 might be mid-sentence (needs different handling)
  • Certain positions in historical documents have patterns (formal openings, closings, etc.)

This is different from fixed sinusoidal embeddings (used in the original Transformer paper), which use a mathematical formula to encode positions. Learned embeddings are generally better because they adapt to specific patterns in the training data.

In historical texts, word order is crucial for understanding meaning. Consider “The King granted the land” versus “The land granted the King” - same words, completely different meanings. Historical legal documents and Victorian-era writings often have precise word order that changes legal or semantic meaning. Position embeddings ensure the model can distinguish between these critical variations.

2.1.3 Causal Self-Attention

Causal self-attention is the mechanism that allows each position in the sequence to selectively attend to previous positions. The “causal” constraint ensures the model can only look at past tokens (not future ones), which is essential for autoregressive generation.

When you read a sentence, you naturally use context from earlier words to understand later ones. If you see “The King granted the land to his loyal…”, you can predict that “servant,” “knight,” or “subject” might come next because you remember what came before. The model needs to do the same thing - use previous words to predict the next word.

However, there’s a crucial constraint: during training, when predicting word 7, the model must only see words 1-6, never word 8 or beyond. This “causal” (cause-and-effect) constraint ensures the model learns realistic patterns - in the real world, you can’t use future information to predict the present.

How Attention Works

Think of attention as a sophisticated “relevance detector”. When the model is processing the word “loyal” in our example above, it needs to look back and ask: “Which previous words are most relevant for understanding this context?” The attention mechanism computes a weighted sum of all previous token representations, where the weights are determined by how relevant each previous token is to the current one.

This is done through three learned linear projections that create different “views” of each word:

  • Query (Q): “What am I looking for?” - The current word asks a question
  • Key (K): “What do I have to offer?” - Previous words advertise their content
  • Value (V): “What information should I contribute?” - The actual information to pass forward

Let us see a practical example to help us grok the concept. Consider the historical phrase: “The alderman of Cheapside, having served the city faithfully, was granted

When processing “granted” the attention mechanism:

  1. Creates a Query from “granted” asking “what context do I need?”
  2. Compares this Query against Keys from all previous words
  3. Finds high relevance with “alderman” (who is being granted something) and “faithfully” (why the grant is happening)
  4. Uses these attention weights to pull relevant Values from those words
  5. Combines this information to understand better “granted” in context

The attention score between token i and token j is computed as:

$$\text{Attention}(Q_i, K_j) = \text{softmax}\left(\frac{Q_i K_j^T}{\sqrt{d_k}}\right)V_j$$

Breaking this down:

  • $Q_i K_j^T$ computes how well the query from token i matches the key from token j (higher = more relevant)
  • $1/\sqrt{d_k}$ is a scaling factor that prevents scores from getting too large (which would make softmax too “sharp”)
  • $\text{softmax}$ converts scores into probabilities that sum to 1 (so each word gets a weighted “vote”)
  • Finally, we use these weights to combine the Values from all previous tokens

The $1/\sqrt{d_k}$ scaling factor (where $d_k = 64$ for our SLM, so $\sqrt{64} = 8$) prevents the dot products from growing too large with high-dimensional embeddings, ensuring stable gradients during training. The softmax ensures the weights sum to 1, creating a proper probability distribution over the previous tokens.

Why This Matters for Historical Text?

Historical documents present unique challenges that make attention particularly valuable. Consider a legal document from 1750: “John Smith, yeoman of the parish of St. Giles, being of sound mind and body, doth hereby bequeath…”

The attention mechanism enables the model to:

  • Connect “doth bequeath” back to “John Smith” across multiple clauses
  • Understand that “yeoman” modifies “John Smith” even though they’re separated
  • Learn that “doth” (archaic) and “does” (modern) serve similar grammatical functions
  • Recognize that formal legal phrasing follows specific patterns

For our historical language models, this attention mechanism learns which historical terms and phrases co-occur and relate to one another contextually - crucial for understanding historical documents where terminology and phrasing differ from modern English. The model learns to attend to relevant historical context, enabling it to generate coherent text that maintains period-appropriate language patterns and references.

2.1.4 Language Modeling Head

The language modeling head (also called the “output projection” or lm_head) is the final translator that turns the rich internal representation (after all the attention + MLP refinements) back into a decision: “Given everything I’ve seen so far, what is the most likely next token?” It does this by mapping each hidden vector at every position into a vector of length equal to the vocabulary size (30,000 in our historical tokenizer). Each element of that output vector is a logit - an unnormalized score indicating how likely the model thinks the token is to be the next one.

Implementation is intentionally simple: lm_head = torch.nn.Linear(n_embd, vocab_size). We don’t put an activation function after it because we want raw, unconstrained scores. Those scores then flow into:

  • Inference: Apply softmax -> probabilities -> sample or greedy pick
  • Training: Feed logits + target token IDs into cross-entropy loss -> gradients flow backward

You can think of logits as evidence totals. The softmax transforms those evidence values into a normalized probability distribution that the model can sample from. High logit = more supporting evidence; low logit = less.

Step-by-step (Inference vs Training):

  1. Hidden state at last position (e.g., index 511) enters lm_head.
  2. Linear projection produces a 30,000-dimensional logit vector.
  3. In inference: probs = softmax(logits / temperature); optionally apply top-k/top-p filtering.
  4. Sample (or argmax) a token → append to sequence → repeat.
  5. In training: Cross-entropy compares logits to the true next token; loss scalar backpropagates through the head into all prior layers.

Because our vocabulary mixes common function words (“the”, “and”, “of”) with rare era-specific tokens (“yeoman”, “guildhall”, “paternoster”, “quoth”), the head must reliably distinguish both frequent and infrequent patterns. Rare historical tokens need consistent representations from embedding -> transformer -> head so they are not forgotten. If their logits remained perpetually low, the model would never learn to generate them in authentic contexts.

Logits (not probabilities) inside the model - We retain logits (raw, unnormalized scores) instead of immediately converting to probabilities because they yield numerically stable loss computation - PyTorch efficiently fuses log_softmax with negative log-likelihood - allow cleaner gradient flow before any normalization (we only invoke softmax when we actually need a distribution), and enable flexible post-processing (temperature scaling, top-k or top-p filtering, repetition penalties) directly in score space without forcing an extra probability recomputation step.

We reuse the input embedding matrix for the output projection to keep input and output semantics aligned and reduce parameter and memory traffic. This concept is called Weight Typing, which we will cover in detail in Section 2.2.2 - Weight Tying .

We share the embedding and output projection weights (self.transformer.wte.weight = self.lm_head.weight) so input token interpretation and next-token scoring occur in the same semantic space.

Using the shared embedding matrix $E$ (shape $(V,d)$), the logits are computed with $\text{logits} = h \cdot E^T$, reusing the same rows used for token lookup. This saves parameters (~23.0M SLM, ~30.7M Regular), keeps gradients for rare historical tokens coupled, and reduces memory traffic (Press & Wolf, 2017; Inan et al., 2016). See Section 2.2.2 for detailed mechanics, benefits, and the historian/scribe analogy.

In short, the lm_head converts rich contextual understanding into next-token scores; with weight tying (details in Section 2.2.2) it stays efficient and semantically consistent.

2.1.5 The Complete Flow

The complete forward pass through our GPT model works as follows:

  1. Input: A sequence of token IDs (batch × sequence_length, e.g., 512 tokens)
  2. Token Embedding: Convert each token ID to a dense vector (768 or 1024 dimensions)
  3. Position Embedding: Add position information to each token
  4. Transformer Blocks: Pass through n_layer transformer blocks (12 for SLM, 24 for regular model), each containing:
    • Layer normalization
    • Causal self-attention (with multiple heads)
    • Residual connection
    • Layer normalization
    • Feed-forward MLP
    • Residual connection
  5. Final Layer Norm: Normalize the final hidden states
  6. Language Head: Project to vocabulary logits (30,000 dimensions)
  7. Output: Probability distribution over next token

This architecture design is conventional and follows the GPT-style pattern established by OpenAI’s GPT models. The traditional design is intentional - it allows for clear, educational learning from the implementation while being configured to work seamlessly with our historical tokenizer from Part 2.

Figure 2 below illustrates the complete architecture for the SLM:

graph TD
    A[Input Tokens<br/>512 tokens] --> B[Token Embedding<br/>30K vocab → 768 dim]
    A --> C[Position Embedding<br/>512 pos → 768 dim]
    B --> D[Add Embeddings]
    C --> D
    D --> E[Layer Norm]
    E --> F[Transformer Block 1<br/>12 heads, 768 dim]
    F --> G[Transformer Block 2<br/>12 heads, 768 dim]
    G --> H[...]
    H --> I[Transformer Block 12<br/>12 heads, 768 dim]
    I --> J[Final Layer Norm]
    J --> K[Language Head<br/>768 → 30K vocab]
    K --> L[Output Logits<br/>30K probabilities]
    
    subgraph "Key Specifications"
        M[Layers: 12<br/>Heads: 12<br/>Embedding: 768<br/>Context: 512<br/>Parameters: 117M<br/>Training: 7-8 hours<br/>MFU: 8-9%]
    end
    
    style A fill:#e1f5fe
    style L fill:#e8f5e8
    style M fill:#fff3e0
Figure 2: SLM Architecture (117M Parameters)

The Regular model, as shown in Figure 3 below, follows the same architectural pattern as the SLM but with increased capacity: 24 transformer layers instead of 12, 16 attention heads instead of 12, and 1024-dimensional embeddings instead of 768.

This represents a ~3x increase in parameters (354M vs 117M), ~2x more attention heads, and ~33% larger embedding dimensions, providing significantly more computational power for learning complex historical language patterns.

graph TD
    A[Input Tokens<br/>1024 tokens] --> B[Token Embedding<br/>30K vocab → 1024 dim]
    A --> C[Position Embedding<br/>1024 pos → 1024 dim]
    B --> D[Add Embeddings]
    C --> D
    D --> E[Layer Norm]
    E --> F[Transformer Block 1<br/>16 heads, 1024 dim]
    F --> G[Transformer Block 2<br/>16 heads, 1024 dim]
    G --> H[...]
    H --> I[Transformer Block 24<br/>16 heads, 1024 dim]
    I --> J[Final Layer Norm]
    J --> K[Language Head<br/>1024 → 30K vocab]
    K --> L[Output Logits<br/>30K probabilities]
    
    subgraph "Key Specifications"
        M[Layers: 24<br/>Heads: 16<br/>Embedding: 1024<br/>Context: 1024<br/>Parameters: 354M<br/>Training: 28-32 hours<br/>MFU: 15-20%]
    end
    
    style A fill:#e1f5fe
    style L fill:#e8f5e8
    style M fill:#fff3e0
Figure 3: Regular Model Architecture (354M Parameters)

2.2 SimpleGPT Class

Now that we’ve covered the theory behind the GPT architecture, let’s examine the actual implementation. The SimpleGPT class is at the heart of our implementation - it’s the core class that brings together all the components discussed in section 2.1 into a working language model. The class inherits from PyTorch’s **torch.nn.Module**, which is the base class for all neural network components in PyTorch. This gives us access to automatic differentiation, GPU support, and other PyTorch features.

2.2.1 The __init__ method

The __init__ method is the constructor that assembles our entire language model from individual components. First, it stores all hyperparameters (such as vocabulary size, embedding dimensions, and the number of layers) in a configuration object that the rest of the model can reference. Next, it creates the embedding layers - one that converts our 30,000 historical tokens into dense vectors, and another that encodes position information. Hence, the model knows where each word appears in the sequence.

Next, it builds the transformer blocks - the core processing units that do the heavy lifting. Each block contains self-attention mechanisms and feed-forward networks that learn to understand relationships between words. The method also initializes the language modeling head, the final layer that converts all internal processing back into probabilities for which word should come next.

Finally, it sets up proper weight initialization to ensure the model starts with good random weights (not too big, not too small), and implements weight tying between the input embeddings and the output layer. This clever technique reduces the number of parameters while improving training efficiency by sharing weights between the first and last layers.

This is important because if the weights are too large, the model’s gradients can explode during training, leading to unstable learning. If they start too small, gradients can vanish, rendering the model unable to learn. Our initialization ensures the model begins in the “Goldilocks zone” - just right for effective learning. Without this, even a perfectly designed architecture might fail to train properly.

Now, let me show you the actual implementation. The code in Listing 1 demonstrates how we implement the core GPT architecture:

class SimpleGPT(torch.nn.Module):
    """Simple GPT model based on nanoGPT
    
    This class implements a decoder-only transformer model optimized for 
    historical text generation. It inherits from PyTorch's Module class
    to get automatic differentiation and GPU support.
    """
    
    def __init__(self, config):
        super().__init__()  # Initialize the parent PyTorch Module class
        self.config = config  # Store all model hyperparameters
        
        # Create the main transformer components using ModuleDict
        # ModuleDict allows us to organize related layers together
        self.transformer = torch.nn.ModuleDict(dict(
            # Token Embedding Layer (wte = "word token embedding")
            # Converts each token ID to a high-dimensional vector
            # Input: token IDs (integers 0 to vocab_size-1)
            # Output: dense vectors of size $n_{embd}$ (e.g., 768 dimensions)
            wte = torch.nn.Embedding(config.vocab_size, config.n_embd),
            
            # Position Embedding Layer (wpe = "word position embedding") 
            # Encodes where each token appears in the sequence
            # Input: position indices (0 to block_size-1)
            # Output: dense vectors of size $n_{embd}$
            wpe = torch.nn.Embedding(config.block_size, config.n_embd),
            
            # Dropout Layer for regularization
            # Randomly sets some inputs to zero during training to prevent overfitting
            drop = torch.nn.Dropout(config.dropout),
            
            # Stack of Transformer Blocks (h = "hidden layers")
            # Each SimpleBlock contains self-attention and feed-forward layers
            # We create n_layer blocks (e.g., 12 for SLM, 24 for regular model)
            h = torch.nn.ModuleList([SimpleBlock(config) for _ in range(config.n_layer)]),
            
            # Final Layer Normalization
            # Normalizes the output before the language modeling head
            ln_f = torch.nn.LayerNorm(config.n_embd, bias=config.bias),
        ))
        
        # Language Modeling Head
        # Converts the final hidden states back to vocabulary space
        # Input: hidden states of size $n_{embd}$
        # Output: logits for each token in vocabulary ($vocab_{size}$ logits)
        self.lm_head = torch.nn.Linear(config.n_embd, config.vocab_size, bias=False)
        
        # Initialize all weights using our custom initialization method
        # This ensures the model starts with good random weights
        self.apply(self._init_weights)
        
        # Weight Tying: Share weights between input embeddings and output layer
        # This technique improves training efficiency and model performance
        # by ensuring the same representation space is used for input and output
        self.transformer.wte.weight = self.lm_head.weight
Listing 1: SimpleGPT Model Architecture

2.2.2 Weight Tying

The tied weights between the embedding layer and language modeling head (self.transformer.wte.weight = self.lm_head.weight) are a crucial optimization for our historical language model. In a typical neural network, you’d have two separate weight matrices - one for converting input tokens to embeddings, and another for converting hidden states back to vocabulary probabilities. Weight tying means we use the same weight matrix for both operations.

Think of it like this: instead of having two different dictionaries (one for reading, one for writing), we use the same dictionary for both. The same table that maps “alderman” → [0.2, -0.5, 0.8, …] is used whether the model is reading “alderman” as input or trying to generate “alderman” as output.

Without weight tying, the model would have two separate weight matrices - one for converting input tokens to embeddings, and another for converting hidden states back to vocabulary probabilities. This means the model could learn that “alderman” means one thing when it sees it as input, but something slightly different when it tries to generate it as output. For rare historical terms, this inconsistency can cause the model to “forget” how to use words it has seen before properly.

Historical vocabulary contains many rare terms such as “quoth” “alderman” and “paternoster” that appear infrequently in our training data. Without weight tying, the model might learn different representations for the same word when it sees it as input versus when it generates it as output. This inconsistency can cause the model to struggle with rare historical terms.

When the model sees “alderman” in the input, it learns a specific representation of it. Later, when it needs to generate “alderman” in the output, it uses that same learned representation, ensuring consistency and improving the model’s ability to generate coherent historical language with proper terminology.

Mechanics (matrix reuse) A single matrix $E \in \mathbb{R}^{(V \times d)}$ serves both roles: row lookup for input embeddings and column interaction for output scoring. The language head reuses it to compute logits via:

$$\text{logits} = h \cdot E^T$$

where $h$ is the hidden state at each position, this keeps the input interpretation and output prediction within the same semantic geometry - no second projection to drift or disagree.

Why it helps Parameter savings (~23.0M SLM, ~30.7M Regular) lower memory footprint and bandwidth. Gradients for predicting a rare token (e.g., yeoman, guildhall, paternoster) directly refine the very rows used to embed it on future inputs - improving both recall and generation. Shared weights mildly regularize against the two spaces drifting apart and empirically improve perplexity for mid-scale autoregressive models (Press & Wolf, 2017; Inan et al., 2016).

Analogy If the transformer stack is a panel of historians debating context, the language modeling head is the scribe choosing the next historically plausible word. Weight tying means the scribe and historians consult the same dictionary - no translation mismatch between how words are read and how they’re proposed.

Practical notes Avoid inflating vocabulary unnecessarily (cost scales with $V$); tied weights do not remove the need for careful rare token coverage in the corpus; and if later adding adapters or LoRA heads, remember that tying interacts with how those layers inject low-rank updates.

Now that we understand how the model efficiently handles vocabulary, let’s examine the core processing units that transform these embeddings into meaningful representations.

2.3 Transformer Block Design

Each transformer block implements the standard attention and feed-forward pattern, but with optimizations for historical text processing. Let us look at the code real quick and then get into a little more detail.

class SimpleBlock(torch.nn.Module):
    """Simple transformer block"""
    
    def __init__(self, config):
        super().__init__()
        self.ln_1 = torch.nn.LayerNorm(config.n_embd, bias=config.bias)
        self.attn = SimpleCausalSelfAttention(config)
        self.ln_2 = torch.nn.LayerNorm(config.n_embd, bias=config.bias)
        self.mlp = SimpleMLP(config)
    
    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x
Listing 2: Transformer Block Implementation

This code implements a single transformer block, which, as we know, is the fundamental building unit of our GPT model.

2.3.1 Self-Attention Step

The self-attention step (x = x + self.attn(self.ln_1(x))) is the block that first normalizes the input with LayerNorm, then applies self-attention to understand relationships between words. The + creates a “residual connection” that helps information flow through the network.

As we discussed in section 2.1.4 , self-attention is the “magic” of transformers, allowing each token to decide how much attention to pay to every other token in the sequence. Our implementation uses multiple attention heads (12 for SLM, 16 for Regular) that operate in parallel, with each head learning to focus on different types of relationships - syntactic, semantic, or positional. Causal masking ensures that during training, the model learns to predict the next token based solely on the preceding context, which is essential for coherent text generation.

The residual connection (+) is crucial, as it allows the model to preserve the original token representation while adding contextual information from the attention. The pre-normalization approach (LayerNorm before attention) provides more stable training than post-normalization, especially important when working with the varied linguistic patterns found in historical text.

2.3.2 Feed-Forward Step

After attention, we have the feed-forward step (x = x + self.mlp(self.ln_2(x))), which first normalizes the attended information with LayerNorm, then passes it through a multi-layer perceptron (MLP) that transforms and processes it.

The MLP typically consists of two linear layers with a non-linear activation function (like GELU) between them, allowing the model to learn complex non-linear transformations of the attended features. This step is crucial because attention can only perform linear transformations on the input representations; the feed-forward network adds the necessary non-linearity, enabling the model to learn complex patterns and relationships in the historical text. Another residual connection preserves the original information, ensuring that the model can always fall back to the pre-attention representation if needed.

2.3.3 Understanding the Feed-Forward (MLP) Sublayer

Directly beneath the SimpleBlock code above, you see the line self.mlp = SimpleMLP(config). After attention has mixed information across positions, the model passes each token embedding through a position-wise feed-forward network (the MLP). Unlike attention, it does not look at other tokens; it refines the representation of each token independently, given the contextualized features attention just produced. In practice, this is where raw contextual patterns are distilled into richer semantic, stylistic, and morphological signals.

Figure 4 below visualizes how a single transformer block routes data through normalization, attention, and the feed-forward expansion/contraction before returning an upgraded representation via the residual path:

graph TB
    A[Input Embeddings<br/>batch, seq, emb] --> LN1[LayerNorm 1]
    LN1 --> ATTN[Multi-Head Attention<br/>query, key, value]
    ATTN --> DROPA[Dropout]
    DROPA --> RES1[Residual Add<br/>x + attn_out]
    RES1 --> LN2[LayerNorm 2]
    RES1 --> LN2
    LN2 --> EXPAND[Linear Expand<br/>emb → 4*emb]
    EXPAND --> GELU[GELU Activation]
    GELU --> PROJECT[Linear Project<br/>4*emb → emb]
    PROJECT --> DROPM[Dropout]
    DROPM --> RES2[Residual Add<br/>res1 + mlp_out]
    RES2 --> OUT[Block Output<br/>Updated embeddings]
    style A fill:#e1f5fe
    style ATTN fill:#f3e5f5
    style EXPAND fill:#fff3e0
    style PROJECT fill:#fff3e0
    style RES2 fill:#e8f5e8
Figure 4: Internal Flow of a Transformer Block

Conceptually, the MLP is a two-step projection: first, an expansion into a higher-dimensional “workspace” with a non-linear activation, then a projection back down so the residual can safely merge with the original stream.

For our SLM, 768 dimensions expand to 3072 and then contract back to 768; for the larger model, 1024 dimensions expand to 4096. This temporary widening allows the network to express combinations of features that a purely linear transform could not capture. It is the difference between merely routing information and actually transforming it.

Here is the representative structure shown in Listing 3:

class SimpleMLP(torch.nn.Module):
    def __init__(self, config):
        super().__init__()
        self.fc_in  = torch.nn.Linear(config.n_embd, 4 * config.n_embd)
        self.act    = torch.nn.GELU()
        self.fc_out = torch.nn.Linear(4 * config.n_embd, config.n_embd)
        self.drop   = torch.nn.Dropout(config.dropout)
    def forward(self, x):
        return self.drop(self.fc_out(self.act(self.fc_in(x))))
Listing 3: Feed-Forward (MLP) Sublayer Implementation

Why expand then shrink?

The widened hidden space allows the model to form intermediate feature bundles (e.g., tense, register, archaic morphology) that do not cleanly live in the original lower-dimensional basis. The contraction enforces a stable interface for the residual path and keeps the total parameter count manageable. Removing the expansion would noticeably degrade expressiveness; removing the contraction would balloon memory use and break architectural symmetry.

In our context, the historical model internalizes regularities like mapping “hath” and “doth” into modern tense abstractions while still preserving period flavor; it encodes stylistic shifts between court proceedings, religious prose, and narrative storytelling; it stabilizes inconsistent orthography and variant spellings so downstream layers predict coherent continuations instead of brittle echoes. Attention tells the model where to look; the MLP decides how to reinterpret what it saw.

Focusing only on attention gives an incomplete mental model of transformers. More than half of the parameters and a large fraction of the FLOPs sit in these feed-forward layers. Under-sized MLPs lead to shallow pattern memorization - models that can repeat phrases but cannot generalize style or adapt archaic forms to new contexts. Properly scaled MLP width (the common ×4 expansion) is a proven sweet spot: smaller factors underfit; much larger ones give diminishing returns at this scale (see scaling law discussions in Kaplan et al. 2020 ).

A useful mental analogy: attention is the lively debate in a hall; the MLP is each participant stepping aside to integrate what was heard into their own refined understanding before the next round of discussion. When you see x = x + self.mlp(self.ln_2(x)), that addition represents the moment a token’s contextual representation is upgraded. Without this transformation, the model would “hear” context but fail to internalize it, producing shallow, literal continuations rather than fluent, period-authentic prose.

In our helloLondon models, the MLP is therefore essential for converting raw multi-head attention patterns into durable historical linguistic competence - one of the quiet reasons the generated text feels coherent rather than stitched together.

Each block in our model (12 for SLM, 24 for Regular) applies this same pattern, allowing the model to build an increasingly sophisticated understanding of historical language patterns as text flows through the layers.

Each transformer block applies layer normalization before both the self-attention mechanism and the feed-forward network, followed by residual connections. This pre-normalization approach (as opposed to post-normalization) has been shown to provide more stable training, especially important when working with the varied linguistic patterns found in historical text.

2.3.4 Activation choice matters

The activation function determines how the neural network processes information at each layer. Think of it as a “decision maker” that decides how much of each input signal to pass through to the next layer. The most common activation functions are ReLU (Rectified Linear Unit) and GELU (Gaussian Error Linear Unit).

ReLU is simple and fast: it passes positive values unchanged and sets negative values to zero (f(x) = max(0, x)). However, ReLU can be “harsh” - it completely cuts off negative signals, leading to “dead neurons” that never activate again. GELU is smoother and more sophisticated: it uses a Gaussian distribution to determine how much of each input to pass through (f(x) = x * ÎŚ(x) where ÎŚ is the cumulative distribution function of a standard normal distribution). This creates a smooth, differentiable function that allows for more nuanced information processing.

GELU offers smoother gradients and better calibration for language than plain ReLU. The smoother nature of GELU helps the model learn more subtle patterns in historical text, where the relationships between words and phrases can be complex and nuanced. Alternatives like SwiGLU can yield marginal gains in perplexity but increase implementation complexity - valuable in frontier systems, optional in educational builds like helloLondon. Modest dropout in the MLP further improves generalization on a corpus that, while sizable, is still modest relative to billion-token modern pretraining regimes.

2.3.5 Pre vs Post-normalization

In pre-normalization, we normalize the input before processing it (like we do here). In post-normalization, we’d process first, then normalize the output. Pre-normalization is like checking that your ingredients are properly prepared before cooking, while post-normalization is like seasoning after cooking - both work, but pre-normalization tends to yield more consistent results.

This matters because historical texts contain complex syntactic structures and long-range dependencies that require sophisticated attention mechanisms. The residual connections ensure that information can flow directly through the network, helping the model learn to preserve important historical context across long sequences while still allowing the attention mechanism to focus on relevant historical details.

2.4 Causal Self-Attention for Historical Sequences

The attention mechanism is crucial for understanding the complex relationships in historical text. Our implementation is based on the original transformer architecture from “Attention Is All You Need” (Vaswani et al., 2017), but optimized for historical language patterns.

Understanding Multi-Head Attention

Multi-head attention runs several attention “heads” in parallel, allowing the model to focus on different aspects of a sequence simultaneously (syntax, semantics, and position). Compared to a single head, this parallelism yields richer representations—think multiple specialists examining the same text. In our setup, the SLM uses 12 heads and the Regular model 16, scaling capacity with model size. Empirically, heads tend to specialize (e.g., subject–verb agreement, word relations, word order), as observed by Clark et al. (2019) - “What Does BERT Look At?” .

Research by Kaplan et al. (2020) Scaling Laws for Neural Language Models shows that the optimal number of attention heads scales with model size. For our 117M-parameter SLM, 12 heads provide sufficient parallel processing capacity, while our 354M-parameter Regular model benefits from 16 heads to capture more complex attention patterns.

The attention mechanism has $O(n^2)$ complexity with respect to sequence length. This means that doubling our sequence length from 512 to 1024 tokens is a quadratic jump and requires 4x more memory for attention computations. This is why we carefully balance sequence length with available GPU memory and why techniques like FlashAttention (Dao et al., 2022) are so important for memory efficiency.

How the attention mechanism works in practice:

The code in Listing 4 shows how we implement the attention mechanism that we’ve been discussing. Here’s what happens step by step:

  1. Input Processing: The model receives a batch of sequences (B = batch size, T = sequence length, C = embedding dimension). For example, with our SLM: B=4, T=512, C=768.

  2. Query, Key, Value Generation: The input embeddings are transformed into three different representations - Query (Q), Key (K), and Value (V) - using a single linear layer that outputs 3×768 dimensions, then splits them.

  3. Multi-Head Reshaping: Each of Q, K, V is reshaped to separate the 12 attention heads, so each head gets its own 64-dimensional subspace (768 á 12 = 64).

  4. Attention Computation: The scaled dot-product attention is computed, where each word “looks at” all previous words (causal masking) and decides how much attention to pay to each.

  5. Output Assembly: All attention head outputs are combined back into a single representation and projected through a final linear layer.

This implementation is optimized for historical text processing, using PyTorch’s efficient scaled_dot_product_attention function with causal masking to ensure the model can only attend to previous tokens, not future ones.

class SimpleCausalSelfAttention(torch.nn.Module):
    """Simple causal self-attention"""
    
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # Key, query, value projections for all heads, but in a batch
        self.c_attn = torch.nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # Output projection
        self.c_proj = torch.nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        # Regularization
        self.attn_dropout = torch.nn.Dropout(config.dropout)
        self.resid_dropout = torch.nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout
    
    def forward(self, x):
        # Batch size, sequence length, embedding dimensionality ($n_{embd}$)
        B, T, C = x.size()
        
        # Calculate query, key, values for all heads in batch 
        # and move head forward to be the batch dim
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        
        # Causal self-attention; Self-attend:
        # (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        y = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True)
        y = y.transpose(1, 2).contiguous().view(B, T, C)  # Re-assemble all head outputs side by side
        
        # Output projection
        y = self.resid_dropout(self.c_proj(y))
        return y
Listing 4: Causal Self-Attention Implementation

The attention mechanism computes attention as show below:

$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where $Q$, $K$, and $V$ are the query, key, and value matrices, respectively.

In our case, with 768-dimensional embeddings and 12 heads, each head operates on 64-dimensional subspaces ($d_k = 768 / 12 = 64$), providing sufficient representational capacity for each type of historical relationship while maintaining computational efficiency.

In addition, the $\sqrt{d_k}$ scaling factor ($\sqrt{64} = 8$) prevents the dot products from becoming too large, ensuring stable gradient flow during training.

In plain English, please!

Think of attention like a spotlight that can shine on different parts of a sentence. When the model is trying to understand the word “he” in a historical document, it needs to look back through the text to find who “he” refers to. The attention mechanism is like having multiple spotlights (our 12 or 16 attention heads) that can each focus on different aspects - each might look for people’s names, another for relationships, and another for locations.

The mathematical formula we showed above is how the model calculates the amount of attention to pay to each word. The scaling factor ($$\sqrt(64) = 8) is like adjusting the brightness of the spotlight – it prevents the model from being “blinded” by very bright spots and helps it focus on the right amount of information.

Does this matter for historical text?

Historical documents are particularly challenging because they often feature complex sentence structures and references spanning long distances. For example, in a court record, you might see “The defendant, John Smith, was accused of theft. He claimed innocence throughout the trial.” The model needs to understand that “He” refers to “John Smith,” even though there are several words between them. The attention mechanism enables the model to make these connections, generating coherent text that maintains proper historical context and references.

This is certainly required for language modeling, given the complex structures in which later words reference earlier ones, and understanding the full context is essential for proper interpretation. The attention mechanism enables the model to learn long-range dependencies, allowing it to generate coherent text across extended sequences. For historical texts specifically, this becomes even more important because archaic language patterns and historical references often span longer distances than those in modern texts.

2.5 Model Configuration

The model architecture uses a centralized configuration, where each parameter is selected based on research findings and practical constraints for historical text processing. The SLM architecture uses five key parameters, each representing a design choice with specific trade-offs between computational efficiency and learning capacity.

ParameterValuePurposeTrade-off
n_layer12Number of transformer blocks (model depth)More layers = better learning, but slower training
n_head12Number of attention heads (parallel processing)More heads = better attention, but more computation
n_embd768Embedding dimension (token representation)Larger = richer representations, but more memory
max_length512Context window size (sequence length)Longer = more context, but quadratic memory growth
vocab_size30KVocabulary size (tokenizer compatibility)Larger = more words, but more parameters

These parameters work together to create a model that effectively processes historical text while remaining computationally manageable.

Layer Count (n_layer: 12)

The 12-layer architecture balances representational capacity with computational efficiency for historical text processing. Shallow layers (1-3) learn basic token patterns and grammatical structures, middle layers (4-8) capture complex syntactic relationships and historical language patterns, and deep layers (9-12) understand high-level semantic relationships and historical context.

This depth follows GPT-2 Small’s 12-layer architecture, which delivers strong performance while remaining computationally manageable on available hardware.

Attention Heads (n_head: 12)

Multi-head attention allows the model to attend to different types of relationships simultaneously – for example, temporal (chronological order), social (class hierarchies), geographical (London landmarks), and linguistic (archaic patterns). The 12-head architecture balances parallel processing capability with computational efficiency for historical text understanding.

Embedding Dimension (n_embd: 768)

The 768-dimensional embedding space can represent complex historical concepts, such as archaic terms (“yeoman”, “paternoster row”, “hath”), while maintaining computational efficiency. This dimension is commonly used in transformer architectures, including BERT-base and GPT-2 Medium.

Why 768 became standard: As a side note, in case you are seeing a lot of 768 lately, there are a good set of reasons for this. Beyond its divisibility ($768 á 12 = 64$ per attention head), 768 aligns with GPU memory architecture - it’s a multiple of 256 (3 × 256), which matches common GPU memory bus widths and cache line sizes. This makes matrix operations more efficient on modern GPUs, as the hardware can process data in optimal chunks. Additionally, 768 provides sufficient representational capacity without the memory overhead of larger dimensions like 1024, making it practical for training on consumer hardware while still capturing complex linguistic relationships.

Context Window (n_positions: 512)

We use a 512-token context window as a practical balance between historical coherence and available compute for a learning-focused setup. While many of our working snippets (e.g., diary passages, sections of legal records, or literary excerpts) comfortably fit within 512 tokens, full historical documents can be much longer. The 512 window keeps attention costs manageable (quadratic in sequence length) while covering typical training segments we use.

Both models use the same 30K vocabulary from our custom historical tokenizer, ensuring consistent tokenization across model variants.

3. GPU Configuration and Perf. Optimization

The training system is designed to maximize GPU utilization while maintaining training stability. Understanding GPU architecture and memory management is crucial for efficient language model training, especially when working with historical text that requires significant computational resources.

3.1 GPU Architecture and Memory Management for Language Model Training

Training on historical text benefits from sensible GPU settings even for a small, learning-focused model. We keep to practical, low-risk optimizations (precision choice, batch/sequence trade-offs, memory-aware attention) and accept some trial and error—reserving heavier systems engineering for larger setups.

The main universal factors are:

  1. Attention scales quadratically with sequence length, so longer contexts get expensive fast.
  2. Natural language variability (syntax, vocabulary, style) demands sufficient model capacity and stable optimization.
  3. Real‑world data quality (formatting, noise) can destabilize training, requiring robust error handling and memory management.

For historical text specifically, archaic vocabulary, period terminology, and cultural references introduce patterns absent from modern corpora. OCR artifacts and uneven formatting in digitized sources add noise beyond what’s typical in contemporary datasets.

3.1.1 GPU memory hierarchy and optimization strategies

Modern GPUs use a hierarchical memory system that significantly impacts training performance: fast but tiny registers and shared memory sit closest to the compute; a larger L2 cache buffers traffic; and global memory holds parameters and activations. Attention often ends up memory-bound, so moving less data (via AMP, Flash/SDPA kernels, and sensible sequence/batch sizes) is as important as raw FLOPs.

For language model training, the key optimization is managing the memory bandwidth bottleneck. Attention operations are often memory-bound rather than compute-bound, meaning performance is limited by how quickly data can be moved between memory levels rather than by computational power. If we are not careful, it is quite easy to run into memory issues, as shown in Figure 5 below.

Out of memory error Screenshot
Figure 5:OOM error

And it is not restricted to training only; even on checkpoints that are saved, we can also encounter memory issues, as shown in Figure 6.

Out of memory error - checkpoing evals Screenshot
Figure 6:OOM checkpoint eval

Mixed precision training and memory optimization

Training large language models requires careful memory management, especially when working with limited GPU resources. Our training system uses several optimization techniques to maximize memory efficiency while maintaining training stability.

GPU detection and basic configuration:

The training system needs to work across different hardware setups, from single consumer GPUs to multi-GPU servers. Our approach uses a centralized configuration system that automatically adapts to available hardware.

The actual GPU detection in train_model_slm.py is quite straightforward - it checks for distributed training environment variables (RANK, LOCAL_RANK, WORLD_SIZE) and sets up basic multi-GPU support if available. The system also detects GPU capabilities, such as bfloat16 support, and enables appropriate optimizations. This allows the same training script to work across different hardware setups, though the real complexity comes from the trial-and-error process of stabilizing training.

# GPU configuration (from config.py)
self.gpu_config = {
    "auto_detect": True,  # Automatically detect available GPUs
    "max_gpus": 0,  # Maximum number of GPUs to use (0 = no limit, use all available)
    "min_gpu_memory_gb": 8,  # Minimum GPU memory required (GB)
    "preferred_gpu_types": ["A30", "A40", "A100", "V100", "RTX4090", "RTX4080"],
    "fallback_to_cpu": True,  # Fall back to CPU if no suitable GPUs found
    "force_single_gpu": False,  # Force single GPU even if multiple available
    "force_multi_gpu": False,  # Force multi-GPU even if only one available
    "gpu_memory_fraction": 0.9,  # Fraction of GPU memory to use (0.0-1.0)
    "allow_growth": True,  # Allow GPU memory growth
    "log_device_placement": False  # Log device placement for debugging
}
Listing 5: GPU Configuration and Detection System

The configuration in Listing 5 is defined in our centralized config.py file and provides settings for automatic GPU detection, memory management, and fallback options. While this looks comprehensive, the actual implementation is simpler - the training code primarily relies on PyTorch’s built-in distributed training detection and basic device selection.

The reality of training: Nearly 100 runs and many failures

WandB Screenshot
Figure 7: helloLondon training runs - WandB

Figure 7 shows the actual training experience: 99 total runs with 24 completions. The failures were largely data-driven - OCR and encoding issues, uneven sequence lengths, and sensitivity to learning-rate warmup - and a few were plain memory pressure from early, less conservative settings. The code stabilized early; the data and knobs took time.

This iterative process is typical in language model development - the “sophisticated” system shown here is the result of learning from these failures and gradually improving the training pipeline. The successful runs exhibit stable loss curves and appropriate learning rate schedules, demonstrating that the final configuration performs well on historical text processing tasks.

Most importantly, this experience reinforces a fundamental truth in machine learning: data quality is still king. No amount of sophisticated architecture, GPU optimization, or training infrastructure can overcome poor data quality. The “garbage in, garbage out” principle remains as true for language models as it was for the earliest machine learning systems. Our 75% failure rate was primarily due to data issues – such as inconsistent formatting, OCR errors, and encoding problems - not technical limitations. This is why Part 2’s focus on data cleaning and tokenization was so crucial to our success.

3.2 Precision and Performance Configuration

The system includes precision and performance configuration options that can be tuned based on available hardware. Mixed-precision training uses lower-precision (fp16/bf16) for most operations while keeping full precision for critical computations, providing significant memory savings and speed improvements with minimal impact on quality.

Understanding fp16 and bf16: The Precision Trade-off

To understand why precision matters for language model training, we need to look at how computers represent numbers. Standard floating-point numbers use 32 bits (float32), but we can use fewer bits to save memory and increase speed:

  • fp16 (Half Precision): Uses 16 bits to represent numbers, cutting memory usage in half and enabling faster computation. However, it has a smaller range of representable numbers, which can cause “overflow” (numbers too large) or “underflow” (numbers too small) during training.

  • bf16 (Brain Float 16): Also uses 16 bits, but with a different bit layout that matches float32’s exponent range. This means it can represent the same range of large and small numbers as float32, but with less precision for very small decimal values.

Why bf16 is better for language models:

bf16 provides better numerical stability than fp16, especially for large language models, reducing the likelihood of overflow and underflow that can cause training instability. The key difference is that bf16 can represent the same range of numbers as float32 (from very small to very large), while fp16 has a much smaller range. This is crucial for language models because:

  1. Gradient magnitudes vary widely - Some gradients are very small (close to zero) while others are large
  2. Attention weights - The softmax operations in attention can produce very small numbers that FP16 might round to zero
  3. Learning rate scaling - Modern optimizers like AdamW work with gradients of varying magnitudes

When gradients become too small and are rounded to zero (underflow), the model stops learning effectively. When they become too large (overflow), training becomes unstable. bf16’s wider range helps prevent both issues.

Understanding precision and performance settings:

The configuration in Listing 6 toggles the levers that matter on consumer hardware: TF32 for faster matmuls, AMP (prefer bf16) for stability and memory cuts, torch.compile for an extra boost after warmup, and sequence/batch sizes sized to your VRAM. Used together, these commonly halve activation memory and yield 2-3x speedups versus full-precision baselines.

# Runtime/precision knobs (A30 optimized)
"enable_tf32": True,
"enable_amp": True,
"amp_dtype": "bf16",  # bf16 on Ampere; fallback to fp16 if unsupported
"enable_compile": True,  # torch.compile; set False to reduce memory usage
# Conservative baseline (for broad hardware) - uncomment to use:
# "enable_tf32": False,
# "enable_amp": True,
# "amp_dtype": "fp16",
# Sequence/batch control
"max_length": 1024,  # increase tokens per step when VRAM allows
"batch_size": 20,    # per-GPU batch; raise if VRAM allows
# Conservative sequence/batch - uncomment to use:
# "max_length": 768,
# "batch_size": 8,
Listing 6: Precision and Performance Configuration

Key GPU Configuration Settings

A few switches move the needle the most: enable TF32 on Ampere-class GPUs for a quick matrix-mul speedup; use AMP (bf16 where supported, fp16 otherwise) to halve activation memory; and turn on torch.compile if you can afford the warmup to get another 1.2-1.5x after a few hundred steps. Keep the sequence length in line with VRAM (~512 tokens for 8GB, ~1024 for 16GB+), and scale the per-GPU batch size accordingly (think hundreds of MB per batch at these widths). The repo includes sensible presets so you can start conservative and dial up.

3.2.1 Real-World Performance Results

On 2x A30s, the SLM lands around mid-20s MFU with ~210 ms/iter and ~18 GB per GPU, converging from ~10.4 loss to the mid-3s over the full run. The clean BPE tokenizer and precision stack keep math efficient, and DDP delivers the expected speedup over a single device.

Automatic Precision Detection and Memory Optimization:

The system also includes automatic precision detection and memory optimization during model initialization. The code snippet below shows how the system automatically selects the optimal precision format based on available hardware capabilities:

# Precision / TF32 knobs from config
tf32 = self.slm_config.get("enable_tf32", True)
torch.backends.cuda.matmul.allow_tf32 = bool(tf32)
torch.backends.cudnn.allow_tf32 = bool(tf32)
try:
    torch.set_float32_matmul_precision('high' if tf32 else 'medium')
except Exception:
    pass
use_amp = self.slm_config.get("enable_amp", True)
amp_dtype_cfg = self.slm_config.get("amp_dtype", "bf16").lower()
bf16_ok = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
if use_amp:
    if amp_dtype_cfg == 'bf16' and bf16_ok:
        self.dtype = 'bfloat16'
    else:
        self.dtype = 'float16'
else:
    self.dtype = 'float32'
Listing 7: Precision Detection and Memory Optimization

The TF32 configuration optimizes matrix operations for Ampere+ GPUs, delivering significant speedups while maintaining training stability.

3.3 Multi-GPU Training with Distributed Data Parallel

The system supports multi-GPU training using PyTorch’s DistributedDataParallel (DDP) - each GPU hosts a full model replica, processes different batches in parallel, and synchronizes gradients automatically. PyTorch handles the inter‑GPU communication, so on two GPUs, you typically see near‑linear speedup (~2×) for these model sizes.

Multi-GPU training improves throughput and shortens wall‑clock time by splitting work across devices. On our 2× A30 setup, we process 36 sequences in parallel (18 per GPU) instead of 18 on a single card, cutting Regular model training from ~56 hours to ~28–32 hours. It also offers operational flexibility: scale up or down based on the number of GPUs available.

However, multi-GPU training introduces several challenges that can limit performance gains. The primary bottleneck is inter-GPU communication - after each backward pass, gradients must be synchronized across all GPUs, which requires transferring large amounts of data. This communication overhead can become significant, especially with larger models and more GPUs.

The performance of multi-GPU training heavily depends on the interconnect between GPUs. On NVIDIA systems, InfiniBand provides the highest bandwidth and lowest latency for GPU-to-GPU communication, enabling near-linear scaling across many GPUs. NVLink (found on high-end NVIDIA GPUs such as A100 and H100) provides direct GPU-to-GPU connections with very high bandwidth, making it ideal for 2-8 GPU setups. PCIe connections are slower but more common in consumer and workstation systems.

In AMD systems, Infinity Fabric serves a role similar to NVLink, providing high-bandwidth interconnects between GPUs. AMD’s MI200 and MI300 series GPUs include Infinity Fabric links that enable efficient multi-GPU communication.

In practice, scaling efficiency depends on the ratio of computation to communication. Our historical language models have relatively modest parameter counts (117M-354M), so communication overhead can be significant compared to computation time. This is why we see good scaling with 2 GPUs but diminishing returns with more GPUs - the communication overhead starts to dominate.

DDP is more efficient than naive data parallelism because it reduces communication overhead and enables larger effective batch sizes, as shown in Listing 8 below.

# DDP setup (process group already initialized in main())
self.ddp = int(os.environ.get('RANK', -1)) != -1
if self.ddp:
    self.ddp_rank = int(os.environ['RANK'])
    self.ddp_local_rank = int(os.environ['LOCAL_RANK'])
    self.ddp_world_size = int(os.environ['WORLD_SIZE'])
    self.device = f'cuda:{self.ddp_local_rank}'
    torch.cuda.set_device(self.device)
    self.master_process = self.ddp_rank == 0
    self.seed_offset = self.ddp_rank
else:
    self.master_process = True
    self.seed_offset = 0
    self.ddp_world_size = 1
Listing 8: Multi-GPU Training Setup

What is “rank” and why does it matter?

In distributed training, each GPU process gets a unique “rank.” Rank 0 acts as the coordinator (handles logging, checkpointing, and WandB), while the remaining ranks focus purely on computation. This avoids collisions - only one process touches files and dashboards - while every device contributes gradients.

This division of labor is crucial because it prevents conflicts. Without it, all processes would try to save checkpoints simultaneously, log to WandB at the same time, or write to the same files, causing errors and corruption.

The key to scaling efficiency is that each GPU works independently on different data batches, then synchronizes only the essential information (gradients). Here’s how it works:

  1. Parallel computation: Each GPU processes a different batch of data simultaneously
  2. Gradient synchronization: After each backward pass, gradients are averaged across all GPUs
  3. Independent updates: Each GPU updates its model copy with the averaged gradients

This means that if you have 2 GPUs, you can process 2x the data in the same time, giving you roughly 2x the speed. With 4 GPUs, you get approximately 4x speedup. The “near-linear” part acknowledges that there’s always some overhead from communication and synchronization, so that you might get 1.9x speedup instead of exactly 2x. Still, it’s close enough to be very effective.

However, there’s a practical limit to this approach. Beyond 8-16 GPUs, the communication overhead becomes so significant that you need more robust hardware (such as InfiniBand networks) and advanced systems engineering techniques (gradient compression, pipeline parallelism, model parallelism) to maintain efficiency. For truly large-scale training with hundreds of GPUs, you need specialized infrastructure and techniques that go far beyond what we’re doing here.

This combination of distributed training and memory optimization enables us to train our historical language models efficiently, even on consumer hardware. The distributed setup provides fault tolerance and near-linear speedup, while the precision optimizations enable larger models and longer sequences on the same hardware.

4. Training Infrastructure: Making It All Work Together

As a reminder, as we saw earlier, the two model variants share the same training stack (scheduler, checkpointing, WandB, DDP). See Part 1 for the high‑level comparison; here are the training‑relevant differences only:

  • SLM (117M): per‑GPU batch 18 → effective 36 on 2 GPUs; sequence length 512; ~7–8h on 2×A30
  • Regular (354M): per‑GPU batch 12 → effective 24 on 2 GPUs; sequence length 1024; ~28–32h on 2×A30

4.1 The Training Loop

The core training happens in the train() method, which implements a standard language model training loop with several key phases - outlined below.

4.1.1 Data Loading and Preparation

The training loop starts by loading tokenized data using get_batch('train'), which reads from pre-processed binary files created during data preparation. This includes both training and validation data, with the tokenizer from Part 2: Data Collection & Custom Tokenizers handling the conversion between text and tokens.

Main Training Loop Structure:

def train(self):
    """Main training loop"""
    # Get initial batch
    X, Y = self.get_batch('train')
    
    while True:
        # 1. Learning rate scheduling
        lr = self.get_lr(iter_num)
        
        # 2. Evaluation and checkpointing (every eval_interval steps)
        if iter_num % self.eval_interval == 0:
            losses = self.estimate_loss()
            # Save checkpoint if validation loss improved
            
        # 3. Forward pass with mixed precision
        with torch.amp.autocast():
            logits, loss = self.model(X, Y)
        
        # 4. Backward pass and optimization
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        optimizer.zero_grad()
        
        # 5. Get next batch
        X, Y = self.get_batch('train')
        
        # 6. Logging and monitoring
        if iter_num % self.log_interval == 0:
            # Log to WandB and console
Listing 9: Core Training Loop Structure

Training Process Flow:

Understanding how the training actually works requires seeing both the high-level flow and the technical details of each phase. Figure 8 shows the complete training process flow.

graph TD
    A[Start Training] --> B[Load Tokenized Data]
    B --> C[Initialize Model & Optimizer]
    C --> D[Training Loop Start]
    D --> E[Update Learning Rate]
    E --> F{Evaluation Time?}
    F -->|Yes| G[Run Validation]
    F -->|No| H[Forward Pass]
    G --> H
    H --> I[Compute Loss]
    I --> J[Backward Pass]
    J --> K[Gradient Clipping]
    K --> L[Update Weights]
    L --> M[Zero Gradients]
    M --> N[Log Metrics]
    N --> O{Checkpoint?}
    O -->|Yes| P[Save Model State]
    O -->|No| Q[Load Next Batch]
    P --> Q
    Q --> R{Max Iterations?}
    R -->|No| D
    R -->|Yes| S[Save Final Model]
    S --> T[End Training]
Figure 8: Training Process Flow

Now that we have a high-level overview of the training process, let us dig deeper into each phase and see how it works in practice.

4.1.2 Data Loading

Data loading reads pre-tokenized sequences from binary files (.bin) using np.memmap for memory efficiency. The initial tokenization process can take quite a long time on our 500M+ character corpus, but this is done only once and saved to disk. This optimization was crucial during our development process – given nearly 100 training runs and many failures, re-tokenizing the entire corpus each time would have been prohibitively slow. The system handles train/val splits (90/10 %) with random sampling per batch and uses pin_memory() and non_blocking=True for faster GPU transfers.

When we run this for the time time, it takes a long time to load and tokenize the training data corpus. We see this just startging in Figure 9 below.

Tokenizer training data Screenshot
Figure 9:Tokenizer training data

Batch sizes are optimized for our 2x A30 GPU setup: 18 per GPU for the SLM model (36 effective batch size) and 12 per GPU for the Regular model (24 effective batch size). These numbers balance memory usage with training stability – the SLM can handle larger batches thanks to its smaller 117M parameter count. In comparison, the Regular model’s 354M parameters require smaller batches to fit in GPU memory.

Figure 10 below shows the dual GPU setup used for one of the training sessions for the regular mode.

GPU detail Screenshot
Figure 10:GPU detail

4.1.3 Learning Rate Scheduling

Learning Rate Scheduling uses cosine decay with warmup, a two-phase approach that helps prevent training instability. The warmup phase gradually increases the learning rate from 0 to the target value over 500 steps (SLM) or 1000 steps (Regular model), preventing the model from making large, destabilizing updates early in training.

After warmup, cosine decay smoothly reduces the learning rate following a cosine curve to 10% of the initial rate by the end of training. In case you are not familiar with Cosine decay, it is a scheduling strategy where the learning rate follows the shape of a cosine wave: starting at the maximum value after warmup, it decreases slowly at first, then more rapidly in the middle of training, and finally levels off gently near the minimum value.

Mathematically, this follows the curve lr = min_lr + (max_lr - min_lr) × 0.5 × (1 + cos(π × progress)), where progress goes from 0 (start of decay) to 1 (end of training). Unlike linear decay (which drops at a constant rate) or step decay (which drops abruptly at fixed intervals), cosine decay provides a smooth, natural reduction that helps the model explore the loss landscape more effectively early on, then refine its parameters more precisely as training progresses.

The initial learning rates are chosen based on model size: 3e-4 (0.0003) for the SLM model and 3e-5 (0.00003) for the Regular model. The 10x difference reflects the Regular model’s larger parameter count (354M vs 117M) - larger models typically need smaller learning rates to prevent gradient explosion. The cosine decay ensures the model converges smoothly rather than oscillating around the minimum, which is crucial for the complex patterns in historical text.

def get_lr(self, it):
    """Learning rate schedule"""
    warmup_iters = 500
    lr_decay_iters = self.max_iters
    min_lr = self.learning_rate * 0.1
    
    if it < warmup_iters:
        return self.learning_rate * (it + 1) / (warmup_iters + 1)
    if it > lr_decay_iters:
        return min_lr
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (self.learning_rate - min_lr)
Listing 10: Learning Rate Scheduling Function

The code in Listing 10 shows how the learning rate schedule is implemented. The function takes the current iteration number and returns the appropriate learning rate based on whether we’re in the warmup phase (linear increase) or decay phase (cosine curve). The warmup_iters parameter controls the warmup duration, while min_lr sets the final learning rate to 10% of the initial value.

As a side note, in case you are curious about why a cosine decay specifically makes sense, then read on. The cosine function has unique mathematical properties that make it ideal for learning rate scheduling. Unlike linear decay (which drops too quickly) or exponential decay (which drops too slowly), cosine decay starts with a gentle slope that gradually steepens, then flattens out near the end. This creates a “restart” effect, allowing the model to escape local minima early in training and then fine-tune more precisely in later stages.

The smooth, continuous nature of cosine decay prevents the learning rate from changing too abruptly, which could destabilize training. Given the historical text’s complex linguistic patterns, this gradual, adaptive approach helps the model learn both general language structures and specific historical vocabulary without getting stuck in suboptimal solutions.

4.1.4 Evaluation

We run evaluations at a regular interval after a certain number of steps, which are defined in the eval_interval() method (defaults to 500 for SLM, 1000 for Regular) and compute loss on both train and validation sets using the estimate_loss() method. The different intervals reflect the models’ training complexity: the SLM trains faster and benefits from more frequent checks, while the Regular model’s longer runs can use less frequent evaluation.

The estimate_loss() function monitors training progress without disrupting the learning process. To ensure consistent measurements, it temporarily switches the model to evaluation mode (model.eval()). In this mode, dropout layers stop randomly dropping neurons (using the full network capacity), and batch normalization uses running statistics rather than recomputing them from each batch. This means the same input produces the same output every time, unlike in training mode, where dropout introduces randomness for regularization.

Rather than computing loss on the entire dataset (which would be too slow), estimate_loss() samples eval_iters random batches (default 100) from both training and validation sets. It computes the loss for each batch and returns the average, providing a representative estimate of model performance while remaining computationally efficient.

The evaluation process uses torch.no_grad() to disable gradient computation during validation. Gradients are the partial derivatives that tell us how to adjust each model parameter to reduce loss - they’re computed during the backward pass and stored for the optimizer. During evaluation, we don’t need gradients because we’re not updating weights; we’re just measuring performance.

Disabling gradient computation serves two critical purposes. First, it prevents memory leaks by not storing gradients for validation computations - without this, GPU memory would gradually increase during evaluation and eventually cause out-of-memory errors. Second, it ensures accurate loss measurement by preventing any accidental gradient updates during the evaluation phase. The no_grad() context manager is essential for maintaining training stability and memory efficiency.

4.1.5 Forward Pass

A forward pass is when the model processes input data through its layers to produce a prediction. Think of it like asking the model a question: given a sequence of historical text tokens, “what word should come next?” The model flows the input forward through 12 transformer blocks (SLM) or 24 blocks (Regular), each applying self-attention (to understand relationships between words) and feed-forward operations (to transform and refine the representations). At the end, the model outputs a probability distribution over all possible next tokens.

The forward pass uses mixed precision training with torch.amp.autocast and bf16/fp16 data types, reducing memory usage by ~50% while maintaining training stability. Cross-entropy loss is computed by comparing the model’s predicted probabilities with the actual next tokens in the training data; it measures how “wrong” the model’s predictions are. The loss function handles variable sequence lengths by appropriately padding sequences. The mixed precision approach is particularly important for our historical text corpus, which contains long sequences that would otherwise exceed GPU memory limits.

4.1.6 Backward Pass

After the forward pass tells us how wrong the model is (via the loss), the backward pass figures out how to fix it. Using loss.backward(), PyTorch computes gradients for every parameter in the model - these gradients tell us the direction and magnitude of changes needed to reduce the loss. It’s like having a GPS telling you which way to move and how far, but for 117 million (SLM) or 354 million (Regular) parameters simultaneously.

The system applies gradient clipping with torch.nn.utils.clip_grad_norm_ using a maximum norm of 1.0. Sometimes gradients can become extremely large, especially when processing complex or unusual historical text patterns. Without clipping, these huge gradients would cause the model parameters to jump wildly, potentially making the model unstable or causing it to “forget” what it learned. Clipping acts like a safety valve, limiting the maximum size of parameter updates to keep training stable. Put simply, gradient clipping caps the overall (global) gradient norm at a threshold; if it exceeds the limit, gradients are rescaled so the update stays bounded. In our early runs, omitting clipping occasionally produced NaN losses; keeping max_norm=1.0 eliminated those spikes.

After computing gradients, the system updates the model weights using the AdamW optimizer, which applies the gradients with momentum and adaptive learning rates for each parameter. The optimizer decouples weight decay (a regularization technique to prevent overfitting) from gradient updates, improving generalization. Finally, gradients are zeroed with optimizer.zero_grad(set_to_none=True) - this clears the gradient buffers before the next iteration, preventing them from accumulating across batches. The set_to_none=True option releases memory immediately rather than waiting for GC, improving memory efficiency.

4.1.7 Checkpointing

Checkpointing saves model state, optimizer state, iteration number, and best validation loss whenever validation performance improves, rather than at every evaluation. This selective saving strategy provides multiple benefits: it conserves disk space (our 354M-parameter model checkpoints are ~1.4GB each), reduces I/O overhead that can slow down training, and improves overall training time by 5-10% by eliminating redundant disk writes. The system maintains only the last 5 checkpoints (as configured in config.py), with PyTorch’s torch.save() using compression to ensure efficient storage while preserving all necessary training state for resuming. We’ll dive deeper into checkpointing strategies and implementation details in Section 6.

The training loop implements standard optimization practices, including dynamic learning rate scheduling, regular evaluation and checkpointing, and comprehensive logging to WandB (as detailed in Section 5). The system automatically saves checkpoints when validation loss improves, ensuring that the best model is always preserved. The learning rate schedule uses cosine decay with warmup, which is standard practice for transformer training.

📁 Full Implementation:

4.2 Model Initialization: Setting Up the Training Foundation

Before the training loop can begin, the system must properly initialize the model, optimizer, and training infrastructure. The init_model() method handles this setup, ensuring everything is configured correctly for efficient training.

4.2.1 Model Configuration and Creation

The initialization process starts by loading metadata from the tokenized data to ensure the model architecture matches the training data. The system reads vocabulary size, block size, and other parameters from the meta.pkl file created during data preparation, ensuring consistency between the model and the data it will be trained on.

The model configuration is built from the SLM parameters defined in config.py, including the number of layers (12), attention heads (12), embedding dimensions (768), and other architectural choices. This configuration is then used to create the SimpleGPT model instance, which inherits from PyTorch’s nn.Module and provides all the functionality we discussed in the architecture section.

4.2.2 Optimizer Setup and Configuration:

The optimizer is the algorithm that actually updates the model’s parameters (weights and biases) during training. After the backward pass computes gradients (which tell us how to adjust each parameter), the optimizer applies those gradients to update the parameters and improve the model.

The system uses AdamW (Adam with Weight Decay), which is a popular optimizer for training transformers. AdamW combines the best of two approaches: Adam (which adapts the learning rate for each parameter individually, helping with convergence) and weight decay (a form of regularization that prevents overfitting by discouraging large parameter values).

However, not all parameters should be regularized the same way. The optimizer splits parameters into two groups for different weight decay:

  • 2D parameters (weight matrices): These are the main “learnable” parts of the model - the connections between neurons in different layers. These receive weight decay (value 0.1) to prevent them from growing too large, which helps prevent overfitting.
  • 1D parameters (biases): These are additive constants that help shift the model’s predictions. They don’t receive weight decay (value 0.0) because regularizing biases doesn’t help with overfitting and can actually hurt performance.

This two-group approach follows standard practices for transformer training and ensures the model generalizes well to unseen historical text.

Modern PyTorch supports “fused” optimizer operations, which combine multiple steps into a single, faster GPU kernel. Instead of executing separate operations (unscale gradients, update parameters, update optimizer state), fused AdamW performs all three in a single optimized GPU operation. This can provide 10-20% speedup on modern GPUs. The system automatically detects whether your PyTorch version supports fused operations and uses them when available, falling back to the standard implementation otherwise.

Concretely, we use AdamW with the following settings for this project: betas=(0.9, 0.95), weight_decay=0.1, and the learning rate provided by the scheduler (warmup + cosine decay). The AdamW eps parameter is left at the PyTorch default unless you change it in code. When available, the fused AdamW kernel is enabled automatically. See Listing 11 for the exact call in init_model().

4.2.3 Model Compilation and Multi-GPU Setup

Model compilation with PyTorch’s torch.compile is similar to traditional code compilation, but with important differences. When you compile Python code (like using gcc for C), the compiler transforms the source code into optimized machine code once, which then runs faster. Similarly, torch.compile takes your model’s computation graph and optimizes it, but it does this at runtime rather than ahead of time.

The compilation process analyzes your model’s operations (matrix multiplications, attention layers, etc.) and generates optimized kernels tuned to your hardware. This includes operator fusion (combining multiple operations into single GPU kernels), memory layout optimization (arranging data for better cache usage), and kernel selection (choosing the fastest implementation for your specific GPU). The result is often 1.2-1.5x speedier training, but with an initial “warmup” cost: the first few forward/backward passes are slower while PyTorch analyzes the model and generates optimized code.

This differs from traditional compilation because the optimization happens dynamically based on actual input shapes and hardware capabilities, rather than being pre-computed. It’s more like a JIT compiler that specializes your model’s operations for the exact conditions it encounters during training.

For multi-GPU training, the model is wrapped with DistributedDataParallel (DDP), which enables parallel training across multiple GPUs. The DDP wrapper handles gradient synchronization and ensures that all GPUs work with identical model parameters throughout training.

def init_model(self):
    """Initialize the model"""
    logger.info("Initializing model...")
    
    # Load metadata from tokenized data
    meta_path = self.data_dir / "meta.pkl"
    meta_vocab_size = None
    if meta_path.exists():
        with open(meta_path, 'rb') as f:
            meta = pickle.load(f)
        meta_vocab_size = meta['vocab_size']
        logger.info(f"Found vocab_size = {meta_vocab_size}")
    
    # Create model configuration
    model_args = dict(
        n_layer=self.n_layer,        # 12 for SLM
        n_head=self.n_head,          # 12 for SLM  
        n_embd=self.n_embd,          # 768 for SLM
        block_size=self.block_size,  # 512 for SLM
        bias=self.bias,              # False
        vocab_size=meta_vocab_size,  # From tokenized data
        dropout=self.dropout         # 0.1
    )
    
    # Create and configure model
    gptconf = SimpleGPTConfig(**model_args)
    self.model = SimpleGPT(gptconf)
    self.model.to(self.device)
    
    # Initialize optimizer with proper parameter groups
    self.optimizer = self.model.configure_optimizers(
        weight_decay=0.1,
        learning_rate=self.learning_rate,
        betas=(0.9, 0.95),
        device_type='cuda' if 'cuda' in self.device else 'cpu'
    )
    
    # Compile model for performance
    if torch.cuda.is_available() and self.slm_config.get("enable_compile", True):
        logger.info("Compiling model...")
        self.model = torch.compile(self.model, mode='reduce-overhead')
    
    # Wrap with DDP for multi-GPU training
    if self.ddp:
        self.model = DDP(self.model, device_ids=[self.ddp_local_rank])
        param_count = self.model.module.get_num_params()
    else:
        param_count = self.model.get_num_params()
    
    logger.info(f"Model initialized with {param_count:,} parameters")
Listing 11: Model Initialization Process

While our model is a relatively simple toy example focused on a single domain (historical London text), proper initialization remains important to avoid common training issues. The vocabulary size must match our custom historical tokenizer, the sequence length needs to work with our tokenized data, and the model architecture should be appropriate for the text patterns we’re learning.

The initialization process ensures these basic requirements are met before training begins, preventing issues such as vocabulary mismatches or memory allocation problems that could lead to training failures. This careful setup was helpful during our development process, where we ran nearly 100 training experiments. Proper initialization helped us avoid some basic configuration errors and focus on the actual training challenges.

Reproducibility and random seeds: To make runs repeatable on the same hardware, we set a deterministic seed per process using torch.manual_seed(1337 + seed_offset), where seed_offset is the DDP rank (0 for single‑GPU). This gives consistent data shuffling and initialization across restarts while keeping each process distinct under DDP. Note that some CUDA kernels (and AMP/bf16) can introduce non‑determinism; for strict determinism, you may also configure PyTorch’s deterministic flags at the cost of performance.

5. WandB Integration

Weights & Biases (WandB) is an experiment tracking and monitoring platform designed specifically for machine learning projects. Think of it as a “black box” for your training runs - it automatically records everything that happens during training so you can understand what worked, what didn’t, and why.

Training a language model is a long‑running experiment. Without live telemetry, we are flying blind and can’t tell whether learning is stable, whether hardware is saturated, or whether runs are comparable. WandB gives real‑time visibility, remote monitoring, and reproducibility. It records loss, learning rate, and perplexity over time; captures GPU utilization and iteration latency; logs configuration and artifacts; and lets you compare runs side‑by‑side to understand which settings worked.

The system includes WandB integration for experiment tracking and monitoring, with automatic configuration logging, real-time metric tracking (including loss, perplexity, and learning rate), model checkpoint integration, experiment comparison across different training runs, and resource monitoring (GPU utilization and memory usage). This integration helps track and compare different training runs, identify better configurations, and reproduce successful experiments.

Understanding WandB integration:

Listing 12 logs the signals you need to fly by instruments: loss and perplexity trends for learning, the LR schedule to confirm warmup/decay, and hardware utilization and iteration timing for throughput and stability. It’s not just logging - it’s how you compare runs and catch issues early.

This real-time monitoring lets us spot problems early, compare different training approaches, and ensure our historical language model is learning properly over days or weeks.

# Log to WandB - loss first for better mobile UI
if self.use_wandb:
    wandb.log({
        "train/loss": lossf,
        "train/lr": lr,
        "train/iter": iter_num,
        "train/mfu": running_mfu * 100 if running_mfu > 0 else 0,
        "train/dt_ms": dt * 1000,
    })
Listing 12: WandB Integration and Logging

The system logs training loss, learning rate, iteration number, model flops utilization (MFU), and training time per iteration. These metrics provide comprehensive insight into training progress, efficiency, and potential issues.

The most useful dials to watch are training loss (should steadily trend from ~8–10 toward ~2–4), MFU (a proxy for GPU efficiency - single‑digit theoretical targets but mid‑20s achievable with good tuning), the learning‑rate curve (warmup then cosine decay), and iteration time (a practical signal for throughput and stalls).

Both SLM and Regular model training runs complete 60,000 iterations, providing consistent training depth across both model variants. Figure 11 below shows the complete training experience for our Regular model (354M parameters), demonstrating both the console output and WandB’s comprehensive monitoring capabilities.

Complete training run output showing console logs and WandB summary
Figure 11:Complete training run output showing console logs and WandB monitoring for Regular model (354M parameters)

Whilst it might be obvious, the screenshot in Figure 11 captures the final moments of a successful 60,000-iteration training run, showing both the real-time console output and WandB’s comprehensive run summary. In this run, the logs reveal the training progression through the final iterations (59,850 to 60,000), with training loss steadily decreasing from 3.0575 to 2.7063, demonstrating healthy convergence.

The WandB run summary provides the complete picture: a final training loss of 2.70315, a validation loss of 3.61921, and a validation perplexity of 37.31, all indicating successful model training. The system automatically saved the final checkpoint and cleaned up old checkpoints, while WandB captured the entire training journey with detailed metrics tracking. This comprehensive monitoring approach ensures we can both track progress in real time and analyze the full training history afterward.

6. Checkpointing and Model Persistence

Checkpointing is one of the most critical aspects of training large language models, especially for historical text, where training can take days or weeks. A robust checkpointing system ensures that training progress is never lost due to hardware failures, power outages, or other interruptions. In this section, we’ll explore the comprehensive checkpointing system built for the helloLondon project, covering everything from basic checkpoint creation to advanced resume functionality.

6.1 Checkpoint System

The training system implements a practical checkpointing system that preserves all aspects of training state, ensuring that training can be resumed from exactly where it left off. This is particularly important for any complex model, where training can take a long time.

Each checkpoint packages four essentials: the model weights (so learning is preserved), the optimizer state (so momentum and adaptive stats resume cleanly), the current iteration (so schedules pick up in the right place), and the best validation loss to date (so we only promote genuinely better models). Together, these let you stop and restart without losing training dynamics.

The code in Listing 13 shows how these components are saved when validation loss improves:

if losses['val'] < best_val_loss:
    best_val_loss = losses['val']
    if iter_num > 0:
        checkpoint = {
            'model': raw_model.state_dict(),
            'optimizer': self.optimizer.state_dict(),
            'iter_num': iter_num,
            'best_val_loss': best_val_loss,
        }
        checkpoint_path = self.output_dir / f'checkpoint-{iter_num}.pt'
        logger.info(f"Saving checkpoint to {checkpoint_path}")
        torch.save(checkpoint, checkpoint_path)
        
        # Clean up old checkpoints - keep only the last 3
        self.cleanup_old_checkpoints()
Listing 13: Checkpointing and Model Persistence

The checkpointing system uses a simple yet effective approach: it saves checkpoints only when the validation loss improves, rather than at every evaluation. This approach serves multiple purposes. First, it ensures we’re always keeping the best-performing model, not just the most recent one. Second, it significantly reduces I/O overhead during training, as checkpoint saves can be expensive operations (our 354M parameter model checkpoints are ~1.4GB each). Third, it prevents disk space issues by avoiding the accumulation of suboptimal checkpoints. This selective checkpointing approach can improve overall training time by 5-10% by eliminating redundant disk writes.

6.2 Checkpoint Management and Cleanup

Since our 354M parameter model checkpoints are ~1.4GB each, we need to clean up old checkpoints to avoid running out of disk space. The system automatically keeps only the last 5 checkpoints and deletes older ones (as defined in config.py). The cleanup function in Listing 14 finds all checkpoint files, sorts them by modification time (newest first), and deletes everything except the most recent 5. Only the master process handles cleanup to avoid race conditions in multi-GPU setups.

def cleanup_old_checkpoints(self, keep_last=5):
    """Clean up old checkpoints, keeping only the last N"""
    if not self.master_process:
        return  # Only the master process should clean up
        
    try:
        # Find all checkpoint files
        checkpoint_files = list(self.output_dir.glob("checkpoint-*.pt"))
        
        if len(checkpoint_files) <= keep_last:
            return  # Not enough checkpoints to clean up
        
        # Sort by modification time (newest first)
        checkpoint_files.sort(key=lambda x: x.stat().st_mtime, reverse=True)
        
        # Keep the newest ones, delete the rest
        files_to_delete = checkpoint_files[keep_last:]
        
        for file_path in files_to_delete:
            try:
                file_path.unlink()
                logger.info(f"Deleted old checkpoint: {file_path.name}")
            except Exception as e:
                logger.warning(f"Failed to delete checkpoint {file_path.name}: {e}")
                
    except Exception as e:
        logger.warning(f"Checkpoint cleanup failed: {e}")
Listing 14: Checkpoint Cleanup and Management

6.3 Resume Training Functionality

The ability to resume training from any checkpoint is useful when training gets interrupted. This functionality lets you pick up where you left off, whether the interruption was a few minutes or longer.

The resume functionality loads a checkpoint file and restores the training state: the model weights, optimizer state, current iteration number, and best validation loss. If checkpoint loading fails, the code falls back to starting from scratch.

When loading checkpoints, the code handles two practical considerations. First, the map_location=self.device parameter ensures the checkpoint loads onto the correct device (CPU or GPU), which matters if you’re resuming on different hardware or after a restart. Second, for multi-GPU setups using DistributedDataParallel, the model is wrapped in a .module attribute, so the code uses raw_model = self.model.module if self.ddp else self.model to access the actual model underneath.

def resume_from_checkpoint_file(self):
    """Resume training from a checkpoint file"""
    if not self.resume_from_checkpoint:
        return
        
    checkpoint_path = Path(self.resume_from_checkpoint)
    if not checkpoint_path.exists():
        logger.error(f"Checkpoint file not found: {checkpoint_path}")
        return
        
    logger.info(f"Resuming from checkpoint: {checkpoint_path}")
    
    try:
        # Load checkpoint
        checkpoint = torch.load(checkpoint_path, map_location=self.device)
        
        # Load model state
        raw_model = self.model.module if self.ddp else self.model
        raw_model.load_state_dict(checkpoint['model'])
        logger.info("Model state loaded successfully")
        
        # Load optimizer state
        self.optimizer.load_state_dict(checkpoint['optimizer'])
        logger.info("Optimizer state loaded successfully")
        
        # Get iteration number and best validation loss
        self.start_iter = checkpoint.get('iter_num', 0)
        self.best_val_loss = checkpoint.get('best_val_loss', 1e9)
        
        logger.info(f"Resuming from iteration: {self.start_iter}")
        logger.info(f"Best validation loss so far: {self.best_val_loss:.4f}")
        
    except Exception as e:
        logger.error(f"Failed to load checkpoint: {e}")
        logger.info("Starting training from scratch...")
        self.start_iter = 0
        self.best_val_loss = 1e9
Listing 15: Resume Training from Checkpoint

The function loads the model weights, optimizer state, iteration number, and best validation loss from the checkpoint file, then continues training from where it left off. If the checkpoint file doesn’t exist or can’t be loaded, it logs an error and starts training from scratch. Since our Regular model takes 28-32 hours to train, resuming from a checkpoint saves significant time when training is interrupted by power outages, crashes, or manual stops.

7. Training Launch and Management

7.1 Multi-GPU Training with torchrun

For a single GPU, you can run the training script directly — a single Python process will use that device. To use multiple GPUs, launch training with torchrun, which spawns one worker process per GPU and lets the code initialize DistributedDataParallel (DDP). This enables larger effective batch sizes and faster wall‑clock training while keeping weights synchronized across devices; set --nproc_per_node to the number of GPUs you want to use (for example, --nproc_per_node=2).

torchrun is PyTorch’s recommended launcher for distributed training: it initializes the distributed backend and sets environment variables (RANK, LOCAL_RANK, WORLD_SIZE) to keep workers in sync. With torchrun --nproc_per_node=N (where N is the number of GPUs to use — it can be less than the total GPUs available), batches are sharded across the chosen GPUs and gradients are synchronized after each backward pass, which often gives near‑linear speedups on a small multi‑GPU node.

# Single GPU (even with multiple available)
python train_model_slm.py

# Multi-GPU with near-linear speedup
torchrun --nproc_per_node=2 train_model_slm.py

The training script handles DDP (DistributedDataParallel) via train_model_slm.py for gradient sync and batch distribution across GPUs. Figure 12 below shows an example where we have dual GPUs and both are being used.

Multiple GPU used for training Screenshot
Figure 12: Multiple GPU used for training

Note that if you run python train_model_slm.py on a multi‑GPU machine, only one GPU is used; the others remain idle. To use more than one GPU, we must use torchrun.

7.2 Training Monitoring

Training is monitored locally via structured console logs and remotely via WandB. The snippet in Listing 16 records loss, learning rate, timing, and MFU at a configurable interval and, when enabled, streams the same metrics to WandB for side‑by‑side run comparison.

# Timing and logging
t1 = time.time()
dt = t1 - t0
t0 = t1
if iter_num % self.log_interval == 0 and self.master_process:
    lossf = loss.item()
    if local_iter_num >= 5:
        mfu = raw_model.estimate_mfu(self.batch_size, dt)
        running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
    logger.info(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
    
    # Log to WandB - loss first for better mobile UI
    if self.use_wandb:
        wandb.log({
            "train/loss": lossf,
            "train/lr": lr,
            "train/iter": iter_num,
            "train/mfu": running_mfu * 100 if running_mfu > 0 else 0,
            "train/dt_ms": dt * 1000,
        })
Listing 16: Training Monitoring and Logging

Together, console logs and WandB provide real‑time visibility and reproducible experiment tracking; Figure 13 below shows an example of the console logs; see Section 5 for setup and dashboards.

Console training logs showing iteration, loss, step time, and MFU with checkpoint saves
Figure 13: Console training logs: iteration, loss, step time, MFU, and checkpoint saves

8. Model File Formats and Conversion

Training produces PyTorch checkpoint files (.pt) that contain model weights, optimizer state, and training metadata — everything needed to resume training. These checkpoints are covered in detail in Section 6 .

For sharing models and standard deployment workflows, we convert PyTorch checkpoints into the Hugging Face repository format. This conversion creates a portable, standardized model package that can be loaded with standard Hugging Face APIs.

8.1 Converting PyTorch Checkpoints to Hugging Face Format

The Hugging Face repository format is a standardized directory structure containing:

  • config.json: Architecture definition (layers, heads, embedding dimensions, vocabulary size, sequence length). Allows AutoModelForCausalLM to reconstruct the model architecture without custom code.
  • model.safetensors: Model weights in SafeTensors format (memory-mapped, secure loading). Contains only model parameters, no optimizer state — suitable for inference workloads.
  • generation_config.json: Default text generation parameters (max_new_tokens, temperature, top_p, repetition_penalty). Can be overridden at runtime.
  • Tokenizer files (tokenizer.json, vocab.json, merges.txt, special_tokens_map.json, tokenizer_config.json): Serialized tokenizer with vocabulary, merge rules, normalization, and special tokens matching the training configuration.

The conversion code in Listing 17 loads a PyTorch checkpoint, extracts model weights and config, handles torch.compile naming prefixes if present, and saves the model and tokenizer in Hugging Face format.

import torch
from transformers import GPT2LMHeadModel

def convert_pytorch_to_huggingface(pytorch_checkpoint_path, output_dir, tokenizer):
    """Convert PyTorch checkpoint to Hugging Face format"""
    
    # Load PyTorch checkpoint
    checkpoint = torch.load(pytorch_checkpoint_path, map_location='cpu')
    model_state = checkpoint['model']
    config = checkpoint['config']
    
    # Handle torch.compile prefixes
    if any(key.startswith('_orig_mod.') for key in model_state.keys()):
        clean_state = {}
        for key, value in model_state.items():
            clean_state[key[10:]] = value if key.startswith('_orig_mod.') else value
        model_state = clean_state
    
    # Convert to Hugging Face format
    hf_model = GPT2LMHeadModel(config)
    hf_model.load_state_dict(model_state)
    
    # Save in Hugging Face format
    hf_model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
Listing 17: PyTorch → Hugging Face Conversion (essentials)

The conversion handles a few practical details. If the model was compiled with torch.compile, parameter names are prefixed with _orig_mod., which the code strips to match Hugging Face module names. GPT2LMHeadModel(config) instantiates a GPT-2-style architecture that matches the checkpoint’s layer structure, and load_state_dict() loads the weights with automatic shape validation. The save_pretrained() method writes all required files to disk.

File sizes: PyTorch checkpoints are ~450MB (SLM) and ~1.4GB (Regular model); the Hugging Face format reduces this slightly by excluding the optimizer state. The tokenizer adds ~15MB to the repository.

9. Inference Options

Inference can run directly from PyTorch checkpoints or from Hugging Face models. PyTorch checkpoints are convenient during development since you can test any training checkpoint without conversion. Hugging Face models use standard from_pretrained() APIs and are better suited for sharing and deployment workflows.

# Option 1: PyTorch checkpoint inference (direct from training)
python 06_inference/inference_pytorch.py \
  --checkpoint 09_models/checkpoints/slm/checkpoint-60001.pt \
  --prompt "In the year 1834, I walked through the streets of London and witnessed"

# Option 2: Hugging Face model inference (published models)
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("bahree/london-historical-slm")
model = AutoModelForCausalLM.from_pretrained("bahree/london-historical-slm")

inputs = tokenizer("In the year 1834, I walked through the streets...", return_tensors="pt")
outputs = model.generate(inputs['input_ids'], max_new_tokens=50, do_sample=True)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
Listing 18: Inference Options

Both methods load in seconds and generate ~50–100 tokens/sec on typical consumer GPUs (2–4GB VRAM for SLM, 6–8GB for the Regular model). Use PyTorch checkpoints for development and training comparisons; use Hugging Face models for production deployment and sharing. For interactive testing with published models, see Part 1 .

10. Summary

We built a training‑ready GPT pipeline for historical text, end‑to‑end: a clear decoder‑only architecture, pragmatic GPU/precision tuning, DDP for scale, resilient checkpointing/resume, WandB tracking, and clean hand‑off of artifacts (PyTorch checkpoints → Hugging Face export).

Outcome: two working models on the Part 2 corpus - 117M (SLM) and 354M (Regular) - ready for inference now and for evaluation/deployment in Part 4.

🔗 GitHub Repository: github.com/bahree/helloLondon - Complete training infrastructure (04_training/), model architecture (config.py), and GPU configuration (08_documentation/GPU_TUNING.md)

🤗 Published Models: SLM Model | Regular Model - Ready-to-use historical language models on HuggingFace

📚 Book Reference: Generative AI in Action - For deeper understanding of core LLM concepts.


Ready for Part 4? Part 4 covers model evaluation, testing, and deployment strategies that turn your trained models into working systems ready for real-world use.

References
  1. Vaswani et al. (2017) – Attention Is All You Need: https://arxiv.org/abs/1706.03762
  2. Radford et al. (2019) – Language Models are Unsupervised Multitask Learners: https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe
  3. Brown et al. (2020) – Language Models are Few-Shot Learners: https://arxiv.org/abs/2005.14165
  4. Kaplan et al. (2020) – Scaling Laws for Neural Language Models: https://arxiv.org/abs/2001.08361
  5. Hoffmann et al. (2022) – Training Compute-Optimal LLMs (Chinchilla): https://arxiv.org/abs/2203.15556
  6. Chowdhery et al. (2022) – PaLM: Scaling Language Modeling with Pathways: https://arxiv.org/abs/2204.02311
  7. Clark et al. (2019) – What Does BERT Look At?: https://arxiv.org/abs/1906.04341
  8. Voita et al. (2019) – Analyzing Multi‑Head Self‑Attention: https://arxiv.org/abs/1905.09418
  9. Dao et al. (2022) – FlashAttention: https://arxiv.org/abs/2205.14135
  10. Micikevicius et al. (2018) – Mixed Precision Training: https://arxiv.org/abs/1710.03740
  11. Rajbhandari et al. (2020) – ZeRO: https://arxiv.org/abs/1910.02054
  12. Paszke et al. (2019) – PyTorch: https://arxiv.org/abs/1912.01703
  13. Kingma & Ba (2014) – Adam: A Method for Stochastic Optimization: https://arxiv.org/abs/1412.6980
  14. Loshchilov & Hutter (2017) – AdamW Decoupled Weight Decay Regularization : https://arxiv.org/abs/1711.05101
  15. Smith & Topin (2017) – Super‑Convergence: Very Fast Training of Neural Networks Using Large Learning Rates: https://arxiv.org/abs/1708.07120
  16. Goyal et al. (2017) – Accurate, Large Minibatch SGD: https://arxiv.org/abs/1706.02677
  17. Sergeev & Del Balso (2018) – Horovod: https://arxiv.org/abs/1802.05799
  18. Pope et al. (2022) – Efficiently Scaling Transformer Inference: https://arxiv.org/abs/2211.05102
  19. Jawahar et al. (2019) – What does BERT learn about the structure of language?: https://aclanthology.org/P19-1356.pdf
  20. Mikolov et al. (2013) – Word2vec: https://arxiv.org/abs/1301.3781
  21. Pennington et al. (2014) – GloVe: https://aclanthology.org/D14-1162/
  22. Devlin et al. (2018) – BERT https://arxiv.org/abs/1810.04805
  23. Press & Wolf (2017) – Using the Output Embedding to Improve Language Models: https://arxiv.org/abs/1608.05859
  24. Inan et al. (2016) – Tying Word Vectors and Word Classifiers: https://arxiv.org/abs/1611.01462