<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>LLM on Amit Bahree&#39;s (useless?) insight!</title>
    <link>/tags/llm/</link>
    <description>Recent content in LLM on Amit Bahree&#39;s (useless?) insight!</description>
    <generator>Hugo -- 0.151.0</generator>
    <language>en-us</language>
    <lastBuildDate>Fri, 02 Jan 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="/tags/llm/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Building LLMs from Scratch - Part 4: Evaluation &amp; Deployment</title>
      <link>/post/2026/01/building-llm-from-scratch-part4-evaluation-deployment/</link>
      <pubDate>Fri, 02 Jan 2026 00:00:00 +0000</pubDate>
      <guid>/post/2026/01/building-llm-from-scratch-part4-evaluation-deployment/</guid>
      <description>Complete evaluation, testing, and deployment pipeline for historical language models. From model validation to Hugging Face publishing. Final part of 4-part series.</description>
      <content:encoded><![CDATA[<p><strong>TL;DR</strong></p>
<p>In this final part of our 4-part series on building language models from scratch, we explore the evaluation, testing, and deployment pipeline that transforms our trained historical language models into working systems. <a
	
		href = "/post/2025/09/building-llm-from-scratch-part1/"
	

	

	>
	
	<span>
		Part 1
	</span>
</a> showed you how to use the published models, <a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2
	</span>
</a> covered data collection and custom tokenization, and <a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3
	</span>
</a> detailed the model architecture and training infrastructure. Here, we complete the journey with evaluation frameworks, testing infrastructure, and deployment to Hugging Face Hub.</p>
<blockquote>
<p><strong>⚠️ Educational Purpose</strong>: This is a learning project designed to teach LLM development concepts. For production-scale LLMs, you&rsquo;ll need much larger datasets, more sophisticated infrastructure, and additional considerations not covered here.</p></blockquote>
<p>As outlined in <a
	
		href = "/post/2025/09/building-llm-from-scratch-part1/"
	

	

	>
	
	<span>
		Part 1
	</span>
</a>, both the SLM (117M parameters) and the Regular Model (354M parameters) use the same training code and infrastructure with different configurations defined in <strong><code>config.py</code></strong>. The evaluation and deployment infrastructure is also identical - only the model architecture parameters differ.</p>
<p>Both PyTorch checkpoint inference and Hugging Face model inference are fully working and available. Both the SLM and the Regular model are published on <a
	
		href = "https://huggingface.co/bahree"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Hugging Face Hub
	</span>
</a>. Local PyTorch checkpoints can be used directly for inference with the script <strong><code>inference_pytorch.py</code></strong>.</p>
<blockquote>
<p><strong>🔗 GitHub Repository</strong>: <a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		github.com/bahree/helloLondon
	</span>
</a> - Complete evaluation and deployment infrastructure (<strong><code>05_evaluation/</code></strong>, <strong><code>06_inference/</code></strong>, <strong><code>10_scripts/</code></strong>) plus guides (<strong><code>08_documentation/EVALUATION_GUIDE.md</code></strong>, <strong><code>08_documentation/HUGGINGFACE_PUBLISHING.md</code></strong>, <strong><code>08_documentation/DEPLOYMENT_GUIDE.md</code></strong>)</p>
<p><strong>🟥 Series Posts</strong>: <a
	
		href = "/post/2025/09/building-llm-from-scratch-part1/"
	

	

	>
	
	<span>
		Part 1 - Using the Published Historical Models
	</span>
</a> | <a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2 - Data Collection &amp; Custom Tokenizer
	</span>
</a> | <a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3 - Training Architecture &amp; GPU Optimization
	</span>
</a> | Part 4 (this post)</p>
<p><strong>🟧 Published Models</strong>: <a
	
		href = "https://huggingface.co/bahree/london-historical-slm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		SLM Model
	</span>
</a> | <a
	
		href = "https://huggingface.co/bahree/london-historical-llm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Regular Model
	</span>
</a> - Ready-to-use historical language models on Hugging Face</p>
<p><strong>📗 Book Reference</strong>: <a
	
		href = "https://a.co/d/gr87rem"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Generative AI in Action
	</span>
</a> - For deeper understanding of core LLM concepts</p></blockquote>
<h2 id="1-the-evaluation-challenge-measuring-what-matters-for-historical-language-models">1. The Evaluation Challenge: Measuring What Matters for Historical Language Models</h2>
<p>Now that we have trained models from <a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3
	</span>
</a>, we face a critical question: <em>How do we know if our models actually work?</em> This isn&rsquo;t just about checking if the code runs - it&rsquo;s about validating that the models can generate historically accurate, linguistically appropriate text that captures the essence of 1500-1850 London English.</p>
<p>The challenge with evaluating historical language models goes far beyond standard LLM metrics. Standard evaluation approaches like Perplexity and BLEU scores (we explain these and other metrics in <a
	
		href = "#industry-standard-metrics"
	

	

	>
	
	<span>
		Section 2.1
	</span>
</a>) tell us whether the model generates fluent text. Still, they don&rsquo;t answer the questions that matter for historical applications: <em>Does the model avoid anachronisms? Can it distinguish between Tudor and Victorian language patterns? Does it understand London geography and historical context?</em></p>
<p>Consider a simple example: if we prompt the model with <em>&ldquo;In the year 1600, I traveled to London by railway&rdquo;</em>, a standard language model might generate this without flagging the obvious problem - railways didn&rsquo;t exist in 1600. The evaluation framework needs to catch these <strong>temporal inconsistencies</strong>, <strong>period-inappropriate language</strong>, and <strong>historical inaccuracies</strong> that standard metrics miss.</p>
<p>This evaluation challenge requires building a specialized assessment pipeline that understands historical context, temporal boundaries, and period-specific linguistic patterns. We need metrics that can distinguish between a model that generates fluent modern English and one that produces authentic historical text - two very different capabilities.</p>
<h3 id="11-high-level-evaluation-strategy">1.1 High-Level Evaluation Strategy</h3>
<p>Our evaluation framework provides two complementary approaches that work with both PyTorch checkpoints and Hugging Face models, as illustrated in <a href="#fig1" class="figure-ref">Figure 1</a> below.</p>
<figure class="align-center " id="fig1">
    <pre class="mermaid">graph TD
    A[🤖 Trained Models&lt;br/&gt;SLM 117M / Regular 354M] --&gt; B{Evaluation Type}
    
    B --&gt;|Quick| C[⚡ Quick Evaluation&lt;br/&gt;Historical accuracy, language quality, coherence]
    B --&gt;|Comprehensive| D[🔬 Comprehensive Evaluation&lt;br/&gt;Benchmarks, G-Eval, groundedness]
    
    C --&gt; E[📊 Evaluation Results&lt;br/&gt;Historical accuracy scores, metrics]
    D --&gt; E
    
    E --&gt; F{Quality OK?}
    F --&gt;|Yes| G[🚀 Deployment Options]
    F --&gt;|No| H[🔄 Retrain/Adjust]
    H --&gt; A
    
    G --&gt; I[📦 PyTorch Checkpoints&lt;br/&gt;Direct inference]
    G --&gt; J[🤗 Hugging Face Hub&lt;br/&gt;Published models]
    G --&gt; K[💻 Local Deployment&lt;br/&gt;API, CLI, notebooks]
    
    I --&gt; L[✅ Working Models&lt;br/&gt;Ready for use]
    J --&gt; L
    K --&gt; L
    
    style A fill:#e1f5fe
    style E fill:#f3e5f5
    style L fill:#e8f5e8
    style H fill:#fff3e0</pre>
    <figcaption>Figure 1: Complete Evaluation and Deployment Pipeline</figcaption>
</figure>
<p><strong>Quick Evaluation</strong> (<strong><code>quick_eval.py</code></strong>): Rapid validation testing historical accuracy on key events (e.g., 1665 plague, 1666 fire, etc.), language quality metrics (vocabulary diversity, historical pattern detection, readability), and coherence (ROUGE scores). Runs in minutes without external APIs.</p>
<p><strong>Comprehensive Evaluation</strong> (<strong><code>comprehensive_evaluator.py</code></strong>): Extends quick evaluation with benchmark datasets (small <strong>MMLU</strong> and <strong>HellaSWAG</strong> subsets), groundedness/fluency metrics, and optional LLM-as-a-judge scoring via <strong>G-Eval</strong> (using an external GPT model). Produces detailed reports with generation samples.</p>
<p>Both evaluators test across historical periods (such as Tudor, Stuart, and Georgian), language patterns (archaic pronouns and verb forms), and London-specific knowledge (geography and landmarks). The framework goes beyond standard LM metrics to assess period-appropriate language, temporal consistency, and historical accuracy.</p>
<h2 id="2-model-evaluation-framework">2. Model Evaluation Framework</h2>
<p>Now that we&rsquo;ve outlined the evaluation challenge, let&rsquo;s dive into the implementation. Our evaluation framework provides two complementary approaches that work with both PyTorch checkpoints and Hugging Face models. The framework is designed to be practical for a learning project while still providing meaningful insights into model performance.</p>
<h3 id="21-historical-linguistic-and-category-specific-evaluation">2.1 Historical, Linguistic, and Category-Specific Evaluation</h3>
<p>To make the evaluation concrete, we look at the model from three complementary aspects that together capture how well it understands the period, writes fluent text, and handles the different slices of the corpus. This multi-dimensional approach ensures we catch various types of failures - a model might generate grammatically perfect text but fail historically, or vice versa.</p>
<ul>
<li><strong>Historical assessments</strong>: Quick evaluation uses targeted prompts around key events (e.g., 1665 plague, 1666 fire, Old Bailey trials) and checks for expected keywords and phrases. Comprehensive evaluation adds temporal consistency checks (forbidden/required terms per period), date-range sanity checks, and historical benchmarks (custom historical questions and the MMLU subset).</li>
<li><strong>Linguistic assessments</strong>: We measure surface quality (chars/words/sentences per sample, words per sentence), vocabulary diversity (unique/total tokens), readability (Flesch-style scores), and presence of historical patterns (archaic verb forms like <em>hath, doth</em>, pronouns like <em>thou, thee</em>, conjunctions and interjections). This shows whether the model writes in a historically flavored yet readable style.</li>
<li><strong>Category-specific benchmarks</strong>: Evaluations are grouped by period (Tudor, Stuart, Georgian), by linguistic phenomena (archaic forms, dialogue patterns), and by London knowledge (Thames, Westminster, Old Bailey, etc.). The comprehensive evaluator further probes general reasoning using HellaSWAG and MMLU subsets to assess the model&rsquo;s performance across broader benchmarks.</li>
</ul>
<blockquote>
<p><a id="industry-standard-metrics"></a><strong>Industry-Standard Evaluation Metrics and Benchmarks</strong></p>
<p>Our evaluation framework uses several standard metrics and benchmarks from LLM research. Here&rsquo;s what each one measures and why we include it:</p>
<ul>
<li><strong>Perplexity</strong>: How surprised the model is by the reference text; lower is better because it means the model assigns higher probability to what actually happened in the corpus.</li>
<li><strong>BLEU / ROUGE</strong>: N-gram overlap between generated and reference text, giving a rough sense of literal similarity and how closely the model &ldquo;sticks&rdquo; to the reference phrasing. We use <strong>ROUGE-L</strong> (longest common subsequence) to evaluate coherence and narrative flow.</li>
<li><strong>MMLU</strong> (<em>Massive Multitask Language Understanding</em>): A large multiple-choice exam covering many academic subjects. Here, we use a tiny subset as a sanity check for general knowledge and reasoning, not as a primary goal.</li>
<li><strong>HellaSWAG</strong>: A commonsense inference benchmark where the model must pick a plausible continuation for a short story-like context. We use it to see whether the model&rsquo;s basic reasoning looks sensible.</li>
<li><strong>G-Eval</strong>: An <em>LLM-as-a-judge</em> pattern where a stronger reference model (for example, GPT) scores generated text along dimensions like coherence or groundedness. In this project, it is optional and requires an external API key.</li>
<li><strong>Groundedness</strong>: Asks: <em>does the model stick to the provided context / known facts, or hallucinate?</em> Our implementation approximates this by comparing generations against reference answers and historical constraints.</li>
</ul>
<p>For a deeper treatment of evaluation benchmarks (including MMLU, HellaSWAG, and LLM-as-a-judge methods like G-Eval), see <strong>Chapter 12 - Evaluating and Monitoring Generative Systems</strong> in the book 📘 <em><a
	
		href = "https://a.co/d/gr87rem"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Generative AI in Action
	</span>
</a></em>.</p></blockquote>
<h3 id="22-automated-evaluation-pipeline">2.2 Automated Evaluation Pipeline</h3>
<p>The <code>run_comprehensive_evaluation</code> function in <strong><code>05_evaluation/comprehensive_evaluator.py</code></strong> orchestrates the entire evaluation process. <a href="#listing1" class="listing-ref">Listing 1</a> shows how it works: We iterate over test sets, generate text with the model, compute all the metrics defined above, and aggregate the results into a results dictionary for analysis.</p>
<figure id="listing1"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">run_comprehensive_evaluation</span>(model, tokenizer, test_data, device<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;cuda&#39;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Run comprehensive evaluation on historical language model&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Initialize evaluation metrics</span>
</span></span><span style="display:flex;"><span>    metrics <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;perplexity&#39;</span>: [],
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;bleu_scores&#39;</span>: [],
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;rouge_scores&#39;</span>: [],
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;historical_accuracy&#39;</span>: [],
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;linguistic_quality&#39;</span>: [],
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;coherence_scores&#39;</span>: [],
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;temporal_consistency&#39;</span>: []
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Evaluate on different text types</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> text_type, samples <span style="color:#91d7e3;font-weight:bold">in</span> test_data<span style="color:#91d7e3;font-weight:bold">.</span>items():
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Evaluating on </span><span style="color:#a6da95">{</span>text_type<span style="color:#a6da95">}</span><span style="color:#a6da95"> samples...&#34;</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">for</span> sample <span style="color:#91d7e3;font-weight:bold">in</span> samples:
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Generate text</span>
</span></span><span style="display:flex;"><span>            generated <span style="color:#91d7e3;font-weight:bold">=</span> generate_text(model, tokenizer, sample[<span style="color:#a6da95">&#39;prompt&#39;</span>], device)
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Calculate metrics</span>
</span></span><span style="display:flex;"><span>            perplexity <span style="color:#91d7e3;font-weight:bold">=</span> calculate_perplexity(model, tokenizer, sample[<span style="color:#a6da95">&#39;text&#39;</span>], device)
</span></span><span style="display:flex;"><span>            bleu <span style="color:#91d7e3;font-weight:bold">=</span> calculate_bleu(generated, sample[<span style="color:#a6da95">&#39;reference&#39;</span>])
</span></span><span style="display:flex;"><span>            rouge <span style="color:#91d7e3;font-weight:bold">=</span> calculate_rouge(generated, sample[<span style="color:#a6da95">&#39;reference&#39;</span>])
</span></span><span style="display:flex;"><span>            hist_acc <span style="color:#91d7e3;font-weight:bold">=</span> assess_historical_accuracy(generated, sample[<span style="color:#a6da95">&#39;context&#39;</span>])
</span></span><span style="display:flex;"><span>            ling_qual <span style="color:#91d7e3;font-weight:bold">=</span> assess_linguistic_quality(generated)
</span></span><span style="display:flex;"><span>            coherence <span style="color:#91d7e3;font-weight:bold">=</span> assess_coherence(generated)
</span></span><span style="display:flex;"><span>            temp_cons <span style="color:#91d7e3;font-weight:bold">=</span> assess_temporal_consistency(generated, sample[<span style="color:#a6da95">&#39;time_period&#39;</span>])
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Store metrics</span>
</span></span><span style="display:flex;"><span>            metrics[<span style="color:#a6da95">&#39;perplexity&#39;</span>]<span style="color:#91d7e3;font-weight:bold">.</span>append(perplexity)
</span></span><span style="display:flex;"><span>            metrics[<span style="color:#a6da95">&#39;bleu_scores&#39;</span>]<span style="color:#91d7e3;font-weight:bold">.</span>append(bleu)
</span></span><span style="display:flex;"><span>            metrics[<span style="color:#a6da95">&#39;rouge_scores&#39;</span>]<span style="color:#91d7e3;font-weight:bold">.</span>append(rouge)
</span></span><span style="display:flex;"><span>            metrics[<span style="color:#a6da95">&#39;historical_accuracy&#39;</span>]<span style="color:#91d7e3;font-weight:bold">.</span>append(hist_acc)
</span></span><span style="display:flex;"><span>            metrics[<span style="color:#a6da95">&#39;linguistic_quality&#39;</span>]<span style="color:#91d7e3;font-weight:bold">.</span>append(ling_qual)
</span></span><span style="display:flex;"><span>            metrics[<span style="color:#a6da95">&#39;coherence_scores&#39;</span>]<span style="color:#91d7e3;font-weight:bold">.</span>append(coherence)
</span></span><span style="display:flex;"><span>            metrics[<span style="color:#a6da95">&#39;temporal_consistency&#39;</span>]<span style="color:#91d7e3;font-weight:bold">.</span>append(temp_cons)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Calculate aggregate metrics</span>
</span></span><span style="display:flex;"><span>    results <span style="color:#91d7e3;font-weight:bold">=</span> {}
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> metric_name, values <span style="color:#91d7e3;font-weight:bold">in</span> metrics<span style="color:#91d7e3;font-weight:bold">.</span>items():
</span></span><span style="display:flex;"><span>        results[metric_name] <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;mean&#39;</span>: np<span style="color:#91d7e3;font-weight:bold">.</span>mean(values),
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;std&#39;</span>: np<span style="color:#91d7e3;font-weight:bold">.</span>std(values),
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;min&#39;</span>: np<span style="color:#91d7e3;font-weight:bold">.</span>min(values),
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;max&#39;</span>: np<span style="color:#91d7e3;font-weight:bold">.</span>max(values),
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;median&#39;</span>: np<span style="color:#91d7e3;font-weight:bold">.</span>median(values)
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> results</span></span></code></pre></div><figcaption>
        <strong>Listing 1: Comprehensive Evaluation Pipeline</strong>
    </figcaption>
</figure>
<p>The pipeline computes all the metrics we outlined above (standard LM metrics such as perplexity and BLEU/ROUGE, plus our historically specific assessments of accuracy, linguistic quality, and coherence). Each metric provides a different lens through which to view model performance: perplexity measures how well the model predicts the training distribution, BLEU/ROUGE measures literal similarity to the reference text, and the custom metrics assess historical authenticity and linguistic appropriateness.</p>
<p><strong>Why This Multi-Metric Approach Matters?</strong></p>
<p>Standard language model evaluation often focuses on perplexity and n-gram overlap metrics, which measure general language quality but miss domain-specific requirements. For historical language models, we need to know not just whether the text is fluent, but whether it&rsquo;s historically accurate, temporally consistent, and linguistically appropriate for the target period. This multi-metric approach ensures we catch different types of failures - a model might generate grammatically perfect text but fail historically, or produce historically accurate content with poor linguistic quality.</p>
<p>The aggregation step (<code>computing mean</code>, <code>std</code>, <code>min</code>, <code>max</code>, <code>median</code>) provides a comprehensive view of model performance across different test cases. This statistical summary helps identify whether the model performs consistently or has high variance, whether certain types of prompts cause failures, and how the model compares across different historical periods and linguistic phenomena.</p>
<h3 id="23-historical-accuracy-assessment">2.3 Historical Accuracy Assessment</h3>
<p>Standard LLM evaluation metrics (perplexity, BLEU, ROUGE) measure general language quality, but they don&rsquo;t tell us whether the model generates historically accurate text for London between 1500-1850. To address this, we built customized evaluation tools that check period-appropriate language, temporal consistency, London-specific geography and landmarks, and historical fact accuracy. These tools are implemented in <strong><code>05_evaluation/comprehensive_evaluator.py</code></strong> as shown in <a href="#listing2" class="listing-ref">Listing 2</a>:</p>
<figure id="listing2"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">assess_historical_accuracy</span>(generated_text, historical_context):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Assess the historical accuracy of generated text&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    accuracy_score <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0.0</span>
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check temporal consistency</span>
</span></span><span style="display:flex;"><span>    temporal_score <span style="color:#91d7e3;font-weight:bold">=</span> check_temporal_consistency(generated_text, historical_context[<span style="color:#a6da95">&#39;time_period&#39;</span>])
</span></span><span style="display:flex;"><span>    accuracy_score <span style="color:#91d7e3;font-weight:bold">+=</span> temporal_score
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check historical facts</span>
</span></span><span style="display:flex;"><span>    fact_score <span style="color:#91d7e3;font-weight:bold">=</span> check_historical_facts(generated_text, historical_context[<span style="color:#a6da95">&#39;facts&#39;</span>])
</span></span><span style="display:flex;"><span>    accuracy_score <span style="color:#91d7e3;font-weight:bold">+=</span> fact_score
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check period-appropriate language</span>
</span></span><span style="display:flex;"><span>    language_score <span style="color:#91d7e3;font-weight:bold">=</span> check_period_language(generated_text, historical_context[<span style="color:#a6da95">&#39;time_period&#39;</span>])
</span></span><span style="display:flex;"><span>    accuracy_score <span style="color:#91d7e3;font-weight:bold">+=</span> language_score
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check geographical accuracy</span>
</span></span><span style="display:flex;"><span>    geo_score <span style="color:#91d7e3;font-weight:bold">=</span> check_geographical_accuracy(generated_text, historical_context[<span style="color:#a6da95">&#39;location&#39;</span>])
</span></span><span style="display:flex;"><span>    accuracy_score <span style="color:#91d7e3;font-weight:bold">+=</span> geo_score
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check social context accuracy</span>
</span></span><span style="display:flex;"><span>    social_score <span style="color:#91d7e3;font-weight:bold">=</span> check_social_context(generated_text, historical_context[<span style="color:#a6da95">&#39;social_class&#39;</span>])
</span></span><span style="display:flex;"><span>    accuracy_score <span style="color:#91d7e3;font-weight:bold">+=</span> social_score
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> accuracy_score <span style="color:#91d7e3;font-weight:bold">/</span> total_checks
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">check_temporal_consistency</span>(text, time_period):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Check if text maintains temporal consistency with the specified period&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Define period-specific constraints</span>
</span></span><span style="display:flex;"><span>    period_constraints <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;1500-1600&#39;</span>: {
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;forbidden_terms&#39;</span>: [<span style="color:#a6da95">&#39;electricity&#39;</span>, <span style="color:#a6da95">&#39;steam engine&#39;</span>, <span style="color:#a6da95">&#39;railway&#39;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;required_terms&#39;</span>: [<span style="color:#a6da95">&#39;ye&#39;</span>, <span style="color:#a6da95">&#39;hath&#39;</span>, <span style="color:#a6da95">&#39;doth&#39;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;date_range&#39;</span>: (<span style="color:#f5a97f">1500</span>, <span style="color:#f5a97f">1600</span>)
</span></span><span style="display:flex;"><span>        },
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;1600-1700&#39;</span>: {
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;forbidden_terms&#39;</span>: [<span style="color:#a6da95">&#39;railway&#39;</span>, <span style="color:#a6da95">&#39;telegraph&#39;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;required_terms&#39;</span>: [<span style="color:#a6da95">&#39;hath&#39;</span>, <span style="color:#a6da95">&#39;doth&#39;</span>, <span style="color:#a6da95">&#39;verily&#39;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;date_range&#39;</span>: (<span style="color:#f5a97f">1600</span>, <span style="color:#f5a97f">1700</span>)
</span></span><span style="display:flex;"><span>        },
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;1700-1800&#39;</span>: {
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;forbidden_terms&#39;</span>: [<span style="color:#a6da95">&#39;telegraph&#39;</span>, <span style="color:#a6da95">&#39;telephone&#39;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;required_terms&#39;</span>: [<span style="color:#a6da95">&#39;hath&#39;</span>, <span style="color:#a6da95">&#39;doth&#39;</span>, <span style="color:#a6da95">&#39;indeed&#39;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;date_range&#39;</span>: (<span style="color:#f5a97f">1700</span>, <span style="color:#f5a97f">1800</span>)
</span></span><span style="display:flex;"><span>        },
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;1800-1850&#39;</span>: {
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;forbidden_terms&#39;</span>: [<span style="color:#a6da95">&#39;telephone&#39;</span>, <span style="color:#a6da95">&#39;automobile&#39;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;required_terms&#39;</span>: [<span style="color:#a6da95">&#39;indeed&#39;</span>, <span style="color:#a6da95">&#39;verily&#39;</span>, <span style="color:#a6da95">&#39;pray&#39;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;date_range&#39;</span>: (<span style="color:#f5a97f">1800</span>, <span style="color:#f5a97f">1850</span>)
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> time_period <span style="color:#91d7e3;font-weight:bold">not</span> <span style="color:#91d7e3;font-weight:bold">in</span> period_constraints:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> <span style="color:#f5a97f">0.5</span>  <span style="color:#6e738d;font-style:italic"># Neutral score for unknown periods</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    constraints <span style="color:#91d7e3;font-weight:bold">=</span> period_constraints[time_period]
</span></span><span style="display:flex;"><span>    score <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">1.0</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check for forbidden terms (anachronisms)</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> term <span style="color:#91d7e3;font-weight:bold">in</span> constraints[<span style="color:#a6da95">&#39;forbidden_terms&#39;</span>]:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> term<span style="color:#91d7e3;font-weight:bold">.</span>lower() <span style="color:#91d7e3;font-weight:bold">in</span> text<span style="color:#91d7e3;font-weight:bold">.</span>lower():
</span></span><span style="display:flex;"><span>            score <span style="color:#91d7e3;font-weight:bold">-=</span> <span style="color:#f5a97f">0.2</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check for required period-appropriate terms</span>
</span></span><span style="display:flex;"><span>    period_terms_found <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> term <span style="color:#91d7e3;font-weight:bold">in</span> constraints[<span style="color:#a6da95">&#39;required_terms&#39;</span>]:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> term<span style="color:#91d7e3;font-weight:bold">.</span>lower() <span style="color:#91d7e3;font-weight:bold">in</span> text<span style="color:#91d7e3;font-weight:bold">.</span>lower():
</span></span><span style="display:flex;"><span>            period_terms_found <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> constraints[<span style="color:#a6da95">&#39;required_terms&#39;</span>]:
</span></span><span style="display:flex;"><span>        score <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">0.3</span> <span style="color:#91d7e3;font-weight:bold">*</span> (period_terms_found <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#91d7e3">len</span>(constraints[<span style="color:#a6da95">&#39;required_terms&#39;</span>]))
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check date references</span>
</span></span><span style="display:flex;"><span>    date_score <span style="color:#91d7e3;font-weight:bold">=</span> check_date_references(text, constraints[<span style="color:#a6da95">&#39;date_range&#39;</span>])
</span></span><span style="display:flex;"><span>    score <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">0.2</span> <span style="color:#91d7e3;font-weight:bold">*</span> date_score
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> <span style="color:#91d7e3">max</span>(<span style="color:#f5a97f">0.0</span>, <span style="color:#91d7e3">min</span>(<span style="color:#f5a97f">1.0</span>, score))</span></span></code></pre></div><figcaption>
        <strong>Listing 2: Historical Accuracy Assessment</strong>
    </figcaption>
</figure>
<p>The forbidden terms (like &ldquo;electricity&rdquo; for 1500-1600, &ldquo;railway&rdquo; for 1600-1700) are anachronisms - technologies or concepts that didn&rsquo;t exist in those periods. We selected them based on historical timelines: electricity wasn&rsquo;t harnessed until the late 1700s, railways didn&rsquo;t appear until the early 1800s, and telegraphs came later. Similarly, the required terms (such as &ldquo;hath&rdquo;, &ldquo;doth&rdquo;, and &ldquo;verily&rdquo;) are archaic language patterns we observed frequently in the training corpus for each period.</p>
<p>We analyzed the corpus to identify which linguistic markers were most characteristic of each era, then selected a small set that would catch obvious anachronisms without being overly restrictive. This is a practical heuristic rather than an exhaustive historical grammar - we focus on high-impact anachronisms and common period markers that are easy to detect automatically.</p>
<p><strong>How the scoring works</strong></p>
<p>The <strong><code>check_temporal_consistency()</code></strong> function starts with a score of <code>1.0</code> and applies penalties and bonuses: each forbidden term found subtracts 0.2 (so finding &ldquo;railway&rdquo; in 1600-1700 text drops the score), while finding required period-appropriate terms adds up to <code>0.3</code> based on how many are present. Date references within the period add up to 0.2. The final score ranges from <code>0.0</code> to <code>1.0</code>.</p>
<p>The overall <strong><code>assess_historical_accuracy()</code></strong> function then averages the five component scores (temporal consistency, historical facts, period-appropriate language, geographical accuracy, and social context) to produce a single score between 0 and 1, with higher values indicating better historical accuracy. In practice (and yes, we are generalizing), scores above <code>0.7</code> indicate good historical consistency, while scores below <code>0.5</code> suggest significant anachronisms or factual errors.</p>
<h3 id="24-linguistic-quality-evaluation">2.4 Linguistic Quality Evaluation</h3>
<p>While historical accuracy checks whether the model gets facts and period-appropriate terms right, linguistic quality measures how well the model writes - grammar, coherence, vocabulary diversity, sentence structure, and the presence of historical language patterns.</p>
<p>Standard metrics like BLEU and ROUGE don&rsquo;t capture whether the text reads naturally or uses appropriate archaic forms. We built customized tools that assess these dimensions, implemented in <strong><code>05_evaluation/comprehensive_evaluator.py</code></strong> as shown in <a href="#listing3" class="listing-ref">Listing 3</a>:</p>
<p>To make this easier to read, it helps to view the code as a scoring <em>scaffold</em> rather than a complete NLP system. Each <strong><code>check_*</code></strong> function is expected to return a normalized score in the range [0, 1] (higher is better), and <strong><code>assess_linguistic_quality()</code></strong> simply averages those components so you can track one headline number over time.</p>
<p>This mirrors patterns from earlier in the series: in <a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2
	</span>
</a> we used lightweight, automatable checks to validate data quality, and in <a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3
	</span>
</a> we relied on simple, repeatable metrics to judge training health. Here, we do the same for generation quality: start with cheap checks that run everywhere, then iterate toward richer evaluators as needed.</p>
<p>Also note that the exact weights (0.3/0.2, etc.) are tunable. The main benefit is splitting &ldquo;linguistic quality&rdquo; into components you can inspect individually, so when output is bad, you can tell <em>why</em> (grammar-ish structure vs coherence vs vocabulary vs historically flavored patterns).</p>
<figure id="listing3"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">assess_linguistic_quality</span>(generated_text):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Assess the linguistic quality of generated historical text&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    quality_score <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0.0</span>
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check grammatical correctness</span>
</span></span><span style="display:flex;"><span>    grammar_score <span style="color:#91d7e3;font-weight:bold">=</span> check_grammatical_correctness(generated_text)
</span></span><span style="display:flex;"><span>    quality_score <span style="color:#91d7e3;font-weight:bold">+=</span> grammar_score
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check coherence and flow</span>
</span></span><span style="display:flex;"><span>    coherence_score <span style="color:#91d7e3;font-weight:bold">=</span> check_text_coherence(generated_text)
</span></span><span style="display:flex;"><span>    quality_score <span style="color:#91d7e3;font-weight:bold">+=</span> coherence_score
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check vocabulary appropriateness</span>
</span></span><span style="display:flex;"><span>    vocab_score <span style="color:#91d7e3;font-weight:bold">=</span> check_vocabulary_appropriateness(generated_text)
</span></span><span style="display:flex;"><span>    quality_score <span style="color:#91d7e3;font-weight:bold">+=</span> vocab_score
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check sentence structure variety</span>
</span></span><span style="display:flex;"><span>    structure_score <span style="color:#91d7e3;font-weight:bold">=</span> check_sentence_structure_variety(generated_text)
</span></span><span style="display:flex;"><span>    quality_score <span style="color:#91d7e3;font-weight:bold">+=</span> structure_score
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check historical language patterns</span>
</span></span><span style="display:flex;"><span>    pattern_score <span style="color:#91d7e3;font-weight:bold">=</span> check_historical_language_patterns(generated_text)
</span></span><span style="display:flex;"><span>    quality_score <span style="color:#91d7e3;font-weight:bold">+=</span> pattern_score
</span></span><span style="display:flex;"><span>    total_checks <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> quality_score <span style="color:#91d7e3;font-weight:bold">/</span> total_checks
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">check_grammatical_correctness</span>(text):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Check grammatical correctness of generated text&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Parse text into sentences</span>
</span></span><span style="display:flex;"><span>    sentences <span style="color:#91d7e3;font-weight:bold">=</span> nltk<span style="color:#91d7e3;font-weight:bold">.</span>sent_tokenize(text)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3;font-weight:bold">not</span> sentences:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> <span style="color:#f5a97f">0.0</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    correct_sentences <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> sentence <span style="color:#91d7e3;font-weight:bold">in</span> sentences:
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Check for basic grammatical patterns</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> check_sentence_grammar(sentence):
</span></span><span style="display:flex;"><span>            correct_sentences <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> correct_sentences <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#91d7e3">len</span>(sentences)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">check_historical_language_patterns</span>(text):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Check if text follows appropriate historical language patterns&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    score <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0.0</span>
</span></span><span style="display:flex;"><span>    total_patterns <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check for appropriate use of historical verb forms</span>
</span></span><span style="display:flex;"><span>    historical_verbs <span style="color:#91d7e3;font-weight:bold">=</span> [<span style="color:#a6da95">&#39;hath&#39;</span>, <span style="color:#a6da95">&#39;doth&#39;</span>, <span style="color:#a6da95">&#39;dost&#39;</span>, <span style="color:#a6da95">&#39;art&#39;</span>, <span style="color:#a6da95">&#39;wilt&#39;</span>, <span style="color:#a6da95">&#39;shalt&#39;</span>]
</span></span><span style="display:flex;"><span>    verb_score <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> verb <span style="color:#91d7e3;font-weight:bold">in</span> historical_verbs:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> verb <span style="color:#91d7e3;font-weight:bold">in</span> text<span style="color:#91d7e3;font-weight:bold">.</span>lower():
</span></span><span style="display:flex;"><span>            verb_score <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> historical_verbs:
</span></span><span style="display:flex;"><span>        score <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">0.3</span> <span style="color:#91d7e3;font-weight:bold">*</span> (verb_score <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#91d7e3">len</span>(historical_verbs))
</span></span><span style="display:flex;"><span>    total_patterns <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check for appropriate use of historical pronouns</span>
</span></span><span style="display:flex;"><span>    historical_pronouns <span style="color:#91d7e3;font-weight:bold">=</span> [<span style="color:#a6da95">&#39;thou&#39;</span>, <span style="color:#a6da95">&#39;thee&#39;</span>, <span style="color:#a6da95">&#39;thy&#39;</span>, <span style="color:#a6da95">&#39;thine&#39;</span>, <span style="color:#a6da95">&#39;ye&#39;</span>]
</span></span><span style="display:flex;"><span>    pronoun_score <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> pronoun <span style="color:#91d7e3;font-weight:bold">in</span> historical_pronouns:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> pronoun <span style="color:#91d7e3;font-weight:bold">in</span> text<span style="color:#91d7e3;font-weight:bold">.</span>lower():
</span></span><span style="display:flex;"><span>            pronoun_score <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> historical_pronouns:
</span></span><span style="display:flex;"><span>        score <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">0.3</span> <span style="color:#91d7e3;font-weight:bold">*</span> (pronoun_score <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#91d7e3">len</span>(historical_pronouns))
</span></span><span style="display:flex;"><span>    total_patterns <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check for appropriate use of historical conjunctions</span>
</span></span><span style="display:flex;"><span>    historical_conjunctions <span style="color:#91d7e3;font-weight:bold">=</span> [<span style="color:#a6da95">&#39;whilst&#39;</span>, <span style="color:#a6da95">&#39;betwixt&#39;</span>, <span style="color:#a6da95">&#39;amongst&#39;</span>, <span style="color:#a6da95">&#39;ere&#39;</span>, <span style="color:#a6da95">&#39;anon&#39;</span>]
</span></span><span style="display:flex;"><span>    conj_score <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> conj <span style="color:#91d7e3;font-weight:bold">in</span> historical_conjunctions:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> conj <span style="color:#91d7e3;font-weight:bold">in</span> text<span style="color:#91d7e3;font-weight:bold">.</span>lower():
</span></span><span style="display:flex;"><span>            conj_score <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> historical_conjunctions:
</span></span><span style="display:flex;"><span>        score <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">0.2</span> <span style="color:#91d7e3;font-weight:bold">*</span> (conj_score <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#91d7e3">len</span>(historical_conjunctions))
</span></span><span style="display:flex;"><span>    total_patterns <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check for appropriate use of historical interjections</span>
</span></span><span style="display:flex;"><span>    historical_interjections <span style="color:#91d7e3;font-weight:bold">=</span> [<span style="color:#a6da95">&#39;verily&#39;</span>, <span style="color:#a6da95">&#39;indeed&#39;</span>, <span style="color:#a6da95">&#39;forsooth&#39;</span>, <span style="color:#a6da95">&#39;prithee&#39;</span>, <span style="color:#a6da95">&#39;marry&#39;</span>]
</span></span><span style="display:flex;"><span>    interj_score <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> interj <span style="color:#91d7e3;font-weight:bold">in</span> historical_interjections:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> interj <span style="color:#91d7e3;font-weight:bold">in</span> text<span style="color:#91d7e3;font-weight:bold">.</span>lower():
</span></span><span style="display:flex;"><span>            interj_score <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> historical_interjections:
</span></span><span style="display:flex;"><span>        score <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">0.2</span> <span style="color:#91d7e3;font-weight:bold">*</span> (interj_score <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#91d7e3">len</span>(historical_interjections))
</span></span><span style="display:flex;"><span>    total_patterns <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> score <span style="color:#91d7e3;font-weight:bold">/</span> total_patterns <span style="color:#c6a0f6">if</span> total_patterns <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#f5a97f">0</span> <span style="color:#c6a0f6">else</span> <span style="color:#f5a97f">0.0</span></span></span></code></pre></div><figcaption>
        <strong>Listing 3: Linguistic Quality Evaluation</strong>
    </figcaption>
</figure>
<p><strong>About NLTK:</strong> We use <strong>NLTK</strong> (Natural Language Toolkit), a standard Python library for natural language processing, to handle text tokenization. If you followed <a
	
		href = "/post/2025/09/building-llm-from-scratch-part1/"
	

	

	>
	
	<span>
		Part 1
	</span>
</a>&rsquo;s setup instructions, NLTK was already installed as part of the data processing dependencies. In <code>check_grammatical_correctness()</code>, we use <code>nltk.sent_tokenize()</code> to split text into sentences so we can evaluate grammar sentence-by-sentence. NLTK also provides word tokenization (<code>word_tokenize</code>) and BLEU score calculation (<code>sentence_bleu</code>), which are used elsewhere in the evaluation pipeline.</p>
<p>We chose NLTK because it&rsquo;s well-established, handles edge cases (like abbreviations and historical punctuation), and provides reliable sentence boundaries even with archaic English patterns. The same qualities made it useful during data collection and cleaning (covered in <a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2
	</span>
</a>).</p>
<p>The historical language patterns we check (verbs like <em><strong>hath, doth</strong></em>, pronouns like <em><strong>thou, thee</strong></em>, conjunctions like <em><strong>whilst, betwixt</strong></em>, and interjections like <em><strong>verily, forsooth</strong></em>) are the same archaic forms we identified during corpus analysis for temporal consistency. The difference here is that we&rsquo;re measuring their presence as a positive signal of historical authenticity, rather than using them as required/forbidden constraints. Each pattern category (verbs, pronouns, conjunctions, interjections) contributes proportionally to the score based on how many patterns from that category appear in the text.</p>
<p><strong>How the scoring works</strong></p>
<p>The <strong><code>assess_linguistic_quality()</code></strong> function averages five component scores (<code>grammar</code>, <code>coherence</code>, <code>vocabulary appropriateness</code>, <code>sentence structure variety</code>, and <code>historical language patterns</code>) to produce a single score between <code>0</code> and <code>1</code>. Each component is evaluated independently and returns a score in the range <code>[0, 1]</code>.</p>
<p>For example, <strong><code>check_grammatical_correctness()</code></strong> counts the proportion of grammatically correct sentences, while <strong><code>check_historical_language_patterns()</code></strong> weights the presence of archaic verb forms (30%), pronouns (30%), conjunctions (20%), and interjections (20%) to produce a pattern score. The final linguistic quality score is the simple average of all five components. In practice, scores above <code>0.75</code> indicate strong linguistic quality with good grammar and historical flavor, while scores below 0.6 suggest the model struggles with either basic grammar or historical language patterns.</p>
<h3 id="25-running-evaluations">2.5 Running Evaluations</h3>
<p>You can run the evaluators directly from the command line. The framework defaults to CPU for safety (so you can evaluate during training without GPU conflicts), but you can use <code>--device gpu</code> when the GPU is free for faster evaluation.</p>
<p><strong>Quick example:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Quick evaluation (runs in minutes, no external APIs)</span>
</span></span><span style="display:flex;"><span>python 05_evaluation/run_evaluation.py --mode quick --device cpu
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Comprehensive evaluation (includes benchmarks, optional G-Eval)</span>
</span></span><span style="display:flex;"><span>python 05_evaluation/run_evaluation.py --mode comprehensive --device cpu</span></span></code></pre></div>
<p>The unified launcher (<strong><code>run_evaluation.py</code></strong>) supports multiple modes: <code>setup</code> (install dependencies), <code>quick</code> (fast validation), <code>comprehensive</code> (full suite with benchmarks), <code>dataset</code> (generate test cases), and <code>all</code> (complete evaluation). You can also call <strong><code>quick_eval.py</code></strong> or <strong><code>comprehensive_evaluator.py</code></strong> directly if you need more control.</p>
<p><strong>Practical Evaluation Workflow:</strong></p>
<p>Our typical evaluation workflow follows this pattern:</p>
<ol>
<li><strong>After Training</strong>: Run a quick evaluation to get immediate feedback on model performance</li>
<li><strong>Before Publishing</strong>: Run a comprehensive evaluation to ensure the model meets quality standards</li>
<li><strong>During Development</strong>: Use interactive testing to explore model behavior on specific prompts</li>
<li><strong>For Research</strong>: Generate custom test datasets and run targeted evaluations</li>
</ol>
<p>The framework defaults to CPU for safety (so you can evaluate during training without GPU conflicts), but you can use <code>--device gpu</code> when the GPU is free for faster evaluation. This design allows continuous assessment throughout the training process without interfering with GPU resources needed for training.</p>
<p>For complete usage examples, command-line options, and troubleshooting, see the <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/08_documentation/EVALUATION_GUIDE.md"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Evaluation Guide
	</span>
</a> in the repository.</p>
<h2 id="3-comprehensive-testing-pipeline">3. Comprehensive Testing Pipeline</h2>
<h3 id="31-automated-testing-framework">3.1 Automated Testing Framework</h3>
<p>The <strong><code>06_testing</code></strong> package contains a parallel set of tests that double-check the full system. <a href="#listing4" class="listing-ref">Listing 4</a> captures the idea behind <strong><code>run_comprehensive_tests</code></strong>.</p>
<p>We group tests into basic functionality, historical accuracy, linguistic quality, performance, edge cases, and integration, then run them as a batch and emit a structured report. This mirrors how you would build a real CI test suite, but at a scale appropriate for this learning project.</p>
<figure id="listing4"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">run_comprehensive_tests</span>(model, tokenizer, device<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;cuda&#39;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Run comprehensive tests on historical language model&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    test_results <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;basic_functionality&#39;</span>: test_basic_functionality(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;historical_accuracy&#39;</span>: test_historical_accuracy(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;linguistic_quality&#39;</span>: test_linguistic_quality(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;performance_metrics&#39;</span>: test_performance_metrics(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;edge_cases&#39;</span>: test_edge_cases(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;integration_tests&#39;</span>: test_integration(model, tokenizer, device)
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Generate test report</span>
</span></span><span style="display:flex;"><span>    generate_test_report(test_results)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> test_results
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">test_basic_functionality</span>(model, tokenizer, device):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Test basic model functionality&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    tests <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;text_generation&#39;</span>: test_text_generation(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;tokenization&#39;</span>: test_tokenization(tokenizer),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;model_loading&#39;</span>: test_model_loading(model, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;memory_usage&#39;</span>: test_memory_usage(model, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;inference_speed&#39;</span>: test_inference_speed(model, tokenizer, device)
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> tests
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">test_historical_accuracy</span>(model, tokenizer, device):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Test historical accuracy of generated text&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    tests <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;temporal_consistency&#39;</span>: test_temporal_consistency(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;factual_accuracy&#39;</span>: test_factual_accuracy(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;period_appropriate_language&#39;</span>: test_period_language(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;geographical_accuracy&#39;</span>: test_geographical_accuracy(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;social_context_accuracy&#39;</span>: test_social_context(model, tokenizer, device)
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> tests</span></span></code></pre></div><figcaption>
        <strong>Listing 4: Comprehensive Testing Framework</strong>
    </figcaption>
</figure>
<p>Automated tests cover basics, historical accuracy, linguistic quality, performance, edge cases, and integration.</p>
<h3 id="32-interactive-testing-and-validation">3.2 Interactive Testing and Validation</h3>
<p>For manual exploration, the interactive testing interface (conceptually similar to the CLI flows in <strong><code>06_inference/inference_unified.py</code></strong>) lets you type prompts, trigger specific test groups, and immediately inspect analysis for each generation. <a href="#listing5" class="listing-ref">Listing 5</a> shows a simple REPL loop that dispatches to the same evaluation helpers used in the automated tests.</p>
<figure id="listing5"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">interactive_testing</span>(model, tokenizer, device<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;cuda&#39;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Interactive testing interface for historical language model&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;Interactive Testing Mode&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;=&#34;</span> <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">50</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;Enter prompts to test the model. Type &#39;quit&#39; to exit.&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;Available commands:&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;  - Enter any text prompt to generate continuation&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;  - &#39;test_historical&#39; - Run historical accuracy tests&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;  - &#39;test_linguistic&#39; - Run linguistic quality tests&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;  - &#39;test_performance&#39; - Run performance tests&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;  - &#39;quit&#39; - Exit testing mode&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>()
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">while</span> <span style="color:#f5a97f">True</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>            prompt <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">input</span>(<span style="color:#a6da95">&#34;Enter prompt: &#34;</span>)<span style="color:#91d7e3;font-weight:bold">.</span>strip()
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">if</span> prompt<span style="color:#91d7e3;font-weight:bold">.</span>lower() <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#a6da95">&#39;quit&#39;</span>:
</span></span><span style="display:flex;"><span>                <span style="color:#c6a0f6">break</span>
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">elif</span> prompt<span style="color:#91d7e3;font-weight:bold">.</span>lower() <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#a6da95">&#39;test_historical&#39;</span>:
</span></span><span style="display:flex;"><span>                run_historical_tests(model, tokenizer, device)
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">elif</span> prompt<span style="color:#91d7e3;font-weight:bold">.</span>lower() <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#a6da95">&#39;test_linguistic&#39;</span>:
</span></span><span style="display:flex;"><span>                run_linguistic_tests(model, tokenizer, device)
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">elif</span> prompt<span style="color:#91d7e3;font-weight:bold">.</span>lower() <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#a6da95">&#39;test_performance&#39;</span>:
</span></span><span style="display:flex;"><span>                run_performance_tests(model, tokenizer, device)
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">elif</span> prompt:
</span></span><span style="display:flex;"><span>                <span style="color:#6e738d;font-style:italic"># Generate text</span>
</span></span><span style="display:flex;"><span>                generated <span style="color:#91d7e3;font-weight:bold">=</span> generate_text(model, tokenizer, prompt, device)
</span></span><span style="display:flex;"><span>                <span style="color:#91d7e3">print</span>(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Generated: </span><span style="color:#a6da95">{</span>generated<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>                <span style="color:#91d7e3">print</span>()
</span></span><span style="display:flex;"><span>                
</span></span><span style="display:flex;"><span>                <span style="color:#6e738d;font-style:italic"># Analyze generated text</span>
</span></span><span style="display:flex;"><span>                analysis <span style="color:#91d7e3;font-weight:bold">=</span> analyze_generated_text(generated, prompt)
</span></span><span style="display:flex;"><span>                <span style="color:#91d7e3">print</span>(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Analysis: </span><span style="color:#a6da95">{</span>analysis<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>                <span style="color:#91d7e3">print</span>()
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>                <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;Please enter a valid prompt or command.&#34;</span>)
</span></span><span style="display:flex;"><span>                
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">KeyboardInterrupt</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;</span><span style="color:#8aadf4">\n</span><span style="color:#a6da95">Exiting interactive testing mode...&#34;</span>)
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">break</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>            <span style="color:#91d7e3">print</span>(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Error: </span><span style="color:#a6da95">{</span>e<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>            <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;Please try again.&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">analyze_generated_text</span>(text, prompt):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Analyze generated text for quality and accuracy&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    analysis <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;length&#39;</span>: <span style="color:#91d7e3">len</span>(text),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;sentences&#39;</span>: <span style="color:#91d7e3">len</span>(nltk<span style="color:#91d7e3;font-weight:bold">.</span>sent_tokenize(text)),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;historical_accuracy&#39;</span>: assess_historical_accuracy(text, {}),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;linguistic_quality&#39;</span>: assess_linguistic_quality(text),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;coherence&#39;</span>: assess_coherence(text),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;relevance&#39;</span>: assess_relevance(text, prompt)
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> analysis</span></span></code></pre></div><figcaption>
        <strong>Listing 5: Interactive Testing Interface</strong>
    </figcaption>
</figure>
<p>Interactive mode lets you try prompts, run quick tests, and see immediate analysis.</p>
<h3 id="33-performance-benchmarking">3.3 Performance Benchmarking</h3>
<p>Performance benchmarking follows the same pattern: generate controlled workloads and measure speed and resource usage. <a href="#listing6" class="listing-ref">Listing 6</a> illustrates how we vary sequence length, measure average latency, and compute tokens-per-second, alongside separate helpers for memory, batch throughput, long-sequence handling, and basic concurrency.</p>
<figure id="listing6"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">benchmark_model_performance</span>(model, tokenizer, device<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;cuda&#39;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Benchmark model performance across different scenarios&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    benchmarks <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;inference_speed&#39;</span>: benchmark_inference_speed(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;memory_usage&#39;</span>: benchmark_memory_usage(model, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;batch_processing&#39;</span>: benchmark_batch_processing(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;long_sequence_handling&#39;</span>: benchmark_long_sequences(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;concurrent_requests&#39;</span>: benchmark_concurrent_requests(model, tokenizer, device)
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> benchmarks
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">benchmark_inference_speed</span>(model, tokenizer, device):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Benchmark inference speed for different sequence lengths&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    sequence_lengths <span style="color:#91d7e3;font-weight:bold">=</span> [<span style="color:#f5a97f">50</span>, <span style="color:#f5a97f">100</span>, <span style="color:#f5a97f">200</span>, <span style="color:#f5a97f">500</span>, <span style="color:#f5a97f">1000</span>]
</span></span><span style="display:flex;"><span>    results <span style="color:#91d7e3;font-weight:bold">=</span> {}
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> length <span style="color:#91d7e3;font-weight:bold">in</span> sequence_lengths:
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Generate test prompts of different lengths</span>
</span></span><span style="display:flex;"><span>        prompts <span style="color:#91d7e3;font-weight:bold">=</span> generate_test_prompts(length, num_prompts<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">100</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Measure inference time</span>
</span></span><span style="display:flex;"><span>        start_time <span style="color:#91d7e3;font-weight:bold">=</span> time<span style="color:#91d7e3;font-weight:bold">.</span>time()
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">for</span> prompt <span style="color:#91d7e3;font-weight:bold">in</span> prompts:
</span></span><span style="display:flex;"><span>            generate_text(model, tokenizer, prompt, device)
</span></span><span style="display:flex;"><span>        end_time <span style="color:#91d7e3;font-weight:bold">=</span> time<span style="color:#91d7e3;font-weight:bold">.</span>time()
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        total_time <span style="color:#91d7e3;font-weight:bold">=</span> end_time <span style="color:#91d7e3;font-weight:bold">-</span> start_time
</span></span><span style="display:flex;"><span>        avg_time_per_prompt <span style="color:#91d7e3;font-weight:bold">=</span> total_time <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#91d7e3">len</span>(prompts)
</span></span><span style="display:flex;"><span>        tokens_per_second <span style="color:#91d7e3;font-weight:bold">=</span> length <span style="color:#91d7e3;font-weight:bold">/</span> avg_time_per_prompt
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        results[length] <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;avg_time_per_prompt&#39;</span>: avg_time_per_prompt,
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;tokens_per_second&#39;</span>: tokens_per_second,
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;total_time&#39;</span>: total_time
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> results</span></span></code></pre></div><figcaption>
        <strong>Listing 6: Performance Benchmarking</strong>
    </figcaption>
</figure>
<p>Benchmarks capture inference speed, memory, batch throughput, long-sequence handling, and simple concurrency.</p>
<h2 id="4-model-deployment-and-publishing">4. Model Deployment and Publishing</h2>
<p>With evaluation and testing complete, we&rsquo;re ready to make our models available for use. This section covers the two deployment paths we support: direct inference from PyTorch checkpoints (useful during development and for maximum control) and publishing to Hugging Face Hub (for easy sharing and community access).</p>
<p>As called out in <a
	
		href = "/post/2025/09/building-llm-from-scratch-part1/"
	

	

	>
	
	<span>
		Part 1
	</span>
</a>, both the SLM (117M parameters) and the Regular Model (354M parameters) are fully trained and available. The SLM has already been published on <a
	
		href = "https://huggingface.co/bahree/london-historical-slm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Hugging Face Hub
	</span>
</a>, while the Regular Model is ready for publication. Both can also be run directly from local PyTorch checkpoints.</p>
<h3 id="41-two-paths-to-inference">4.1 Two Paths to Inference</h3>
<p>We provide two complementary ways to run inference, each suited to different use cases.</p>
<p><strong>PyTorch Checkpoint Inference</strong> gives you direct access to the trained model weights without any conversion overhead. This is ideal during development, when you want to test a freshly trained checkpoint, or when you need maximum control over the inference process. The checkpoints live in <strong><code>09_models/checkpoints/</code></strong> - the SLM at <strong><code>slm/checkpoint-4000.pt</code></strong> (117M parameters) and the Regular Model at <strong><code>checkpoint-60001.pt</code></strong> (354M parameters). The <strong><code>inference_pytorch.py</code></strong> script handles loading these directly: <a href="#listing7" class="listing-ref">Listing 7</a></p>
<figure id="listing7"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># SLM inference from checkpoint</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_pytorch.py <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --checkpoint 09_models/checkpoints/slm/checkpoint-4000.pt <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --prompt <span style="color:#a6da95">&#34;In the year 1834, I walked through the streets of London and witnessed&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Regular model inference from checkpoint</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_pytorch.py <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --checkpoint 09_models/checkpoints/checkpoint-60001.pt <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --prompt <span style="color:#a6da95">&#34;In the year 1834, I walked through the streets of London and witnessed&#34;</span></span></span></code></pre></div><figcaption>
        <strong>Listing 7: Running Inference from PyTorch Checkpoints</strong>
    </figcaption>
</figure>
<p><strong>Hugging Face Model Inference</strong> uses the published models on Hugging Face Hub, which means anyone can load and use them with just a few lines of code - no need to download checkpoints or set up the full training environment. The <strong><code>inference_unified.py</code></strong> script provides a consistent interface for both published models and local checkpoints: <a href="#listing8" class="listing-ref">Listing 8</a></p>
<figure id="listing8"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Published model inference (downloads from Hugging Face Hub)</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_unified.py <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --published <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --model_name bahree/london-historical-slm <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --prompt <span style="color:#a6da95">&#34;In the year 1834, I walked through the streets of London and witnessed&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Interactive mode for exploration</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_unified.py --published --model_type slm --interactive
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Demo mode with curated historical prompts</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_unified.py --published --model_type slm --demo</span></span></code></pre></div><figcaption>
        <strong>Listing 8: Hugging Face Model Inference</strong>
    </figcaption>
</figure>
<p>We&rsquo;ve tested both paths extensively. The published SLM loads in about 9 seconds on a GPU, generates text in under 6 seconds, and passes all 10 automated validation tests. The unified inference script provides clean logging, proper model detection, and accurate parameter counts - small details that make a big difference when debugging or demonstrating the models.</p>
<h3 id="42-publishing-to-hugging-face-hub">4.2 Publishing to Hugging Face Hub</h3>
<p>Publishing to Hugging Face Hub makes our models accessible to the broader community without requiring anyone to clone our repository or set up a training environment. The process involves converting our PyTorch checkpoints to the Hugging Face format, creating a model card with documentation, and uploading everything to the Hub.</p>
<p>The publishing workflow is handled by scripts in <strong><code>10_scripts/</code></strong> - specifically <strong><code>publish_slm_to_huggingface.py</code></strong> for the SLM and <strong><code>publish_to_huggingface.py</code></strong> for the Regular Model. <a href="#listing9" class="listing-ref">Listing 9</a> shows the core publishing flow: authenticate with the Hub, create (or reuse) a repository, save the model and tokenizer locally in Hugging Face format, upload the folder, and generate a model card.</p>
<figure id="listing9"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">publish_to_huggingface</span>(model, tokenizer, model_name, description, tags):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Publish model to Hugging Face Hub&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">from</span> <span style="color:#f5a97f">huggingface_hub</span> <span style="color:#8bd5ca">import</span> HfApi
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    api <span style="color:#91d7e3;font-weight:bold">=</span> HfApi()
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Create model repository</span>
</span></span><span style="display:flex;"><span>    repo_id <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;bahree/</span><span style="color:#a6da95">{</span>model_name<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>
</span></span><span style="display:flex;"><span>    api<span style="color:#91d7e3;font-weight:bold">.</span>create_repo(repo_id<span style="color:#91d7e3;font-weight:bold">=</span>repo_id, exist_ok<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Save model and tokenizer locally</span>
</span></span><span style="display:flex;"><span>    model<span style="color:#91d7e3;font-weight:bold">.</span>save_pretrained(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;./models/</span><span style="color:#a6da95">{</span>model_name<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>    tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>save_pretrained(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;./models/</span><span style="color:#a6da95">{</span>model_name<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Upload to Hub</span>
</span></span><span style="display:flex;"><span>    api<span style="color:#91d7e3;font-weight:bold">.</span>upload_folder(
</span></span><span style="display:flex;"><span>        folder_path<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;./models/</span><span style="color:#a6da95">{</span>model_name<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>,
</span></span><span style="display:flex;"><span>        repo_id<span style="color:#91d7e3;font-weight:bold">=</span>repo_id,
</span></span><span style="display:flex;"><span>        commit_message<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;Initial model upload&#34;</span>
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Generate and upload model card (README.md)</span>
</span></span><span style="display:flex;"><span>    model_card <span style="color:#91d7e3;font-weight:bold">=</span> generate_model_card(model_name, description, tags)
</span></span><span style="display:flex;"><span>    api<span style="color:#91d7e3;font-weight:bold">.</span>upload_file(
</span></span><span style="display:flex;"><span>        path_or_fileobj<span style="color:#91d7e3;font-weight:bold">=</span>model_card,
</span></span><span style="display:flex;"><span>        path_in_repo<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;README.md&#34;</span>,
</span></span><span style="display:flex;"><span>        repo_id<span style="color:#91d7e3;font-weight:bold">=</span>repo_id
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> repo_id</span></span></code></pre></div><figcaption>
        <strong>Listing 9: Hugging Face Publishing</strong>
    </figcaption>
</figure>
<p>The <code>generate_model_card()</code> function creates the <strong><code>README.md</code></strong> that appears on the Hugging Face model page. This includes model description, architecture details, training data sources, usage examples, and limitations. You can see the live model cards at <a
	
		href = "https://huggingface.co/bahree/london-historical-slm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		bahree/london-historical-slm
	</span>
</a> and <a
	
		href = "https://huggingface.co/bahree/london-historical-llm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		bahree/london-historical-llm
	</span>
</a>.</p>
<p><a href="#listing10" class="listing-ref">Listing 10</a> shows how to load and use the published models:</p>
<figure id="listing10"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#8bd5ca">from</span> <span style="color:#f5a97f">transformers</span> <span style="color:#8bd5ca">import</span> AutoTokenizer, AutoModelForCausalLM
</span></span><span style="display:flex;"><span><span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">torch</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Load the published model</span>
</span></span><span style="display:flex;"><span>model_name <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;bahree/london-historical-slm&#34;</span>  <span style="color:#6e738d;font-style:italic"># or &#34;bahree/london-historical-llm&#34;</span>
</span></span><span style="display:flex;"><span>tokenizer <span style="color:#91d7e3;font-weight:bold">=</span> AutoTokenizer<span style="color:#91d7e3;font-weight:bold">.</span>from_pretrained(model_name)
</span></span><span style="display:flex;"><span>model <span style="color:#91d7e3;font-weight:bold">=</span> AutoModelForCausalLM<span style="color:#91d7e3;font-weight:bold">.</span>from_pretrained(model_name)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Move to GPU if available</span>
</span></span><span style="display:flex;"><span>device <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;cuda&#34;</span> <span style="color:#c6a0f6">if</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>cuda<span style="color:#91d7e3;font-weight:bold">.</span>is_available() <span style="color:#c6a0f6">else</span> <span style="color:#a6da95">&#34;cpu&#34;</span>
</span></span><span style="display:flex;"><span>model <span style="color:#91d7e3;font-weight:bold">=</span> model<span style="color:#91d7e3;font-weight:bold">.</span>to(device)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Generate historical text</span>
</span></span><span style="display:flex;"><span>prompt <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;In the year 1834, I walked through the streets of London and witnessed&#34;</span>
</span></span><span style="display:flex;"><span>inputs <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer(prompt, return_tensors<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;pt&#34;</span>)<span style="color:#91d7e3;font-weight:bold">.</span>to(device)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>outputs <span style="color:#91d7e3;font-weight:bold">=</span> model<span style="color:#91d7e3;font-weight:bold">.</span>generate(
</span></span><span style="display:flex;"><span>    inputs[<span style="color:#a6da95">&#34;input_ids&#34;</span>],
</span></span><span style="display:flex;"><span>    max_new_tokens<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">50</span>,
</span></span><span style="display:flex;"><span>    do_sample<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>,
</span></span><span style="display:flex;"><span>    temperature<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">0.8</span>,
</span></span><span style="display:flex;"><span>    top_p<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">0.95</span>,
</span></span><span style="display:flex;"><span>    repetition_penalty<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">1.2</span>,
</span></span><span style="display:flex;"><span>    pad_token_id<span style="color:#91d7e3;font-weight:bold">=</span>tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>eos_token_id
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#91d7e3">print</span>(tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>decode(outputs[<span style="color:#f5a97f">0</span>], skip_special_tokens<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>))</span></span></code></pre></div><figcaption>
        <strong>Listing 10: Loading Models from Hugging Face Hub</strong>
    </figcaption>
</figure>
<h3 id="43-publishing-workflow">4.3 Publishing Workflow</h3>
<p>If you want to publish your own trained model to Hugging Face Hub, here&rsquo;s the workflow we followed:</p>
<ol>
<li>
<p><strong>Set up authentication</strong>: Install <code>huggingface_hub</code> and authenticate with a token that has Write permissions. You can generate tokens at <a
	
		href = "https://huggingface.co/settings/tokens"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		huggingface.co/settings/tokens
	</span>
</a>.</p>
</li>
<li>
<p><strong>Convert the checkpoint</strong>: PyTorch training checkpoints include optimizer states and training metadata that aren&rsquo;t needed for inference. The conversion scripts extract just the model weights and translate them to Hugging Face&rsquo;s naming conventions (covered in detail in Section 5).</p>
</li>
<li>
<p><strong>Prepare the tokenizer</strong>: Save the tokenizer files alongside the model. Our custom tokenizer with 30,000 tokens and 150+ historical special tokens needs to be converted to the <code>transformers</code> library format.</p>
</li>
<li>
<p><strong>Generate a model card</strong>: The <strong><code>README.md</code></strong> on your Hugging Face model page serves as documentation. Include model architecture details, training data sources, usage examples, evaluation results, and limitations. The scripts generate this automatically, but you should review and customize it.</p>
</li>
<li>
<p><strong>Upload and validate</strong>: Push everything to the Hub, then immediately test with <code>from_pretrained()</code> to ensure the published model loads and generates correctly.</p>
</li>
</ol>
<blockquote>
<p><strong>📝 Full documentation</strong>: See <strong><code>08_documentation/HUGGINGFACE_PUBLISHING.md</code></strong> and <strong><code>08_documentation/DEPLOYMENT_GUIDE.md</code></strong> in the repository for the complete step-by-step workflow with troubleshooting guidance.</p></blockquote>
<h2 id="5-pytorch-to-hugging-face-format-conversion">5. PyTorch to Hugging Face Format Conversion</h2>
<h3 id="51-why-format-conversion-is-necessary">5.1 Why Format Conversion is Necessary</h3>
<p>During training, our models are saved in PyTorch&rsquo;s native <code>.pt</code> format. These checkpoints include everything needed to resume training: model weights, optimizer states, learning rate schedules, and training metadata. However, for deployment and sharing, we need a leaner, inference-optimized format compatible with the broader machine learning ecosystem.</p>
<p>Think of it like the difference between a development environment and a production deployment: training checkpoints are like a developer&rsquo;s workspace with all the tools and intermediate files, while Hugging Face format is like a clean, standardized package that anyone can use without understanding the internal training details.</p>
<p>The Hugging Face Hub expects models to follow specific file structures, naming conventions, and metadata requirements. The conversion process extracts just the model weights (discarding optimizer states and training metadata), translates weight names to match Hugging Face conventions, creates proper configuration files, and ensures the tokenizer is compatible with the <code>transformers</code> library.</p>
<h3 id="52-the-conversion-process">5.2 The Conversion Process</h3>
<p>The conversion handles several transformations to bridge PyTorch and Hugging Face formats:</p>
<ul>
<li><strong>Weight name mapping</strong>: PyTorch layer names like <code>transformer.h.0.attn.c_attn.weight</code> become Hugging Face names like <code>transformer.h.0.attn.c_attn.weight</code> (mostly the same for GPT-2, but with careful handling of edge cases)</li>
<li><strong>Automatic torch.compile handling</strong>: If you used <code>torch.compile()</code> during training, weights get prefixed with <code>_orig_mod.</code> - the conversion strips these prefixes</li>
<li><strong>Configuration translation</strong>: Model hyperparameters (n_layer, n_head, n_embd, etc.) are mapped to Hugging Face&rsquo;s <code>config.json</code> format</li>
<li><strong>Tokenizer conversion</strong>: Our custom 30,000-token vocabulary with 150+ historical special tokens is converted to <code>transformers</code> library format</li>
<li><strong>Validation</strong>: After conversion, we verify that the model loads correctly and produces expected outputs</li>
</ul>
<blockquote>
<p><strong>💻 Full Implementation</strong>: See <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/10_scripts/publish_slm_to_huggingface.py"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		<strong><code>10_scripts/publish_slm_to_huggingface.py</code></strong>
	</span>
</a> for the complete conversion pipeline with error handling, validation, and model card generation.</p></blockquote>
<h3 id="53-dependencies-for-hugging-face-integration">5.3 Dependencies for Hugging Face Integration</h3>
<p>The Hugging Face integration requires specific dependencies and follows established patterns for model publishing and usage: <a href="#listing11" class="listing-ref">Listing 11</a></p>
<figure id="listing11"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Required dependencies for Hugging Face integration</span>
</span></span><span style="display:flex;"><span>huggingface_dependencies <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;transformers&#34;</span>: <span style="color:#a6da95">&#34;&gt;=4.21.0&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;torch&#34;</span>: <span style="color:#a6da95">&#34;&gt;=1.12.0&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;tokenizers&#34;</span>: <span style="color:#a6da95">&#34;&gt;=0.12.0&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;safetensors&#34;</span>: <span style="color:#a6da95">&#34;&gt;=0.3.0&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;accelerate&#34;</span>: <span style="color:#a6da95">&#34;&gt;=0.20.0&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;huggingface_hub&#34;</span>: <span style="color:#a6da95">&#34;&gt;=0.10.0&#34;</span>
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Model loading and usage example</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">load_published_model</span>(model_name<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;bahree/london-historical-slm&#34;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Load published model from Hugging Face Hub&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Suppress warnings for cleaner output</span>
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">os</span>
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">warnings</span>
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">logging</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    os<span style="color:#91d7e3;font-weight:bold">.</span>environ[<span style="color:#a6da95">&#39;TRANSFORMERS_VERBOSITY&#39;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#39;error&#39;</span>
</span></span><span style="display:flex;"><span>    warnings<span style="color:#91d7e3;font-weight:bold">.</span>filterwarnings(<span style="color:#a6da95">&#39;ignore&#39;</span>)
</span></span><span style="display:flex;"><span>    logging<span style="color:#91d7e3;font-weight:bold">.</span>getLogger(<span style="color:#a6da95">&#34;transformers&#34;</span>)<span style="color:#91d7e3;font-weight:bold">.</span>setLevel(logging<span style="color:#91d7e3;font-weight:bold">.</span>ERROR)
</span></span><span style="display:flex;"><span>    os<span style="color:#91d7e3;font-weight:bold">.</span>environ[<span style="color:#a6da95">&#39;TOKENIZERS_PARALLELISM&#39;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#39;false&#39;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Load model and tokenizer</span>
</span></span><span style="display:flex;"><span>    tokenizer <span style="color:#91d7e3;font-weight:bold">=</span> AutoTokenizer<span style="color:#91d7e3;font-weight:bold">.</span>from_pretrained(model_name)
</span></span><span style="display:flex;"><span>    model <span style="color:#91d7e3;font-weight:bold">=</span> AutoModelForCausalLM<span style="color:#91d7e3;font-weight:bold">.</span>from_pretrained(model_name)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Set pad token if not set</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>pad_token <span style="color:#91d7e3;font-weight:bold">is</span> <span style="color:#f5a97f">None</span>:
</span></span><span style="display:flex;"><span>        tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>pad_token <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>eos_token
</span></span><span style="display:flex;"><span>        tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>pad_token_id <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>eos_token_id
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> model, tokenizer
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">generate_historical_text</span>(model, tokenizer, prompt, max_length<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">50</span>, temperature<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">0.3</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Generate historical text using the published model&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Tokenize input</span>
</span></span><span style="display:flex;"><span>    inputs <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>encode(prompt, return_tensors<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;pt&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Generate text</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">with</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>no_grad():
</span></span><span style="display:flex;"><span>        outputs <span style="color:#91d7e3;font-weight:bold">=</span> model<span style="color:#91d7e3;font-weight:bold">.</span>generate(
</span></span><span style="display:flex;"><span>            inputs,
</span></span><span style="display:flex;"><span>            max_new_tokens<span style="color:#91d7e3;font-weight:bold">=</span>max_length,
</span></span><span style="display:flex;"><span>            do_sample<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>,
</span></span><span style="display:flex;"><span>            temperature<span style="color:#91d7e3;font-weight:bold">=</span>temperature,
</span></span><span style="display:flex;"><span>            top_p<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">0.9</span>,
</span></span><span style="display:flex;"><span>            top_k<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">20</span>,
</span></span><span style="display:flex;"><span>            repetition_penalty<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">1.2</span>,
</span></span><span style="display:flex;"><span>            no_repeat_ngram_size<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">3</span>,
</span></span><span style="display:flex;"><span>            pad_token_id<span style="color:#91d7e3;font-weight:bold">=</span>tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>pad_token_id,
</span></span><span style="display:flex;"><span>            eos_token_id<span style="color:#91d7e3;font-weight:bold">=</span>tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>eos_token_id,
</span></span><span style="display:flex;"><span>            early_stopping<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Decode output</span>
</span></span><span style="display:flex;"><span>    generated_text <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>decode(outputs[<span style="color:#f5a97f">0</span>], skip_special_tokens<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> generated_text
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Example usage</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">if</span> <span style="color:#f4dbd6">__name__</span> <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#a6da95">&#34;__main__&#34;</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Load the published model</span>
</span></span><span style="display:flex;"><span>    model, tokenizer <span style="color:#91d7e3;font-weight:bold">=</span> load_published_model()
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test prompts</span>
</span></span><span style="display:flex;"><span>    test_prompts <span style="color:#91d7e3;font-weight:bold">=</span> [
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;In the year 1834, I walked through the streets of London and witnessed&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The gentleman from the country said, &#39;we have never seen such a sight&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The Thames flowed dark and mysterious through the heart&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;Parliament sat in Westminster Hall&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The Great Fire of 1666 had destroyed&#34;</span>
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Generate text for each prompt</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> prompt <span style="color:#91d7e3;font-weight:bold">in</span> test_prompts:
</span></span><span style="display:flex;"><span>        generated <span style="color:#91d7e3;font-weight:bold">=</span> generate_historical_text(model, tokenizer, prompt)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">print</span>(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Prompt: </span><span style="color:#a6da95">{</span>prompt<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">print</span>(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Generated: </span><span style="color:#a6da95">{</span>generated<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;-&#34;</span> <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">80</span>)</span></span></code></pre></div><figcaption>
        <strong>Listing 11: Hugging Face Dependencies</strong>
    </figcaption>
</figure>
<p>Hugging Face integration provides standard <code>from_pretrained()</code> loading and generation with minimal setup, making the models easy to share and reuse.</p>
<h3 id="54-comprehensive-testing-and-validation-framework">5.4 Comprehensive Testing and Validation Framework</h3>
<p>Once a model is on the Hub, <strong><code>06_inference/test_published_models.py</code></strong> provides a concrete implementation of the testing pattern in <a href="#listing12" class="listing-ref">Listing 12</a>. It loads the model via <code>from_pretrained</code>, runs functional, historical, linguistic, and performance checks, and prints a human-readable summary so you can verify the published artefact behaves like your local checkpoints.</p>
<figure id="listing12"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">test_published_model</span>(model_name<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;bahree/london-historical-slm&#34;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Comprehensive testing of published model&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Testing published model: </span><span style="color:#a6da95">{</span>model_name<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Load model</span>
</span></span><span style="display:flex;"><span>    model, tokenizer <span style="color:#91d7e3;font-weight:bold">=</span> load_published_model(model_name)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test basic functionality</span>
</span></span><span style="display:flex;"><span>    basic_tests <span style="color:#91d7e3;font-weight:bold">=</span> test_basic_functionality(model, tokenizer)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test historical accuracy</span>
</span></span><span style="display:flex;"><span>    historical_tests <span style="color:#91d7e3;font-weight:bold">=</span> test_historical_accuracy(model, tokenizer)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test linguistic quality</span>
</span></span><span style="display:flex;"><span>    linguistic_tests <span style="color:#91d7e3;font-weight:bold">=</span> test_linguistic_quality(model, tokenizer)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test performance metrics</span>
</span></span><span style="display:flex;"><span>    performance_tests <span style="color:#91d7e3;font-weight:bold">=</span> test_performance_metrics(model, tokenizer)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Compile results</span>
</span></span><span style="display:flex;"><span>    results <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;basic_functionality&#34;</span>: basic_tests,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;historical_accuracy&#34;</span>: historical_tests,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;linguistic_quality&#34;</span>: linguistic_tests,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;performance_metrics&#34;</span>: performance_tests
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Print summary</span>
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;</span><span style="color:#8aadf4">\n</span><span style="color:#a6da95">Test Results Summary:&#34;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">print</span>(<span style="color:#a6da95">&#34;=&#34;</span> <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">50</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> category, tests <span style="color:#91d7e3;font-weight:bold">in</span> results<span style="color:#91d7e3;font-weight:bold">.</span>items():
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">print</span>(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;</span><span style="color:#8aadf4">\n</span><span style="color:#a6da95">{</span>category<span style="color:#91d7e3;font-weight:bold">.</span>replace(<span style="color:#a6da95">&#39;_&#39;</span>, <span style="color:#a6da95">&#39; &#39;</span>)<span style="color:#91d7e3;font-weight:bold">.</span>title()<span style="color:#a6da95">}</span><span style="color:#a6da95">:&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">for</span> test_name, result <span style="color:#91d7e3;font-weight:bold">in</span> tests<span style="color:#91d7e3;font-weight:bold">.</span>items():
</span></span><span style="display:flex;"><span>            status <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;PASS&#34;</span> <span style="color:#c6a0f6">if</span> result <span style="color:#c6a0f6">else</span> <span style="color:#a6da95">&#34;FAIL&#34;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#91d7e3">print</span>(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;  </span><span style="color:#a6da95">{</span>test_name<span style="color:#a6da95">}</span><span style="color:#a6da95">: </span><span style="color:#a6da95">{</span>status<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> results
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">test_basic_functionality</span>(model, tokenizer):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Test basic model functionality&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    tests <span style="color:#91d7e3;font-weight:bold">=</span> {}
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test model loading</span>
</span></span><span style="display:flex;"><span>    tests[<span style="color:#a6da95">&#34;model_loading&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> model <span style="color:#91d7e3;font-weight:bold">is</span> <span style="color:#91d7e3;font-weight:bold">not</span> <span style="color:#f5a97f">None</span> <span style="color:#91d7e3;font-weight:bold">and</span> tokenizer <span style="color:#91d7e3;font-weight:bold">is</span> <span style="color:#91d7e3;font-weight:bold">not</span> <span style="color:#f5a97f">None</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test tokenizer functionality</span>
</span></span><span style="display:flex;"><span>    test_text <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;In the year 1834, London was&#34;</span>
</span></span><span style="display:flex;"><span>    tokens <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>encode(test_text)
</span></span><span style="display:flex;"><span>    decoded <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>decode(tokens)
</span></span><span style="display:flex;"><span>    tests[<span style="color:#a6da95">&#34;tokenizer_encode_decode&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> test_text <span style="color:#91d7e3;font-weight:bold">in</span> decoded
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test model generation</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>        inputs <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>encode(test_text, return_tensors<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;pt&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">with</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>no_grad():
</span></span><span style="display:flex;"><span>            outputs <span style="color:#91d7e3;font-weight:bold">=</span> model<span style="color:#91d7e3;font-weight:bold">.</span>generate(inputs, max_new_tokens<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">10</span>, do_sample<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">False</span>)
</span></span><span style="display:flex;"><span>        generated <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>decode(outputs[<span style="color:#f5a97f">0</span>], skip_special_tokens<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>        tests[<span style="color:#a6da95">&#34;model_generation&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">len</span>(generated) <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#91d7e3">len</span>(test_text)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>        tests[<span style="color:#a6da95">&#34;model_generation&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">False</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test special tokens</span>
</span></span><span style="display:flex;"><span>    special_tokens <span style="color:#91d7e3;font-weight:bold">=</span> [<span style="color:#a6da95">&#34;&lt;|london|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|thou|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|hath|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|doth|&gt;&#34;</span>]
</span></span><span style="display:flex;"><span>    special_token_tests <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> token <span style="color:#91d7e3;font-weight:bold">in</span> special_tokens:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> token <span style="color:#91d7e3;font-weight:bold">in</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>get_vocab():
</span></span><span style="display:flex;"><span>            special_token_tests<span style="color:#91d7e3;font-weight:bold">.</span>append(<span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>            special_token_tests<span style="color:#91d7e3;font-weight:bold">.</span>append(<span style="color:#f5a97f">False</span>)
</span></span><span style="display:flex;"><span>    tests[<span style="color:#a6da95">&#34;special_tokens&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">any</span>(special_token_tests)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> tests
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">test_historical_accuracy</span>(model, tokenizer):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Test historical accuracy of generated text&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    tests <span style="color:#91d7e3;font-weight:bold">=</span> {}
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test prompts for different historical periods</span>
</span></span><span style="display:flex;"><span>    period_prompts <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;1500-1600&#34;</span>: <span style="color:#a6da95">&#34;In the year 1550, the gentleman said&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;1600-1700&#34;</span>: <span style="color:#a6da95">&#34;In the year 1650, the gentleman said&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;1700-1800&#34;</span>: <span style="color:#a6da95">&#34;In the year 1750, the gentleman said&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;1800-1850&#34;</span>: <span style="color:#a6da95">&#34;In the year 1834, the gentleman said&#34;</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> period, prompt <span style="color:#91d7e3;font-weight:bold">in</span> period_prompts<span style="color:#91d7e3;font-weight:bold">.</span>items():
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>            generated <span style="color:#91d7e3;font-weight:bold">=</span> generate_historical_text(model, tokenizer, prompt, max_length<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">30</span>)
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Check for period-appropriate language</span>
</span></span><span style="display:flex;"><span>            period_terms <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>                <span style="color:#a6da95">&#34;1500-1600&#34;</span>: [<span style="color:#a6da95">&#34;ye&#34;</span>, <span style="color:#a6da95">&#34;hath&#34;</span>, <span style="color:#a6da95">&#34;doth&#34;</span>, <span style="color:#a6da95">&#34;thou&#34;</span>, <span style="color:#a6da95">&#34;thee&#34;</span>],
</span></span><span style="display:flex;"><span>                <span style="color:#a6da95">&#34;1600-1700&#34;</span>: [<span style="color:#a6da95">&#34;hath&#34;</span>, <span style="color:#a6da95">&#34;doth&#34;</span>, <span style="color:#a6da95">&#34;thou&#34;</span>, <span style="color:#a6da95">&#34;thee&#34;</span>, <span style="color:#a6da95">&#34;verily&#34;</span>],
</span></span><span style="display:flex;"><span>                <span style="color:#a6da95">&#34;1700-1800&#34;</span>: [<span style="color:#a6da95">&#34;hath&#34;</span>, <span style="color:#a6da95">&#34;doth&#34;</span>, <span style="color:#a6da95">&#34;thou&#34;</span>, <span style="color:#a6da95">&#34;thee&#34;</span>, <span style="color:#a6da95">&#34;indeed&#34;</span>],
</span></span><span style="display:flex;"><span>                <span style="color:#a6da95">&#34;1800-1850&#34;</span>: [<span style="color:#a6da95">&#34;indeed&#34;</span>, <span style="color:#a6da95">&#34;verily&#34;</span>, <span style="color:#a6da95">&#34;whilst&#34;</span>, <span style="color:#a6da95">&#34;pray&#34;</span>]
</span></span><span style="display:flex;"><span>            }
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            found_terms <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">sum</span>(<span style="color:#f5a97f">1</span> <span style="color:#c6a0f6">for</span> term <span style="color:#91d7e3;font-weight:bold">in</span> period_terms[period] <span style="color:#c6a0f6">if</span> term <span style="color:#91d7e3;font-weight:bold">in</span> generated<span style="color:#91d7e3;font-weight:bold">.</span>lower())
</span></span><span style="display:flex;"><span>            tests[<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;period_</span><span style="color:#a6da95">{</span>period<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> found_terms <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>            tests[<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;period_</span><span style="color:#a6da95">{</span>period<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">False</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test London-specific knowledge</span>
</span></span><span style="display:flex;"><span>    london_prompts <span style="color:#91d7e3;font-weight:bold">=</span> [
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The Thames flowed through&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;Westminster Hall was&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The Tower of London&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;Cheapside was filled with&#34;</span>
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    london_tests <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> prompt <span style="color:#91d7e3;font-weight:bold">in</span> london_prompts:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>            generated <span style="color:#91d7e3;font-weight:bold">=</span> generate_historical_text(model, tokenizer, prompt, max_length<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">20</span>)
</span></span><span style="display:flex;"><span>            london_tests<span style="color:#91d7e3;font-weight:bold">.</span>append(<span style="color:#91d7e3">len</span>(generated) <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#91d7e3">len</span>(prompt))
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>            london_tests<span style="color:#91d7e3;font-weight:bold">.</span>append(<span style="color:#f5a97f">False</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    tests[<span style="color:#a6da95">&#34;london_knowledge&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">any</span>(london_tests)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> tests
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">test_linguistic_quality</span>(model, tokenizer):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Test linguistic quality of generated text&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    tests <span style="color:#91d7e3;font-weight:bold">=</span> {}
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test prompts for linguistic quality</span>
</span></span><span style="display:flex;"><span>    quality_prompts <span style="color:#91d7e3;font-weight:bold">=</span> [
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The gentleman walked through the garden&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;In the morning, the sun rose&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The old man sat by the fire&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The young woman read her book&#34;</span>
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    quality_tests <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> prompt <span style="color:#91d7e3;font-weight:bold">in</span> quality_prompts:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>            generated <span style="color:#91d7e3;font-weight:bold">=</span> generate_historical_text(model, tokenizer, prompt, max_length<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">30</span>)
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Check for basic linguistic quality</span>
</span></span><span style="display:flex;"><span>            sentences <span style="color:#91d7e3;font-weight:bold">=</span> generated<span style="color:#91d7e3;font-weight:bold">.</span>split(<span style="color:#a6da95">&#39;.&#39;</span>)
</span></span><span style="display:flex;"><span>            quality_tests<span style="color:#91d7e3;font-weight:bold">.</span>append(<span style="color:#91d7e3">len</span>(sentences) <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#f5a97f">1</span>)
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>            quality_tests<span style="color:#91d7e3;font-weight:bold">.</span>append(<span style="color:#f5a97f">False</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    tests[<span style="color:#a6da95">&#34;linguistic_quality&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">any</span>(quality_tests)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test coherence</span>
</span></span><span style="display:flex;"><span>    coherence_prompt <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;The gentleman walked through the garden and&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>        generated <span style="color:#91d7e3;font-weight:bold">=</span> generate_historical_text(model, tokenizer, coherence_prompt, max_length<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">50</span>)
</span></span><span style="display:flex;"><span>        tests[<span style="color:#a6da95">&#34;coherence&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">len</span>(generated) <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#91d7e3">len</span>(coherence_prompt)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>        tests[<span style="color:#a6da95">&#34;coherence&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">False</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> tests
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">test_performance_metrics</span>(model, tokenizer):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Test performance metrics of the model&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    tests <span style="color:#91d7e3;font-weight:bold">=</span> {}
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test inference speed</span>
</span></span><span style="display:flex;"><span>    test_prompt <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;In the year 1834, London was&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>        start_time <span style="color:#91d7e3;font-weight:bold">=</span> time<span style="color:#91d7e3;font-weight:bold">.</span>time()
</span></span><span style="display:flex;"><span>        generated <span style="color:#91d7e3;font-weight:bold">=</span> generate_historical_text(model, tokenizer, test_prompt, max_length<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">50</span>)
</span></span><span style="display:flex;"><span>        end_time <span style="color:#91d7e3;font-weight:bold">=</span> time<span style="color:#91d7e3;font-weight:bold">.</span>time()
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        inference_time <span style="color:#91d7e3;font-weight:bold">=</span> end_time <span style="color:#91d7e3;font-weight:bold">-</span> start_time
</span></span><span style="display:flex;"><span>        tests[<span style="color:#a6da95">&#34;inference_speed&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> inference_time <span style="color:#91d7e3;font-weight:bold">&lt;</span> <span style="color:#f5a97f">5.0</span>  <span style="color:#6e738d;font-style:italic"># Should complete within 5 seconds</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>        tests[<span style="color:#a6da95">&#34;inference_speed&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">False</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Test memory usage</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">psutil</span>
</span></span><span style="display:flex;"><span>        process <span style="color:#91d7e3;font-weight:bold">=</span> psutil<span style="color:#91d7e3;font-weight:bold">.</span>Process()
</span></span><span style="display:flex;"><span>        memory_usage <span style="color:#91d7e3;font-weight:bold">=</span> process<span style="color:#91d7e3;font-weight:bold">.</span>memory_info()<span style="color:#91d7e3;font-weight:bold">.</span>rss <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#f5a97f">1024</span> <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#f5a97f">1024</span>  <span style="color:#6e738d;font-style:italic"># MB</span>
</span></span><span style="display:flex;"><span>        tests[<span style="color:#a6da95">&#34;memory_usage&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> memory_usage <span style="color:#91d7e3;font-weight:bold">&lt;</span> <span style="color:#f5a97f">8000</span>  <span style="color:#6e738d;font-style:italic"># Should use less than 8GB</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>        tests[<span style="color:#a6da95">&#34;memory_usage&#34;</span>] <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">True</span>  <span style="color:#6e738d;font-style:italic"># Skip if psutil not available</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> tests</span></span></code></pre></div><figcaption>
        <strong>Listing 12: Testing Published Models</strong>
    </figcaption>
</figure>
<p>Published model tests validate loading, generation, historical accuracy, and basic performance before and after publication.</p>
<h3 id="55-model-card-generation">5.5 Model Card Generation</h3>
<p>The model card serves as the primary documentation on Hugging Face Hub, making it the first thing users see when they discover your model. A well-crafted model card helps users understand what the model does, how to use it, and its limitations. The <code>generate_comprehensive_model_card()</code> function in <strong><code>10_scripts/publish_slm_to_huggingface.py</code></strong> creates this documentation automatically.</p>
<p><strong>What Makes an Effective Model Card:</strong></p>
<p>The model card for our historical language models includes several key sections that provide users with everything they need to get started. At a minimum, include:</p>
<ol>
<li>
<p><strong>Model Description &amp; Key Features</strong>: A clear explanation that the model was trained from scratch (not fine-tuned), emphasizing the 117M parameter SLM variant and 354M parameter Regular Model, with details about the custom 30,000-token vocabulary and 150+ historical special tokens.</p>
</li>
<li>
<p><strong>Setup Instructions</strong>: Platform-specific guidance for creating virtual environments (Linux/macOS/Windows), installing dependencies (<code>transformers</code>, <code>torch</code>, <code>accelerate</code>), and handling different accelerators (CPU, NVIDIA CUDA, AMD ROCm).</p>
</li>
<li>
<p><strong>Quick Start Code</strong>: Auto-device detection that works across CPU, CUDA, and ROCm with sensible generation parameters (<code>temperature=0.8</code>, <code>top_p=0.95</code>, <code>repetition_penalty=1.2</code>).</p>
</li>
<li>
<p><strong>Training Details</strong>: Architecture specifics (GPT-2 Small/Medium), training infrastructure (2x GPU with Distributed Data Parallel), performance metrics (training loss, MFU utilization), and data sources (218+ historical sources spanning 1500-1850).</p>
</li>
<li>
<p><strong>Example Prompts</strong>: Period-specific prompts demonstrating different historical eras (Tudor, Stuart, Georgian, Victorian) and London-specific contexts (Thames, Westminster, Parliament).</p>
</li>
<li>
<p><strong>Testing &amp; Validation</strong>: Instructions for running the automated test suite (<strong><code>test_published_models.py</code></strong>) and interactive testing with custom prompts.</p>
</li>
<li>
<p><strong>Troubleshooting</strong>: Common issues and solutions for PyTorch installation, GPU detection, and memory constraints.</p>
</li>
<li>
<p><strong>Citation &amp; License</strong>: BibTeX citation format and MIT license information.</p>
</li>
</ol>
<p><strong>Key Implementation Details:</strong></p>
<p>The model card generation follows Hugging Face conventions with YAML frontmatter specifying license, library, pipeline type, language, and tags. The script emphasizes that models were <strong>trained from scratch</strong> (not fine-tuned) and provides device-agnostic code examples that run on CPU, CUDA, and ROCm.</p>
<p>The card also includes detailed model selection guidance comparing the SLM (faster, lower memory) versus the Regular Model (higher quality, more parameters), helping users choose the right model for their use case - whether that&rsquo;s quick experimentation, educational purposes, or production deployment.</p>
<blockquote>
<p><strong>💻 Complete Implementation</strong>: See <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/10_scripts/publish_slm_to_huggingface.py"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		<strong><code>10_scripts/publish_slm_to_huggingface.py</code></strong>
	</span>
</a> and <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/10_scripts/publish_to_huggingface.py"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		<strong><code>10_scripts/publish_to_huggingface.py</code></strong>
	</span>
</a> for the full model card generation implementation.</p></blockquote>
<blockquote>
<p><strong>👀 Live Model Cards</strong>: View the published cards at <a
	
		href = "https://huggingface.co/bahree/london-historical-slm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		bahree/london-historical-slm
	</span>
</a> and <a
	
		href = "https://huggingface.co/bahree/london-historical-llm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		bahree/london-historical-llm
	</span>
</a>.</p></blockquote>
<blockquote>
<p><strong>📝 Documentation</strong>: See <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/08_documentation/HUGGINGFACE_PUBLISHING.md"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		HUGGINGFACE_PUBLISHING.md
	</span>
</a> and <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/08_documentation/DEPLOYMENT_GUIDE.md"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		DEPLOYMENT_GUIDE.md
	</span>
</a> for complete publishing and deployment workflows.</p></blockquote>
<h3 id="56-local-deployment-options">5.6 Local Deployment Options</h3>
<p>Finally, <a href="#listing13" class="listing-ref">Listing 13</a> sketches how you might wrap a trained model into a simple REST API or CLI. These patterns are intentionally minimal, meant to help you connect the dots between the inference utilities in <strong><code>06_inference/</code></strong> and real applications (dashboards, notebooks, small services).</p>
<figure id="listing13"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">setup_local_deployment</span>(model, tokenizer, deployment_type<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;api&#39;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Set up local deployment for historical language model&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> deployment_type <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#a6da95">&#39;api&#39;</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> setup_api_deployment(model, tokenizer)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">elif</span> deployment_type <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#a6da95">&#39;cli&#39;</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> setup_cli_deployment(model, tokenizer)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">elif</span> deployment_type <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#a6da95">&#39;notebook&#39;</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> setup_notebook_deployment(model, tokenizer)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">raise</span> <span style="color:#f5a97f">ValueError</span>(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Unknown deployment type: </span><span style="color:#a6da95">{</span>deployment_type<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">setup_api_deployment</span>(model, tokenizer):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Set up REST API deployment&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">from</span> <span style="color:#f5a97f">flask</span> <span style="color:#8bd5ca">import</span> Flask, request, jsonify
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">torch</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    app <span style="color:#91d7e3;font-weight:bold">=</span> Flask(<span style="color:#f4dbd6">__name__</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#8aadf4;font-weight:bold">@app.route</span>(<span style="color:#a6da95">&#39;/generate&#39;</span>, methods<span style="color:#91d7e3;font-weight:bold">=</span>[<span style="color:#a6da95">&#39;POST&#39;</span>])
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">generate_text</span>():
</span></span><span style="display:flex;"><span>        data <span style="color:#91d7e3;font-weight:bold">=</span> request<span style="color:#91d7e3;font-weight:bold">.</span>get_json()
</span></span><span style="display:flex;"><span>        prompt <span style="color:#91d7e3;font-weight:bold">=</span> data<span style="color:#91d7e3;font-weight:bold">.</span>get(<span style="color:#a6da95">&#39;prompt&#39;</span>, <span style="color:#a6da95">&#39;&#39;</span>)
</span></span><span style="display:flex;"><span>        max_length <span style="color:#91d7e3;font-weight:bold">=</span> data<span style="color:#91d7e3;font-weight:bold">.</span>get(<span style="color:#a6da95">&#39;max_length&#39;</span>, <span style="color:#f5a97f">100</span>)
</span></span><span style="display:flex;"><span>        temperature <span style="color:#91d7e3;font-weight:bold">=</span> data<span style="color:#91d7e3;font-weight:bold">.</span>get(<span style="color:#a6da95">&#39;temperature&#39;</span>, <span style="color:#f5a97f">0.3</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Generate text</span>
</span></span><span style="display:flex;"><span>        inputs <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>encode(prompt, return_tensors<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;pt&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">with</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>no_grad():
</span></span><span style="display:flex;"><span>            outputs <span style="color:#91d7e3;font-weight:bold">=</span> model<span style="color:#91d7e3;font-weight:bold">.</span>generate(
</span></span><span style="display:flex;"><span>                inputs,
</span></span><span style="display:flex;"><span>                max_length<span style="color:#91d7e3;font-weight:bold">=</span>max_length,
</span></span><span style="display:flex;"><span>                temperature<span style="color:#91d7e3;font-weight:bold">=</span>temperature,
</span></span><span style="display:flex;"><span>                do_sample<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>,
</span></span><span style="display:flex;"><span>                pad_token_id<span style="color:#91d7e3;font-weight:bold">=</span>tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>eos_token_id
</span></span><span style="display:flex;"><span>            )
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        generated_text <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>decode(outputs[<span style="color:#f5a97f">0</span>], skip_special_tokens<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> jsonify({
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;generated_text&#39;</span>: generated_text,
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;prompt&#39;</span>: prompt,
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;parameters&#39;</span>: {
</span></span><span style="display:flex;"><span>                <span style="color:#a6da95">&#39;max_length&#39;</span>: max_length,
</span></span><span style="display:flex;"><span>                <span style="color:#a6da95">&#39;temperature&#39;</span>: temperature
</span></span><span style="display:flex;"><span>            }
</span></span><span style="display:flex;"><span>        })
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#8aadf4;font-weight:bold">@app.route</span>(<span style="color:#a6da95">&#39;/health&#39;</span>, methods<span style="color:#91d7e3;font-weight:bold">=</span>[<span style="color:#a6da95">&#39;GET&#39;</span>])
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">health_check</span>():
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> jsonify({<span style="color:#a6da95">&#39;status&#39;</span>: <span style="color:#a6da95">&#39;healthy&#39;</span>, <span style="color:#a6da95">&#39;model_loaded&#39;</span>: <span style="color:#f5a97f">True</span>})
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> app
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">setup_cli_deployment</span>(model, tokenizer):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Set up command-line interface deployment&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">argparse</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">main</span>():
</span></span><span style="display:flex;"><span>        parser <span style="color:#91d7e3;font-weight:bold">=</span> argparse<span style="color:#91d7e3;font-weight:bold">.</span>ArgumentParser(description<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;Historical Language Model CLI&#39;</span>)
</span></span><span style="display:flex;"><span>        parser<span style="color:#91d7e3;font-weight:bold">.</span>add_argument(<span style="color:#a6da95">&#39;--prompt&#39;</span>, <span style="color:#91d7e3">type</span><span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">str</span>, required<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>, help<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;Text prompt&#39;</span>)
</span></span><span style="display:flex;"><span>        parser<span style="color:#91d7e3;font-weight:bold">.</span>add_argument(<span style="color:#a6da95">&#39;--max_length&#39;</span>, <span style="color:#91d7e3">type</span><span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">int</span>, default<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">100</span>, help<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;Maximum length&#39;</span>)
</span></span><span style="display:flex;"><span>        parser<span style="color:#91d7e3;font-weight:bold">.</span>add_argument(<span style="color:#a6da95">&#39;--temperature&#39;</span>, <span style="color:#91d7e3">type</span><span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">float</span>, default<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">0.3</span>, help<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;Temperature&#39;</span>)
</span></span><span style="display:flex;"><span>        parser<span style="color:#91d7e3;font-weight:bold">.</span>add_argument(<span style="color:#a6da95">&#39;--interactive&#39;</span>, action<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;store_true&#39;</span>, help<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;Interactive mode&#39;</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        args <span style="color:#91d7e3;font-weight:bold">=</span> parser<span style="color:#91d7e3;font-weight:bold">.</span>parse_args()
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> args<span style="color:#91d7e3;font-weight:bold">.</span>interactive:
</span></span><span style="display:flex;"><span>            run_interactive_mode(model, tokenizer)
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>            generate_and_print(model, tokenizer, args<span style="color:#91d7e3;font-weight:bold">.</span>prompt, args<span style="color:#91d7e3;font-weight:bold">.</span>max_length, args<span style="color:#91d7e3;font-weight:bold">.</span>temperature)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> main</span></span></code></pre></div><figcaption>
        <strong>Listing 13: Local Deployment Setup</strong>
    </figcaption>
</figure>
<p>Local deployment options: REST API, CLI, or notebook integration for different workflows.</p>
<h2 id="6-quality-assurance-and-validation">6. Quality Assurance and Validation</h2>
<p>Before wrapping up, let&rsquo;s look at the quality assurance systems that ensure the models behave reliably across different scenarios.</p>
<h3 id="61-automated-quality-checks">6.1 Automated Quality Checks</h3>
<p>The system includes automated quality checks that validate model performance and reliability: <a href="#listing14" class="listing-ref">Listing 14</a></p>
<figure id="listing14"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">run_quality_checks</span>(model, tokenizer, device<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;cuda&#39;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Run quality checks on historical language model&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    quality_checks <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;model_integrity&#39;</span>: check_model_integrity(model),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;tokenizer_consistency&#39;</span>: check_tokenizer_consistency(tokenizer),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;generation_quality&#39;</span>: check_generation_quality(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;historical_accuracy&#39;</span>: check_historical_accuracy(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;performance_metrics&#39;</span>: check_performance_metrics(model, tokenizer, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;memory_usage&#39;</span>: check_memory_usage(model, device),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;error_handling&#39;</span>: check_error_handling(model, tokenizer, device)
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Generate quality report</span>
</span></span><span style="display:flex;"><span>    quality_report <span style="color:#91d7e3;font-weight:bold">=</span> generate_quality_report(quality_checks)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> quality_checks, quality_report
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">check_model_integrity</span>(model):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Check model integrity and consistency&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    checks <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;parameter_count&#39;</span>: check_parameter_count(model),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;weight_distribution&#39;</span>: check_weight_distribution(model),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;gradient_flow&#39;</span>: check_gradient_flow(model),
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;activation_patterns&#39;</span>: check_activation_patterns(model)
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> checks
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">check_generation_quality</span>(model, tokenizer, device):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Check quality of generated text&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    test_prompts <span style="color:#91d7e3;font-weight:bold">=</span> [
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;In the year of our Lord 1750, London was&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The Thames flowed through the heart of&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;Merchants and tradesmen plied their wares&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The Great Fire of 1666 had changed&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;Parliament sat in Westminster, making laws&#34;</span>
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    quality_scores <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> prompt <span style="color:#91d7e3;font-weight:bold">in</span> test_prompts:
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Generate text</span>
</span></span><span style="display:flex;"><span>        generated <span style="color:#91d7e3;font-weight:bold">=</span> generate_text(model, tokenizer, prompt, device)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Check quality metrics</span>
</span></span><span style="display:flex;"><span>        quality_score <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;coherence&#39;</span>: assess_coherence(generated),
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;grammatical_correctness&#39;</span>: assess_grammatical_correctness(generated),
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;historical_accuracy&#39;</span>: assess_historical_accuracy(generated, {}),
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;linguistic_quality&#39;</span>: assess_linguistic_quality(generated),
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;relevance&#39;</span>: assess_relevance(generated, prompt)
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        quality_scores<span style="color:#91d7e3;font-weight:bold">.</span>append(quality_score)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> quality_scores</span></span></code></pre></div><figcaption>
        <strong>Listing 14: Quality Assurance Checks</strong>
    </figcaption>
</figure>
<p>Quality checks cover model integrity, generation quality, historical accuracy, performance, and error handling, ensuring the models behave reliably across different scenarios.</p>
<h3 id="62-continuous-integration-and-testing">6.2 Continuous Integration and Testing</h3>
<p>If you want to wire this into a lightweight CI gate, keep it simple and CPU-friendly. The goal is not to re-run full benchmarks in CI - it&rsquo;s to catch obvious regressions (can the model load, can it generate, do the evaluators still run).</p>
<p><strong>Minimal CI smoke checks (suggested):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># 1) Run a fast, local evaluation pass (no external APIs)</span>
</span></span><span style="display:flex;"><span>python 05_evaluation/run_evaluation.py --mode quick --device cpu
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># 2) Run a local inference smoke test from a checkpoint (replace with your path)</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_pytorch.py --checkpoint &lt;path-to-checkpoint.pt&gt; --prompt <span style="color:#a6da95">&#34;In the year 1834, London was&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># 3) Optional: test the published model (requires downloading from Hugging Face)</span>
</span></span><span style="display:flex;"><span>python 06_inference/test_published_models.py --model_name bahree/london-historical-slm</span></span></code></pre></div>
<h2 id="7-summary">7. Summary</h2>
<p>We&rsquo;ve now completed the full cycle of building language models from scratch. This final part has shown how to transform trained models into working systems that can be evaluated, tested, and deployed for real-world use. The journey that began in <a
	
		href = "/post/2025/09/building-llm-from-scratch-part1/"
	

	

	>
	
	<span>
		Part 1
	</span>
</a> with using published models, continued through <a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2
	</span>
</a>&rsquo;s data collection and tokenization, and <a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3
	</span>
</a>&rsquo;s training architecture, now concludes with evaluation and deployment - the critical final steps that make models usable.</p>
<p><strong>What we&rsquo;ve built:</strong></p>
<p>The evaluation, testing, and deployment pipeline provides a practical approach for bringing historical language models from research to deployment. We&rsquo;ve created specialized assessment metrics that go beyond standard LLM evaluation to catch historical inaccuracies, temporal inconsistencies, and period-inappropriate language. The testing infrastructure ensures reliability across different scenarios, while multiple deployment options make the models accessible to researchers, educators, and developers worldwide.</p>
<p><strong>Current Deployment Status:</strong></p>
<ul>
<li><strong>PyTorch Checkpoint Inference</strong>: Fully working with both SLM and Regular models</li>
<li><strong>Hugging Face Model Inference</strong>: SLM published and available, Regular model ready</li>
<li><strong>Local Testing</strong>: Both inference methods tested and validated on a remote Ubuntu machine</li>
<li><strong>Documentation</strong>: Complete guides and examples for all inference methods</li>
<li><strong>Performance</strong>: Clean logging, proper model detection, accurate parameter counts</li>
</ul>
<p><strong>The Complete Pipeline:</strong></p>
<p>This four-part series has demonstrated the complete LLM development lifecycle:</p>
<ol>
<li>
<p><strong>Data Collection</strong> (<a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2
	</span>
</a>): We gathered 218+ historical sources spanning 1500-1850, processed them through a sophisticated cleaning pipeline, and created a 500M+ character corpus of authentic historical English.</p>
</li>
<li>
<p><strong>Custom Tokenization</strong> (<a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2
	</span>
</a>): We built a specialized BPE tokenizer with 30,000 vocabulary tokens and 150+ special tokens that understand historical language patterns, London geography, and period-specific terminology.</p>
</li>
<li>
<p><strong>Model Training</strong> (<a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3
	</span>
</a>): We implemented custom GPT architectures, optimized for multi-GPU training, and successfully trained two models - an SLM (117M parameters) and a Regular model (354M parameters) - both capable of generating authentic historical text.</p>
</li>
<li>
<p><strong>Evaluation &amp; Deployment</strong> (This Part): We built comprehensive evaluation frameworks that assess historical accuracy, linguistic quality, and temporal consistency. We created a testing infrastructure for reliability and deployed models to the Hugging Face Hub for community access.</p>
</li>
</ol>
<p><strong>The Learning Journey:</strong></p>
<p>What started as a learning project has become a complete, working system that demonstrates every aspect of LLM development - from raw data collection through model deployment. The principles and techniques we&rsquo;ve covered scale from the 500M-character corpus to production-scale systems, and the evaluation frameworks we&rsquo;ve built can be adapted to any domain-specific language modeling task.</p>
<p>Whether you&rsquo;re a researcher exploring historical linguistics, an educator teaching AI concepts, or a developer building specialized language models, this series provides the complete toolkit for understanding and implementing LLM development from scratch. The models are published, the code is available, and the journey from data to deployment is complete.</p>
<blockquote>
<p><strong>🔗 GitHub Repository</strong>: <a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		github.com/bahree/helloLondon
	</span>
</a> - Complete training infrastructure (<strong><code>04_training/</code></strong>), model architecture (<strong><code>config.py</code></strong>), and evaluation/deployment (<strong><code>05_evaluation/</code></strong>, <strong><code>06_inference/</code></strong>, <strong><code>10_scripts/</code></strong>)</p>
<p><strong>🟥 Series Posts</strong>: <a
	
		href = "/post/2025/09/building-llm-from-scratch-part1/"
	

	

	>
	
	<span>
		Part 1 - Using the Published Historical Models
	</span>
</a> | <a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2 - Data Collection &amp; Custom Tokenizer
	</span>
</a> | <a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3 - Training Architecture &amp; GPU Optimization
	</span>
</a> | Part 4 (this post)</p>
<p><strong>🟧 Published Models</strong>: <a
	
		href = "https://huggingface.co/bahree/london-historical-slm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		SLM Model
	</span>
</a> | <a
	
		href = "https://huggingface.co/bahree/london-historical-llm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Regular Model
	</span>
</a> - Ready-to-use historical language models on Hugging Face</p>
<p><strong>📗 Book Reference</strong>: <a
	
		href = "https://a.co/d/gr87rem"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Generative AI in Action
	</span>
</a> - For deeper understanding of core LLM concepts</p></blockquote>
<h2 id="8-resources">8. Resources</h2>
<p>If you want to reproduce the full pipeline (or adapt it to your own domain), these are the most useful starting points:</p>
<ul>
<li><strong>Public GitHub repo</strong>: <a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		github.com/bahree/helloLondon
	</span>
</a></li>
<li><strong>Evaluation guide</strong>: <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/08_documentation/EVALUATION_GUIDE.md"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://github.com/bahree/helloLondon/blob/main/08_documentation/EVALUATION_GUIDE.md
	</span>
</a></li>
<li><strong>Hugging Face publishing guide</strong>: <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/08_documentation/HUGGINGFACE_PUBLISHING.md"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://github.com/bahree/helloLondon/blob/main/08_documentation/HUGGINGFACE_PUBLISHING.md
	</span>
</a></li>
<li><strong>Deployment guide</strong>: <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/08_documentation/DEPLOYMENT_GUIDE.md"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://github.com/bahree/helloLondon/blob/main/08_documentation/DEPLOYMENT_GUIDE.md
	</span>
</a></li>
<li><strong>Published models</strong>: <a
	
		href = "https://huggingface.co/bahree/london-historical-slm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		bahree/london-historical-slm
	</span>
</a> | <a
	
		href = "https://huggingface.co/bahree/london-historical-llm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		bahree/london-historical-llm
	</span>
</a></li>
</ul>
<h2 id="references">References</h2>
<ol>
<li>Vaswani et al. (2017) - Attention Is All You Need: <a
	
		href = "https://arxiv.org/abs/1706.03762"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1706.03762
	</span>
</a></li>
<li>Radford et al. (2019) - Language Models are Unsupervised Multitask Learners: <a
	
		href = "https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe
	</span>
</a></li>
<li>Lin (2004) - ROUGE: A Package for Automatic Evaluation of Summaries: <a
	
		href = "https://aclanthology.org/W04-1013/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://aclanthology.org/W04-1013/
	</span>
</a></li>
<li>Papineni et al. (2002) - BLEU: A Method for Automatic Evaluation of Machine Translation: <a
	
		href = "https://aclanthology.org/P02-1040/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://aclanthology.org/P02-1040/
	</span>
</a></li>
<li>Hendrycks et al. (2021) - Measuring Massive Multitask Language Understanding (MMLU): <a
	
		href = "https://arxiv.org/abs/2009.03300"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/2009.03300
	</span>
</a></li>
<li>Zellers et al. (2019) - HellaSwag: Can a Machine Really Finish Your Sentence?: <a
	
		href = "https://arxiv.org/abs/1905.07830"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1905.07830
	</span>
</a></li>
<li>Liu et al. (2023) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment: <a
	
		href = "https://arxiv.org/abs/2303.16634"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/2303.16634
	</span>
</a></li>
</ol>
<h2 id="acknowledgments">Acknowledgments</h2>
<p>This project builds upon the excellent work of the open-source community. Special thanks to <a
	
		href = "https://github.com/haykgrigo3/TimeCapsuleLLM"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		haykgrigo3&rsquo;s TimeCapsuleLLM
	</span>
</a> for the initial inspiration and framework for historical language model training, and to <a
	
		href = "https://github.com/karpathy/nanoGPT"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Andrej Karpathy&rsquo;s nanoGPT
	</span>
</a> for the foundational GPT architecture and training methodology. The project extends these foundations with specialized adaptations for historical text, including custom tokenizers, advanced data filtering, evaluation frameworks, and educational deployment infrastructure.</p>
]]></content:encoded>
    </item>
    <item>
      <title>🏛️Building LLMs from Scratch - Part 3: Training Architecture &amp; GPU Optimization</title>
      <link>/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/</link>
      <pubDate>Sat, 01 Nov 2025 00:00:00 +0000</pubDate>
      <guid>/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/</guid>
      <description>Complete guide to training custom GPT models with multi-GPU setup, checkpointing, and performance monitoring. Learn by building with working code and real training results.</description>
      <content:encoded><![CDATA[<p><strong>TL;DR</strong></p>
<p>In this third part of our 4-part series on building language models from scratch, I explore the complete training infrastructure that transforms our clean historical data and custom tokenizer into working language models.</p>
<ul>
<li><a
	
		href = "https://blog.desigeek.com/post/2025/09/building-llm-from-scratch-part1/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		<strong>Part 1</strong>
	</span>
</a> How to build a Large Language Model from Scratch - covered using the published model</li>
<li><a
	
		href = "https://blog.desigeek.com/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		<strong>Part 2</strong>
	</span>
</a> Building LLMs from Scratch - Part 2: Data Collection &amp; Custom Tokenizers - detailed data collection and custom tokenizer development.</li>
</ul>
<p>Here, we build the complete training pipeline from a custom GPT architecture through deployment-ready checkpoints.</p>
<p>This post demonstrates how to design custom model architectures, optimize GPU utilization, and implement comprehensive training pipelines that transform our 500M+ character historical corpus into two working language models.</p>
<blockquote>
<p><strong>⚠️ Educational Purpose</strong>: This is a learning project designed to teach LLM development concepts. For production-scale LLMs, you&rsquo;ll need significantly larger datasets, more sophisticated infrastructure, and additional considerations that are not covered in this post.</p></blockquote>
<p>As outlined in <a
	
		href = "https://blog.desigeek.com/post/2025/09/building-llm-from-scratch-part1/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Part 1
	</span>
</a>, both the SLM (117M parameters) and the regular Model (354M parameters) use the same training code and pipeline (<code>04_training/train_model_slm.py</code> and <code>04_training/train_model.py</code>) with different configurations defined in <code>config.py</code>. The training infrastructure, GPU optimization, checkpointing, and WandB integration are identical - only the model architecture parameters differ.</p>
<p>Both PyTorch checkpoint inference and Hugging Face model inference are fully working and available. Both the SLM and the Regular model are published on <a
	
		href = "https://huggingface.co/bahree"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Hugging Face Hub
	</span>
</a>. Local PyTorch checkpoints can be used directly for inference with the <code>inference_pytorch.py</code> script.</p>
<blockquote>
<p><strong>🔗 GitHub Repository</strong>: <a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		github.com/bahree/helloLondon
	</span>
</a> - Complete training infrastructure (<code>04_training/</code>), model architecture (<code>config.py</code>), and GPU configuration (<code>08_documentation/GPU_TUNING.md</code>)</p>
<p><strong>🧱 Series Posts</strong>: <a
	
		href = "https://blog.desigeek.com/post/2025/09/building-llm-from-scratch-part1/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Part 1 – Using the Published Historical Models
	</span>
</a> | <a
	
		href = "https://blog.desigeek.com/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Part 2 – Data Collection &amp; Custom Tokenizer
	</span>
</a> | Part 3 (this post) | <a
	
		href = "https://blog.desigeek.com/post/2026/01/building-llm-from-scratch-part4-evaluation-deployment/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Part 4 – Evaluation &amp; Deployment
	</span>
</a></p></blockquote>
<blockquote>
<p><strong>🤗 Published Models</strong>: <a
	
		href = "https://huggingface.co/bahree/london-historical-slm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		SLM Model
	</span>
</a> | <a
	
		href = "https://huggingface.co/bahree/london-historical-llm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Regular Model
	</span>
</a> - Ready-to-use historical language models on Hugging Face</p></blockquote>
<blockquote>
<p><strong>📚 Book Reference</strong>: <a
	
		href = "https://a.co/d/ffzkJ7T"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Generative AI in Action
	</span>
</a> - For deeper understanding of core LLM concepts.</p></blockquote>
<h2 id="1-the-training-challenge-from-data-to-working-models">1. The Training Challenge: From Data to Working Models</h2>
<p>Now that we have our clean historical corpus and custom tokenizer from <a
	
		href = "https://blog.desigeek.com/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Part 2
	</span>
</a>, we need to transform this data into working language models. This isn&rsquo;t just about running training scripts – it&rsquo;s about designing an architecture that can learn from historical text, optimizing for the unique patterns of 1500-1850 English, and building infrastructure to handle the computational demands of language model training.</p>
<p>The challenge with historical language modeling isn&rsquo;t just having enough data - it&rsquo;s having the right architecture and training process that can learn from the complex linguistic patterns in historical texts. Unlike modern text, historical English contains archaic vocabulary, period-specific terminology, and cultural references that require specialized attention mechanisms and training strategies.</p>
<h3 id="11-high-level-training-process-overview">1.1 High-Level Training Process Overview</h3>
<p>The model training pipeline transforms our clean historical data and custom tokenizer into working language models through several key stages:</p>
<ol>
<li><em>Model Architecture Design</em> - involves a custom GPT implementation optimized for historical text patterns</li>
<li><em>GPU Configuration</em> covers - multi-GPU training with precision optimization and memory management</li>
<li><em>Training Infrastructure</em> - includes distributed training, checkpointing, and experiment tracking</li>
<li><em>Performance Optimization</em> - encompasses mixed precision, compilation, and hardware-specific tuning</li>
<li><em>Model Validation</em> - covers testing and evaluation of trained models.</li>
</ol>
<p><a href="#fig1" class="figure-ref">Figure 1</a> below illustrates this complete training pipeline:</p>
<figure class="align-center " id="fig1">
    <pre class="mermaid">graph TD
    A[📚 Clean Historical Corpus&lt;br/&gt;500M+ characters] --&gt; B[🔤 Custom Tokenizer&lt;br/&gt;30K vocab + 150+ special tokens]
    B --&gt; C[🏗️ Model Architecture&lt;br/&gt;Custom GPT for Historical Text]
    
    C --&gt; D[⚙️ GPU Configuration&lt;br/&gt;Multi-GPU + Precision Optimization]
    D --&gt; D1[Mixed Precision&lt;br/&gt;bf16/fp16]
    D --&gt; D2[Torch Compile&lt;br/&gt;JIT optimization]
    D --&gt; D3[Memory Management&lt;br/&gt;Gradient checkpointing]
    
    D1 --&gt; E[🏋️ Training Process&lt;br/&gt;60K iterations, checkpointing]
    D2 --&gt; E
    D3 --&gt; E
    
    E --&gt; E1[SLM: 117M params&lt;br/&gt;7-8 hours training]
    E --&gt; E2[Regular: 354M params&lt;br/&gt;28-32 hours training]
    
    E --&gt; E3[WandB Integration&lt;br/&gt;Experiment tracking]
    E --&gt; E4[Checkpointing&lt;br/&gt;Resume capability]
    E --&gt; E5[Multi-GPU Support&lt;br/&gt;Distributed training]
    
    E1 --&gt; F[📊 Model Evaluation&lt;br/&gt;Historical accuracy testing]
    E2 --&gt; F
    E3 --&gt; F
    E4 --&gt; F
    E5 --&gt; F
    
    F --&gt; G{Quality OK?}
    G --&gt;|Yes| H[🚀 Deployment&lt;br/&gt;Hugging Face + Local Inference]
    G --&gt;|No| I[🔄 Retrain/Adjust]
    I --&gt; E
    H --&gt; J[💬 Text Generation&lt;br/&gt;Historical language output]
    
    style A fill:#e1f5fe
    style C fill:#f3e5f5
    style H fill:#e8f5e8
    style J fill:#fff3e0</pre>
    <figcaption>Figure 1: Complete Training Pipeline</figcaption>
</figure>
<p>We will explore each of these components in detail, starting with the model architecture design, but first, let&rsquo;s discuss why I chose PyTorch as the framework for this project.</p>
<h3 id="12-using-pytorch">1.2 Using PyTorch</h3>
<p>I chose PyTorch for this project based on three key factors: educational accessibility, integration with the research ecosystem, and practical convenience. PyTorch provides many components out of the box - transformer blocks, attention layers, feed-forward networks, training loops, and CUDA support - which makes it much easier for learners building their first language model.</p>
<p>From a technical perspective, PyTorch&rsquo;s memory management and GPU optimization features-including automatic mixed precision, gradient checkpointing, and efficient attention implementations well-suited for the resource-intensive task of training language models on historical text.</p>
<p>PyTorch&rsquo;s recent developments, such as <strong><code>torch.compile</code></strong>, <strong><code>FlashAttention</code></strong> kernels, and <strong><code>SDPA</code></strong> operator (scaled dot-product attention), provide significant performance improvements, making training more efficient. These improvements enhance both speed and memory efficiency, which are critical for scaling LLMs. Of course, in our case, we&rsquo;re building a working toy example rather than scaling to production levels, and these optimizations help keep training times reasonable on available hardware.</p>
<p><strong>What about other Frameworks?</strong></p>
<p>I also considered TensorFlow and JAX, but neither seemed right for <strong><code>helloLondon</code></strong>; TensorFlow&rsquo;s API felt too complex, specifically from a beginner’s perspective. JAX has excellent performance and a clean, functional approach, but it&rsquo;s more research-focused and has a smaller ecosystem, which would make it harder to follow along and experiment with.</p>
<h2 id="2-model-architecture-overview">2. Model Architecture Overview</h2>
<h3 id="21-understanding-the-gpt-architecture">2.1 Understanding the GPT Architecture</h3>
<p>Our custom GPT (Generative Pre-trained Transformer) is a decoder-only transformer model designed for autoregressive language modeling on historical text. The architecture consists of four core components, each serving a distinct purpose in the sequence-to-sequence prediction pipeline. These are: token embeddings, position embeddings, causal self-attention mechanisms, and the language modeling head. Let us double-click into each component to understand its role and implementation.</p>
<h4 id="211-token-embeddings">2.1.1 Token Embeddings</h4>
<p>Token embeddings convert discrete token IDs from our 30,000-token historical vocabulary into dense, continuous vector representations. Each token (whether it&rsquo;s a word, subword unit, or special token) is mapped to a point in a high-dimensional space (768 dimensions for SLM, 1024 for the regular model).</p>
<p>This is implemented as a simple lookup table - <strong><code>wte = torch.nn.Embedding(config.vocab_size, config.n_embd)</code></strong>. When processing the token sequence, we look up the corresponding vector for each token ID. These embeddings are learned during training - the model learns which tokens should be close together in this vector space based on their co-occurrence patterns in historical text.</p>
<p>For historical language models, this is particularly valuable because rare historical terms (like &ldquo;yeoman&rdquo; or &ldquo;guildhall&rdquo;) get their own representations that can capture contextual relationships with related terms from that era.</p>
<h4 id="212-position-embeddings">2.1.2 Position Embeddings</h4>
<p>Position embeddings encode each token&rsquo;s absolute position within the sequence. This is crucial because, unlike recurrent models, transformer architectures have no inherent notion of temporal order or sequence position - they process all tokens in parallel. Let us double-click into the problem why.</p>
<p>Think of it like reading words without any sense of order. The words &ldquo;<em>The cat chased the mouse</em>&rdquo; would be indistinguishable from &ldquo;<em>Mouse the chased cat the</em>&rdquo; - you&rsquo;d see the same words but lose all meaning because you don&rsquo;t know which word came first, second, or third. Transformers face exactly this problem because they process all words simultaneously rather than sequentially, unlike older RNN models.</p>
<p>To help the model understand word order, we add position embeddings to the token embeddings. This way, each token&rsquo;s representation includes information about both &ldquo;what&rdquo; the token is and &ldquo;where&rdquo; it appears in the sequence. We use learned position embeddings (as opposed to fixed sinusoidal patterns): <strong><code>wpe = torch.nn.Embedding(config.block_size, config.n_embd)</code></strong>. For the SLM with a 512-token context window, we learn 512 different position vectors (one for each possible position). Similarly, the regular model with a 1024-token context learns 1024 position vectors.</p>
<p>Position embeddings work like giving each word a &ldquo;timestamp&rdquo; or &ldquo;address&rdquo; that tells the model where it sits in the sequence:</p>
<ol>
<li><strong>Token embedding</strong> says: &ldquo;This is the word &lsquo;cat&rsquo;&rdquo; → converts to a vector like <code>[0.2, -0.5, 0.8, ...]</code></li>
<li><strong>Position embedding</strong> says: &ldquo;This word is at position 3&rdquo; → adds another vector like <code>[0.1, 0.3, -0.2, ...]</code></li>
<li><strong>Combined</strong>: The model sees both &ldquo;what&rdquo; the word is AND &ldquo;where&rdquo; it appears</li>
</ol>
<p>The embedding vectors are combined element-wise: <strong><code>x = token_emb + position_emb</code></strong>. This allows the model to understand both what each token is (via the token embedding) and where it appears in the sequence (via the position embedding).</p>
<p>Our model uses <strong>learned</strong> position embeddings, meaning during training the model discovers that:</p>
<ul>
<li>Position 1 tends to be capitalized (start of sentence)</li>
<li>Position 512 might be mid-sentence (needs different handling)</li>
<li>Certain positions in historical documents have patterns (formal openings, closings, etc.)</li>
</ul>
<p>This is different from <strong>fixed</strong> sinusoidal embeddings (used in the original Transformer paper), which use a mathematical formula to encode positions. Learned embeddings are generally better because they adapt to specific patterns in the training data.</p>
<p>In historical texts, word order is crucial for understanding meaning. Consider &ldquo;The King granted the land&rdquo; versus &ldquo;The land granted the King&rdquo; - same words, completely different meanings. Historical legal documents and Victorian-era writings often have precise word order that changes legal or semantic meaning. Position embeddings ensure the model can distinguish between these critical variations.</p>
<h4 id="213-causal-self-attention">2.1.3 Causal Self-Attention</h4>
<p>Causal self-attention is the mechanism that allows each position in the sequence to selectively attend to previous positions. The <em>&ldquo;causal&rdquo;</em> constraint ensures the model can only look at past tokens (not future ones), which is essential for autoregressive generation.</p>
<p>When you read a sentence, you naturally use context from earlier words to understand later ones. If you see &ldquo;The King granted the land to his loyal&hellip;&rdquo;, you can predict that &ldquo;servant,&rdquo; &ldquo;knight,&rdquo; or &ldquo;subject&rdquo; might come next because you remember what came before. The model needs to do the same thing - use previous words to predict the next word.</p>
<p>However, there&rsquo;s a crucial constraint: during training, when predicting word 7, the model must <em>only</em> see words 1-6, never word 8 or beyond. This &ldquo;causal&rdquo; (cause-and-effect) constraint ensures the model learns realistic patterns - in the real world, you can&rsquo;t use future information to predict the present.</p>
<h5 id="how-attention-works">How Attention Works</h5>
<p>Think of attention as a sophisticated &ldquo;relevance detector&rdquo;. When the model is processing the word &ldquo;loyal&rdquo; in our example above, it needs to look back and ask: &ldquo;<em>Which previous words are most relevant for understanding this context?</em>&rdquo; The attention mechanism computes a weighted sum of all previous token representations, where the weights are determined by how relevant each previous token is to the current one.</p>
<p>This is done through three learned linear projections that create different &ldquo;views&rdquo; of each word:</p>
<ul>
<li><strong>Query (Q)</strong>: &ldquo;What am I looking for?&rdquo; - The current word asks a question</li>
<li><strong>Key (K)</strong>: &ldquo;What do I have to offer?&rdquo; - Previous words advertise their content</li>
<li><strong>Value (V)</strong>: &ldquo;What information should I contribute?&rdquo; - The actual information to pass forward</li>
</ul>
<p>Let us see a practical example to help us grok the concept. Consider the historical phrase: &ldquo;<em>The alderman of Cheapside, having served the city faithfully, was <strong>granted</strong>&hellip;</em>&rdquo;</p>
<p>When processing &ldquo;granted&rdquo; the attention mechanism:</p>
<ol>
<li>Creates a Query from &ldquo;granted&rdquo; asking &ldquo;what context do I need?&rdquo;</li>
<li>Compares this Query against Keys from all previous words</li>
<li>Finds high relevance with &ldquo;alderman&rdquo; (who is being granted something) and &ldquo;faithfully&rdquo; (why the grant is happening)</li>
<li>Uses these attention weights to pull relevant Values from those words</li>
<li>Combines this information to understand better &ldquo;granted&rdquo; in context</li>
</ol>
<p>The attention score between token i and token j is computed as:</p>
<p>$$\text{Attention}(Q_i, K_j) = \text{softmax}\left(\frac{Q_i K_j^T}{\sqrt{d_k}}\right)V_j$$</p>
<p>Breaking this down:</p>
<ul>
<li>$Q_i K_j^T$ computes how well the query from token <em><strong>i</strong></em> matches the key from token <em><strong>j</strong></em> (higher = more relevant)</li>
<li>$1/\sqrt{d_k}$ is a scaling factor that prevents scores from getting too large (which would make softmax too &ldquo;sharp&rdquo;)</li>
<li>$\text{softmax}$ converts scores into probabilities that sum to <strong>1</strong> (so each word gets a weighted &ldquo;vote&rdquo;)</li>
<li>Finally, we use these weights to combine the Values from all previous tokens</li>
</ul>
<p>The $1/\sqrt{d_k}$ scaling factor (where $d_k = 64$ for our SLM, so $\sqrt{64} = 8$) prevents the dot products from growing too large with high-dimensional embeddings, ensuring stable gradients during training. The softmax ensures the weights sum to <strong>1</strong>, creating a proper probability distribution over the previous tokens.</p>
<h5 id="why-this-matters-for-historical-text">Why This Matters for Historical Text?</h5>
<p>Historical documents present unique challenges that make attention particularly valuable. Consider a legal document from 1750: <em>&ldquo;John Smith, yeoman of the parish of St. Giles, being of sound mind and body, doth hereby bequeath&hellip;&rdquo;</em></p>
<p>The attention mechanism enables the model to:</p>
<ul>
<li>Connect &ldquo;doth bequeath&rdquo; back to &ldquo;John Smith&rdquo; across multiple clauses</li>
<li>Understand that &ldquo;yeoman&rdquo; modifies &ldquo;John Smith&rdquo; even though they&rsquo;re separated</li>
<li>Learn that &ldquo;doth&rdquo; (archaic) and &ldquo;does&rdquo; (modern) serve similar grammatical functions</li>
<li>Recognize that formal legal phrasing follows specific patterns</li>
</ul>
<p>For our historical language models, this attention mechanism learns which historical terms and phrases co-occur and relate to one another contextually - crucial for understanding historical documents where terminology and phrasing differ from modern English. The model learns to attend to relevant historical context, enabling it to generate coherent text that maintains period-appropriate language patterns and references.</p>
<h4 id="214-language-modeling-head">2.1.4 Language Modeling Head</h4>
<p>The language modeling head (also called the &ldquo;output projection&rdquo; or <strong>lm_head</strong>) is the final translator that turns the rich internal representation (after all the attention + MLP refinements) back into a decision: &ldquo;Given everything I&rsquo;ve seen so far, what is the most likely next token?&rdquo; It does this by mapping each hidden vector at every position into a vector of length equal to the vocabulary size (30,000 in our historical tokenizer). Each element of that output vector is a <em>logit</em> - an unnormalized score indicating how likely the model thinks the token is to be the next one.</p>
<p>Implementation is intentionally simple: <strong><code>lm_head = torch.nn.Linear(n_embd, vocab_size)</code></strong>. We don&rsquo;t put an activation function after it because we want raw, unconstrained scores. Those scores then flow into:</p>
<ul>
<li><strong>Inference:</strong> Apply softmax -&gt; probabilities -&gt; sample or greedy pick</li>
<li><strong>Training:</strong> Feed logits + target token IDs into cross-entropy loss -&gt; gradients flow backward</li>
</ul>
<p>You can think of logits as <em>evidence totals</em>. The softmax transforms those evidence values into a normalized probability distribution that the model can sample from. High logit = more supporting evidence; low logit = less.</p>
<p><strong>Step-by-step (Inference vs Training):</strong></p>
<ol>
<li>Hidden state at last position (e.g., index 511) enters <code>lm_head</code>.</li>
<li>Linear projection produces a 30,000-dimensional logit vector.</li>
<li>In inference: <code>probs = softmax(logits / temperature)</code>; optionally apply <code>top-k</code>/<code>top-p</code> filtering.</li>
<li>Sample (or argmax) a token → append to sequence → repeat.</li>
<li>In training: Cross-entropy compares logits to the true next token; loss scalar backpropagates through the head into all prior layers.</li>
</ol>
<p>Because our vocabulary mixes common function words (&ldquo;the&rdquo;, &ldquo;and&rdquo;, &ldquo;of&rdquo;) with rare era-specific tokens (&ldquo;yeoman&rdquo;, &ldquo;guildhall&rdquo;, &ldquo;paternoster&rdquo;, &ldquo;quoth&rdquo;), the head must reliably distinguish both frequent and infrequent patterns. Rare historical tokens need <em>consistent</em> representations from embedding -&gt; transformer -&gt; head so they are not forgotten. If their logits remained perpetually low, the model would never learn to generate them in authentic contexts.</p>
<p>Logits (not probabilities) inside the model - We retain logits (raw, unnormalized scores) instead of immediately converting to probabilities because they yield numerically stable loss computation - PyTorch efficiently fuses <code>log_softmax</code> with negative log-likelihood - allow cleaner gradient flow before any normalization (we only invoke softmax when we actually need a distribution), and enable flexible post-processing (temperature scaling, top-k or top-p filtering, repetition penalties) directly in score space without forcing an extra probability recomputation step.</p>
<p>We reuse the input embedding matrix for the output projection to keep input and output semantics aligned and reduce parameter and memory traffic. This concept is called Weight Typing, which we will cover in detail in <a
	
		href = "#-222-weight-tying"
	

	

	>
	
	<span>
		Section 2.2.2  -  Weight Tying
	</span>
</a>.</p>
<p>We share the embedding and output projection weights (<code>self.transformer.wte.weight = self.lm_head.weight</code>) so input token interpretation and next-token scoring occur in the <em>same semantic space</em>.</p>
<p>Using the shared embedding matrix $E$ (shape $(V,d)$), the logits are computed with $\text{logits} = h \cdot E^T$, reusing the same rows used for token lookup. This saves parameters (~23.0M SLM, ~30.7M Regular), keeps gradients for rare historical tokens coupled, and reduces memory traffic (Press &amp; Wolf, 2017; Inan et al., 2016). See Section 2.2.2 for detailed mechanics, benefits, and the historian/scribe analogy.</p>
<p>In short, the <code>lm_head</code> converts rich contextual understanding into next-token scores; with weight tying (details in Section 2.2.2) it stays efficient and semantically consistent.</p>
<h4 id="215-the-complete-flow">2.1.5 The Complete Flow</h4>
<p>The complete forward pass through our GPT model works as follows:</p>
<ol>
<li><strong>Input</strong>: A sequence of token IDs (batch × sequence_length, e.g., 512 tokens)</li>
<li><strong>Token Embedding</strong>: Convert each token ID to a dense vector (768 or 1024 dimensions)</li>
<li><strong>Position Embedding</strong>: Add position information to each token</li>
<li><strong>Transformer Blocks</strong>: Pass through n_layer transformer blocks (12 for SLM, 24 for regular model), each containing:
<ul>
<li>Layer normalization</li>
<li>Causal self-attention (with multiple heads)</li>
<li>Residual connection</li>
<li>Layer normalization</li>
<li>Feed-forward MLP</li>
<li>Residual connection</li>
</ul>
</li>
<li><strong>Final Layer Norm</strong>: Normalize the final hidden states</li>
<li><strong>Language Head</strong>: Project to vocabulary logits (30,000 dimensions)</li>
<li><strong>Output</strong>: Probability distribution over next token</li>
</ol>
<p>This architecture design is conventional and follows the GPT-style pattern established by OpenAI&rsquo;s GPT models. The traditional design is intentional - it allows for clear, educational learning from the implementation while being configured to work seamlessly with our historical tokenizer from Part 2.</p>
<p><a href="#fig2" class="figure-ref">Figure 2</a> below illustrates the complete architecture for the SLM:</p>
<figure class="align-center " id="fig2">
    <pre class="mermaid">graph TD
    A[Input Tokens&lt;br/&gt;512 tokens] --&gt; B[Token Embedding&lt;br/&gt;30K vocab → 768 dim]
    A --&gt; C[Position Embedding&lt;br/&gt;512 pos → 768 dim]
    B --&gt; D[Add Embeddings]
    C --&gt; D
    D --&gt; E[Layer Norm]
    E --&gt; F[Transformer Block 1&lt;br/&gt;12 heads, 768 dim]
    F --&gt; G[Transformer Block 2&lt;br/&gt;12 heads, 768 dim]
    G --&gt; H[...]
    H --&gt; I[Transformer Block 12&lt;br/&gt;12 heads, 768 dim]
    I --&gt; J[Final Layer Norm]
    J --&gt; K[Language Head&lt;br/&gt;768 → 30K vocab]
    K --&gt; L[Output Logits&lt;br/&gt;30K probabilities]
    
    subgraph &#34;Key Specifications&#34;
        M[Layers: 12&lt;br/&gt;Heads: 12&lt;br/&gt;Embedding: 768&lt;br/&gt;Context: 512&lt;br/&gt;Parameters: 117M&lt;br/&gt;Training: 7-8 hours&lt;br/&gt;MFU: 8-9%]
    end
    
    style A fill:#e1f5fe
    style L fill:#e8f5e8
    style M fill:#fff3e0</pre>
    <figcaption>Figure 2: SLM Architecture (117M Parameters)</figcaption>
</figure>
<p>The Regular model, as shown in <a href="#fig3" class="figure-ref">Figure 3</a> below, follows the same architectural pattern as the SLM but with increased capacity: 24 transformer layers instead of 12, 16 attention heads instead of 12, and 1024-dimensional embeddings instead of 768.</p>
<p>This represents a ~3x increase in parameters (354M vs 117M), ~2x more attention heads, and ~33% larger embedding dimensions, providing significantly more computational power for learning complex historical language patterns.</p>
<figure class="align-center " id="fig3">
    <pre class="mermaid">graph TD
    A[Input Tokens&lt;br/&gt;1024 tokens] --&gt; B[Token Embedding&lt;br/&gt;30K vocab → 1024 dim]
    A --&gt; C[Position Embedding&lt;br/&gt;1024 pos → 1024 dim]
    B --&gt; D[Add Embeddings]
    C --&gt; D
    D --&gt; E[Layer Norm]
    E --&gt; F[Transformer Block 1&lt;br/&gt;16 heads, 1024 dim]
    F --&gt; G[Transformer Block 2&lt;br/&gt;16 heads, 1024 dim]
    G --&gt; H[...]
    H --&gt; I[Transformer Block 24&lt;br/&gt;16 heads, 1024 dim]
    I --&gt; J[Final Layer Norm]
    J --&gt; K[Language Head&lt;br/&gt;1024 → 30K vocab]
    K --&gt; L[Output Logits&lt;br/&gt;30K probabilities]
    
    subgraph &#34;Key Specifications&#34;
        M[Layers: 24&lt;br/&gt;Heads: 16&lt;br/&gt;Embedding: 1024&lt;br/&gt;Context: 1024&lt;br/&gt;Parameters: 354M&lt;br/&gt;Training: 28-32 hours&lt;br/&gt;MFU: 15-20%]
    end
    
    style A fill:#e1f5fe
    style L fill:#e8f5e8
    style M fill:#fff3e0</pre>
    <figcaption>Figure 3: Regular Model Architecture (354M Parameters)</figcaption>
</figure>
<h3 id="22-simplegpt-class">2.2 SimpleGPT Class</h3>
<p>Now that we&rsquo;ve covered the theory behind the GPT architecture, let&rsquo;s examine the actual implementation. The <strong><code>SimpleGPT</code></strong> class is at the heart of our implementation - it&rsquo;s the core class that brings together all the components discussed in section 2.1 into a working language model. The class inherits from PyTorch&rsquo;s <code>**torch.nn.Module**</code>, which is the base class for all neural network components in PyTorch. This gives us access to automatic differentiation, GPU support, and other PyTorch features.</p>
<h4 id="221-the-__init__-method">2.2.1 The <code>__init__</code> method</h4>
<p>The <code>__init__</code> method is the constructor that assembles our entire language model from individual components. First, it stores all hyperparameters (such as vocabulary size, embedding dimensions, and the number of layers) in a configuration object that the rest of the model can reference. Next, it creates the embedding layers - one that converts our 30,000 historical tokens into dense vectors, and another that encodes position information. Hence, the model knows where each word appears in the sequence.</p>
<p>Next, it builds the transformer blocks - the core processing units that do the heavy lifting. Each block contains self-attention mechanisms and feed-forward networks that learn to understand relationships between words. The method also initializes the language modeling head, the final layer that converts all internal processing back into probabilities for which word should come next.</p>
<p>Finally, it sets up proper weight initialization to ensure the model starts with good random weights (not too big, not too small), and implements weight tying between the input embeddings and the output layer. This clever technique reduces the number of parameters while improving training efficiency by sharing weights between the first and last layers.</p>
<p>This is important because if the weights are too large, the model&rsquo;s gradients can explode during training, leading to unstable learning. If they start too small, gradients can vanish, rendering the model unable to learn. Our initialization ensures the model begins in the &ldquo;Goldilocks zone&rdquo; - just right for effective learning. Without this, even a perfectly designed architecture might fail to train properly.</p>
<p>Now, let me show you the actual implementation. The code in <a href="#listing1" class="listing-ref">Listing 1</a> demonstrates how we implement the core GPT architecture:</p>
<figure id="listing1"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">class</span> <span style="color:#eed49f">SimpleGPT</span>(torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Simple GPT model based on nanoGPT
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">    
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">    This class implements a decoder-only transformer model optimized for 
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">    historical text generation. It inherits from PyTorch&#39;s Module class
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">    to get automatic differentiation and GPU support.
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">    &#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">__init__</span>(<span style="color:#91d7e3">self</span>, config):
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">super</span>()<span style="color:#91d7e3;font-weight:bold">.</span><span style="color:#8aadf4">__init__</span>()  <span style="color:#6e738d;font-style:italic"># Initialize the parent PyTorch Module class</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>config <span style="color:#91d7e3;font-weight:bold">=</span> config  <span style="color:#6e738d;font-style:italic"># Store all model hyperparameters</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Create the main transformer components using ModuleDict</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># ModuleDict allows us to organize related layers together</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>transformer <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>ModuleDict(<span style="color:#91d7e3">dict</span>(
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Token Embedding Layer (wte = &#34;word token embedding&#34;)</span>
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Converts each token ID to a high-dimensional vector</span>
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Input: token IDs (integers 0 to vocab_size-1)</span>
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Output: dense vectors of size $n_{embd}$ (e.g., 768 dimensions)</span>
</span></span><span style="display:flex;"><span>            wte <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Embedding(config<span style="color:#91d7e3;font-weight:bold">.</span>vocab_size, config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd),
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Position Embedding Layer (wpe = &#34;word position embedding&#34;) </span>
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Encodes where each token appears in the sequence</span>
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Input: position indices (0 to block_size-1)</span>
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Output: dense vectors of size $n_{embd}$</span>
</span></span><span style="display:flex;"><span>            wpe <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Embedding(config<span style="color:#91d7e3;font-weight:bold">.</span>block_size, config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd),
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Dropout Layer for regularization</span>
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Randomly sets some inputs to zero during training to prevent overfitting</span>
</span></span><span style="display:flex;"><span>            drop <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Dropout(config<span style="color:#91d7e3;font-weight:bold">.</span>dropout),
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Stack of Transformer Blocks (h = &#34;hidden layers&#34;)</span>
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Each SimpleBlock contains self-attention and feed-forward layers</span>
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># We create n_layer blocks (e.g., 12 for SLM, 24 for regular model)</span>
</span></span><span style="display:flex;"><span>            h <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>ModuleList([SimpleBlock(config) <span style="color:#c6a0f6">for</span> _ <span style="color:#91d7e3;font-weight:bold">in</span> <span style="color:#91d7e3">range</span>(config<span style="color:#91d7e3;font-weight:bold">.</span>n_layer)]),
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Final Layer Normalization</span>
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Normalizes the output before the language modeling head</span>
</span></span><span style="display:flex;"><span>            ln_f <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>LayerNorm(config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd, bias<span style="color:#91d7e3;font-weight:bold">=</span>config<span style="color:#91d7e3;font-weight:bold">.</span>bias),
</span></span><span style="display:flex;"><span>        ))
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Language Modeling Head</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Converts the final hidden states back to vocabulary space</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Input: hidden states of size $n_{embd}$</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Output: logits for each token in vocabulary ($vocab_{size}$ logits)</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>lm_head <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Linear(config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd, config<span style="color:#91d7e3;font-weight:bold">.</span>vocab_size, bias<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">False</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Initialize all weights using our custom initialization method</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># This ensures the model starts with good random weights</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>apply(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>_init_weights)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Weight Tying: Share weights between input embeddings and output layer</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># This technique improves training efficiency and model performance</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># by ensuring the same representation space is used for input and output</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>transformer<span style="color:#91d7e3;font-weight:bold">.</span>wte<span style="color:#91d7e3;font-weight:bold">.</span>weight <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>lm_head<span style="color:#91d7e3;font-weight:bold">.</span>weight</span></span></code></pre></div><figcaption>
        <strong>Listing 1: SimpleGPT Model Architecture</strong>
    </figcaption>
</figure>
<h4 id="222-weight-tying">2.2.2 Weight Tying</h4>
<p>The tied weights between the embedding layer and language modeling head (<code>self.transformer.wte.weight = self.lm_head.weight</code>) are a crucial optimization for our historical language model. In a typical neural network, you&rsquo;d have two separate weight matrices - one for converting input tokens to embeddings, and another for converting hidden states back to vocabulary probabilities. Weight tying means we use the <em>same</em> weight matrix for both operations.</p>
<p>Think of it like this: instead of having two different dictionaries (one for reading, one for writing), we use the same dictionary for both. The same table that maps &ldquo;alderman&rdquo; → [0.2, -0.5, 0.8, &hellip;] is used whether the model is reading &ldquo;alderman&rdquo; as input or trying to generate &ldquo;alderman&rdquo; as output.</p>
<p>Without weight tying, the model would have two separate weight matrices - one for converting input tokens to embeddings, and another for converting hidden states back to vocabulary probabilities. This means the model could learn that &ldquo;alderman&rdquo; means one thing when it sees it as input, but something slightly different when it tries to generate it as output. For rare historical terms, this inconsistency can cause the model to &ldquo;forget&rdquo; how to use words it has seen before properly.</p>
<p>Historical vocabulary contains many rare terms such as &ldquo;quoth&rdquo; &ldquo;alderman&rdquo; and &ldquo;paternoster&rdquo; that appear infrequently in our training data. Without weight tying, the model might learn different representations for the same word when it sees it as input versus when it generates it as output. This inconsistency can cause the model to struggle with rare historical terms.</p>
<p>When the model sees &ldquo;alderman&rdquo; in the input, it learns a specific representation of it. Later, when it needs to generate &ldquo;alderman&rdquo; in the output, it uses that same learned representation, ensuring consistency and improving the model&rsquo;s ability to generate coherent historical language with proper terminology.</p>
<p><strong>Mechanics (matrix reuse)</strong>  A <em>single</em> matrix $E \in \mathbb{R}^{(V \times d)}$ serves both roles: row lookup for input embeddings and column interaction for output scoring. The language head reuses it to compute logits via:</p>
<p>$$\text{logits} = h \cdot E^T$$</p>
<p>where $h$ is the hidden state at each position, this keeps the input interpretation and output prediction within the same semantic geometry - no second projection to drift or disagree.</p>
<p><strong>Why it helps</strong> Parameter savings (~23.0M SLM, ~30.7M Regular) lower memory footprint and bandwidth. Gradients for predicting a rare token (e.g., <em>yeoman</em>, <em>guildhall</em>, <em>paternoster</em>) directly refine the very rows used to embed it on future inputs - improving both recall and generation. Shared weights mildly regularize against the two spaces drifting apart and empirically improve perplexity for mid-scale autoregressive models (Press &amp; Wolf, 2017; Inan et al., 2016).</p>
<p><strong>Analogy</strong> If the transformer stack is a panel of historians debating context, the language modeling head is the scribe choosing the next historically plausible word. Weight tying means the scribe and historians consult the <em>same dictionary</em> - no translation mismatch between how words are read and how they&rsquo;re proposed.</p>
<p><strong>Practical notes</strong> Avoid inflating vocabulary unnecessarily (cost scales with $V$); tied weights do not remove the need for careful rare token coverage in the corpus; and if later adding adapters or LoRA heads, remember that tying interacts with how those layers inject low-rank updates.</p>
<p>Now that we understand how the model efficiently handles vocabulary, let&rsquo;s examine the core processing units that transform these embeddings into meaningful representations.</p>
<h3 id="23-transformer-block-design">2.3 Transformer Block Design</h3>
<p>Each transformer block implements the standard attention and feed-forward pattern, but with optimizations for historical text processing. Let us look at the code real quick and then get into a little more detail.</p>
<figure id="listing2"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">class</span> <span style="color:#eed49f">SimpleBlock</span>(torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Simple transformer block&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">__init__</span>(<span style="color:#91d7e3">self</span>, config):
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">super</span>()<span style="color:#91d7e3;font-weight:bold">.</span><span style="color:#8aadf4">__init__</span>()
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ln_1 <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>LayerNorm(config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd, bias<span style="color:#91d7e3;font-weight:bold">=</span>config<span style="color:#91d7e3;font-weight:bold">.</span>bias)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>attn <span style="color:#91d7e3;font-weight:bold">=</span> SimpleCausalSelfAttention(config)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ln_2 <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>LayerNorm(config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd, bias<span style="color:#91d7e3;font-weight:bold">=</span>config<span style="color:#91d7e3;font-weight:bold">.</span>bias)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>mlp <span style="color:#91d7e3;font-weight:bold">=</span> SimpleMLP(config)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">forward</span>(<span style="color:#91d7e3">self</span>, x):
</span></span><span style="display:flex;"><span>        x <span style="color:#91d7e3;font-weight:bold">=</span> x <span style="color:#91d7e3;font-weight:bold">+</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>attn(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ln_1(x))
</span></span><span style="display:flex;"><span>        x <span style="color:#91d7e3;font-weight:bold">=</span> x <span style="color:#91d7e3;font-weight:bold">+</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>mlp(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ln_2(x))
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> x</span></span></code></pre></div><figcaption>
        <strong>Listing 2: Transformer Block Implementation</strong>
    </figcaption>
</figure>
<p>This code implements a single transformer block, which, as we know, is the fundamental building unit of our GPT model.</p>
<h4 id="231-self-attention-step">2.3.1 Self-Attention Step</h4>
<p>The self-attention step (<strong><code>x = x + self.attn(self.ln_1(x))</code></strong>) is the block that first normalizes the input with LayerNorm, then applies self-attention to understand relationships between words. The <code>+</code> creates a &ldquo;residual connection&rdquo; that helps information flow through the network.</p>
<p>As we discussed in <a
	
		href = "#214-gpt-architecture-components"
	

	

	>
	
	<span>
		section 2.1.4
	</span>
</a>, self-attention is the &ldquo;magic&rdquo; of transformers, allowing each token to decide how much attention to pay to every other token in the sequence. Our implementation uses multiple attention heads (12 for SLM, 16 for Regular) that operate in parallel, with each head learning to focus on different types of relationships - syntactic, semantic, or positional. Causal masking ensures that during training, the model learns to predict the next token based solely on the preceding context, which is essential for coherent text generation.</p>
<p>The residual connection (<strong><code>+</code></strong>) is crucial, as it allows the model to preserve the original token representation while adding contextual information from the attention. The pre-normalization approach (LayerNorm before attention) provides more stable training than post-normalization, especially important when working with the varied linguistic patterns found in historical text.</p>
<h4 id="232-feed-forward-step">2.3.2 Feed-Forward Step</h4>
<p>After attention, we have the feed-forward step (<strong><code>x = x + self.mlp(self.ln_2(x))</code></strong>), which first normalizes the attended information with LayerNorm, then passes it through a multi-layer perceptron (MLP) that transforms and processes it.</p>
<p>The MLP typically consists of two linear layers with a non-linear activation function (like GELU) between them, allowing the model to learn complex non-linear transformations of the attended features. This step is crucial because attention can only perform linear transformations on the input representations; the feed-forward network adds the necessary non-linearity, enabling the model to learn complex patterns and relationships in the historical text. Another residual connection preserves the original information, ensuring that the model can always fall back to the pre-attention representation if needed.</p>
<h4 id="233-understanding-the-feed-forward-mlp-sublayer">2.3.3 Understanding the Feed-Forward (MLP) Sublayer</h4>
<p>Directly beneath the <code>SimpleBlock</code> code above, you see the line <code>self.mlp = SimpleMLP(config)</code>. After attention has mixed information across positions, the model passes each token embedding through a position-wise feed-forward network (the MLP). Unlike attention, it does not look at other tokens; it refines the representation of each token independently, given the contextualized features attention just produced. In practice, this is where raw contextual patterns are distilled into richer semantic, stylistic, and morphological signals.</p>
<p><a href="#fig4" class="figure-ref">Figure 4</a> below visualizes how a single transformer block routes data through normalization, attention, and the feed-forward expansion/contraction before returning an upgraded representation via the residual path:</p>
<figure class="align-center " id="fig4">
    <pre class="mermaid">graph TB
    A[Input Embeddings&lt;br/&gt;batch, seq, emb] --&gt; LN1[LayerNorm 1]
    LN1 --&gt; ATTN[Multi-Head Attention&lt;br/&gt;query, key, value]
    ATTN --&gt; DROPA[Dropout]
    DROPA --&gt; RES1[Residual Add&lt;br/&gt;x + attn_out]
    RES1 --&gt; LN2[LayerNorm 2]
    RES1 --&gt; LN2
    LN2 --&gt; EXPAND[Linear Expand&lt;br/&gt;emb → 4*emb]
    EXPAND --&gt; GELU[GELU Activation]
    GELU --&gt; PROJECT[Linear Project&lt;br/&gt;4*emb → emb]
    PROJECT --&gt; DROPM[Dropout]
    DROPM --&gt; RES2[Residual Add&lt;br/&gt;res1 + mlp_out]
    RES2 --&gt; OUT[Block Output&lt;br/&gt;Updated embeddings]
    style A fill:#e1f5fe
    style ATTN fill:#f3e5f5
    style EXPAND fill:#fff3e0
    style PROJECT fill:#fff3e0
    style RES2 fill:#e8f5e8</pre>
    <figcaption>Figure 4: Internal Flow of a Transformer Block</figcaption>
</figure>
<p>Conceptually, the MLP is a two-step projection: first, an expansion into a higher-dimensional &ldquo;workspace&rdquo; with a non-linear activation, then a projection back down so the residual can safely merge with the original stream.</p>
<p>For our SLM, 768 dimensions expand to 3072 and then contract back to 768; for the larger model, 1024 dimensions expand to 4096. This temporary widening allows the network to express combinations of features that a purely linear transform could not capture. It is the difference between merely routing information and actually transforming it.</p>
<p>Here is the representative structure shown in <a href="#listing3" class="listing-ref">Listing 3</a>:</p>
<figure id="listing3"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">class</span> <span style="color:#eed49f">SimpleMLP</span>(torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">__init__</span>(<span style="color:#91d7e3">self</span>, config):
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">super</span>()<span style="color:#91d7e3;font-weight:bold">.</span><span style="color:#8aadf4">__init__</span>()
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>fc_in  <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Linear(config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd, <span style="color:#f5a97f">4</span> <span style="color:#91d7e3;font-weight:bold">*</span> config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>act    <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>GELU()
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>fc_out <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Linear(<span style="color:#f5a97f">4</span> <span style="color:#91d7e3;font-weight:bold">*</span> config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd, config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>drop   <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Dropout(config<span style="color:#91d7e3;font-weight:bold">.</span>dropout)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">forward</span>(<span style="color:#91d7e3">self</span>, x):
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>drop(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>fc_out(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>act(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>fc_in(x))))</span></span></code></pre></div><figcaption>
        <strong>Listing 3: Feed-Forward (MLP) Sublayer Implementation</strong>
    </figcaption>
</figure>
<p><strong>Why expand then shrink?</strong></p>
<p>The widened hidden space allows the model to form intermediate feature bundles (e.g., tense, register, archaic morphology) that do not cleanly live in the original lower-dimensional basis. The contraction enforces a stable interface for the residual path and keeps the total parameter count manageable. Removing the expansion would noticeably degrade expressiveness; removing the contraction would balloon memory use and break architectural symmetry.</p>
<p>In our context, the historical model internalizes regularities like mapping &ldquo;hath&rdquo; and &ldquo;doth&rdquo; into modern tense abstractions while still preserving period flavor; it encodes stylistic shifts between court proceedings, religious prose, and narrative storytelling; it stabilizes inconsistent orthography and variant spellings so downstream layers predict coherent continuations instead of brittle echoes. Attention tells the model where to look; the MLP decides how to reinterpret what it saw.</p>
<p>Focusing only on attention gives an incomplete mental model of transformers. More than half of the parameters and a large fraction of the FLOPs sit in these feed-forward layers. Under-sized MLPs lead to shallow pattern memorization - models that can repeat phrases but cannot generalize style or adapt archaic forms to new contexts. Properly scaled MLP width (the common ×4 expansion) is a proven sweet spot: smaller factors underfit; much larger ones give diminishing returns at this scale (see scaling law discussions in <a
	
		href = "https://arxiv.org/abs/2001.08361"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Kaplan et al. 2020
	</span>
</a>).</p>
<p><strong>A useful mental analogy:</strong> attention is the lively debate in a hall; the MLP is each participant stepping aside to integrate what was heard into their own refined understanding before the next round of discussion. When you see <code>x = x + self.mlp(self.ln_2(x))</code>, that addition represents the moment a token&rsquo;s contextual representation is upgraded. Without this transformation, the model would &ldquo;hear&rdquo; context but fail to internalize it, producing shallow, literal continuations rather than fluent, period-authentic prose.</p>
<p>In our <code>helloLondon</code> models, the MLP is therefore essential for converting raw multi-head attention patterns into durable historical linguistic competence - one of the quiet reasons the generated text feels coherent rather than stitched together.</p>
<p>Each block in our model (12 for SLM, 24 for Regular) applies this same pattern, allowing the model to build an increasingly sophisticated understanding of historical language patterns as text flows through the layers.</p>
<p>Each transformer block applies layer normalization before both the self-attention mechanism and the feed-forward network, followed by residual connections. This pre-normalization approach (as opposed to post-normalization) has been shown to provide more stable training, especially important when working with the varied linguistic patterns found in historical text.</p>
<h4 id="234-activation-choice-matters">2.3.4 Activation choice matters</h4>
<p>The activation function determines how the neural network processes information at each layer. Think of it as a &ldquo;decision maker&rdquo; that decides how much of each input signal to pass through to the next layer. The most common activation functions are ReLU (Rectified Linear Unit) and GELU (Gaussian Error Linear Unit).</p>
<p>ReLU is simple and fast: it passes positive values unchanged and sets negative values to zero (<code>f(x) = max(0, x)</code>). However, ReLU can be &ldquo;harsh&rdquo; - it completely cuts off negative signals, leading to &ldquo;dead neurons&rdquo; that never activate again. GELU is smoother and more sophisticated: it uses a Gaussian distribution to determine how much of each input to pass through (<code>f(x) = x * Φ(x)</code> where Φ is the cumulative distribution function of a standard normal distribution). This creates a smooth, differentiable function that allows for more nuanced information processing.</p>
<p>GELU offers smoother gradients and better calibration for language than plain ReLU. The smoother nature of GELU helps the model learn more subtle patterns in historical text, where the relationships between words and phrases can be complex and nuanced. Alternatives like SwiGLU can yield marginal gains in perplexity but increase implementation complexity - valuable in frontier systems, optional in educational builds like helloLondon. Modest dropout in the MLP further improves generalization on a corpus that, while sizable, is still modest relative to billion-token modern pretraining regimes.</p>
<h4 id="235-pre-vs-post-normalization">2.3.5 Pre vs Post-normalization</h4>
<p>In pre-normalization, we normalize the input before processing it (like we do here). In post-normalization, we&rsquo;d process first, then normalize the output. Pre-normalization is like checking that your ingredients are properly prepared before cooking, while post-normalization is like seasoning after cooking - both work, but pre-normalization tends to yield more consistent results.</p>
<p>This matters because historical texts contain complex syntactic structures and long-range dependencies that require sophisticated attention mechanisms. The residual connections ensure that information can flow directly through the network, helping the model learn to preserve important historical context across long sequences while still allowing the attention mechanism to focus on relevant historical details.</p>
<h3 id="24-causal-self-attention-for-historical-sequences">2.4 Causal Self-Attention for Historical Sequences</h3>
<p>The attention mechanism is crucial for understanding the complex relationships in historical text. Our implementation is based on the original transformer architecture from <a
	
		href = "https://arxiv.org/abs/1706.03762"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		&ldquo;Attention Is All You Need&rdquo;
	</span>
</a> (Vaswani et al., 2017), but optimized for historical language patterns.</p>
<h4 id="understanding-multi-head-attention">Understanding Multi-Head Attention</h4>
<p>Multi-head attention runs several attention “heads” in parallel, allowing the model to focus on different aspects of a sequence simultaneously (syntax, semantics, and position). Compared to a single head, this parallelism yields richer representations—think multiple specialists examining the same text. In our setup, the SLM uses 12 heads and the Regular model 16, scaling capacity with model size. Empirically, heads tend to specialize (e.g., subject–verb agreement, word relations, word order), as observed by <strong><a
	
		href = "https://arxiv.org/abs/1906.04341"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Clark et al. (2019) - &ldquo;What Does BERT Look At?&rdquo;
	</span>
</a></strong>.</p>
<p>Research by <strong><a
	
		href = "https://arxiv.org/abs/2001.08361"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Kaplan et al. (2020) Scaling Laws for Neural Language Models
	</span>
</a></strong> shows that the optimal number of attention heads scales with model size. For our 117M-parameter SLM, 12 heads provide sufficient parallel processing capacity, while our 354M-parameter Regular model benefits from 16 heads to capture more complex attention patterns.</p>
<p>The attention mechanism has $O(n^2)$ complexity with respect to sequence length. This means that doubling our sequence length from 512 to 1024 tokens is a quadratic jump and requires <strong>4x</strong> more memory for attention computations. This is why we carefully balance sequence length with available GPU memory and why techniques like <a
	
		href = "https://arxiv.org/abs/2205.14135"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		FlashAttention
	</span>
</a> (Dao et al., 2022) are so important for memory efficiency.</p>
<p><strong>How the attention mechanism works in practice:</strong></p>
<p>The code in <a href="#listing4" class="listing-ref">Listing 4</a> shows how we implement the attention mechanism that we&rsquo;ve been discussing. Here&rsquo;s what happens step by step:</p>
<ol>
<li>
<p><strong>Input Processing</strong>: The model receives a batch of sequences (B = batch size, T = sequence length, C = embedding dimension). For example, with our SLM: B=4, T=512, C=768.</p>
</li>
<li>
<p><strong>Query, Key, Value Generation</strong>: The input embeddings are transformed into three different representations - Query (Q), Key (K), and Value (V) - using a single linear layer that outputs 3×768 dimensions, then splits them.</p>
</li>
<li>
<p><strong>Multi-Head Reshaping</strong>: Each of Q, K, V is reshaped to separate the 12 attention heads, so each head gets its own 64-dimensional subspace (768 ÷ 12 = 64).</p>
</li>
<li>
<p><strong>Attention Computation</strong>: The scaled dot-product attention is computed, where each word &ldquo;looks at&rdquo; all previous words (causal masking) and decides how much attention to pay to each.</p>
</li>
<li>
<p><strong>Output Assembly</strong>: All attention head outputs are combined back into a single representation and projected through a final linear layer.</p>
</li>
</ol>
<p>This implementation is optimized for historical text processing, using PyTorch&rsquo;s efficient <code>scaled_dot_product_attention</code> function with causal masking to ensure the model can only attend to previous tokens, not future ones.</p>
<figure id="listing4"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">class</span> <span style="color:#eed49f">SimpleCausalSelfAttention</span>(torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Module):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Simple causal self-attention&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">__init__</span>(<span style="color:#91d7e3">self</span>, config):
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">super</span>()<span style="color:#91d7e3;font-weight:bold">.</span><span style="color:#8aadf4">__init__</span>()
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">assert</span> config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd <span style="color:#91d7e3;font-weight:bold">%</span> config<span style="color:#91d7e3;font-weight:bold">.</span>n_head <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Key, query, value projections for all heads, but in a batch</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>c_attn <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Linear(config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd, <span style="color:#f5a97f">3</span> <span style="color:#91d7e3;font-weight:bold">*</span> config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd, bias<span style="color:#91d7e3;font-weight:bold">=</span>config<span style="color:#91d7e3;font-weight:bold">.</span>bias)
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Output projection</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>c_proj <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Linear(config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd, config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd, bias<span style="color:#91d7e3;font-weight:bold">=</span>config<span style="color:#91d7e3;font-weight:bold">.</span>bias)
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Regularization</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>attn_dropout <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Dropout(config<span style="color:#91d7e3;font-weight:bold">.</span>dropout)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>resid_dropout <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>Dropout(config<span style="color:#91d7e3;font-weight:bold">.</span>dropout)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_head <span style="color:#91d7e3;font-weight:bold">=</span> config<span style="color:#91d7e3;font-weight:bold">.</span>n_head
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_embd <span style="color:#91d7e3;font-weight:bold">=</span> config<span style="color:#91d7e3;font-weight:bold">.</span>n_embd
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>dropout <span style="color:#91d7e3;font-weight:bold">=</span> config<span style="color:#91d7e3;font-weight:bold">.</span>dropout
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">forward</span>(<span style="color:#91d7e3">self</span>, x):
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Batch size, sequence length, embedding dimensionality ($n_{embd}$)</span>
</span></span><span style="display:flex;"><span>        B, T, C <span style="color:#91d7e3;font-weight:bold">=</span> x<span style="color:#91d7e3;font-weight:bold">.</span>size()
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Calculate query, key, values for all heads in batch </span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># and move head forward to be the batch dim</span>
</span></span><span style="display:flex;"><span>        q, k, v <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>c_attn(x)<span style="color:#91d7e3;font-weight:bold">.</span>split(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_embd, dim<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">2</span>)
</span></span><span style="display:flex;"><span>        k <span style="color:#91d7e3;font-weight:bold">=</span> k<span style="color:#91d7e3;font-weight:bold">.</span>view(B, T, <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_head, C <span style="color:#91d7e3;font-weight:bold">//</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_head)<span style="color:#91d7e3;font-weight:bold">.</span>transpose(<span style="color:#f5a97f">1</span>, <span style="color:#f5a97f">2</span>)  <span style="color:#6e738d;font-style:italic"># (B, nh, T, hs)</span>
</span></span><span style="display:flex;"><span>        q <span style="color:#91d7e3;font-weight:bold">=</span> q<span style="color:#91d7e3;font-weight:bold">.</span>view(B, T, <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_head, C <span style="color:#91d7e3;font-weight:bold">//</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_head)<span style="color:#91d7e3;font-weight:bold">.</span>transpose(<span style="color:#f5a97f">1</span>, <span style="color:#f5a97f">2</span>)  <span style="color:#6e738d;font-style:italic"># (B, nh, T, hs)</span>
</span></span><span style="display:flex;"><span>        v <span style="color:#91d7e3;font-weight:bold">=</span> v<span style="color:#91d7e3;font-weight:bold">.</span>view(B, T, <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_head, C <span style="color:#91d7e3;font-weight:bold">//</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_head)<span style="color:#91d7e3;font-weight:bold">.</span>transpose(<span style="color:#f5a97f">1</span>, <span style="color:#f5a97f">2</span>)  <span style="color:#6e738d;font-style:italic"># (B, nh, T, hs)</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Causal self-attention; Self-attend:</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># (B, nh, T, hs) x (B, nh, hs, T) -&gt; (B, nh, T, T)</span>
</span></span><span style="display:flex;"><span>        y <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>functional<span style="color:#91d7e3;font-weight:bold">.</span>scaled_dot_product_attention(q, k, v, attn_mask<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">None</span>, dropout_p<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>dropout <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>training <span style="color:#c6a0f6">else</span> <span style="color:#f5a97f">0</span>, is_causal<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>        y <span style="color:#91d7e3;font-weight:bold">=</span> y<span style="color:#91d7e3;font-weight:bold">.</span>transpose(<span style="color:#f5a97f">1</span>, <span style="color:#f5a97f">2</span>)<span style="color:#91d7e3;font-weight:bold">.</span>contiguous()<span style="color:#91d7e3;font-weight:bold">.</span>view(B, T, C)  <span style="color:#6e738d;font-style:italic"># Re-assemble all head outputs side by side</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Output projection</span>
</span></span><span style="display:flex;"><span>        y <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>resid_dropout(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>c_proj(y))
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> y</span></span></code></pre></div><figcaption>
        <strong>Listing 4: Causal Self-Attention Implementation</strong>
    </figcaption>
</figure>
<p>The attention mechanism computes attention as show below:</p>
<p>$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$</p>
<p>Where $Q$, $K$, and $V$ are the query, key, and value matrices, respectively.</p>
<p>In our case, with 768-dimensional embeddings and 12 heads, each head operates on 64-dimensional subspaces ($d_k = 768 / 12 = 64$), providing sufficient representational capacity for each type of historical relationship while maintaining computational efficiency.</p>
<p>In addition, the $\sqrt{d_k}$ scaling factor ($\sqrt{64} = 8$) prevents the dot products from becoming too large, ensuring stable gradient flow during training.</p>
<p><strong>In plain English, please!</strong></p>
<p>Think of attention like a spotlight that can shine on different parts of a sentence. When the model is trying to understand the word &ldquo;he&rdquo; in a historical document, it needs to look back through the text to find who &ldquo;he&rdquo; refers to. The attention mechanism is like having multiple spotlights (our 12 or 16 attention heads) that can each focus on different aspects - each might look for people&rsquo;s names, another for relationships, and another for locations.</p>
<p>The mathematical formula we showed above is how the model calculates the amount of attention to pay to each word. The scaling factor ($$\sqrt(64) = 8) is like adjusting the brightness of the spotlight – it prevents the model from being &ldquo;blinded&rdquo; by very bright spots and helps it focus on the right amount of information.</p>
<p><strong>Does this matter for historical text?</strong></p>
<p>Historical documents are particularly challenging because they often feature complex sentence structures and references spanning long distances. For example, in a court record, you might see &ldquo;The defendant, John Smith, was accused of theft. He claimed innocence throughout the trial.&rdquo; The model needs to understand that &ldquo;He&rdquo; refers to &ldquo;John Smith,&rdquo; even though there are several words between them. The attention mechanism enables the model to make these connections, generating coherent text that maintains proper historical context and references.</p>
<p>This is certainly required for language modeling, given the complex structures in which later words reference earlier ones, and understanding the full context is essential for proper interpretation. The attention mechanism enables the model to learn long-range dependencies, allowing it to generate coherent text across extended sequences. For historical texts specifically, this becomes even more important because archaic language patterns and historical references often span longer distances than those in modern texts.</p>
<h3 id="25-model-configuration">2.5 Model Configuration</h3>
<p>The model architecture uses a centralized configuration, where each parameter is selected based on research findings and practical constraints for historical text processing. The SLM architecture uses five key parameters, each representing a design choice with specific trade-offs between computational efficiency and learning capacity.</p>
<table>
  <thead>
      <tr>
          <th>Parameter</th>
          <th>Value</th>
          <th>Purpose</th>
          <th>Trade-off</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>n_layer</code></td>
          <td>12</td>
          <td>Number of transformer blocks (model depth)</td>
          <td>More layers = better learning, but slower training</td>
      </tr>
      <tr>
          <td><code>n_head</code></td>
          <td>12</td>
          <td>Number of attention heads (parallel processing)</td>
          <td>More heads = better attention, but more computation</td>
      </tr>
      <tr>
          <td><code>n_embd</code></td>
          <td>768</td>
          <td>Embedding dimension (token representation)</td>
          <td>Larger = richer representations, but more memory</td>
      </tr>
      <tr>
          <td><code>max_length</code></td>
          <td>512</td>
          <td>Context window size (sequence length)</td>
          <td>Longer = more context, but quadratic memory growth</td>
      </tr>
      <tr>
          <td><code>vocab_size</code></td>
          <td>30K</td>
          <td>Vocabulary size (tokenizer compatibility)</td>
          <td>Larger = more words, but more parameters</td>
      </tr>
  </tbody>
</table>
<p>These parameters work together to create a model that effectively processes historical text while remaining computationally manageable.</p>
<h4 id="layer-count-n_"><strong>Layer Count (n_layer: 12)</strong></h4>
<p>The 12-layer architecture balances representational capacity with computational efficiency for historical text processing. Shallow layers (1-3) learn basic token patterns and grammatical structures, middle layers (4-8) capture complex syntactic relationships and historical language patterns, and deep layers (9-12) understand high-level semantic relationships and historical context.</p>
<p>This depth follows GPT-2 Small&rsquo;s 12-layer architecture, which delivers strong performance while remaining computationally manageable on available hardware.</p>
<h4 id="attention-heads-n_"><strong>Attention Heads (n_head: 12)</strong></h4>
<p>Multi-head attention allows the model to attend to different types of relationships simultaneously – for example, temporal (chronological order), social (class hierarchies), geographical (London landmarks), and linguistic (archaic patterns). The 12-head architecture balances parallel processing capability with computational efficiency for historical text understanding.</p>
<h4 id="embedding-dimension-n_"><strong>Embedding Dimension (n_embd: 768)</strong></h4>
<p>The 768-dimensional embedding space can represent complex historical concepts, such as archaic terms (&ldquo;yeoman&rdquo;, &ldquo;paternoster row&rdquo;, &ldquo;hath&rdquo;), while maintaining computational efficiency. This dimension is commonly used in transformer architectures, including BERT-base and GPT-2 Medium.</p>
<blockquote>
<p><strong>Why 768 became standard:</strong> As a side note, in case you are seeing a lot of 768 lately, there are a good set of reasons for this. Beyond its divisibility ($768 ÷ 12 = 64$ per attention head), 768 aligns with GPU memory architecture - it&rsquo;s a multiple of 256 (3 × 256), which matches common GPU memory bus widths and cache line sizes. This makes matrix operations more efficient on modern GPUs, as the hardware can process data in optimal chunks. Additionally, 768 provides sufficient representational capacity without the memory overhead of larger dimensions like 1024, making it practical for training on consumer hardware while still capturing complex linguistic relationships.</p></blockquote>
<h4 id="context-window-n_"><strong>Context Window (n_positions: 512)</strong></h4>
<p>We use a 512-token context window as a practical balance between historical coherence and available compute for a learning-focused setup. While many of our working snippets (e.g., diary passages, sections of legal records, or literary excerpts) comfortably fit within 512 tokens, full historical documents can be much longer. The 512 window keeps attention costs manageable (quadratic in sequence length) while covering typical training segments we use.</p>
<p>Both models use the same 30K vocabulary from our custom historical tokenizer, ensuring consistent tokenization across model variants.</p>
<h2 id="3-gpu-configuration-and-perf-optimization">3. GPU Configuration and Perf. Optimization</h2>
<p>The training system is designed to maximize GPU utilization while maintaining training stability. Understanding GPU architecture and memory management is crucial for efficient language model training, especially when working with historical text that requires significant computational resources.</p>
<h3 id="31-gpu-architecture-and-memory-management-for-language-model-training">3.1 GPU Architecture and Memory Management for Language Model Training</h3>
<p>Training on historical text benefits from sensible GPU settings even for a small, learning-focused model. We keep to practical, low-risk optimizations (precision choice, batch/sequence trade-offs, memory-aware attention) and accept some trial and error—reserving heavier systems engineering for larger setups.</p>
<p>The main universal factors are:</p>
<ol>
<li>Attention scales quadratically with sequence length, so longer contexts get expensive fast.</li>
<li>Natural language variability (syntax, vocabulary, style) demands sufficient model capacity and stable optimization.</li>
<li>Real‑world data quality (formatting, noise) can destabilize training, requiring robust error handling and memory management.</li>
</ol>
<p>For historical text specifically, archaic vocabulary, period terminology, and cultural references introduce patterns absent from modern corpora. OCR artifacts and uneven formatting in digitized sources add noise beyond what’s typical in contemporary datasets.</p>
<h4 id="311-gpu-memory-hierarchy-and-optimization-strategies">3.1.1 GPU memory hierarchy and optimization strategies</h4>
<p>Modern GPUs use a hierarchical memory system that significantly impacts training performance: fast but tiny registers and shared memory sit closest to the compute; a larger L2 cache buffers traffic; and global memory holds parameters and activations. Attention often ends up memory-bound, so moving less data (via AMP, Flash/SDPA kernels, and sensible sequence/batch sizes) is as important as raw FLOPs.</p>
<p>For language model training, the key optimization is managing the <em>memory bandwidth bottleneck</em>. Attention operations are often memory-bound rather than compute-bound, meaning performance is limited by how quickly data can be moved between memory levels rather than by computational power. If we are not careful, it is quite easy to run into memory issues, as shown in <a href="#fig5" class="figure-ref">Figure 5</a> below.</p>
<figure>
<img src="images/mem1.png" alt="Out of memory error Screenshot" title="OOM error" id="fig5">
<figcaption><strong>Figure 5:</strong>OOM error</figcaption>
</figure>
<p>And it is not restricted to training only; even on checkpoints that are saved, we can also encounter memory issues, as shown in <a href="#fig6" class="figure-ref">Figure 6</a>.</p>
<figure>
<img src="images/oom-checkpoint-eval.png" alt="Out of memory error - checkpoing evals Screenshot" title="OOM checkpoint eval" id="fig6">
<figcaption><strong>Figure 6:</strong>OOM checkpoint eval</figcaption>
</figure>
<p><strong>Mixed precision training and memory optimization</strong></p>
<p>Training large language models requires careful memory management, especially when working with limited GPU resources. Our training system uses several optimization techniques to maximize memory efficiency while maintaining training stability.</p>
<p><strong>GPU detection and basic configuration:</strong></p>
<p>The training system needs to work across different hardware setups, from single consumer GPUs to multi-GPU servers. Our approach uses a centralized configuration system that automatically adapts to available hardware.</p>
<p>The actual GPU detection in <code>train_model_slm.py</code> is quite straightforward - it checks for distributed training environment variables (<code>RANK</code>, <code>LOCAL_RANK</code>, <code>WORLD_SIZE</code>) and sets up basic multi-GPU support if available. The system also detects GPU capabilities, such as bfloat16 support, and enables appropriate optimizations. This allows the same training script to work across different hardware setups, though the real complexity comes from the trial-and-error process of stabilizing training.</p>
<figure id="listing5"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># GPU configuration (from config.py)</span>
</span></span><span style="display:flex;"><span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>gpu_config <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;auto_detect&#34;</span>: <span style="color:#f5a97f">True</span>,  <span style="color:#6e738d;font-style:italic"># Automatically detect available GPUs</span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;max_gpus&#34;</span>: <span style="color:#f5a97f">0</span>,  <span style="color:#6e738d;font-style:italic"># Maximum number of GPUs to use (0 = no limit, use all available)</span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;min_gpu_memory_gb&#34;</span>: <span style="color:#f5a97f">8</span>,  <span style="color:#6e738d;font-style:italic"># Minimum GPU memory required (GB)</span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;preferred_gpu_types&#34;</span>: [<span style="color:#a6da95">&#34;A30&#34;</span>, <span style="color:#a6da95">&#34;A40&#34;</span>, <span style="color:#a6da95">&#34;A100&#34;</span>, <span style="color:#a6da95">&#34;V100&#34;</span>, <span style="color:#a6da95">&#34;RTX4090&#34;</span>, <span style="color:#a6da95">&#34;RTX4080&#34;</span>],
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;fallback_to_cpu&#34;</span>: <span style="color:#f5a97f">True</span>,  <span style="color:#6e738d;font-style:italic"># Fall back to CPU if no suitable GPUs found</span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;force_single_gpu&#34;</span>: <span style="color:#f5a97f">False</span>,  <span style="color:#6e738d;font-style:italic"># Force single GPU even if multiple available</span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;force_multi_gpu&#34;</span>: <span style="color:#f5a97f">False</span>,  <span style="color:#6e738d;font-style:italic"># Force multi-GPU even if only one available</span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;gpu_memory_fraction&#34;</span>: <span style="color:#f5a97f">0.9</span>,  <span style="color:#6e738d;font-style:italic"># Fraction of GPU memory to use (0.0-1.0)</span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;allow_growth&#34;</span>: <span style="color:#f5a97f">True</span>,  <span style="color:#6e738d;font-style:italic"># Allow GPU memory growth</span>
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;log_device_placement&#34;</span>: <span style="color:#f5a97f">False</span>  <span style="color:#6e738d;font-style:italic"># Log device placement for debugging</span>
</span></span><span style="display:flex;"><span>}</span></span></code></pre></div><figcaption>
        <strong>Listing 5: GPU Configuration and Detection System</strong>
    </figcaption>
</figure>
<p>The configuration in <a href="#listing5" class="listing-ref">Listing 5</a> is defined in our centralized <code>config.py</code> file and provides settings for automatic GPU detection, memory management, and fallback options. While this looks comprehensive, the actual implementation is simpler - the training code primarily relies on PyTorch&rsquo;s built-in distributed training detection and basic device selection.</p>
<p><strong>The reality of training: Nearly 100 runs and many failures</strong></p>
<figure>
<img src="images/wandb.png" alt="WandB Screenshot" title="helloLondon training runs - WandB" id="fig7">
<figcaption><strong>Figure 7:</strong> helloLondon training runs - WandB</figcaption>
</figure>
<p><a href="#fig7" class="figure-ref">Figure 7</a> shows the actual training experience: 99 total runs with 24 completions. The failures were largely data-driven - OCR and encoding issues, uneven sequence lengths, and sensitivity to learning-rate warmup - and a few were plain memory pressure from early, less conservative settings. The code stabilized early; the data and knobs took time.</p>
<p>This iterative process is typical in language model development - the &ldquo;sophisticated&rdquo; system shown here is the result of learning from these failures and gradually improving the training pipeline. The successful runs exhibit stable loss curves and appropriate learning rate schedules, demonstrating that the final configuration performs well on historical text processing tasks.</p>
<p>Most importantly, this experience reinforces a fundamental truth in machine learning: <strong>data quality is still king</strong>. No amount of sophisticated architecture, GPU optimization, or training infrastructure can overcome poor data quality. The &ldquo;garbage in, garbage out&rdquo; principle remains as true for language models as it was for the earliest machine learning systems. Our 75% failure rate was primarily due to data issues – such as inconsistent formatting, OCR errors, and encoding problems - not technical limitations. This is why Part 2&rsquo;s focus on data cleaning and tokenization was so crucial to our success.</p>
<h3 id="32-precision-and-performance-configuration">3.2 Precision and Performance Configuration</h3>
<p>The system includes precision and performance configuration options that can be tuned based on available hardware. Mixed-precision training uses lower-precision (fp16/bf16) for most operations while keeping full precision for critical computations, providing significant memory savings and speed improvements with minimal impact on quality.</p>
<p><strong>Understanding fp16 and bf16: The Precision Trade-off</strong></p>
<p>To understand why precision matters for language model training, we need to look at how computers represent numbers. Standard floating-point numbers use 32 bits (float32), but we can use fewer bits to save memory and increase speed:</p>
<ul>
<li>
<p><strong>fp16 (Half Precision)</strong>: Uses 16 bits to represent numbers, cutting memory usage in half and enabling faster computation. However, it has a smaller range of representable numbers, which can cause &ldquo;overflow&rdquo; (numbers too large) or &ldquo;underflow&rdquo; (numbers too small) during training.</p>
</li>
<li>
<p><strong>bf16 (Brain Float 16)</strong>: Also uses 16 bits, but with a different bit layout that matches float32&rsquo;s exponent range. This means it can represent the same range of large and small numbers as float32, but with less precision for very small decimal values.</p>
</li>
</ul>
<p><strong>Why bf16 is better for language models:</strong></p>
<p>bf16 provides better numerical stability than fp16, especially for large language models, reducing the likelihood of overflow and underflow that can cause training instability. The key difference is that bf16 can represent the same range of numbers as float32 (from very small to very large), while fp16 has a much smaller range. This is crucial for language models because:</p>
<ol>
<li><strong>Gradient magnitudes vary widely</strong> - Some gradients are very small (close to zero) while others are large</li>
<li><strong>Attention weights</strong> - The softmax operations in attention can produce very small numbers that FP16 might round to zero</li>
<li><strong>Learning rate scaling</strong> - Modern optimizers like AdamW work with gradients of varying magnitudes</li>
</ol>
<p>When gradients become too small and are rounded to zero (underflow), the model stops learning effectively. When they become too large (overflow), training becomes unstable. bf16&rsquo;s wider range helps prevent both issues.</p>
<p><strong>Understanding precision and performance settings:</strong></p>
<p>The configuration in <a href="#listing6" class="listing-ref">Listing 6</a> toggles the levers that matter on consumer hardware: TF32 for faster matmuls, AMP (prefer bf16) for stability and memory cuts, <code>torch.compile</code> for an extra boost after warmup, and sequence/batch sizes sized to your VRAM. Used together, these commonly halve activation memory and yield 2-3x speedups versus full-precision baselines.</p>
<figure id="listing6"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Runtime/precision knobs (A30 optimized)</span>
</span></span><span style="display:flex;"><span><span style="color:#a6da95">&#34;enable_tf32&#34;</span>: <span style="color:#f5a97f">True</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6da95">&#34;enable_amp&#34;</span>: <span style="color:#f5a97f">True</span>,
</span></span><span style="display:flex;"><span><span style="color:#a6da95">&#34;amp_dtype&#34;</span>: <span style="color:#a6da95">&#34;bf16&#34;</span>,  <span style="color:#6e738d;font-style:italic"># bf16 on Ampere; fallback to fp16 if unsupported</span>
</span></span><span style="display:flex;"><span><span style="color:#a6da95">&#34;enable_compile&#34;</span>: <span style="color:#f5a97f">True</span>,  <span style="color:#6e738d;font-style:italic"># torch.compile; set False to reduce memory usage</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Conservative baseline (for broad hardware) - uncomment to use:</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># &#34;enable_tf32&#34;: False,</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># &#34;enable_amp&#34;: True,</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># &#34;amp_dtype&#34;: &#34;fp16&#34;,</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Sequence/batch control</span>
</span></span><span style="display:flex;"><span><span style="color:#a6da95">&#34;max_length&#34;</span>: <span style="color:#f5a97f">1024</span>,  <span style="color:#6e738d;font-style:italic"># increase tokens per step when VRAM allows</span>
</span></span><span style="display:flex;"><span><span style="color:#a6da95">&#34;batch_size&#34;</span>: <span style="color:#f5a97f">20</span>,    <span style="color:#6e738d;font-style:italic"># per-GPU batch; raise if VRAM allows</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Conservative sequence/batch - uncomment to use:</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># &#34;max_length&#34;: 768,</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># &#34;batch_size&#34;: 8,</span></span></span></code></pre></div><figcaption>
        <strong>Listing 6: Precision and Performance Configuration</strong>
    </figcaption>
</figure>
<h4 id="key-gpu-configuration-settings"><strong>Key GPU Configuration Settings</strong></h4>
<p>A few switches move the needle the most: enable TF32 on Ampere-class GPUs for a quick matrix-mul speedup; use AMP (bf16 where supported, fp16 otherwise) to halve activation memory; and turn on <code>torch.compile</code> if you can afford the warmup to get another 1.2-1.5x after a few hundred steps. Keep the sequence length in line with VRAM (~512 tokens for 8GB, ~1024 for 16GB+), and scale the per-GPU batch size accordingly (think hundreds of MB per batch at these widths). The repo includes sensible presets so you can start conservative and dial up.</p>
<h4 id="321-real-world-performance-results">3.2.1 Real-World Performance Results</h4>
<p>On 2x A30s, the SLM lands around mid-20s MFU with ~210 ms/iter and ~18 GB per GPU, converging from ~10.4 loss to the mid-3s over the full run. The clean BPE tokenizer and precision stack keep math efficient, and DDP delivers the expected speedup over a single device.</p>
<p><strong>Automatic Precision Detection and Memory Optimization:</strong></p>
<p>The system also includes automatic precision detection and memory optimization during model initialization. The code snippet below shows how the system automatically selects the optimal precision format based on available hardware capabilities:</p>
<figure id="listing7"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Precision / TF32 knobs from config</span>
</span></span><span style="display:flex;"><span>tf32 <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>slm_config<span style="color:#91d7e3;font-weight:bold">.</span>get(<span style="color:#a6da95">&#34;enable_tf32&#34;</span>, <span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>torch<span style="color:#91d7e3;font-weight:bold">.</span>backends<span style="color:#91d7e3;font-weight:bold">.</span>cuda<span style="color:#91d7e3;font-weight:bold">.</span>matmul<span style="color:#91d7e3;font-weight:bold">.</span>allow_tf32 <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">bool</span>(tf32)
</span></span><span style="display:flex;"><span>torch<span style="color:#91d7e3;font-weight:bold">.</span>backends<span style="color:#91d7e3;font-weight:bold">.</span>cudnn<span style="color:#91d7e3;font-weight:bold">.</span>allow_tf32 <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">bool</span>(tf32)
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>    torch<span style="color:#91d7e3;font-weight:bold">.</span>set_float32_matmul_precision(<span style="color:#a6da95">&#39;high&#39;</span> <span style="color:#c6a0f6">if</span> tf32 <span style="color:#c6a0f6">else</span> <span style="color:#a6da95">&#39;medium&#39;</span>)
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">pass</span>
</span></span><span style="display:flex;"><span>use_amp <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>slm_config<span style="color:#91d7e3;font-weight:bold">.</span>get(<span style="color:#a6da95">&#34;enable_amp&#34;</span>, <span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>amp_dtype_cfg <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>slm_config<span style="color:#91d7e3;font-weight:bold">.</span>get(<span style="color:#a6da95">&#34;amp_dtype&#34;</span>, <span style="color:#a6da95">&#34;bf16&#34;</span>)<span style="color:#91d7e3;font-weight:bold">.</span>lower()
</span></span><span style="display:flex;"><span>bf16_ok <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>cuda<span style="color:#91d7e3;font-weight:bold">.</span>is_available() <span style="color:#91d7e3;font-weight:bold">and</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>cuda<span style="color:#91d7e3;font-weight:bold">.</span>is_bf16_supported()
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">if</span> use_amp:
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> amp_dtype_cfg <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#a6da95">&#39;bf16&#39;</span> <span style="color:#91d7e3;font-weight:bold">and</span> bf16_ok:
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>dtype <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#39;bfloat16&#39;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>dtype <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#39;float16&#39;</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>dtype <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#39;float32&#39;</span></span></span></code></pre></div><figcaption>
        <strong>Listing 7: Precision Detection and Memory Optimization</strong>
    </figcaption>
</figure>
<p>The TF32 configuration optimizes matrix operations for Ampere+ GPUs, delivering significant speedups while maintaining training stability.</p>
<h3 id="33-multi-gpu-training-with-distributed-data-parallel">3.3 Multi-GPU Training with Distributed Data Parallel</h3>
<p>The system supports multi-GPU training using PyTorch&rsquo;s DistributedDataParallel (DDP) - each GPU hosts a full model replica, processes different batches in parallel, and synchronizes gradients automatically. PyTorch handles the inter‑GPU communication, so on two GPUs, you typically see near‑linear speedup (~2×) for these model sizes.</p>
<p>Multi-GPU training improves throughput and shortens wall‑clock time by splitting work across devices. On our 2× A30 setup, we process 36 sequences in parallel (18 per GPU) instead of 18 on a single card, cutting Regular model training from ~56 hours to ~28–32 hours. It also offers operational flexibility: scale up or down based on the number of GPUs available.</p>
<p>However, multi-GPU training introduces several challenges that can limit performance gains. The primary bottleneck is <strong>inter-GPU communication</strong> - after each backward pass, gradients must be synchronized across all GPUs, which requires transferring large amounts of data. This communication overhead can become significant, especially with larger models and more GPUs.</p>
<p>The performance of multi-GPU training heavily depends on the interconnect between GPUs. On NVIDIA systems, <em>InfiniBand</em> provides the highest bandwidth and lowest latency for GPU-to-GPU communication, enabling near-linear scaling across many GPUs. <em>NVLink</em> (found on high-end NVIDIA GPUs such as A100 and H100) provides direct GPU-to-GPU connections with very high bandwidth, making it ideal for 2-8 GPU setups. <em>PCIe</em> connections are slower but more common in consumer and workstation systems.</p>
<p>In AMD systems, <em>Infinity Fabric</em> serves a role similar to NVLink, providing high-bandwidth interconnects between GPUs. AMD&rsquo;s <em>MI200</em> and <em>MI300</em> series GPUs include Infinity Fabric links that enable efficient multi-GPU communication.</p>
<p>In practice, scaling efficiency depends on the ratio of computation to communication. Our historical language models have relatively modest parameter counts (117M-354M), so communication overhead can be significant compared to computation time. This is why we see good scaling with 2 GPUs but diminishing returns with more GPUs - the communication overhead starts to dominate.</p>
<p>DDP is more efficient than naive data parallelism because it reduces communication overhead and enables larger effective batch sizes, as shown in <a href="#listing8" class="listing-ref">Listing 8</a> below.</p>
<figure id="listing8"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># DDP setup (process group already initialized in main())</span>
</span></span><span style="display:flex;"><span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">int</span>(os<span style="color:#91d7e3;font-weight:bold">.</span>environ<span style="color:#91d7e3;font-weight:bold">.</span>get(<span style="color:#a6da95">&#39;RANK&#39;</span>, <span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">1</span>)) <span style="color:#91d7e3;font-weight:bold">!=</span> <span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp:
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp_rank <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">int</span>(os<span style="color:#91d7e3;font-weight:bold">.</span>environ[<span style="color:#a6da95">&#39;RANK&#39;</span>])
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp_local_rank <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">int</span>(os<span style="color:#91d7e3;font-weight:bold">.</span>environ[<span style="color:#a6da95">&#39;LOCAL_RANK&#39;</span>])
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp_world_size <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">int</span>(os<span style="color:#91d7e3;font-weight:bold">.</span>environ[<span style="color:#a6da95">&#39;WORLD_SIZE&#39;</span>])
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>device <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#ed8796">f</span><span style="color:#a6da95">&#39;cuda:</span><span style="color:#a6da95">{</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp_local_rank<span style="color:#a6da95">}</span><span style="color:#a6da95">&#39;</span>
</span></span><span style="display:flex;"><span>    torch<span style="color:#91d7e3;font-weight:bold">.</span>cuda<span style="color:#91d7e3;font-weight:bold">.</span>set_device(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>device)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>master_process <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp_rank <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>seed_offset <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp_rank
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>master_process <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">True</span>
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>seed_offset <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp_world_size <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">1</span></span></span></code></pre></div><figcaption>
        <strong>Listing 8: Multi-GPU Training Setup</strong>
    </figcaption>
</figure>
<p><strong>What is &ldquo;rank&rdquo; and why does it matter?</strong></p>
<p>In distributed training, each GPU process gets a unique “rank.” Rank 0 acts as the coordinator (handles logging, checkpointing, and WandB), while the remaining ranks focus purely on computation. This avoids collisions - only one process touches files and dashboards - while every device contributes gradients.</p>
<p>This division of labor is crucial because it prevents conflicts. Without it, all processes would try to save checkpoints simultaneously, log to WandB at the same time, or write to the same files, causing errors and corruption.</p>
<p>The key to scaling efficiency is that each GPU works independently on different data batches, then synchronizes only the essential information (gradients). Here&rsquo;s how it works:</p>
<ol>
<li><strong>Parallel computation</strong>: Each GPU processes a different batch of data simultaneously</li>
<li><strong>Gradient synchronization</strong>: After each backward pass, gradients are averaged across all GPUs</li>
<li><strong>Independent updates</strong>: Each GPU updates its model copy with the averaged gradients</li>
</ol>
<p>This means that if you have 2 GPUs, you can process 2x the data in the same time, giving you roughly 2x the speed. With 4 GPUs, you get approximately 4x speedup. The &ldquo;near-linear&rdquo; part acknowledges that there&rsquo;s always some overhead from communication and synchronization, so that you might get 1.9x speedup instead of exactly 2x. Still, it&rsquo;s close enough to be very effective.</p>
<p>However, there&rsquo;s a practical limit to this approach. Beyond 8-16 GPUs, the communication overhead becomes so significant that you need more robust hardware (such as InfiniBand networks) and advanced systems engineering techniques (gradient compression, pipeline parallelism, model parallelism) to maintain efficiency. For truly large-scale training with hundreds of GPUs, you need specialized infrastructure and techniques that go far beyond what we&rsquo;re doing here.</p>
<p>This combination of distributed training and memory optimization enables us to train our historical language models efficiently, even on consumer hardware. The distributed setup provides fault tolerance and near-linear speedup, while the precision optimizations enable larger models and longer sequences on the same hardware.</p>
<h2 id="4-training-infrastructure-making-it-all-work-together">4. Training Infrastructure: Making It All Work Together</h2>
<p>As a reminder, as we saw earlier, the two model variants share the same training stack (scheduler, checkpointing, WandB, DDP). See Part 1 for the high‑level comparison; here are the training‑relevant differences only:</p>
<ul>
<li>SLM (117M): per‑GPU batch 18 → effective 36 on 2 GPUs; sequence length 512; ~7–8h on 2×A30</li>
<li>Regular (354M): per‑GPU batch 12 → effective 24 on 2 GPUs; sequence length 1024; ~28–32h on 2×A30</li>
</ul>
<h3 id="41-the-training-loop">4.1 The Training Loop</h3>
<p>The core training happens in the <code>train()</code> method, which implements a standard language model training loop with several key phases - outlined below.</p>
<h4 id="411-data-loading-and-preparation">4.1.1 Data Loading and Preparation</h4>
<p>The training loop starts by loading tokenized data using <code>get_batch('train')</code>, which reads from pre-processed binary files created during data preparation. This includes both training and validation data, with the tokenizer from <a
	
		href = "https://blog.desigeek.com/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Part 2: Data Collection &amp; Custom Tokenizers
	</span>
</a> handling the conversion between text and tokens.</p>
<p><strong>Main Training Loop Structure:</strong></p>
<figure id="listing9"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">train</span>(<span style="color:#91d7e3">self</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Main training loop&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Get initial batch</span>
</span></span><span style="display:flex;"><span>    X, Y <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>get_batch(<span style="color:#a6da95">&#39;train&#39;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">while</span> <span style="color:#f5a97f">True</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># 1. Learning rate scheduling</span>
</span></span><span style="display:flex;"><span>        lr <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>get_lr(iter_num)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># 2. Evaluation and checkpointing (every eval_interval steps)</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> iter_num <span style="color:#91d7e3;font-weight:bold">%</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>eval_interval <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#f5a97f">0</span>:
</span></span><span style="display:flex;"><span>            losses <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>estimate_loss()
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Save checkpoint if validation loss improved</span>
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># 3. Forward pass with mixed precision</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">with</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>amp<span style="color:#91d7e3;font-weight:bold">.</span>autocast():
</span></span><span style="display:flex;"><span>            logits, loss <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model(X, Y)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># 4. Backward pass and optimization</span>
</span></span><span style="display:flex;"><span>        loss<span style="color:#91d7e3;font-weight:bold">.</span>backward()
</span></span><span style="display:flex;"><span>        torch<span style="color:#91d7e3;font-weight:bold">.</span>nn<span style="color:#91d7e3;font-weight:bold">.</span>utils<span style="color:#91d7e3;font-weight:bold">.</span>clip_grad_norm_(model<span style="color:#91d7e3;font-weight:bold">.</span>parameters(), <span style="color:#f5a97f">1.0</span>)
</span></span><span style="display:flex;"><span>        optimizer<span style="color:#91d7e3;font-weight:bold">.</span>step()
</span></span><span style="display:flex;"><span>        optimizer<span style="color:#91d7e3;font-weight:bold">.</span>zero_grad()
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># 5. Get next batch</span>
</span></span><span style="display:flex;"><span>        X, Y <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>get_batch(<span style="color:#a6da95">&#39;train&#39;</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># 6. Logging and monitoring</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> iter_num <span style="color:#91d7e3;font-weight:bold">%</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>log_interval <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#f5a97f">0</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Log to WandB and console</span></span></span></code></pre></div><figcaption>
        <strong>Listing 9: Core Training Loop Structure</strong>
    </figcaption>
</figure>
<p><strong>Training Process Flow:</strong></p>
<p>Understanding how the training actually works requires seeing both the high-level flow and the technical details of each phase. <a href="#fig8" class="figure-ref">Figure 8</a> shows the complete training process flow.</p>
<figure class="align-center " id="fig8">
    <pre class="mermaid">graph TD
    A[Start Training] --&gt; B[Load Tokenized Data]
    B --&gt; C[Initialize Model &amp; Optimizer]
    C --&gt; D[Training Loop Start]
    D --&gt; E[Update Learning Rate]
    E --&gt; F{Evaluation Time?}
    F --&gt;|Yes| G[Run Validation]
    F --&gt;|No| H[Forward Pass]
    G --&gt; H
    H --&gt; I[Compute Loss]
    I --&gt; J[Backward Pass]
    J --&gt; K[Gradient Clipping]
    K --&gt; L[Update Weights]
    L --&gt; M[Zero Gradients]
    M --&gt; N[Log Metrics]
    N --&gt; O{Checkpoint?}
    O --&gt;|Yes| P[Save Model State]
    O --&gt;|No| Q[Load Next Batch]
    P --&gt; Q
    Q --&gt; R{Max Iterations?}
    R --&gt;|No| D
    R --&gt;|Yes| S[Save Final Model]
    S --&gt; T[End Training]</pre>
    <figcaption>Figure 8: Training Process Flow</figcaption>
</figure>
<p>Now that we have a high-level overview of the training process, let us dig deeper into each phase and see how it works in practice.</p>
<h4 id="412-data-loading">4.1.2 Data Loading</h4>
<p>Data loading reads pre-tokenized sequences from binary files (<code>.bin</code>) using <code>np.memmap</code> for memory efficiency. The initial tokenization process can take quite a long time on our 500M+ character corpus, but this is done only once and saved to disk. This optimization was crucial during our development process – given nearly 100 training runs and many failures, re-tokenizing the entire corpus each time would have been prohibitively slow. The system handles train/val splits (90/10 %) with random sampling per batch and uses <code>pin_memory()</code> and <code>non_blocking=True</code> for faster GPU transfers.</p>
<p>When we run this for the time time, it takes a long time to load and tokenize the training data corpus. We see this just startging in <a href="#fig9" class="figure-ref">Figure 9</a> below.</p>
<figure>
<img src="images/train16.png" alt="Tokenizer training data Screenshot" title="Tokenizer training data" id="fig9">
<figcaption><strong>Figure 9:</strong>Tokenizer training data</figcaption>
</figure>
<p>Batch sizes are optimized for our 2x A30 GPU setup: 18 per GPU for the SLM model (36 effective batch size) and 12 per GPU for the Regular model (24 effective batch size). These numbers balance memory usage with training stability – the SLM can handle larger batches thanks to its smaller 117M parameter count. In comparison, the Regular model&rsquo;s 354M parameters require smaller batches to fit in GPU memory.</p>
<p><a href="#fig10" class="figure-ref">Figure 10</a> below shows the dual GPU setup used for one of the training sessions for the regular mode.</p>
<figure>
<img src="images/gpu1.png" alt="GPU detail Screenshot" title="GPU details" id="fig10">
<figcaption><strong>Figure 10:</strong>GPU detail</figcaption>
</figure>
<h4 id="413-learning-rate-scheduling">4.1.3 Learning Rate Scheduling</h4>
<p>Learning Rate Scheduling uses cosine decay with warmup, a two-phase approach that helps prevent training instability. The warmup phase gradually increases the learning rate from 0 to the target value over 500 steps (SLM) or 1000 steps (Regular model), preventing the model from making large, destabilizing updates early in training.</p>
<p>After warmup, cosine decay smoothly reduces the learning rate following a cosine curve to 10% of the initial rate by the end of training. In case you are not familiar with Cosine decay, it is a scheduling strategy where the learning rate follows the shape of a cosine wave: starting at the maximum value after warmup, it decreases slowly at first, then more rapidly in the middle of training, and finally levels off gently near the minimum value.</p>
<p>Mathematically, this follows the curve <code>lr = min_lr + (max_lr - min_lr) × 0.5 × (1 + cos(π × progress))</code>, where <code>progress</code> goes from 0 (start of decay) to 1 (end of training). Unlike linear decay (which drops at a constant rate) or step decay (which drops abruptly at fixed intervals), cosine decay provides a smooth, natural reduction that helps the model explore the loss landscape more effectively early on, then refine its parameters more precisely as training progresses.</p>
<p>The initial learning rates are chosen based on model size: 3e-4 (0.0003) for the SLM model and 3e-5 (0.00003) for the Regular model. The 10x difference reflects the Regular model&rsquo;s larger parameter count (354M vs 117M) - larger models typically need smaller learning rates to prevent gradient explosion. The cosine decay ensures the model converges smoothly rather than oscillating around the minimum, which is crucial for the complex patterns in historical text.</p>
<figure id="listing10"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">get_lr</span>(<span style="color:#91d7e3">self</span>, it):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Learning rate schedule&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    warmup_iters <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">500</span>
</span></span><span style="display:flex;"><span>    lr_decay_iters <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>max_iters
</span></span><span style="display:flex;"><span>    min_lr <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>learning_rate <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">0.1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> it <span style="color:#91d7e3;font-weight:bold">&lt;</span> warmup_iters:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>learning_rate <span style="color:#91d7e3;font-weight:bold">*</span> (it <span style="color:#91d7e3;font-weight:bold">+</span> <span style="color:#f5a97f">1</span>) <span style="color:#91d7e3;font-weight:bold">/</span> (warmup_iters <span style="color:#91d7e3;font-weight:bold">+</span> <span style="color:#f5a97f">1</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> it <span style="color:#91d7e3;font-weight:bold">&gt;</span> lr_decay_iters:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> min_lr
</span></span><span style="display:flex;"><span>    decay_ratio <span style="color:#91d7e3;font-weight:bold">=</span> (it <span style="color:#91d7e3;font-weight:bold">-</span> warmup_iters) <span style="color:#91d7e3;font-weight:bold">/</span> (lr_decay_iters <span style="color:#91d7e3;font-weight:bold">-</span> warmup_iters)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">assert</span> <span style="color:#f5a97f">0</span> <span style="color:#91d7e3;font-weight:bold">&lt;=</span> decay_ratio <span style="color:#91d7e3;font-weight:bold">&lt;=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    coeff <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0.5</span> <span style="color:#91d7e3;font-weight:bold">*</span> (<span style="color:#f5a97f">1.0</span> <span style="color:#91d7e3;font-weight:bold">+</span> math<span style="color:#91d7e3;font-weight:bold">.</span>cos(math<span style="color:#91d7e3;font-weight:bold">.</span>pi <span style="color:#91d7e3;font-weight:bold">*</span> decay_ratio))
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> min_lr <span style="color:#91d7e3;font-weight:bold">+</span> coeff <span style="color:#91d7e3;font-weight:bold">*</span> (<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>learning_rate <span style="color:#91d7e3;font-weight:bold">-</span> min_lr)</span></span></code></pre></div><figcaption>
        <strong>Listing 10: Learning Rate Scheduling Function</strong>
    </figcaption>
</figure>
<p>The code in <a href="#listing10" class="listing-ref">Listing 10</a> shows how the learning rate schedule is implemented. The function takes the current iteration number and returns the appropriate learning rate based on whether we&rsquo;re in the warmup phase (linear increase) or decay phase (cosine curve). The <code>warmup_iters</code> parameter controls the warmup duration, while <code>min_lr</code> sets the final learning rate to 10% of the initial value.</p>
<blockquote>
<p>As a side note, in case you are curious about why a <strong>cosine decay</strong> specifically makes sense, then read on. The cosine function has unique mathematical properties that make it ideal for learning rate scheduling. Unlike linear decay (which drops too quickly) or exponential decay (which drops too slowly), cosine decay starts with a gentle slope that gradually steepens, then flattens out near the end. This creates a &ldquo;restart&rdquo; effect, allowing the model to escape local minima early in training and then fine-tune more precisely in later stages.</p></blockquote>
<blockquote>
<p>The smooth, continuous nature of cosine decay prevents the learning rate from changing too abruptly, which could destabilize training. Given the historical text&rsquo;s complex linguistic patterns, this gradual, adaptive approach helps the model learn both general language structures and specific historical vocabulary without getting stuck in suboptimal solutions.</p></blockquote>
<h4 id="414-evaluation">4.1.4 Evaluation</h4>
<p>We run evaluations at a regular interval after a certain number of steps, which are defined in the <strong><code>eval_interval()</code></strong> method (defaults to 500 for SLM, 1000 for Regular) and compute loss on both train and validation sets using the <strong><code>estimate_loss()</code></strong> method. The different intervals reflect the models&rsquo; training complexity: the SLM trains faster and benefits from more frequent checks, while the Regular model’s longer runs can use less frequent evaluation.</p>
<p>The <code>estimate_loss()</code> function monitors training progress without disrupting the learning process. To ensure consistent measurements, it temporarily switches the model to evaluation mode (<strong><code>model.eval()</code></strong>). In this mode, dropout layers stop randomly dropping neurons (using the full network capacity), and batch normalization uses running statistics rather than recomputing them from each batch. This means the same input produces the same output every time, unlike in training mode, where dropout introduces randomness for regularization.</p>
<p>Rather than computing loss on the entire dataset (which would be too slow), <code>estimate_loss()</code> samples <code>eval_iters</code> random batches (default 100) from both training and validation sets. It computes the loss for each batch and returns the average, providing a representative estimate of model performance while remaining computationally efficient.</p>
<p>The evaluation process uses <strong><code>torch.no_grad()</code></strong> to disable gradient computation during validation. Gradients are the partial derivatives that tell us how to adjust each model parameter to reduce loss - they&rsquo;re computed during the backward pass and stored for the optimizer. During evaluation, we don&rsquo;t need gradients because we&rsquo;re not updating weights; we&rsquo;re just measuring performance.</p>
<p>Disabling gradient computation serves two critical purposes. First, it prevents memory leaks by not storing gradients for validation computations - without this, GPU memory would gradually increase during evaluation and eventually cause out-of-memory errors. Second, it ensures accurate loss measurement by preventing any accidental gradient updates during the evaluation phase. The <code>no_grad()</code> context manager is essential for maintaining training stability and memory efficiency.</p>
<h4 id="415-forward-pass">4.1.5 Forward Pass</h4>
<p>A forward pass is when the model processes input data through its layers to produce a prediction. Think of it like asking the model a question: given a sequence of historical text tokens, &ldquo;what word should come next?&rdquo; The model flows the input forward through 12 transformer blocks (SLM) or 24 blocks (Regular), each applying self-attention (to understand relationships between words) and feed-forward operations (to transform and refine the representations). At the end, the model outputs a probability distribution over all possible next tokens.</p>
<p>The forward pass uses mixed precision training with <strong><code>torch.amp.autocast</code></strong> and bf16/fp16 data types, reducing memory usage by ~50% while maintaining training stability. Cross-entropy loss is computed by comparing the model&rsquo;s predicted probabilities with the actual next tokens in the training data; it measures how &ldquo;wrong&rdquo; the model&rsquo;s predictions are. The loss function handles variable sequence lengths by appropriately padding sequences. The mixed precision approach is particularly important for our historical text corpus, which contains long sequences that would otherwise exceed GPU memory limits.</p>
<h4 id="416-backward-pass">4.1.6 Backward Pass</h4>
<p>After the forward pass tells us how wrong the model is (via the loss), the backward pass figures out how to fix it. Using <strong><code>loss.backward()</code></strong>, PyTorch computes gradients for every parameter in the model - these gradients tell us the direction and magnitude of changes needed to reduce the loss. It&rsquo;s like having a GPS telling you which way to move and how far, but for 117 million (SLM) or 354 million (Regular) parameters simultaneously.</p>
<p>The system applies gradient clipping with <strong><code>torch.nn.utils.clip_grad_norm_</code></strong> using a maximum norm of 1.0. Sometimes gradients can become extremely large, especially when processing complex or unusual historical text patterns. Without clipping, these huge gradients would cause the model parameters to jump wildly, potentially making the model unstable or causing it to &ldquo;forget&rdquo; what it learned. Clipping acts like a safety valve, limiting the maximum size of parameter updates to keep training stable. Put simply, gradient clipping caps the overall (global) gradient norm at a threshold; if it exceeds the limit, gradients are rescaled so the update stays bounded. In our early runs, omitting clipping occasionally produced NaN losses; keeping <code>max_norm=1.0</code> eliminated those spikes.</p>
<p>After computing gradients, the system updates the model weights using the <strong><code>AdamW</code></strong> optimizer, which applies the gradients with momentum and adaptive learning rates for each parameter. The optimizer decouples weight decay (a regularization technique to prevent overfitting) from gradient updates, improving generalization. Finally, gradients are zeroed with <strong><code>optimizer.zero_grad(set_to_none=True)</code></strong> - this clears the gradient buffers before the next iteration, preventing them from accumulating across batches. The <code>set_to_none=True</code> option releases memory immediately rather than waiting for GC, improving memory efficiency.</p>
<h4 id="417-checkpointing">4.1.7 Checkpointing</h4>
<p>Checkpointing saves model state, optimizer state, iteration number, and best validation loss whenever validation performance improves, rather than at every evaluation. This selective saving strategy provides multiple benefits: it conserves disk space (our 354M-parameter model checkpoints are ~1.4GB each), reduces I/O overhead that can slow down training, and improves overall training time by 5-10% by eliminating redundant disk writes. The system maintains only the last 5 checkpoints (as configured in <code>config.py</code>), with PyTorch&rsquo;s <code>torch.save()</code> using compression to ensure efficient storage while preserving all necessary training state for resuming. We&rsquo;ll dive deeper into checkpointing strategies and implementation details in Section 6.</p>
<p>The training loop implements standard optimization practices, including dynamic learning rate scheduling, regular evaluation and checkpointing, and comprehensive logging to WandB (as detailed in Section 5). The system automatically saves checkpoints when validation loss improves, ensuring that the best model is always preserved. The learning rate schedule uses cosine decay with warmup, which is standard practice for transformer training.</p>
<blockquote>
<p><strong>📁 Full Implementation</strong>:</p>
<ul>
<li>SLM: <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/04_training/train_model_slm.py"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		<code>04_training/train_model_slm.py</code>
	</span>
</a></li>
<li>Regular Model: <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/04_training/train_model.py"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		<code>04_training/train_model.py</code>
	</span>
</a></li>
</ul></blockquote>
<h3 id="42-model-initialization-setting-up-the-training-foundation">4.2 Model Initialization: Setting Up the Training Foundation</h3>
<p>Before the training loop can begin, the system must properly initialize the model, optimizer, and training infrastructure. The <strong><code>init_model()</code></strong> method handles this setup, ensuring everything is configured correctly for efficient training.</p>
<h4 id="421-model-configuration-and-creation">4.2.1 Model Configuration and Creation</h4>
<p>The initialization process starts by loading metadata from the tokenized data to ensure the model architecture matches the training data. The system reads vocabulary size, block size, and other parameters from the <code>meta.pkl</code> file created during data preparation, ensuring consistency between the model and the data it will be trained on.</p>
<p>The model configuration is built from the SLM parameters defined in <code>config.py</code>, including the number of layers (12), attention heads (12), embedding dimensions (768), and other architectural choices. This configuration is then used to create the <code>SimpleGPT</code> model instance, which inherits from PyTorch&rsquo;s <code>nn.Module</code> and provides all the functionality we discussed in the architecture section.</p>
<h4 id="422-optimizer-setup-and-configuration">4.2.2 Optimizer Setup and Configuration:</h4>
<p>The optimizer is the algorithm that actually updates the model&rsquo;s parameters (weights and biases) during training. After the backward pass computes gradients (which tell us how to adjust each parameter), the optimizer applies those gradients to update the parameters and improve the model.</p>
<p>The system uses <strong>AdamW</strong> (Adam with Weight Decay), which is a popular optimizer for training transformers. AdamW combines the best of two approaches: Adam (which adapts the learning rate for each parameter individually, helping with convergence) and weight decay (a form of regularization that prevents overfitting by discouraging large parameter values).</p>
<p>However, not all parameters should be regularized the same way. The optimizer splits parameters into two groups for different weight decay:</p>
<ul>
<li><strong>2D parameters</strong> (weight matrices): These are the main &ldquo;learnable&rdquo; parts of the model - the connections between neurons in different layers. These receive weight decay (value 0.1) to prevent them from growing too large, which helps prevent overfitting.</li>
<li><strong>1D parameters</strong> (biases): These are additive constants that help shift the model&rsquo;s predictions. They don&rsquo;t receive weight decay (value 0.0) because regularizing biases doesn&rsquo;t help with overfitting and can actually hurt performance.</li>
</ul>
<p>This two-group approach follows standard practices for transformer training and ensures the model generalizes well to unseen historical text.</p>
<p>Modern PyTorch supports &ldquo;fused&rdquo; optimizer operations, which combine multiple steps into a single, faster GPU kernel. Instead of executing separate operations (unscale gradients, update parameters, update optimizer state), fused AdamW performs all three in a single optimized GPU operation. This can provide 10-20% speedup on modern GPUs. The system automatically detects whether your PyTorch version supports fused operations and uses them when available, falling back to the standard implementation otherwise.</p>
<p>Concretely, we use AdamW with the following settings for this project: <code>betas=(0.9, 0.95)</code>, <code>weight_decay=0.1</code>, and the learning rate provided by the scheduler (warmup + cosine decay). The AdamW <code>eps</code> parameter is left at the PyTorch default unless you change it in code. When available, the fused AdamW kernel is enabled automatically. See <a href="#listing11" class="listing-ref">Listing 11</a> for the exact call in <code>init_model()</code>.</p>
<h4 id="423-model-compilation-and-multi-gpu-setup">4.2.3 Model Compilation and Multi-GPU Setup</h4>
<p>Model compilation with PyTorch&rsquo;s <strong><code>torch.compile</code></strong> is similar to traditional code compilation, but with important differences. When you compile Python code (like using <code>gcc</code> for C), the compiler transforms the source code into optimized machine code once, which then runs faster. Similarly, <code>torch.compile</code> takes your model&rsquo;s computation graph and optimizes it, but it does this <strong>at runtime</strong> rather than ahead of time.</p>
<p>The compilation process analyzes your model&rsquo;s operations (matrix multiplications, attention layers, etc.) and generates optimized kernels tuned to your hardware. This includes <strong>operator fusion</strong> (combining multiple operations into single GPU kernels), <strong>memory layout optimization</strong> (arranging data for better cache usage), and <strong>kernel selection</strong> (choosing the fastest implementation for your specific GPU). The result is often 1.2-1.5x speedier training, but with an initial &ldquo;warmup&rdquo; cost: the first few forward/backward passes are slower while PyTorch analyzes the model and generates optimized code.</p>
<p>This differs from traditional compilation because the optimization happens dynamically based on actual input shapes and hardware capabilities, rather than being pre-computed. It&rsquo;s more like a JIT compiler that specializes your model&rsquo;s operations for the exact conditions it encounters during training.</p>
<p>For multi-GPU training, the model is wrapped with <code>DistributedDataParallel</code> (DDP), which enables parallel training across multiple GPUs. The DDP wrapper handles gradient synchronization and ensures that all GPUs work with identical model parameters throughout training.</p>
<figure id="listing11"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">init_model</span>(<span style="color:#91d7e3">self</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Initialize the model&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#a6da95">&#34;Initializing model...&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Load metadata from tokenized data</span>
</span></span><span style="display:flex;"><span>    meta_path <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>data_dir <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#a6da95">&#34;meta.pkl&#34;</span>
</span></span><span style="display:flex;"><span>    meta_vocab_size <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">None</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> meta_path<span style="color:#91d7e3;font-weight:bold">.</span>exists():
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">with</span> <span style="color:#91d7e3">open</span>(meta_path, <span style="color:#a6da95">&#39;rb&#39;</span>) <span style="color:#c6a0f6">as</span> f:
</span></span><span style="display:flex;"><span>            meta <span style="color:#91d7e3;font-weight:bold">=</span> pickle<span style="color:#91d7e3;font-weight:bold">.</span>load(f)
</span></span><span style="display:flex;"><span>        meta_vocab_size <span style="color:#91d7e3;font-weight:bold">=</span> meta[<span style="color:#a6da95">&#39;vocab_size&#39;</span>]
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Found vocab_size = </span><span style="color:#a6da95">{</span>meta_vocab_size<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Create model configuration</span>
</span></span><span style="display:flex;"><span>    model_args <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">dict</span>(
</span></span><span style="display:flex;"><span>        n_layer<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_layer,        <span style="color:#6e738d;font-style:italic"># 12 for SLM</span>
</span></span><span style="display:flex;"><span>        n_head<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_head,          <span style="color:#6e738d;font-style:italic"># 12 for SLM  </span>
</span></span><span style="display:flex;"><span>        n_embd<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>n_embd,          <span style="color:#6e738d;font-style:italic"># 768 for SLM</span>
</span></span><span style="display:flex;"><span>        block_size<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>block_size,  <span style="color:#6e738d;font-style:italic"># 512 for SLM</span>
</span></span><span style="display:flex;"><span>        bias<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>bias,              <span style="color:#6e738d;font-style:italic"># False</span>
</span></span><span style="display:flex;"><span>        vocab_size<span style="color:#91d7e3;font-weight:bold">=</span>meta_vocab_size,  <span style="color:#6e738d;font-style:italic"># From tokenized data</span>
</span></span><span style="display:flex;"><span>        dropout<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>dropout         <span style="color:#6e738d;font-style:italic"># 0.1</span>
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Create and configure model</span>
</span></span><span style="display:flex;"><span>    gptconf <span style="color:#91d7e3;font-weight:bold">=</span> SimpleGPTConfig(<span style="color:#91d7e3;font-weight:bold">**</span>model_args)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model <span style="color:#91d7e3;font-weight:bold">=</span> SimpleGPT(gptconf)
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model<span style="color:#91d7e3;font-weight:bold">.</span>to(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>device)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Initialize optimizer with proper parameter groups</span>
</span></span><span style="display:flex;"><span>    <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>optimizer <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model<span style="color:#91d7e3;font-weight:bold">.</span>configure_optimizers(
</span></span><span style="display:flex;"><span>        weight_decay<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">0.1</span>,
</span></span><span style="display:flex;"><span>        learning_rate<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>learning_rate,
</span></span><span style="display:flex;"><span>        betas<span style="color:#91d7e3;font-weight:bold">=</span>(<span style="color:#f5a97f">0.9</span>, <span style="color:#f5a97f">0.95</span>),
</span></span><span style="display:flex;"><span>        device_type<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;cuda&#39;</span> <span style="color:#c6a0f6">if</span> <span style="color:#a6da95">&#39;cuda&#39;</span> <span style="color:#91d7e3;font-weight:bold">in</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>device <span style="color:#c6a0f6">else</span> <span style="color:#a6da95">&#39;cpu&#39;</span>
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Compile model for performance</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>cuda<span style="color:#91d7e3;font-weight:bold">.</span>is_available() <span style="color:#91d7e3;font-weight:bold">and</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>slm_config<span style="color:#91d7e3;font-weight:bold">.</span>get(<span style="color:#a6da95">&#34;enable_compile&#34;</span>, <span style="color:#f5a97f">True</span>):
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#a6da95">&#34;Compiling model...&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>compile(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model, mode<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;reduce-overhead&#39;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Wrap with DDP for multi-GPU training</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp:
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model <span style="color:#91d7e3;font-weight:bold">=</span> DDP(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model, device_ids<span style="color:#91d7e3;font-weight:bold">=</span>[<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp_local_rank])
</span></span><span style="display:flex;"><span>        param_count <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model<span style="color:#91d7e3;font-weight:bold">.</span>module<span style="color:#91d7e3;font-weight:bold">.</span>get_num_params()
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>        param_count <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model<span style="color:#91d7e3;font-weight:bold">.</span>get_num_params()
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Model initialized with </span><span style="color:#a6da95">{</span>param_count<span style="color:#a6da95">:</span><span style="color:#a6da95">,</span><span style="color:#a6da95">}</span><span style="color:#a6da95"> parameters&#34;</span>)</span></span></code></pre></div><figcaption>
        <strong>Listing 11: Model Initialization Process</strong>
    </figcaption>
</figure>
<p>While our model is a relatively simple toy example focused on a single domain (historical London text), proper initialization remains important to avoid common training issues. The vocabulary size must match our custom historical tokenizer, the sequence length needs to work with our tokenized data, and the model architecture should be appropriate for the text patterns we&rsquo;re learning.</p>
<p>The initialization process ensures these basic requirements are met before training begins, preventing issues such as vocabulary mismatches or memory allocation problems that could lead to training failures. This careful setup was helpful during our development process, where we ran nearly 100 training experiments. Proper initialization helped us avoid some basic configuration errors and focus on the actual training challenges.</p>
<p>Reproducibility and random seeds: To make runs repeatable on the same hardware, we set a deterministic seed per process using <code>torch.manual_seed(1337 + seed_offset)</code>, where <code>seed_offset</code> is the DDP rank (0 for single‑GPU). This gives consistent data shuffling and initialization across restarts while keeping each process distinct under DDP. Note that some CUDA kernels (and AMP/bf16) can introduce non‑determinism; for strict determinism, you may also configure PyTorch’s deterministic flags at the cost of performance.</p>
<h2 id="5-wandb-integration">5. WandB Integration</h2>
<p><strong>Weights &amp; Biases (WandB)</strong> is an experiment tracking and monitoring platform designed specifically for machine learning projects. Think of it as a &ldquo;black box&rdquo; for your training runs - it automatically records everything that happens during training so you can understand what worked, what didn&rsquo;t, and why.</p>
<p>Training a language model is a long‑running experiment. Without live telemetry, we are flying blind and can’t tell whether learning is stable, whether hardware is saturated, or whether runs are comparable. WandB gives real‑time visibility, remote monitoring, and reproducibility. It records loss, learning rate, and perplexity over time; captures GPU utilization and iteration latency; logs configuration and artifacts; and lets you compare runs side‑by‑side to understand which settings worked.</p>
<p>The system includes WandB integration for experiment tracking and monitoring, with automatic configuration logging, real-time metric tracking (including loss, perplexity, and learning rate), model checkpoint integration, experiment comparison across different training runs, and resource monitoring (GPU utilization and memory usage). This integration helps track and compare different training runs, identify better configurations, and reproduce successful experiments.</p>
<p><strong>Understanding WandB integration:</strong></p>
<p><a href="#listing12" class="listing-ref">Listing 12</a> logs the signals you need to fly by instruments: loss and perplexity trends for learning, the LR schedule to confirm warmup/decay, and hardware utilization and iteration timing for throughput and stability. It’s not just logging - it’s how you compare runs and catch issues early.</p>
<p>This real-time monitoring lets us spot problems early, compare different training approaches, and ensure our historical language model is learning properly over days or weeks.</p>
<figure id="listing12"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Log to WandB - loss first for better mobile UI</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>use_wandb:
</span></span><span style="display:flex;"><span>    wandb<span style="color:#91d7e3;font-weight:bold">.</span>log({
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;train/loss&#34;</span>: lossf,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;train/lr&#34;</span>: lr,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;train/iter&#34;</span>: iter_num,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;train/mfu&#34;</span>: running_mfu <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">100</span> <span style="color:#c6a0f6">if</span> running_mfu <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#f5a97f">0</span> <span style="color:#c6a0f6">else</span> <span style="color:#f5a97f">0</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;train/dt_ms&#34;</span>: dt <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">1000</span>,
</span></span><span style="display:flex;"><span>    })</span></span></code></pre></div><figcaption>
        <strong>Listing 12: WandB Integration and Logging</strong>
    </figcaption>
</figure>
<p>The system logs training loss, learning rate, iteration number, model flops utilization (MFU), and training time per iteration. These metrics provide comprehensive insight into training progress, efficiency, and potential issues.</p>
<p>The most useful dials to watch are training loss (should steadily trend from ~8–10 toward ~2–4), MFU (a proxy for GPU efficiency - single‑digit theoretical targets but mid‑20s achievable with good tuning), the learning‑rate curve (warmup then cosine decay), and iteration time (a practical signal for throughput and stalls).</p>
<p>Both SLM and Regular model training runs complete 60,000 iterations, providing consistent training depth across both model variants. <a href="#fig11" class="figure-ref">Figure 11</a> below shows the complete training experience for our Regular model (354M parameters), demonstrating both the console output and WandB&rsquo;s comprehensive monitoring capabilities.</p>
<figure>
<img src="images/train17-regular.png" alt="Complete training run output showing console logs and WandB summary" title="Complete training run with WandB monitoring" id="fig11">
<figcaption><strong>Figure 11:</strong>Complete training run output showing console logs and WandB monitoring for Regular model (354M parameters)</figcaption>
</figure>
<p>Whilst it might be obvious, the screenshot in <a href="#fig11" class="figure-ref">Figure 11</a> captures the final moments of a successful 60,000-iteration training run, showing both the real-time console output and WandB&rsquo;s comprehensive run summary. In this run, the logs reveal the training progression through the final iterations (59,850 to 60,000), with training loss steadily decreasing from 3.0575 to 2.7063, demonstrating healthy convergence.</p>
<p>The WandB run summary provides the complete picture: a final training loss of 2.70315, a validation loss of 3.61921, and a validation perplexity of 37.31, all indicating successful model training. The system automatically saved the final checkpoint and cleaned up old checkpoints, while WandB captured the entire training journey with detailed metrics tracking. This comprehensive monitoring approach ensures we can both track progress in real time and analyze the full training history afterward.</p>
<h2 id="6-checkpointing-and-model-persistence">6. Checkpointing and Model Persistence</h2>
<p>Checkpointing is one of the most critical aspects of training large language models, especially for historical text, where training can take days or weeks. A robust checkpointing system ensures that training progress is never lost due to hardware failures, power outages, or other interruptions. In this section, we&rsquo;ll explore the comprehensive checkpointing system built for the <strong><code>helloLondon</code></strong> project, covering everything from basic checkpoint creation to advanced resume functionality.</p>
<h3 id="61-checkpoint-system">6.1 Checkpoint System</h3>
<p>The training system implements a practical checkpointing system that preserves all aspects of training state, ensuring that training can be resumed from exactly where it left off. This is particularly important for any complex model, where training can take a long time.</p>
<p>Each checkpoint packages four essentials: the model weights (so learning is preserved), the optimizer state (so momentum and adaptive stats resume cleanly), the current iteration (so schedules pick up in the right place), and the best validation loss to date (so we only promote genuinely better models). Together, these let you stop and restart without losing training dynamics.</p>
<p>The code in <a href="#listing13" class="listing-ref">Listing 13</a> shows how these components are saved when validation loss improves:</p>
<figure id="listing13"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">if</span> losses[<span style="color:#a6da95">&#39;val&#39;</span>] <span style="color:#91d7e3;font-weight:bold">&lt;</span> best_val_loss:
</span></span><span style="display:flex;"><span>    best_val_loss <span style="color:#91d7e3;font-weight:bold">=</span> losses[<span style="color:#a6da95">&#39;val&#39;</span>]
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> iter_num <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#f5a97f">0</span>:
</span></span><span style="display:flex;"><span>        checkpoint <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;model&#39;</span>: raw_model<span style="color:#91d7e3;font-weight:bold">.</span>state_dict(),
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;optimizer&#39;</span>: <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>optimizer<span style="color:#91d7e3;font-weight:bold">.</span>state_dict(),
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;iter_num&#39;</span>: iter_num,
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#39;best_val_loss&#39;</span>: best_val_loss,
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>        checkpoint_path <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>output_dir <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#ed8796">f</span><span style="color:#a6da95">&#39;checkpoint-</span><span style="color:#a6da95">{</span>iter_num<span style="color:#a6da95">}</span><span style="color:#a6da95">.pt&#39;</span>
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Saving checkpoint to </span><span style="color:#a6da95">{</span>checkpoint_path<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>        torch<span style="color:#91d7e3;font-weight:bold">.</span>save(checkpoint, checkpoint_path)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Clean up old checkpoints - keep only the last 3</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>cleanup_old_checkpoints()</span></span></code></pre></div><figcaption>
        <strong>Listing 13: Checkpointing and Model Persistence</strong>
    </figcaption>
</figure>
<p>The checkpointing system uses a simple yet effective approach: it saves checkpoints only when the validation loss improves, rather than at every evaluation. This approach serves multiple purposes. First, it ensures we&rsquo;re always keeping the best-performing model, not just the most recent one. Second, it significantly reduces I/O overhead during training, as checkpoint saves can be expensive operations (our 354M parameter model checkpoints are ~1.4GB each). Third, it prevents disk space issues by avoiding the accumulation of suboptimal checkpoints. This selective checkpointing approach can improve overall training time by 5-10% by eliminating redundant disk writes.</p>
<h3 id="62-checkpoint-management-and-cleanup">6.2 Checkpoint Management and Cleanup</h3>
<p>Since our 354M parameter model checkpoints are ~1.4GB each, we need to clean up old checkpoints to avoid running out of disk space. The system automatically keeps only the last 5 checkpoints and deletes older ones (as defined in <code>config.py</code>). The cleanup function in <a href="#listing14" class="listing-ref">Listing 14</a> finds all checkpoint files, sorts them by modification time (newest first), and deletes everything except the most recent 5. Only the master process handles cleanup to avoid race conditions in multi-GPU setups.</p>
<figure id="listing14"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">cleanup_old_checkpoints</span>(<span style="color:#91d7e3">self</span>, keep_last<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">5</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Clean up old checkpoints, keeping only the last N&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3;font-weight:bold">not</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>master_process:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span>  <span style="color:#6e738d;font-style:italic"># Only the master process should clean up</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Find all checkpoint files</span>
</span></span><span style="display:flex;"><span>        checkpoint_files <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">list</span>(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>output_dir<span style="color:#91d7e3;font-weight:bold">.</span>glob(<span style="color:#a6da95">&#34;checkpoint-*.pt&#34;</span>))
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">len</span>(checkpoint_files) <span style="color:#91d7e3;font-weight:bold">&lt;=</span> keep_last:
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">return</span>  <span style="color:#6e738d;font-style:italic"># Not enough checkpoints to clean up</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Sort by modification time (newest first)</span>
</span></span><span style="display:flex;"><span>        checkpoint_files<span style="color:#91d7e3;font-weight:bold">.</span>sort(key<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#c6a0f6">lambda</span> x: x<span style="color:#91d7e3;font-weight:bold">.</span>stat()<span style="color:#91d7e3;font-weight:bold">.</span>st_mtime, reverse<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Keep the newest ones, delete the rest</span>
</span></span><span style="display:flex;"><span>        files_to_delete <span style="color:#91d7e3;font-weight:bold">=</span> checkpoint_files[keep_last:]
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">for</span> file_path <span style="color:#91d7e3;font-weight:bold">in</span> files_to_delete:
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>                file_path<span style="color:#91d7e3;font-weight:bold">.</span>unlink()
</span></span><span style="display:flex;"><span>                logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Deleted old checkpoint: </span><span style="color:#a6da95">{</span>file_path<span style="color:#91d7e3;font-weight:bold">.</span>name<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>                logger<span style="color:#91d7e3;font-weight:bold">.</span>warning(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Failed to delete checkpoint </span><span style="color:#a6da95">{</span>file_path<span style="color:#91d7e3;font-weight:bold">.</span>name<span style="color:#a6da95">}</span><span style="color:#a6da95">: </span><span style="color:#a6da95">{</span>e<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>                
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>warning(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Checkpoint cleanup failed: </span><span style="color:#a6da95">{</span>e<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)</span></span></code></pre></div><figcaption>
        <strong>Listing 14: Checkpoint Cleanup and Management</strong>
    </figcaption>
</figure>
<h3 id="63-resume-training-functionality">6.3 Resume Training Functionality</h3>
<p>The ability to resume training from any checkpoint is useful when training gets interrupted. This functionality lets you pick up where you left off, whether the interruption was a few minutes or longer.</p>
<p>The resume functionality loads a checkpoint file and restores the training state: the model weights, optimizer state, current iteration number, and best validation loss. If checkpoint loading fails, the code falls back to starting from scratch.</p>
<p>When loading checkpoints, the code handles two practical considerations. First, the <strong><code>map_location=self.device</code></strong> parameter ensures the checkpoint loads onto the correct device (CPU or GPU), which matters if you&rsquo;re resuming on different hardware or after a restart. Second, for multi-GPU setups using DistributedDataParallel, the model is wrapped in a <code>.module</code> attribute, so the code uses <code>raw_model = self.model.module if self.ddp else self.model</code> to access the actual model underneath.</p>
<figure id="listing15"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">resume_from_checkpoint_file</span>(<span style="color:#91d7e3">self</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Resume training from a checkpoint file&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3;font-weight:bold">not</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>resume_from_checkpoint:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>    checkpoint_path <span style="color:#91d7e3;font-weight:bold">=</span> Path(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>resume_from_checkpoint)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3;font-weight:bold">not</span> checkpoint_path<span style="color:#91d7e3;font-weight:bold">.</span>exists():
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>error(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Checkpoint file not found: </span><span style="color:#a6da95">{</span>checkpoint_path<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>    logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Resuming from checkpoint: </span><span style="color:#a6da95">{</span>checkpoint_path<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">try</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Load checkpoint</span>
</span></span><span style="display:flex;"><span>        checkpoint <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>load(checkpoint_path, map_location<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>device)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Load model state</span>
</span></span><span style="display:flex;"><span>        raw_model <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model<span style="color:#91d7e3;font-weight:bold">.</span>module <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>ddp <span style="color:#c6a0f6">else</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>model
</span></span><span style="display:flex;"><span>        raw_model<span style="color:#91d7e3;font-weight:bold">.</span>load_state_dict(checkpoint[<span style="color:#a6da95">&#39;model&#39;</span>])
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#a6da95">&#34;Model state loaded successfully&#34;</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Load optimizer state</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>optimizer<span style="color:#91d7e3;font-weight:bold">.</span>load_state_dict(checkpoint[<span style="color:#a6da95">&#39;optimizer&#39;</span>])
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#a6da95">&#34;Optimizer state loaded successfully&#34;</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Get iteration number and best validation loss</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>start_iter <span style="color:#91d7e3;font-weight:bold">=</span> checkpoint<span style="color:#91d7e3;font-weight:bold">.</span>get(<span style="color:#a6da95">&#39;iter_num&#39;</span>, <span style="color:#f5a97f">0</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>best_val_loss <span style="color:#91d7e3;font-weight:bold">=</span> checkpoint<span style="color:#91d7e3;font-weight:bold">.</span>get(<span style="color:#a6da95">&#39;best_val_loss&#39;</span>, <span style="color:#f5a97f">1e9</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Resuming from iteration: </span><span style="color:#a6da95">{</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>start_iter<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Best validation loss so far: </span><span style="color:#a6da95">{</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>best_val_loss<span style="color:#a6da95">:</span><span style="color:#a6da95">.4f</span><span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">except</span> <span style="color:#f5a97f">Exception</span> <span style="color:#c6a0f6">as</span> e:
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>error(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Failed to load checkpoint: </span><span style="color:#a6da95">{</span>e<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>        logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#a6da95">&#34;Starting training from scratch...&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>start_iter <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">0</span>
</span></span><span style="display:flex;"><span>        <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>best_val_loss <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">1e9</span></span></span></code></pre></div><figcaption>
        <strong>Listing 15: Resume Training from Checkpoint</strong>
    </figcaption>
</figure>
<p>The function loads the model weights, optimizer state, iteration number, and best validation loss from the checkpoint file, then continues training from where it left off. If the checkpoint file doesn&rsquo;t exist or can&rsquo;t be loaded, it logs an error and starts training from scratch. Since our Regular model takes 28-32 hours to train, resuming from a checkpoint saves significant time when training is interrupted by power outages, crashes, or manual stops.</p>
<h2 id="7-training-launch-and-management">7. Training Launch and Management</h2>
<h3 id="71-multi-gpu-training-with-torchrun">7.1 Multi-GPU Training with torchrun</h3>
<p>For a single GPU, you can run the training script directly — a single Python process will use that device. To use multiple GPUs, launch training with <code>torchrun</code>, which spawns one worker process per GPU and lets the code initialize <code>DistributedDataParallel</code> (DDP). This enables larger effective batch sizes and faster wall‑clock training while keeping weights synchronized across devices; set <code>--nproc_per_node</code> to the number of GPUs you want to use (for example, <code>--nproc_per_node=2</code>).</p>
<p><code>torchrun</code> is PyTorch&rsquo;s recommended launcher for distributed training: it initializes the distributed backend and sets environment variables (<code>RANK</code>, <code>LOCAL_RANK</code>, <code>WORLD_SIZE</code>) to keep workers in sync. With <code>torchrun --nproc_per_node=N</code> (where <code>N</code> is the number of GPUs to use — it can be less than the total GPUs available), batches are sharded across the chosen GPUs and gradients are synchronized after each backward pass, which often gives near‑linear speedups on a small multi‑GPU node.</p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Single GPU (even with multiple available)</span>
</span></span><span style="display:flex;"><span>python train_model_slm.py
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Multi-GPU with near-linear speedup</span>
</span></span><span style="display:flex;"><span>torchrun --nproc_per_node<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">2</span> train_model_slm.py</span></span></code></pre></div>
<p>The training script handles DDP (<code>DistributedDataParallel</code>) via <code>train_model_slm.py</code> for gradient sync and batch distribution across GPUs. <a href="#fig12" class="figure-ref">Figure 12</a> below shows an example where we have dual GPUs and both are being used.</p>
<figure>
<img src="images/train16-4.png" alt="Multiple GPU used for training Screenshot" title="Multiple GPU used for training" id="fig12">
<figcaption><strong>Figure 12:</strong> Multiple GPU used for training</figcaption>
</figure>
<p>Note that if you run <code>python train_model_slm.py</code> on a multi‑GPU machine, only one GPU is used; the others remain idle. To use more than one GPU, we must use <code>torchrun</code>.</p>
<h3 id="72-training-monitoring">7.2 Training Monitoring</h3>
<p>Training is monitored locally via structured console logs and remotely via WandB. The snippet in <a href="#listing16" class="listing-ref">Listing 16</a> records loss, learning rate, timing, and MFU at a configurable interval and, when enabled, streams the same metrics to WandB for side‑by‑side run comparison.</p>
<figure id="listing16"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Timing and logging</span>
</span></span><span style="display:flex;"><span>t1 <span style="color:#91d7e3;font-weight:bold">=</span> time<span style="color:#91d7e3;font-weight:bold">.</span>time()
</span></span><span style="display:flex;"><span>dt <span style="color:#91d7e3;font-weight:bold">=</span> t1 <span style="color:#91d7e3;font-weight:bold">-</span> t0
</span></span><span style="display:flex;"><span>t0 <span style="color:#91d7e3;font-weight:bold">=</span> t1
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">if</span> iter_num <span style="color:#91d7e3;font-weight:bold">%</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>log_interval <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#f5a97f">0</span> <span style="color:#91d7e3;font-weight:bold">and</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>master_process:
</span></span><span style="display:flex;"><span>    lossf <span style="color:#91d7e3;font-weight:bold">=</span> loss<span style="color:#91d7e3;font-weight:bold">.</span>item()
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> local_iter_num <span style="color:#91d7e3;font-weight:bold">&gt;=</span> <span style="color:#f5a97f">5</span>:
</span></span><span style="display:flex;"><span>        mfu <span style="color:#91d7e3;font-weight:bold">=</span> raw_model<span style="color:#91d7e3;font-weight:bold">.</span>estimate_mfu(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>batch_size, dt)
</span></span><span style="display:flex;"><span>        running_mfu <span style="color:#91d7e3;font-weight:bold">=</span> mfu <span style="color:#c6a0f6">if</span> running_mfu <span style="color:#91d7e3;font-weight:bold">==</span> <span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">1.0</span> <span style="color:#c6a0f6">else</span> <span style="color:#f5a97f">0.9</span><span style="color:#91d7e3;font-weight:bold">*</span>running_mfu <span style="color:#91d7e3;font-weight:bold">+</span> <span style="color:#f5a97f">0.1</span><span style="color:#91d7e3;font-weight:bold">*</span>mfu
</span></span><span style="display:flex;"><span>    logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;iter </span><span style="color:#a6da95">{</span>iter_num<span style="color:#a6da95">}</span><span style="color:#a6da95">: loss </span><span style="color:#a6da95">{</span>lossf<span style="color:#a6da95">:</span><span style="color:#a6da95">.4f</span><span style="color:#a6da95">}</span><span style="color:#a6da95">, time </span><span style="color:#a6da95">{</span>dt<span style="color:#91d7e3;font-weight:bold">*</span><span style="color:#f5a97f">1000</span><span style="color:#a6da95">:</span><span style="color:#a6da95">.2f</span><span style="color:#a6da95">}</span><span style="color:#a6da95">ms, mfu </span><span style="color:#a6da95">{</span>running_mfu<span style="color:#91d7e3;font-weight:bold">*</span><span style="color:#f5a97f">100</span><span style="color:#a6da95">:</span><span style="color:#a6da95">.2f</span><span style="color:#a6da95">}</span><span style="color:#a6da95">%&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Log to WandB - loss first for better mobile UI</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>use_wandb:
</span></span><span style="display:flex;"><span>        wandb<span style="color:#91d7e3;font-weight:bold">.</span>log({
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#34;train/loss&#34;</span>: lossf,
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#34;train/lr&#34;</span>: lr,
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#34;train/iter&#34;</span>: iter_num,
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#34;train/mfu&#34;</span>: running_mfu <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">100</span> <span style="color:#c6a0f6">if</span> running_mfu <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#f5a97f">0</span> <span style="color:#c6a0f6">else</span> <span style="color:#f5a97f">0</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#a6da95">&#34;train/dt_ms&#34;</span>: dt <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">1000</span>,
</span></span><span style="display:flex;"><span>        })</span></span></code></pre></div><figcaption>
        <strong>Listing 16: Training Monitoring and Logging</strong>
    </figcaption>
</figure>
<p>Together, console logs and WandB provide real‑time visibility and reproducible experiment tracking; <a href="#fig13" class="figure-ref">Figure 13</a> below shows an example of the console logs; see Section 5 for setup and dashboards.</p>
<figure>
<img src="images/train16-7.png" alt="Console training logs showing iteration, loss, step time, and MFU with checkpoint saves" title="Console training logs: iteration, loss, step time, MFU, and checkpoint saves" id="fig13">
<figcaption><strong>Figure 13:</strong> Console training logs: iteration, loss, step time, MFU, and checkpoint saves</figcaption>
</figure>
<h2 id="8-model-file-formats-and-conversion">8. Model File Formats and Conversion</h2>
<p>Training produces PyTorch checkpoint files (<code>.pt</code>) that contain model weights, optimizer state, and training metadata — everything needed to resume training. These checkpoints are covered in detail in <a
	
		href = "#6-checkpointing-and-model-persistence"
	

	

	>
	
	<span>
		Section 6
	</span>
</a>.</p>
<p>For sharing models and standard deployment workflows, we convert PyTorch checkpoints into the Hugging Face repository format. This conversion creates a portable, standardized model package that can be loaded with standard Hugging Face APIs.</p>
<h3 id="81-converting-pytorch-checkpoints-to-hugging-face-format">8.1 Converting PyTorch Checkpoints to Hugging Face Format</h3>
<p>The Hugging Face repository format is a standardized directory structure containing:</p>
<ul>
<li><strong><code>config.json</code></strong>: Architecture definition (layers, heads, embedding dimensions, vocabulary size, sequence length). Allows <code>AutoModelForCausalLM</code> to reconstruct the model architecture without custom code.</li>
<li><strong><code>model.safetensors</code></strong>: Model weights in SafeTensors format (memory-mapped, secure loading). Contains only model parameters, no optimizer state — suitable for inference workloads.</li>
<li><strong><code>generation_config.json</code></strong>: Default text generation parameters (max_new_tokens, temperature, top_p, repetition_penalty). Can be overridden at runtime.</li>
<li><strong>Tokenizer files</strong> (<code>tokenizer.json</code>, <code>vocab.json</code>, <code>merges.txt</code>, <code>special_tokens_map.json</code>, <code>tokenizer_config.json</code>): Serialized tokenizer with vocabulary, merge rules, normalization, and special tokens matching the training configuration.</li>
</ul>
<p>The conversion code in <a href="#listing17" class="listing-ref">Listing 17</a> loads a PyTorch checkpoint, extracts model weights and config, handles <code>torch.compile</code> naming prefixes if present, and saves the model and tokenizer in Hugging Face format.</p>
<figure id="listing17"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">torch</span>
</span></span><span style="display:flex;"><span><span style="color:#8bd5ca">from</span> <span style="color:#f5a97f">transformers</span> <span style="color:#8bd5ca">import</span> GPT2LMHeadModel
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">convert_pytorch_to_huggingface</span>(pytorch_checkpoint_path, output_dir, tokenizer):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Convert PyTorch checkpoint to Hugging Face format&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Load PyTorch checkpoint</span>
</span></span><span style="display:flex;"><span>    checkpoint <span style="color:#91d7e3;font-weight:bold">=</span> torch<span style="color:#91d7e3;font-weight:bold">.</span>load(pytorch_checkpoint_path, map_location<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;cpu&#39;</span>)
</span></span><span style="display:flex;"><span>    model_state <span style="color:#91d7e3;font-weight:bold">=</span> checkpoint[<span style="color:#a6da95">&#39;model&#39;</span>]
</span></span><span style="display:flex;"><span>    config <span style="color:#91d7e3;font-weight:bold">=</span> checkpoint[<span style="color:#a6da95">&#39;config&#39;</span>]
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Handle torch.compile prefixes</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">any</span>(key<span style="color:#91d7e3;font-weight:bold">.</span>startswith(<span style="color:#a6da95">&#39;_orig_mod.&#39;</span>) <span style="color:#c6a0f6">for</span> key <span style="color:#91d7e3;font-weight:bold">in</span> model_state<span style="color:#91d7e3;font-weight:bold">.</span>keys()):
</span></span><span style="display:flex;"><span>        clean_state <span style="color:#91d7e3;font-weight:bold">=</span> {}
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">for</span> key, value <span style="color:#91d7e3;font-weight:bold">in</span> model_state<span style="color:#91d7e3;font-weight:bold">.</span>items():
</span></span><span style="display:flex;"><span>            clean_state[key[<span style="color:#f5a97f">10</span>:]] <span style="color:#91d7e3;font-weight:bold">=</span> value <span style="color:#c6a0f6">if</span> key<span style="color:#91d7e3;font-weight:bold">.</span>startswith(<span style="color:#a6da95">&#39;_orig_mod.&#39;</span>) <span style="color:#c6a0f6">else</span> value
</span></span><span style="display:flex;"><span>        model_state <span style="color:#91d7e3;font-weight:bold">=</span> clean_state
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Convert to Hugging Face format</span>
</span></span><span style="display:flex;"><span>    hf_model <span style="color:#91d7e3;font-weight:bold">=</span> GPT2LMHeadModel(config)
</span></span><span style="display:flex;"><span>    hf_model<span style="color:#91d7e3;font-weight:bold">.</span>load_state_dict(model_state)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Save in Hugging Face format</span>
</span></span><span style="display:flex;"><span>    hf_model<span style="color:#91d7e3;font-weight:bold">.</span>save_pretrained(output_dir)
</span></span><span style="display:flex;"><span>    tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>save_pretrained(output_dir)</span></span></code></pre></div><figcaption>
        <strong>Listing 17: PyTorch → Hugging Face Conversion (essentials)</strong>
    </figcaption>
</figure>
<p>The conversion handles a few practical details. If the model was compiled with <code>torch.compile</code>, parameter names are prefixed with <code>_orig_mod.</code>, which the code strips to match Hugging Face module names. <code>GPT2LMHeadModel(config)</code> instantiates a GPT-2-style architecture that matches the checkpoint&rsquo;s layer structure, and <code>load_state_dict()</code> loads the weights with automatic shape validation. The <code>save_pretrained()</code> method writes all required files to disk.</p>
<p>File sizes: PyTorch checkpoints are ~450MB (SLM) and ~1.4GB (Regular model); the Hugging Face format reduces this slightly by excluding the optimizer state. The tokenizer adds ~15MB to the repository.</p>
<h2 id="9-inference-options">9. Inference Options</h2>
<p>Inference can run directly from PyTorch checkpoints or from Hugging Face models. PyTorch checkpoints are convenient during development since you can test any training checkpoint without conversion. Hugging Face models use standard <code>from_pretrained()</code> APIs and are better suited for sharing and deployment workflows.</p>
<figure id="listing18"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Option 1: PyTorch checkpoint inference (direct from training)</span>
</span></span><span style="display:flex;"><span>python <span style="color:#f5a97f">06</span>_inference<span style="color:#91d7e3;font-weight:bold">/</span>inference_pytorch<span style="color:#91d7e3;font-weight:bold">.</span>py \
</span></span><span style="display:flex;"><span>  <span style="color:#91d7e3;font-weight:bold">--</span>checkpoint <span style="color:#f5a97f">09</span>_models<span style="color:#91d7e3;font-weight:bold">/</span>checkpoints<span style="color:#91d7e3;font-weight:bold">/</span>slm<span style="color:#91d7e3;font-weight:bold">/</span>checkpoint<span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">60001.</span>pt \
</span></span><span style="display:flex;"><span>  <span style="color:#91d7e3;font-weight:bold">--</span>prompt <span style="color:#a6da95">&#34;In the year 1834, I walked through the streets of London and witnessed&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Option 2: Hugging Face model inference (published models)</span>
</span></span><span style="display:flex;"><span><span style="color:#8bd5ca">from</span> <span style="color:#f5a97f">transformers</span> <span style="color:#8bd5ca">import</span> AutoTokenizer, AutoModelForCausalLM
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>tokenizer <span style="color:#91d7e3;font-weight:bold">=</span> AutoTokenizer<span style="color:#91d7e3;font-weight:bold">.</span>from_pretrained(<span style="color:#a6da95">&#34;bahree/london-historical-slm&#34;</span>)
</span></span><span style="display:flex;"><span>model <span style="color:#91d7e3;font-weight:bold">=</span> AutoModelForCausalLM<span style="color:#91d7e3;font-weight:bold">.</span>from_pretrained(<span style="color:#a6da95">&#34;bahree/london-historical-slm&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>inputs <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer(<span style="color:#a6da95">&#34;In the year 1834, I walked through the streets...&#34;</span>, return_tensors<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;pt&#34;</span>)
</span></span><span style="display:flex;"><span>outputs <span style="color:#91d7e3;font-weight:bold">=</span> model<span style="color:#91d7e3;font-weight:bold">.</span>generate(inputs[<span style="color:#a6da95">&#39;input_ids&#39;</span>], max_new_tokens<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">50</span>, do_sample<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>result <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>decode(outputs[<span style="color:#f5a97f">0</span>], skip_special_tokens<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>)</span></span></code></pre></div><figcaption>
        <strong>Listing 18: Inference Options</strong>
    </figcaption>
</figure>
<p>Both methods load in seconds and generate ~50–100 tokens/sec on typical consumer GPUs (2–4GB VRAM for SLM, 6–8GB for the Regular model). Use PyTorch checkpoints for development and training comparisons; use Hugging Face models for production deployment and sharing. For interactive testing with published models, see <a
	
		href = "https://blog.desigeek.com/post/2025/09/building-llm-from-scratch-part1/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Part 1
	</span>
</a>.</p>
<h2 id="10-summary">10. Summary</h2>
<p>We built a training‑ready GPT pipeline for historical text, end‑to‑end: a clear decoder‑only architecture, pragmatic GPU/precision tuning, DDP for scale, resilient checkpointing/resume, WandB tracking, and clean hand‑off of artifacts (PyTorch checkpoints → Hugging Face export).</p>
<p>Outcome: two working models on the Part 2 corpus - 117M (SLM) and 354M (Regular) - ready for inference now and for evaluation/deployment in Part 4.</p>
<blockquote>
<p><strong>🔗 GitHub Repository</strong>: <a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		github.com/bahree/helloLondon
	</span>
</a> - Complete training infrastructure (<code>04_training/</code>), model architecture (<code>config.py</code>), and GPU configuration (<code>08_documentation/GPU_TUNING.md</code>)</p></blockquote>
<blockquote>
<p><strong>🧱 Series Posts</strong>: <a
	
		href = "/post/2025/09/building-llm-from-scratch-part1/"
	

	

	>
	
	<span>
		Part 1 – Using the Published Historical Models
	</span>
</a> | <a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2 – Data Collection &amp; Custom Tokenizer
	</span>
</a> | Part 3 (this post) | <a
	
		href = "/post/2026/01/building-llm-from-scratch-part4-evaluation-deployment/"
	

	

	>
	
	<span>
		Part 4 – Evaluation &amp; Deployment
	</span>
</a></p></blockquote>
<blockquote>
<p><strong>🤗 Published Models</strong>: <a
	
		href = "https://huggingface.co/bahree/london-historical-slm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		SLM Model
	</span>
</a> | <a
	
		href = "https://huggingface.co/bahree/london-historical-llm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Regular Model
	</span>
</a> - Ready-to-use historical language models on HuggingFace</p></blockquote>
<blockquote>
<p><strong>📚 Book Reference</strong>: <a
	
		href = "https://a.co/d/ffzkJ7T"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Generative AI in Action
	</span>
</a> - For deeper understanding of core LLM concepts.</p></blockquote>
<hr>
<p><strong>Ready for Part 4?</strong> Part 4 covers model evaluation, testing, and deployment strategies that turn your trained models into working systems ready for real-world use.</p>
<h2 id="references">References</h2>
<div class="references" style="font-size:0.85em">
<ol>
<li>Vaswani et al. (2017) – Attention Is All You Need: <a
	
		href = "https://arxiv.org/abs/1706.03762"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1706.03762
	</span>
</a></li>
<li>Radford et al. (2019) – Language Models are Unsupervised Multitask Learners: <a
	
		href = "https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe
	</span>
</a></li>
<li>Brown et al. (2020) – Language Models are Few-Shot Learners: <a
	
		href = "https://arxiv.org/abs/2005.14165"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/2005.14165
	</span>
</a></li>
<li>Kaplan et al. (2020) – Scaling Laws for Neural Language Models: <a
	
		href = "https://arxiv.org/abs/2001.08361"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/2001.08361
	</span>
</a></li>
<li>Hoffmann et al. (2022) – Training Compute-Optimal LLMs (Chinchilla): <a
	
		href = "https://arxiv.org/abs/2203.15556"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/2203.15556
	</span>
</a></li>
<li>Chowdhery et al. (2022) – PaLM: Scaling Language Modeling with Pathways: <a
	
		href = "https://arxiv.org/abs/2204.02311"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/2204.02311
	</span>
</a></li>
<li>Clark et al. (2019) – What Does BERT Look At?: <a
	
		href = "https://arxiv.org/abs/1906.04341"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1906.04341
	</span>
</a></li>
<li>Voita et al. (2019) – Analyzing Multi‑Head Self‑Attention: <a
	
		href = "https://arxiv.org/abs/1905.09418"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1905.09418
	</span>
</a></li>
<li>Dao et al. (2022) – FlashAttention: <a
	
		href = "https://arxiv.org/abs/2205.14135"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/2205.14135
	</span>
</a></li>
<li>Micikevicius et al. (2018) – Mixed Precision Training: <a
	
		href = "https://arxiv.org/abs/1710.03740"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1710.03740
	</span>
</a></li>
<li>Rajbhandari et al. (2020) – ZeRO: <a
	
		href = "https://arxiv.org/abs/1910.02054"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1910.02054
	</span>
</a></li>
<li>Paszke et al. (2019) – PyTorch: <a
	
		href = "https://arxiv.org/abs/1912.01703"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1912.01703
	</span>
</a></li>
<li>Kingma &amp; Ba (2014) – Adam: A Method for Stochastic Optimization: <a
	
		href = "https://arxiv.org/abs/1412.6980"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1412.6980
	</span>
</a></li>
<li>Loshchilov &amp; Hutter (2017) – AdamW Decoupled Weight Decay Regularization : <a
	
		href = "https://arxiv.org/abs/1711.05101"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1711.05101
	</span>
</a></li>
<li>Smith &amp; Topin (2017) – Super‑Convergence: Very Fast Training of Neural Networks Using Large Learning Rates: <a
	
		href = "https://arxiv.org/abs/1708.07120"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1708.07120
	</span>
</a></li>
<li>Goyal et al. (2017) – Accurate, Large Minibatch SGD: <a
	
		href = "https://arxiv.org/abs/1706.02677"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1706.02677
	</span>
</a></li>
<li>Sergeev &amp; Del Balso (2018) – Horovod: <a
	
		href = "https://arxiv.org/abs/1802.05799"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1802.05799
	</span>
</a></li>
<li>Pope et al. (2022) – Efficiently Scaling Transformer Inference: <a
	
		href = "https://arxiv.org/abs/2211.05102"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/2211.05102
	</span>
</a></li>
<li>Jawahar et al. (2019) – What does BERT learn about the structure of language?: <a
	
		href = "https://aclanthology.org/P19-1356.pdf"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://aclanthology.org/P19-1356.pdf
	</span>
</a></li>
<li>Mikolov et al. (2013) – Word2vec: <a
	
		href = "https://arxiv.org/abs/1301.3781"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1301.3781
	</span>
</a></li>
<li>Pennington et al. (2014) – GloVe:  <a
	
		href = "https://aclanthology.org/D14-1162/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://aclanthology.org/D14-1162/
	</span>
</a></li>
<li>Devlin et al. (2018) – BERT <a
	
		href = "https://arxiv.org/abs/1810.04805"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1810.04805
	</span>
</a></li>
<li>Press &amp; Wolf (2017) – Using the Output Embedding to Improve Language Models: <a
	
		href = "https://arxiv.org/abs/1608.05859"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1608.05859
	</span>
</a></li>
<li>Inan et al. (2016) – Tying Word Vectors and Word Classifiers: <a
	
		href = "https://arxiv.org/abs/1611.01462"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		https://arxiv.org/abs/1611.01462
	</span>
</a></li>
</ol>
</div>
]]></content:encoded>
    </item>
    <item>
      <title>🏛️Building LLMs from Scratch - Part 2: Data Collection &amp; Custom Tokenizers</title>
      <link>/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/</link>
      <pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate>
      <guid>/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/</guid>
      <description>Deep dive into data collection, cleaning pipelines, and custom tokenizer development for authentic historical text processing. Complete 4-part series with working code.</description>
      <content:encoded><![CDATA[<p><strong>TL;DR</strong></p>
<p>In this second part of our 4-part series on building language models from scratch, I explore the two foundational areas of LLM development: data collection and custom tokenizer creation. <a
	
		href = "https://blog.desigeek.com/post/2025/09/building-llm-from-scratch-part1/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Part 1 - Building LLM from Scratch
	</span>
</a> covered using the published model; here, we build the complete pipeline from raw historical documents to a custom tokenizer that understands archaic English, London geography, and period-specific terminology.</p>
<p>The challenge with historical LLMs isn&rsquo;t just having enough data—it&rsquo;s having the <em>right</em> data processed to preserve linguistic nuances across different historical periods. This post demonstrates how to transform over 218 historical sources into a corpus of more than 500 million characters using a specialized tokenizer for authentic historical text generation.</p>
<blockquote>
<p><strong>⚠️ Educational Purpose</strong>: This is a learning project designed to teach LLM development concepts. For production-scale LLMs, you&rsquo;ll need significantly larger datasets, more sophisticated infrastructure, and additional considerations that are not covered in this post.</p></blockquote>
<h2 id="1-the-historical-language-modeling-challenge">1. The Historical Language Modeling Challenge</h2>
<p>Building a language model for historical text presents unique challenges. Historical English from 1500 to 1850 contains linguistic patterns, vocabulary, and cultural references that modern tokenizers have never encountered. Standard tokenizers like <a
	
		href = "https://github.com/openai/tiktoken"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		TikToken
	</span>
</a> fragment archaic words like &ldquo;quoth&rdquo; and &ldquo;hast&rdquo; into multiple subword tokens, destroying semantic meaning crucial for historical text generation.</p>
<p>A simple phrase like <strong><code>Quoth the alderman, 'Tis a fair day at Newgate</code></strong> becomes dozens of meaningless fragments, losing both historical context and linguistic coherence. This fragmentation is why we built a custom tokenizer trained specifically on historical English patterns, ensuring the model can generate coherent, historically accurate text.</p>
<p>As a reminder, both the SLM (117M parameters) and Regular Model (354M parameters) utilize the same training code and infrastructure, including GPU optimization, checkpointing, and WandB integration. The only difference lies in the model architecture parameters, which are specified in <code>config.py</code>.</p>
<blockquote>
<p><strong>🔗 GitHub Repository</strong>: <a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		github.com/bahree/helloLondon
	</span>
</a> - Complete source code for data collection (<code>02_data_collection/</code>) and tokenizer training (<code>03_tokenizer/</code>). We will see the relevant code snippets in this post show key concepts—see the full implementation in the repository.</p></blockquote>
<blockquote>
<p><strong>🧱 Series Posts</strong>: <a
	
		href = "/post/2025/09/building-llm-from-scratch-part1/"
	

	

	>
	
	<span>
		Part 1 – Using the Published Historical Models
	</span>
</a> | Part 2 (this post) | <a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3 – Training Architecture &amp; GPU Optimization
	</span>
</a> | <a
	
		href = "/post/2026/01/building-llm-from-scratch-part4-evaluation-deployment/"
	

	

	>
	
	<span>
		Part 4 – Evaluation &amp; Deployment
	</span>
</a></p></blockquote>
<p><strong>What will you learn?</strong></p>
<p>This project provides hands-on experience with real-world LLM development challenges, including data collection from over 218 historical sources, cleaning OCR errors and encoding issues, and developing custom tokenizers for historical text. Unlike theoretical tutorials, you receive complete, runnable code that demonstrates actual trade-offs and decisions—such as choosing BPE over WordPiece or handling different file formats—that you&rsquo;d encounter in any serious LLM project.</p>
<p>While operating at a learning scale, the principles taught here directly apply to larger systems. Data collection patterns, cleaning strategies, and tokenizer design principles scale from our 500M character corpus to the 500B+ character datasets used in production models.</p>
<h2 id="11-high-level-process-overview">1.1 High-Level Process Overview</h2>
<p>The complete pipeline transforms raw historical documents into a working language model through five key stages:</p>
<ol>
<li><strong>Data Collection</strong>: 218+ historical sources (1500-1850), including literature, newspapers, court records, and personal diaries</li>
<li><strong>Cleaning Pipeline</strong>: Handles multiple file formats (PDF, HTML, XML, TXT) while removing OCR artifacts and preserving authentic historical language</li>
<li><strong>Quality Validation</strong>: Removes duplicates, filters non-English content, and ensures only meaningful historical text reaches the final corpus</li>
<li><strong>Custom Tokenizer Training</strong>: BPE-based tokenizer with ~150 special tokens capturing archaic pronouns, historical landmarks, and period-specific terminology</li>
<li><strong>Model Training</strong>: Two language models (SLM 117M and Regular 354M parameters) trained on the same historical corpus</li>
</ol>
<p>The result is a system capable of generating authentic historical text that captures the linguistic patterns and cultural context of 1500-1850 English. <a href="#fig1" class="figure-ref">Figure 1</a> illustrates this complete pipeline:</p>
<figure class="align-center " id="fig1">
    <pre class="mermaid">graph TD
    A[📚 218+ Historical Sources&lt;br/&gt;1500-1850] --&gt; B[🔍 Data Collection&lt;br/&gt;Download and Filter]
    B --&gt; C[🧹 5-Phase Cleaning Pipeline&lt;br/&gt;Format-Specific Processing]
    C --&gt; D[📊 Quality Validation&lt;br/&gt;Duplicate and Language Detection]
    D --&gt; E[📝 500M+ Character Corpus&lt;br/&gt;Clean Historical Text]
    E --&gt; F[🔤 Custom Tokenizer Training&lt;br/&gt;BPE with 150+ Special Tokens]
    F --&gt; G[🤖 Language Model Training&lt;br/&gt;SLM 117M + Regular 354M]
    
    style A fill:#e1f5fe
    style E fill:#f3e5f5
    style F fill:#fff3e0</pre>
    <figcaption>Figure 1: Complete Historical Text Processing Pipeline</figcaption>
</figure>
<h2 id="2-data-collection-the-foundation-of-historical-language-modeling">2. Data Collection: The Foundation of Historical Language Modeling</h2>
<p>Let us dig deeper into steps 1-4: data collection, cleaning, validation, and corpus creation. The data collection system processes over 218 sources spanning the years 1500-1850 to create a corpus of over 500 million characters of authentic historical English text. But collecting historical data isn&rsquo;t just about downloading files - it&rsquo;s about handling the sheer variety of formats and quality levels that historical documents present.</p>
<p>Historical documents come in all shapes and sizes - scanned books with OCR errors, HTML pages with messy markup, XML archives with rich metadata, and plain text files with inconsistent encoding. This is especially true for the earlier periods, when the quality of the documents can vary significantly, and most modern techniques for processing them struggle to cope. This data diversity requires a cleaning pipeline that transforms raw historical documents into training data while preserving the authentic language patterns of 1500-1850 English.</p>
<h3 id="21-system-architecture-processing-218-historical-sources">2.1 System Architecture: Processing 218+ Historical Sources</h3>
<p>The data collection system employs a modular architecture, with <strong><code>historical_data_collector.py</code></strong> serving as the primary orchestration engine, coordinating with a <strong><code>data_sources.json</code></strong> configuration file that contains metadata for over 218 historical sources. This enables easy management and updates without code changes.</p>
<p>Supporting scripts include <strong><code>add_data_source.py</code></strong> for interactive source addition with built-in validation, and <strong><code>generate_report.py</code></strong> for comprehensive reporting and analysis across multiple output formats.</p>
<p>The <strong><code>data_sources.json</code></strong> file contains metadata for each source, including time periods, formats, licensing, and processing priorities. Each entry includes:</p>
<ul>
<li><strong><code>time_period</code></strong> (e.g., [1690, 1800] for London Lives)</li>
<li><strong><code>format</code></strong> (XML, HTML, PDF)</li>
<li><strong><code>priority</code></strong> (high/medium/low)</li>
<li><strong><code>search_terms</code></strong> for collection guidance</li>
</ul>
<p>Our data sources span multiple categories, each contributing unique perspectives to the historical corpus.</p>
<ul>
<li>
<p><strong>Project Gutenberg:</strong> This provides foundational literature with 8+ carefully selected texts, using relaxed quality criteria that accept texts with as low as 40% meaningful words to capture the full spectrum of historical writing styles.</p>
</li>
<li>
<p><strong>Historical Archives:</strong> Historical Archives like <em>London Lives</em> (240,000 pages of personal records) and <em>Old Bailey</em> (197,000+ trial transcripts) offer rich historical content and were initially enabled in our data collection.</p>
<ul>
<li>Note: I was using the aggressive cleaning earlier (enabled using the <code>aggressive_cleaning</code> flag designed to remove structured legal data and semantic markup), and discovered that it was too aggressive and caused generation quality issues. After initial training runs revealed repetitive and incoherent text patterns, I turned off these sources. Enabling this back might be an exercise for you to try.</li>
</ul>
</li>
<li>
<p><strong>Archive.org:</strong> Archive.org has an API access that can be used for file filtering, and this makes it relatively straightforward.</p>
</li>
<li>
<p><strong>The National Archives (TNA):</strong> TNA records contribute government correspondence and official documents that provide the institutional context for historical events.</p>
</li>
<li>
<p><strong>British History Online:</strong> Finally, these supplements our collection with historical surveys and period documents that offer scholarly perspectives on the time periods we&rsquo;re modeling.</p>
</li>
</ul>
<p>However, each source type presents unique technical challenges that require specialized processing approaches. One example is Project Gutenberg, which contains files with standardized headers and footers that must be removed. (As a side note, I really appreciate the effort that has gone into this to make this formatting consistent, which makes the process of this relatively straightforward.)</p>
<p>On the other hand, PDF files often suffer from OCR errors, especially for older documents that contain corrupted historical language, requiring sophisticated text correction algorithms to restore proper spelling and grammar from scanned documents. The figure below shows one example of how older documents look. This example is &ldquo;The abridgment of the charter of the city of London&rdquo; from 1680.</p>
<figure>
<img src="images/charter-city-of-london.png" alt="The abridgment of the charter of the city of London" title="The abridgment of the charter of the city of London">
<figcaption><strong>Figure 2:</strong> The abridgment of the charter of the city of London (1680) - showing faded text and ink blots typical of historical documents</figcaption>
</figure>
<p>As you can see, the text is faded, has ink blots, and the font style is very different from modern text. OCR software often misinterprets characters in such documents, resulting in numerous errors, as illustrated in the image below. These OCR artifacts can severely degrade the quality of our training data if not properly addressed.</p>
<figure>
<img src="images/charter-city-of-london-ocr.png" alt="OCR - Charter of the city of London" title="OCR - Charter of the city of London">
<figcaption><strong>Figure 3:</strong> OCR errors in the charter document - showing how optical character recognition struggles with historical fonts and document quality</figcaption>
</figure>
<p><strong>HTML files</strong> from sources like Archive.org contain navigation elements, advertisements, and modern web markup that contaminate the historical corpus, demanding careful content extraction that preserves only the meaningful historical text.</p>
<p><strong>XML archives</strong> like London Lives and Old Bailey require specialized parsing to extract meaningful text while preserving semantic markup that provides context about speakers, dates, and document structure - a delicate balance between removing technical artifacts and maintaining historical authenticity.</p>
<p><strong>Government records</strong> from TNA often contain bureaucratic formatting, form fields, and institutional language that need careful filtering to extract the human stories and historical narratives.</p>
<p><strong>British History Online</strong> documents present challenges with academic formatting, footnotes, and scholarly apparatus that must be processed to maintain readability while preserving the scholarly context that makes them valuable for historical language modeling.</p>
<h3 id="22-cleaning-pipeline">2.2 Cleaning Pipeline</h3>
<p>I implement a 5-stage cleaning pipeline that helps transform the raw historical documents into training-ready text. Each phase addresses specific challenges that would otherwise contaminate our language model training.</p>
<h4 id="221-stage-1-file-discovery--initial-filtering">2.2.1 Stage 1: File Discovery &amp; Initial Filtering</h4>
<p>Historical archives often contain files in various formats, which may be missing proper file extensions or have non-standard naming conventions. Many files contain non-English content that would contaminate our English historical corpus. Additionally, many sources employ their own templates and standards for this purpose. To resolve this, we first implement a simple file detection and naming cleanup, as shown in <a href="#listing1" class="listing-ref">Listing 1</a>. The code itself is simple and self-explanatory.</p>
<figure id="listing1"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">detect_file_type</span>(file_path: <span style="color:#91d7e3">str</span>) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">str</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Detect file type based on extension and content analysis&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Extension-based detection</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> file_path<span style="color:#91d7e3;font-weight:bold">.</span>endswith((<span style="color:#a6da95">&#39;.txt&#39;</span>, <span style="color:#a6da95">&#39;.txt.utf-8&#39;</span>, <span style="color:#a6da95">&#39;_txt.utf-8&#39;</span>)):
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;text&#39;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">elif</span> file_path<span style="color:#91d7e3;font-weight:bold">.</span>endswith((<span style="color:#a6da95">&#39;.pdf&#39;</span>,)):
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;pdf&#39;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">elif</span> file_path<span style="color:#91d7e3;font-weight:bold">.</span>endswith((<span style="color:#a6da95">&#39;.html&#39;</span>, <span style="color:#a6da95">&#39;.htm&#39;</span>)):
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;html&#39;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">elif</span> file_path<span style="color:#91d7e3;font-weight:bold">.</span>endswith((<span style="color:#a6da95">&#39;.xml&#39;</span>,)):
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;xml&#39;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Content-based detection for files without extensions</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">with</span> <span style="color:#91d7e3">open</span>(file_path, <span style="color:#a6da95">&#39;rb&#39;</span>) <span style="color:#c6a0f6">as</span> f:
</span></span><span style="display:flex;"><span>        content <span style="color:#91d7e3;font-weight:bold">=</span> f<span style="color:#91d7e3;font-weight:bold">.</span>read(<span style="color:#f5a97f">1024</span>)  <span style="color:#6e738d;font-style:italic"># Read first 1KB</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> <span style="color:#ed8796">b</span><span style="color:#a6da95">&#39;&lt;html&#39;</span> <span style="color:#91d7e3;font-weight:bold">in</span> content<span style="color:#91d7e3;font-weight:bold">.</span>lower() <span style="color:#91d7e3;font-weight:bold">or</span> <span style="color:#ed8796">b</span><span style="color:#a6da95">&#39;&lt;!doctype&#39;</span> <span style="color:#91d7e3;font-weight:bold">in</span> content<span style="color:#91d7e3;font-weight:bold">.</span>lower():
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;html&#39;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">elif</span> <span style="color:#ed8796">b</span><span style="color:#a6da95">&#39;&lt;?xml&#39;</span> <span style="color:#91d7e3;font-weight:bold">in</span> content<span style="color:#91d7e3;font-weight:bold">.</span>lower():
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;xml&#39;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">elif</span> content<span style="color:#91d7e3;font-weight:bold">.</span>isascii() <span style="color:#91d7e3;font-weight:bold">and</span> <span style="color:#ed8796">b</span><span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\x00</span><span style="color:#a6da95">&#39;</span> <span style="color:#91d7e3;font-weight:bold">not</span> <span style="color:#91d7e3;font-weight:bold">in</span> content:
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;text&#39;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;binary&#39;</span></span></span></code></pre></div><figcaption>
        <strong>Listing 1: File Type Detection Function</strong>
    </figcaption>
</figure>
<p>When we run this locally, we will see the flow as outlined below, which illustrates how the detection works. This, of course, can be made more robust for non-English characters, but for now, we reject these.</p>
<p><strong>File Type Detection Flow:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>📁 Raw Files (218+ sources)
</span></span><span style="display:flex;"><span>    ↓
</span></span><span style="display:flex;"><span>🔍 File Type Detection
</span></span><span style="display:flex;"><span>    ├── .txt, .txt.utf-8, _txt.utf-8 → Text Processing
</span></span><span style="display:flex;"><span>    ├── .pdf → PDF Processing  
</span></span><span style="display:flex;"><span>    ├── .html, .htm → HTML Processing
</span></span><span style="display:flex;"><span>    ├── .xml → XML Processing (Old Bailey, London Lives)
</span></span><span style="display:flex;"><span>    └── No Extension → Content Detection
</span></span><span style="display:flex;"><span>        ├── HTML-like content → HTML Processing
</span></span><span style="display:flex;"><span>        ├── Text-like content → Text Processing
</span></span><span style="display:flex;"><span>        └── Binary/Unknown → REJECTED
</span></span><span style="display:flex;"><span>    ↓
</span></span><span style="display:flex;"><span>🚫 Filename Language Check
</span></span><span style="display:flex;"><span>    ├── Non-English characters → REJECTED (logged)
</span></span><span style="display:flex;"><span>    └── English/Latin → Continue</span></span></code></pre></div>
<p>Historical archives often lack standardized file extensions and contain content in languages other than English. Our two-stage detection ensures we capture valuable historical documents while filtering out irrelevant files, preventing both data loss and processing waste.</p>
<h4 id="222-stage-2-format-specific-content-extraction">2.2.2 Stage 2: Format-Specific Content Extraction</h4>
<p>Each file format requires specialized processing due to its unique contamination sources, including Project Gutenberg headers, PDF OCR errors, HTML navigation elements, and XML structural markup. Our format-specific extraction functions clean these artifacts while preserving authentic historical content.</p>
<h5 id="text-files-txt-txtutf-8"><strong>Text Files (.txt, .txt.utf-8)</strong></h5>
<p>Project Gutenberg texts contain standardized headers and footers that would confuse our language model. The cleaning process removes these while preserving the actual historical content. The code snippet in <a href="#listing2" class="listing-ref">Listing 2</a> demonstrates this approach and is quite straightforward. Of course, this can be made more robust, but this works well for our selected texts.</p>
<figure id="listing2"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">clean_gutenberg_text</span>(text: <span style="color:#91d7e3">str</span>) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">str</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Clean Project Gutenberg text by removing headers/footers and metadata&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    lines <span style="color:#91d7e3;font-weight:bold">=</span> text<span style="color:#91d7e3;font-weight:bold">.</span>split(<span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n</span><span style="color:#a6da95">&#39;</span>)
</span></span><span style="display:flex;"><span>    cleaned_lines <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    in_content <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">False</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> line <span style="color:#91d7e3;font-weight:bold">in</span> lines:
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Skip Gutenberg headers (before &#34;*** START OF&#34;)</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> <span style="color:#a6da95">&#34;*** START OF&#34;</span> <span style="color:#91d7e3;font-weight:bold">in</span> line:
</span></span><span style="display:flex;"><span>            in_content <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">True</span>
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">continue</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Skip Gutenberg footers (after &#34;*** END OF&#34;)</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> <span style="color:#a6da95">&#34;*** END OF&#34;</span> <span style="color:#91d7e3;font-weight:bold">in</span> line:
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">break</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Skip metadata lines</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> line<span style="color:#91d7e3;font-weight:bold">.</span>startswith((<span style="color:#a6da95">&#39;Title:&#39;</span>, <span style="color:#a6da95">&#39;Author:&#39;</span>, <span style="color:#a6da95">&#39;Release Date:&#39;</span>, <span style="color:#a6da95">&#39;Language:&#39;</span>)):
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">continue</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Skip empty lines at start</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3;font-weight:bold">not</span> in_content <span style="color:#91d7e3;font-weight:bold">and</span> <span style="color:#91d7e3;font-weight:bold">not</span> line<span style="color:#91d7e3;font-weight:bold">.</span>strip():
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">continue</span>
</span></span><span style="display:flex;"><span>            
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> in_content:
</span></span><span style="display:flex;"><span>            cleaned_lines<span style="color:#91d7e3;font-weight:bold">.</span>append(line)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n</span><span style="color:#a6da95">&#39;</span><span style="color:#91d7e3;font-weight:bold">.</span>join(cleaned_lines)<span style="color:#91d7e3;font-weight:bold">.</span>strip()</span></span></code></pre></div><figcaption>
        <strong>Listing 2: Project Gutenberg Text Cleaning Function</strong>
    </figcaption>
</figure>
<p><strong>Real Example - Before Cleaning:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>Title: A Journal of the Plague Year
</span></span><span style="display:flex;"><span>Author: Daniel Defoe
</span></span><span style="display:flex;"><span>Release Date: March 2003
</span></span><span style="display:flex;"><span>Language: English
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>*** START OF THE PROJECT GUTENBERG EBOOK A JOURNAL OF THE PLAGUE YEAR ***
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>It was about the beginning of September 1664, that I, among the rest of my neighbours, heard in ordinary discourse that the plague was returned again in Holland...
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>*** END OF THE PROJECT GUTENBERG EBOOK A JOURNAL OF THE PLAGUE YEAR ***</span></span></code></pre></div>
<p><strong>After Cleaning:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>It was about the beginning of September 1664, that I, among the rest of my neighbours, heard in ordinary discourse that the plague was returned again in Holland...</span></span></code></pre></div>
<p>Without this cleaning, the model would learn to generate Gutenberg headers and metadata instead of authentic historical text, contaminating the training data with modern digital artifacts.</p>
<h5 id="pdf-files"><strong>PDF Files</strong></h5>
<p>PDF files from historical archives often contain OCR errors and digital artifacts that require correction. The cleaning process in <a href="#listing3" class="listing-ref">Listing 3</a> addresses these issues while preserving historical content, removing page numbers and all-caps headers. While not perfect, it significantly improves text quality.</p>
<p>The OCR correction rules are based on common patterns in historical documents and can be refined for specific datasets. Libraries like <code>PyMuPDF</code> or <code>pdfplumber</code> extract text, while regex-based cleaning corrects common OCR errors and removes digital stamps. More advanced techniques, such as layout analysis or AI-based OCR correction, can further enhance this process.</p>
<figure id="listing3"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">clean_pdf_text</span>(text: <span style="color:#91d7e3">str</span>) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">str</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Clean PDF text by removing OCR artifacts and digital stamps&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Remove page numbers: [Page 123], standalone numbers</span>
</span></span><span style="display:flex;"><span>    text <span style="color:#91d7e3;font-weight:bold">=</span> re<span style="color:#91d7e3;font-weight:bold">.</span>sub(<span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\[Page \d+\]&#39;</span>, <span style="color:#a6da95">&#39;&#39;</span>, text)
</span></span><span style="display:flex;"><span>    text <span style="color:#91d7e3;font-weight:bold">=</span> re<span style="color:#91d7e3;font-weight:bold">.</span>sub(<span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;^\d+$&#39;</span>, <span style="color:#a6da95">&#39;&#39;</span>, text, flags<span style="color:#91d7e3;font-weight:bold">=</span>re<span style="color:#91d7e3;font-weight:bold">.</span>MULTILINE)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Remove library stamps: Internet Archive, Google, etc.</span>
</span></span><span style="display:flex;"><span>    stamps <span style="color:#91d7e3;font-weight:bold">=</span> [
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;Internet Archive&#39;</span>, <span style="color:#a6da95">&#39;Google Books&#39;</span>, <span style="color:#a6da95">&#39;HathiTrust&#39;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;Digitized by Google&#39;</span>, <span style="color:#a6da95">&#39;Scanned by Google&#39;</span>
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> stamp <span style="color:#91d7e3;font-weight:bold">in</span> stamps:
</span></span><span style="display:flex;"><span>        text <span style="color:#91d7e3;font-weight:bold">=</span> text<span style="color:#91d7e3;font-weight:bold">.</span>replace(stamp, <span style="color:#a6da95">&#39;&#39;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Fix common OCR artifacts</span>
</span></span><span style="display:flex;"><span>    ocr_fixes <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\b0\b&#39;</span>: <span style="color:#a6da95">&#39;O&#39;</span>,  <span style="color:#6e738d;font-style:italic"># 0 → O</span>
</span></span><span style="display:flex;"><span>        <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\b1\b&#39;</span>: <span style="color:#a6da95">&#39;I&#39;</span>,  <span style="color:#6e738d;font-style:italic"># 1 → I  </span>
</span></span><span style="display:flex;"><span>        <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\b5\b&#39;</span>: <span style="color:#a6da95">&#39;S&#39;</span>,  <span style="color:#6e738d;font-style:italic"># 5 → S</span>
</span></span><span style="display:flex;"><span>        <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\b8\b&#39;</span>: <span style="color:#a6da95">&#39;B&#39;</span>,  <span style="color:#6e738d;font-style:italic"># 8 → B</span>
</span></span><span style="display:flex;"><span>        <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\brn\b&#39;</span>: <span style="color:#a6da95">&#39;m&#39;</span>, <span style="color:#6e738d;font-style:italic"># rn → m</span>
</span></span><span style="display:flex;"><span>        <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\bcl\b&#39;</span>: <span style="color:#a6da95">&#39;d&#39;</span>  <span style="color:#6e738d;font-style:italic"># cl → d</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> pattern, replacement <span style="color:#91d7e3;font-weight:bold">in</span> ocr_fixes<span style="color:#91d7e3;font-weight:bold">.</span>items():
</span></span><span style="display:flex;"><span>        text <span style="color:#91d7e3;font-weight:bold">=</span> re<span style="color:#91d7e3;font-weight:bold">.</span>sub(pattern, replacement, text)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Remove all-caps lines (usually headers)</span>
</span></span><span style="display:flex;"><span>    lines <span style="color:#91d7e3;font-weight:bold">=</span> text<span style="color:#91d7e3;font-weight:bold">.</span>split(<span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n</span><span style="color:#a6da95">&#39;</span>)
</span></span><span style="display:flex;"><span>    cleaned_lines <span style="color:#91d7e3;font-weight:bold">=</span> [line <span style="color:#c6a0f6">for</span> line <span style="color:#91d7e3;font-weight:bold">in</span> lines <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3;font-weight:bold">not</span> line<span style="color:#91d7e3;font-weight:bold">.</span>isupper() <span style="color:#91d7e3;font-weight:bold">or</span> <span style="color:#91d7e3">len</span>(line) <span style="color:#91d7e3;font-weight:bold">&lt;</span> <span style="color:#f5a97f">10</span>]
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n</span><span style="color:#a6da95">&#39;</span><span style="color:#91d7e3;font-weight:bold">.</span>join(cleaned_lines)</span></span></code></pre></div><figcaption>
        <strong>Listing 3: PDF Text Cleaning Function</strong>
    </figcaption>
</figure>
<p><strong>Real Example - Before Cleaning:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>[Page 1]
</span></span><span style="display:flex;"><span>INTERNET ARCHIVE
</span></span><span style="display:flex;"><span>A JOURNAL OF THE PLAGUE YEAR
</span></span><span style="display:flex;"><span>BY DANIEL DEFOE
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>It was about the beginning of September 1664, that I, among the rest of my neighbours, heard in ordinary discourse that the plague was returned again in Holland. For it was indeed a very terrible time, and the people began to be very much alarmed at it.</span></span></code></pre></div>
<p><strong>After Cleaning:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>A JOURNAL OF THE PLAGUE YEAR
</span></span><span style="display:flex;"><span>BY DANIEL DEFOE
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>It was about the beginning of September 1664, that I, among the rest of my neighbours, heard in ordinary discourse that the plague was returned again in Holland. For it was indeed a very terrible time, and the people began to be very much alarmed at it.</span></span></code></pre></div>
<p>OCR errors can significantly impact the quality of model training. For example, if <code>London</code> appears as <code>L0nd0n</code> due to OCR errors, the model won&rsquo;t learn the correct spelling and will generate nonsensical text when asked about historical London. The correction process ensures our model learns authentic historical language patterns rather than digital artifacts, which is crucial for generating coherent and historically accurate text.</p>
<h5 id="html-files"><strong>HTML Files</strong></h5>
<p>HTML files from historical websites and digital archives contain markup that needs to be stripped while preserving the actual text content. We use the <code>BeautifulSoup</code> library in <a href="#listing4" class="listing-ref">Listing 4</a> to clean the HTML structure and extract only the meaningful text.</p>
<figure id="listing4"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">clean_html_text</span>(html_content: <span style="color:#91d7e3">str</span>) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">str</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Clean HTML content by removing markup and extracting text&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">from</span> <span style="color:#f5a97f">bs4</span> <span style="color:#8bd5ca">import</span> BeautifulSoup
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    soup <span style="color:#91d7e3;font-weight:bold">=</span> BeautifulSoup(html_content, <span style="color:#a6da95">&#39;html.parser&#39;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Remove unwanted elements</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> element <span style="color:#91d7e3;font-weight:bold">in</span> soup([<span style="color:#a6da95">&#39;script&#39;</span>, <span style="color:#a6da95">&#39;style&#39;</span>, <span style="color:#a6da95">&#39;nav&#39;</span>, <span style="color:#a6da95">&#39;header&#39;</span>, <span style="color:#a6da95">&#39;footer&#39;</span>, 
</span></span><span style="display:flex;"><span>                        <span style="color:#a6da95">&#39;aside&#39;</span>, <span style="color:#a6da95">&#39;menu&#39;</span>, <span style="color:#a6da95">&#39;form&#39;</span>, <span style="color:#a6da95">&#39;input&#39;</span>, <span style="color:#a6da95">&#39;button&#39;</span>]):
</span></span><span style="display:flex;"><span>        element<span style="color:#91d7e3;font-weight:bold">.</span>decompose()
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Remove wiki-specific elements</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> element <span style="color:#91d7e3;font-weight:bold">in</span> soup<span style="color:#91d7e3;font-weight:bold">.</span>find_all([<span style="color:#a6da95">&#39;div&#39;</span>, <span style="color:#a6da95">&#39;span&#39;</span>], class_<span style="color:#91d7e3;font-weight:bold">=</span>[<span style="color:#a6da95">&#39;navbox&#39;</span>, <span style="color:#a6da95">&#39;infobox&#39;</span>, <span style="color:#a6da95">&#39;sidebar&#39;</span>]):
</span></span><span style="display:flex;"><span>        element<span style="color:#91d7e3;font-weight:bold">.</span>decompose()
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Remove navigation elements</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> element <span style="color:#91d7e3;font-weight:bold">in</span> soup<span style="color:#91d7e3;font-weight:bold">.</span>find_all([<span style="color:#a6da95">&#39;div&#39;</span>, <span style="color:#a6da95">&#39;ul&#39;</span>], class_<span style="color:#91d7e3;font-weight:bold">=</span>[<span style="color:#a6da95">&#39;breadcrumb&#39;</span>, <span style="color:#a6da95">&#39;navigation&#39;</span>, <span style="color:#a6da95">&#39;menu&#39;</span>]):
</span></span><span style="display:flex;"><span>        element<span style="color:#91d7e3;font-weight:bold">.</span>decompose()
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Extract text content</span>
</span></span><span style="display:flex;"><span>    text <span style="color:#91d7e3;font-weight:bold">=</span> soup<span style="color:#91d7e3;font-weight:bold">.</span>get_text(separator<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39; &#39;</span>, strip<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Clean up excessive whitespace</span>
</span></span><span style="display:flex;"><span>    text <span style="color:#91d7e3;font-weight:bold">=</span> re<span style="color:#91d7e3;font-weight:bold">.</span>sub(<span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\s+&#39;</span>, <span style="color:#a6da95">&#39; &#39;</span>, text)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> text<span style="color:#91d7e3;font-weight:bold">.</span>strip()</span></span></code></pre></div><figcaption>
        <strong>Listing 4: HTML Text Cleaning Function</strong>
    </figcaption>
</figure>
<p><strong>Real Example - Before Cleaning:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-html" data-lang="html"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic">&lt;!DOCTYPE html&gt;</span>
</span></span><span style="display:flex;"><span>&lt;<span style="color:#c6a0f6">html</span>&gt;
</span></span><span style="display:flex;"><span>&lt;<span style="color:#c6a0f6">head</span>&gt;&lt;<span style="color:#c6a0f6">title</span>&gt;London History&lt;/<span style="color:#c6a0f6">title</span>&gt;&lt;/<span style="color:#c6a0f6">head</span>&gt;
</span></span><span style="display:flex;"><span>&lt;<span style="color:#c6a0f6">body</span>&gt;
</span></span><span style="display:flex;"><span>&lt;<span style="color:#c6a0f6">nav</span>&gt;Home | About | Contact&lt;/<span style="color:#c6a0f6">nav</span>&gt;
</span></span><span style="display:flex;"><span>&lt;<span style="color:#c6a0f6">header</span>&gt;London Historical Society&lt;/<span style="color:#c6a0f6">header</span>&gt;
</span></span><span style="display:flex;"><span>&lt;<span style="color:#c6a0f6">div</span> <span style="color:#8aadf4">class</span><span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;content&#34;</span>&gt;
</span></span><span style="display:flex;"><span>    &lt;<span style="color:#c6a0f6">h1</span>&gt;The Great Fire of London&lt;/<span style="color:#c6a0f6">h1</span>&gt;
</span></span><span style="display:flex;"><span>    &lt;<span style="color:#c6a0f6">p</span>&gt;In the year 1666, a great fire consumed much of London...&lt;/<span style="color:#c6a0f6">p</span>&gt;
</span></span><span style="display:flex;"><span>&lt;/<span style="color:#c6a0f6">div</span>&gt;
</span></span><span style="display:flex;"><span>&lt;<span style="color:#c6a0f6">footer</span>&gt;© 2024 London Historical Society&lt;/<span style="color:#c6a0f6">footer</span>&gt;
</span></span><span style="display:flex;"><span>&lt;/<span style="color:#c6a0f6">body</span>&gt;
</span></span><span style="display:flex;"><span>&lt;/<span style="color:#c6a0f6">html</span>&gt;</span></span></code></pre></div>
<p><strong>After Cleaning:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>The Great Fire of London in the year 1666, a great fire consumed much of London...</span></span></code></pre></div>
<p>HTML tags and navigation elements would contaminate training, causing the model to generate markup instead of historical text. Our cleaning process extracts meaningful content while preserving natural flow and structure.</p>
<h5 id="xml-files-historical-archives"><strong>XML Files (Historical Archives):</strong></h5>
<p>XML files from historical archives, such as the Old Bailey and London Lives, use specific schemas that require specialized parsing. Old Bailey employs <strong>TEI (Text Encoding Initiative)</strong> with <code>TEI.2</code> elements, while London Lives uses semantic markup (<code>name</code>, <code>geo</code>, <code>occupation</code>). These structured formats contain authentic historical language with rich metadata, as shown in <a href="#listing5" class="listing-ref">Listing 5</a>.</p>
<figure id="listing5"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">extract_old_bailey_text</span>(soup) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">str</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Extract text from Old Bailey XML using TEI schema structure&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    extracted_text <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check for TEI.2 elements (Old Bailey schema)</span>
</span></span><span style="display:flex;"><span>    tei_elements <span style="color:#91d7e3;font-weight:bold">=</span> soup<span style="color:#91d7e3;font-weight:bold">.</span>find_all(<span style="color:#a6da95">&#39;TEI.2&#39;</span>)
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> tei_elements:
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Extract trial accounts (main narrative content)</span>
</span></span><span style="display:flex;"><span>        trial_accounts <span style="color:#91d7e3;font-weight:bold">=</span> soup<span style="color:#91d7e3;font-weight:bold">.</span>find_all(<span style="color:#a6da95">&#39;div1&#39;</span>, {<span style="color:#a6da95">&#39;type&#39;</span>: <span style="color:#a6da95">&#39;trialAccount&#39;</span>})
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">for</span> trial <span style="color:#91d7e3;font-weight:bold">in</span> trial_accounts:
</span></span><span style="display:flex;"><span>            trial_text <span style="color:#91d7e3;font-weight:bold">=</span> extract_trial_narrative(trial)
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">if</span> trial_text:
</span></span><span style="display:flex;"><span>                extracted_text<span style="color:#91d7e3;font-weight:bold">.</span>append(trial_text)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Extract front matter (session information)</span>
</span></span><span style="display:flex;"><span>        front_matter <span style="color:#91d7e3;font-weight:bold">=</span> soup<span style="color:#91d7e3;font-weight:bold">.</span>find_all(<span style="color:#a6da95">&#39;div1&#39;</span>, {<span style="color:#a6da95">&#39;type&#39;</span>: <span style="color:#a6da95">&#39;frontMatter&#39;</span>})
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">for</span> front <span style="color:#91d7e3;font-weight:bold">in</span> front_matter:
</span></span><span style="display:flex;"><span>            front_text <span style="color:#91d7e3;font-weight:bold">=</span> extract_front_matter_narrative(front)
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">if</span> front_text:
</span></span><span style="display:flex;"><span>                extracted_text<span style="color:#91d7e3;font-weight:bold">.</span>append(front_text)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n\n</span><span style="color:#a6da95">&#39;</span><span style="color:#91d7e3;font-weight:bold">.</span>join(extracted_text)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">extract_london_lives_text</span>(soup) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">str</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Extract text from London Lives XML using semantic markup schema&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    extracted_text <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check for London Lives specific elements (name, geo, occupation, date)</span>
</span></span><span style="display:flex;"><span>    name_elements <span style="color:#91d7e3;font-weight:bold">=</span> soup<span style="color:#91d7e3;font-weight:bold">.</span>find_all(<span style="color:#a6da95">&#39;name&#39;</span>)
</span></span><span style="display:flex;"><span>    geo_elements <span style="color:#91d7e3;font-weight:bold">=</span> soup<span style="color:#91d7e3;font-weight:bold">.</span>find_all(<span style="color:#a6da95">&#39;geo&#39;</span>)
</span></span><span style="display:flex;"><span>    occupation_elements <span style="color:#91d7e3;font-weight:bold">=</span> soup<span style="color:#91d7e3;font-weight:bold">.</span>find_all(<span style="color:#a6da95">&#39;occupation&#39;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> name_elements <span style="color:#91d7e3;font-weight:bold">and</span> geo_elements <span style="color:#91d7e3;font-weight:bold">and</span> occupation_elements:
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Extract paragraphs with semantic markup</span>
</span></span><span style="display:flex;"><span>        paragraphs <span style="color:#91d7e3;font-weight:bold">=</span> soup<span style="color:#91d7e3;font-weight:bold">.</span>find_all(<span style="color:#a6da95">&#39;p&#39;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">for</span> para <span style="color:#91d7e3;font-weight:bold">in</span> paragraphs:
</span></span><span style="display:flex;"><span>            p_text <span style="color:#91d7e3;font-weight:bold">=</span> extract_paragraph_with_semantic_markup(para)
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">if</span> p_text<span style="color:#91d7e3;font-weight:bold">.</span>strip():
</span></span><span style="display:flex;"><span>                extracted_text<span style="color:#91d7e3;font-weight:bold">.</span>append(p_text)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> <span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n\n</span><span style="color:#a6da95">&#39;</span><span style="color:#91d7e3;font-weight:bold">.</span>join(extracted_text)</span></span></code></pre></div><figcaption>
        <strong>Listing 5: XML Text Extraction Functions</strong>
    </figcaption>
</figure>
<p><strong>Real Example - Old Bailey XML (Before Processing):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-xml" data-lang="xml"><span style="display:flex;"><span><span style="color:#c6a0f6">&lt;trial&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">&lt;frontmatter&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">&lt;session&gt;</span>Session 1<span style="color:#c6a0f6">&lt;/session&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">&lt;date&gt;</span>1674-04-15<span style="color:#c6a0f6">&lt;/date&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">&lt;location&gt;</span>Old Bailey<span style="color:#c6a0f6">&lt;/location&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">&lt;/frontmatter&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">&lt;proceedings&gt;</span>
</span></span><span style="display:flex;"><span>The prisoner being brought to the bar, and the indictment being read, he pleaded Not Guilty. The witnesses being sworn, the first witness deposed that on the 15th day of April last, he saw the prisoner in the company of several suspicious persons...
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">&lt;/proceedings&gt;</span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">&lt;/trial&gt;</span></span></span></code></pre></div>
<p><strong>After Processing:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>Session 1 1674-04-15 Old Bailey The prisoner being brought to the bar, and the indictment being read, he pleaded Not Guilty. The witnesses being sworn, the first witness deposed, that on the 15th day of April last, he saw the prisoner in the company of several suspicious persons...</span></span></code></pre></div>
<p>These XML files contain the most authentic historical language in our entire dataset. The Old Bailey trials show how people actually spoke in court during the 17th-19th centuries, while London Lives reveals the everyday language used in personal records and official documents. This authentic historical language is very useful for training a model that can generate historically accurate text, as it provides the model with genuine examples of how people wrote and spoke during different historical periods.</p>
<h4 id="223-stage-3-text-normalization">2.2.3 Stage 3: Text Normalization</h4>
<p>After extraction, text normalization ensures consistency and compatibility with the training data. Historical documents contain encoding issues, inconsistent formatting, and special characters that confuse the model. Our normalization process fixes these issues and breaks long lines to fit within the model&rsquo;s context window. This is critical because lines exceeding the context window appear as incomplete sentences to the transformer, severely degrading generation quality due to the attention mechanism&rsquo;s inability to process fragmented text.</p>
<p>Inconsistent encoding and formatting can severely confuse the language model during training. For example, if some files use smart quotes (&quot;) and others use straight quotes (&quot;), the model might not learn that they represent the same concept, leading to inconsistent and potentially incorrect text generation. Normalization ensures that the model observes consistent patterns across all training data, which is crucial for learning coherent language patterns and generating high-quality historical text.</p>
<p>The code snippet in <a href="#listing6" class="listing-ref">Listing 6</a> demonstrates how we implement this normalization, which is quite straightforward.</p>
<figure id="listing6"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">normalize_text</span>(text: <span style="color:#91d7e3">str</span>) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">str</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Normalize text for consistent training data&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">unicodedata</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Fix common encoding issues</span>
</span></span><span style="display:flex;"><span>    encoding_fixes <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;â€™&#39;</span>: <span style="color:#a6da95">&#34;&#39;&#34;</span>,  <span style="color:#6e738d;font-style:italic"># Smart apostrophe</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;â€œ&#39;</span>: <span style="color:#a6da95">&#39;&#34;&#39;</span>,  <span style="color:#6e738d;font-style:italic"># Smart quote left</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;â€&#39;</span>: <span style="color:#a6da95">&#39;&#34;&#39;</span>,   <span style="color:#6e738d;font-style:italic"># Smart quote right</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;â€&#34;&#39;</span>: <span style="color:#a6da95">&#39;—&#39;</span>,  <span style="color:#6e738d;font-style:italic"># Em dash</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;â€¢&#39;</span>: <span style="color:#a6da95">&#39;•&#39;</span>,  <span style="color:#6e738d;font-style:italic"># Bullet point</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;â€¦&#39;</span>: <span style="color:#a6da95">&#39;…&#39;</span>,  <span style="color:#6e738d;font-style:italic"># Ellipsis</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> old, new <span style="color:#91d7e3;font-weight:bold">in</span> encoding_fixes<span style="color:#91d7e3;font-weight:bold">.</span>items():
</span></span><span style="display:flex;"><span>        text <span style="color:#91d7e3;font-weight:bold">=</span> text<span style="color:#91d7e3;font-weight:bold">.</span>replace(old, new)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Normalize Unicode (NFC)</span>
</span></span><span style="display:flex;"><span>    text <span style="color:#91d7e3;font-weight:bold">=</span> unicodedata<span style="color:#91d7e3;font-weight:bold">.</span>normalize(<span style="color:#a6da95">&#39;NFC&#39;</span>, text)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Break long lines for training compatibility (max 2000 chars)</span>
</span></span><span style="display:flex;"><span>    lines <span style="color:#91d7e3;font-weight:bold">=</span> text<span style="color:#91d7e3;font-weight:bold">.</span>split(<span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n</span><span style="color:#a6da95">&#39;</span>)
</span></span><span style="display:flex;"><span>    normalized_lines <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> line <span style="color:#91d7e3;font-weight:bold">in</span> lines:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">len</span>(line) <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#f5a97f">2000</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#6e738d;font-style:italic"># Split at sentence boundaries</span>
</span></span><span style="display:flex;"><span>            sentences <span style="color:#91d7e3;font-weight:bold">=</span> re<span style="color:#91d7e3;font-weight:bold">.</span>split(<span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;(?&lt;=[.!?])\s+&#39;</span>, line)
</span></span><span style="display:flex;"><span>            current_line <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;&#34;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">for</span> sentence <span style="color:#91d7e3;font-weight:bold">in</span> sentences:
</span></span><span style="display:flex;"><span>                <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">len</span>(current_line <span style="color:#91d7e3;font-weight:bold">+</span> sentence) <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#f5a97f">2000</span>:
</span></span><span style="display:flex;"><span>                    <span style="color:#c6a0f6">if</span> current_line:
</span></span><span style="display:flex;"><span>                        normalized_lines<span style="color:#91d7e3;font-weight:bold">.</span>append(current_line<span style="color:#91d7e3;font-weight:bold">.</span>strip())
</span></span><span style="display:flex;"><span>                    current_line <span style="color:#91d7e3;font-weight:bold">=</span> sentence
</span></span><span style="display:flex;"><span>                <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>                    current_line <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#a6da95">&#34; &#34;</span> <span style="color:#91d7e3;font-weight:bold">+</span> sentence <span style="color:#c6a0f6">if</span> current_line <span style="color:#c6a0f6">else</span> sentence
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">if</span> current_line:
</span></span><span style="display:flex;"><span>                normalized_lines<span style="color:#91d7e3;font-weight:bold">.</span>append(current_line<span style="color:#91d7e3;font-weight:bold">.</span>strip())
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>            normalized_lines<span style="color:#91d7e3;font-weight:bold">.</span>append(line)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Normalize line endings and whitespace</span>
</span></span><span style="display:flex;"><span>    text <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n</span><span style="color:#a6da95">&#39;</span><span style="color:#91d7e3;font-weight:bold">.</span>join(normalized_lines)
</span></span><span style="display:flex;"><span>    text <span style="color:#91d7e3;font-weight:bold">=</span> re<span style="color:#91d7e3;font-weight:bold">.</span>sub(<span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;[ \t]+&#39;</span>, <span style="color:#a6da95">&#39; &#39;</span>, text)  <span style="color:#6e738d;font-style:italic"># Multiple spaces/tabs to single space</span>
</span></span><span style="display:flex;"><span>    text <span style="color:#91d7e3;font-weight:bold">=</span> re<span style="color:#91d7e3;font-weight:bold">.</span>sub(<span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\n\s*\n&#39;</span>, <span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n\n</span><span style="color:#a6da95">&#39;</span>, text)  <span style="color:#6e738d;font-style:italic"># Multiple newlines to double newline</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> text<span style="color:#91d7e3;font-weight:bold">.</span>strip()</span></span></code></pre></div><figcaption>
        <strong>Listing 6: Text Normalization Function</strong>
    </figcaption>
</figure>
<p><strong>Real Example - Before Normalization:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>The year was 1666, and the plague had come to London. â€œIt was indeed a very terrible time,â€ wrote one observer. The streets were filled with the sounds of horse-drawn carriages and the cries of the afflicted.</span></span></code></pre></div>
<p><strong>After Normalization:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>The year was 1666, and the plague had come to London. &#34;It was indeed a very terrible time,&#34; wrote one observer. The streets were filled with the sounds of horse-drawn carriages and the cries of the afflicted.</span></span></code></pre></div>
<h4 id="224-stage-4-quality-validation">2.2.4 Stage 4: Quality Validation</h4>
<p>Not all extracted text is suitable for training. Some files contain duplicates, non-English content, or poor-quality text that would degrade model performance. We need a comprehensive validation system that ensures only high-quality, relevant text is included in our training corpus.</p>
<p>The key challenge is striking a balance between quality standards and historical value. A strict approach might reject valuable historical documents that have some OCR issues, while a lenient approach might include too much low-quality content, which can degrade model training. To address this, I implemented a <strong>tiered quality threshold system</strong> that applies different standards based on content type:</p>
<ul>
<li><strong>General Content</strong>: 200+ chars, 50+ words, 50% meaningful words</li>
<li><strong>Project Gutenberg</strong>: 200+ chars, 50+ words, 40% meaningful words (relaxed for historical value)</li>
<li><strong>Historical Documents</strong>: 1000+ chars, 100+ words, 30% meaningful words (very relaxed for historical value)</li>
</ul>
<p>This tiered approach ensures that we capture valuable historical content while maintaining quality standards, filtering out duplicates, non-English content, and low-quality text, thereby preserving the integrity of useful historical documents. Again, these implementations are quite simple, in the context of a toy project, but can be made more robust. The code itself is quite straightforward, as shown in <a href="#listing7" class="listing-ref">Listing 7</a>.</p>
<figure id="listing7"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">analyze_text_quality</span>(text: <span style="color:#91d7e3">str</span>, source_type: <span style="color:#91d7e3">str</span> <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#39;general&#39;</span>) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">dict</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Analyze text quality and determine if it should be included in training corpus&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">hashlib</span>
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">import</span> <span style="color:#f5a97f">re</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Length validation</span>
</span></span><span style="display:flex;"><span>    char_count <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">len</span>(text)
</span></span><span style="display:flex;"><span>    word_count <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">len</span>(text<span style="color:#91d7e3;font-weight:bold">.</span>split())
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># OCR artifact detection using regex patterns</span>
</span></span><span style="display:flex;"><span>    ocr_patterns <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;long_capitals&#39;</span>: <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;[A-Z]{5,}\s+[A-Z]{5,}&#39;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;spaced_letters&#39;</span>: <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\b[A-Za-z]\s+[A-Za-z]\s+[A-Za-z]\s+[A-Za-z]\b&#39;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;special_chars&#39;</span>: <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;[!@#$%^&amp;*()]{3,}&#39;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;mixed_alphanumeric&#39;</span>: <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;\b\d+[A-Za-z]+\d+\b&#39;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;long_non_word&#39;</span>: <span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;[^\w\s]{10,}&#39;</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    ocr_issues <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> pattern_name, pattern <span style="color:#91d7e3;font-weight:bold">in</span> ocr_patterns<span style="color:#91d7e3;font-weight:bold">.</span>items():
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> re<span style="color:#91d7e3;font-weight:bold">.</span>search(pattern, text):
</span></span><span style="display:flex;"><span>            ocr_issues<span style="color:#91d7e3;font-weight:bold">.</span>append(pattern_name)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Advertisement detection</span>
</span></span><span style="display:flex;"><span>    ad_patterns <span style="color:#91d7e3;font-weight:bold">=</span> [
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;this day is published&#39;</span>, <span style="color:#a6da95">&#39;just ready&#39;</span>, <span style="color:#a6da95">&#39;elegantly bound&#39;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;now ready&#39;</span>, <span style="color:#a6da95">&#39;new novels&#39;</span>, <span style="color:#a6da95">&#39;advertisements&#39;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;price \d+s&#39;</span>, <span style="color:#a6da95">&#39;paternoster row&#39;</span>, <span style="color:#a6da95">&#39;corner of&#39;</span>, <span style="color:#a6da95">&#39;publishers&#39;</span>
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    ad_count <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">sum</span>(<span style="color:#f5a97f">1</span> <span style="color:#c6a0f6">for</span> pattern <span style="color:#91d7e3;font-weight:bold">in</span> ad_patterns <span style="color:#c6a0f6">if</span> re<span style="color:#91d7e3;font-weight:bold">.</span>search(pattern, text, re<span style="color:#91d7e3;font-weight:bold">.</span>IGNORECASE))
</span></span><span style="display:flex;"><span>    ad_density <span style="color:#91d7e3;font-weight:bold">=</span> ad_count <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#91d7e3">max</span>(word_count, <span style="color:#f5a97f">1</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Meaningful word ratio calculation</span>
</span></span><span style="display:flex;"><span>    words <span style="color:#91d7e3;font-weight:bold">=</span> text<span style="color:#91d7e3;font-weight:bold">.</span>split()
</span></span><span style="display:flex;"><span>    meaningful_words <span style="color:#91d7e3;font-weight:bold">=</span> [w <span style="color:#c6a0f6">for</span> w <span style="color:#91d7e3;font-weight:bold">in</span> words <span style="color:#c6a0f6">if</span> w<span style="color:#91d7e3;font-weight:bold">.</span>isalpha() <span style="color:#91d7e3;font-weight:bold">and</span> <span style="color:#91d7e3">len</span>(w) <span style="color:#91d7e3;font-weight:bold">&gt;</span> <span style="color:#f5a97f">2</span>]
</span></span><span style="display:flex;"><span>    meaningful_ratio <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#91d7e3">len</span>(meaningful_words) <span style="color:#91d7e3;font-weight:bold">/</span> <span style="color:#91d7e3">max</span>(<span style="color:#91d7e3">len</span>(words), <span style="color:#f5a97f">1</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Quality thresholds based on source type</span>
</span></span><span style="display:flex;"><span>    thresholds <span style="color:#91d7e3;font-weight:bold">=</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;general&#39;</span>: {<span style="color:#a6da95">&#39;min_chars&#39;</span>: <span style="color:#f5a97f">200</span>, <span style="color:#a6da95">&#39;min_words&#39;</span>: <span style="color:#f5a97f">50</span>, <span style="color:#a6da95">&#39;min_meaningful_ratio&#39;</span>: <span style="color:#f5a97f">0.50</span>},
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;gutenberg&#39;</span>: {<span style="color:#a6da95">&#39;min_chars&#39;</span>: <span style="color:#f5a97f">200</span>, <span style="color:#a6da95">&#39;min_words&#39;</span>: <span style="color:#f5a97f">50</span>, <span style="color:#a6da95">&#39;min_meaningful_ratio&#39;</span>: <span style="color:#f5a97f">0.40</span>},
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;historical&#39;</span>: {<span style="color:#a6da95">&#39;min_chars&#39;</span>: <span style="color:#f5a97f">1000</span>, <span style="color:#a6da95">&#39;min_words&#39;</span>: <span style="color:#f5a97f">100</span>, <span style="color:#a6da95">&#39;min_meaningful_ratio&#39;</span>: <span style="color:#f5a97f">0.30</span>}
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    threshold <span style="color:#91d7e3;font-weight:bold">=</span> thresholds<span style="color:#91d7e3;font-weight:bold">.</span>get(source_type, thresholds[<span style="color:#a6da95">&#39;general&#39;</span>])
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Quality scoring</span>
</span></span><span style="display:flex;"><span>    score <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">100</span>
</span></span><span style="display:flex;"><span>    score <span style="color:#91d7e3;font-weight:bold">-=</span> <span style="color:#91d7e3">len</span>(ocr_issues) <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">3</span>  <span style="color:#6e738d;font-style:italic"># OCR issues</span>
</span></span><span style="display:flex;"><span>    score <span style="color:#91d7e3;font-weight:bold">-=</span> ad_density <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">50</span>  <span style="color:#6e738d;font-style:italic"># Advertisement density</span>
</span></span><span style="display:flex;"><span>    score <span style="color:#91d7e3;font-weight:bold">-=</span> (<span style="color:#f5a97f">1</span> <span style="color:#91d7e3;font-weight:bold">-</span> meaningful_ratio) <span style="color:#91d7e3;font-weight:bold">*</span> <span style="color:#f5a97f">20</span>  <span style="color:#6e738d;font-style:italic"># Meaningful word ratio</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Check if text meets quality thresholds</span>
</span></span><span style="display:flex;"><span>    meets_thresholds <span style="color:#91d7e3;font-weight:bold">=</span> (
</span></span><span style="display:flex;"><span>        char_count <span style="color:#91d7e3;font-weight:bold">&gt;=</span> threshold[<span style="color:#a6da95">&#39;min_chars&#39;</span>] <span style="color:#91d7e3;font-weight:bold">and</span>
</span></span><span style="display:flex;"><span>        word_count <span style="color:#91d7e3;font-weight:bold">&gt;=</span> threshold[<span style="color:#a6da95">&#39;min_words&#39;</span>] <span style="color:#91d7e3;font-weight:bold">and</span>
</span></span><span style="display:flex;"><span>        meaningful_ratio <span style="color:#91d7e3;font-weight:bold">&gt;=</span> threshold[<span style="color:#a6da95">&#39;min_meaningful_ratio&#39;</span>] <span style="color:#91d7e3;font-weight:bold">and</span>
</span></span><span style="display:flex;"><span>        ad_density <span style="color:#91d7e3;font-weight:bold">&lt;</span> <span style="color:#f5a97f">0.1</span>  <span style="color:#6e738d;font-style:italic"># Less than 10% advertisement content</span>
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;char_count&#39;</span>: char_count,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;word_count&#39;</span>: word_count,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;meaningful_ratio&#39;</span>: meaningful_ratio,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;ocr_issues&#39;</span>: ocr_issues,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;ad_density&#39;</span>: ad_density,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;score&#39;</span>: score,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;meets_thresholds&#39;</span>: meets_thresholds,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#39;content_hash&#39;</span>: hashlib<span style="color:#91d7e3;font-weight:bold">.</span>md5(text<span style="color:#91d7e3;font-weight:bold">.</span>encode())<span style="color:#91d7e3;font-weight:bold">.</span>hexdigest()
</span></span><span style="display:flex;"><span>    }</span></span></code></pre></div><figcaption>
        <strong>Listing 7: Text Quality Analysis Function</strong>
    </figcaption>
</figure>
<p><strong>Content Quality Validation</strong></p>
<p>Our validation system employs multiple detection mechanisms to ensure training corpus quality:</p>
<ul>
<li><strong>OCR Artifact Detection</strong>: Regex patterns identify common digitization errors, including misread headers, character separation failures, scanning artifacts, alphanumeric misinterpretations, and corrupted text regions</li>
<li><strong>Advertisement Filtering</strong>: Pattern matching detects commercial content using phrases like &ldquo;this day is published&rdquo;, &ldquo;just ready&rdquo;, &ldquo;elegantly bound&rdquo;, and price references</li>
<li><strong>Quality Scoring</strong>: A 100-point system deducts points for OCR artifacts (-3 each), advertisement density (-50), and low meaningful word ratios (-20)</li>
</ul>
<p>This multi-layered approach balances quality standards with preservation of valuable historical content, ensuring the model trains on authentic historical language while filtering out contamination sources.</p>
<h4 id="225-stage-5-final-processing-and-corpus-creation">2.2.5 Stage 5: Final Processing and Corpus Creation</h4>
<p>After cleaning and validation, we create a final training corpus optimized for language model training. This requires intelligent segmentation that breaks long texts into manageable chunks while preserving the historical narrative flow, which is essential given the context window limits (e.g., 2048 tokens). The code snippet in <a href="#listing8" class="listing-ref">Listing 8</a> demonstrates this final processing stage.</p>
<figure id="listing8"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">create_comprehensive_corpus</span>(cleaned_files: <span style="color:#91d7e3">list</span>) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">str</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Create final training corpus with intelligent segmentation&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    corpus_parts <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> file_path <span style="color:#91d7e3;font-weight:bold">in</span> cleaned_files:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">with</span> <span style="color:#91d7e3">open</span>(file_path, <span style="color:#a6da95">&#39;r&#39;</span>, encoding<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;utf-8&#39;</span>) <span style="color:#c6a0f6">as</span> f:
</span></span><span style="display:flex;"><span>            content <span style="color:#91d7e3;font-weight:bold">=</span> f<span style="color:#91d7e3;font-weight:bold">.</span>read()
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Split into training segments</span>
</span></span><span style="display:flex;"><span>        segments <span style="color:#91d7e3;font-weight:bold">=</span> split_into_training_segments(content)
</span></span><span style="display:flex;"><span>        corpus_parts<span style="color:#91d7e3;font-weight:bold">.</span>extend(segments)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Create final corpus</span>
</span></span><span style="display:flex;"><span>    final_corpus <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n\n</span><span style="color:#a6da95">&#39;</span><span style="color:#91d7e3;font-weight:bold">.</span>join(corpus_parts)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Save to file</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">with</span> <span style="color:#91d7e3">open</span>(<span style="color:#a6da95">&#39;london_historical_corpus_comprehensive.txt&#39;</span>, <span style="color:#a6da95">&#39;w&#39;</span>, encoding<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#39;utf-8&#39;</span>) <span style="color:#c6a0f6">as</span> f:
</span></span><span style="display:flex;"><span>        f<span style="color:#91d7e3;font-weight:bold">.</span>write(final_corpus)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> final_corpus
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">split_into_training_segments</span>(text: <span style="color:#91d7e3">str</span>, max_length: <span style="color:#91d7e3">int</span> <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">2000</span>) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">list</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Split text into training segments while preserving narrative flow&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># First split on double newlines (paragraphs)</span>
</span></span><span style="display:flex;"><span>    paragraphs <span style="color:#91d7e3;font-weight:bold">=</span> text<span style="color:#91d7e3;font-weight:bold">.</span>split(<span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n\n</span><span style="color:#a6da95">&#39;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    segments <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    current_segment <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> paragraph <span style="color:#91d7e3;font-weight:bold">in</span> paragraphs:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">len</span>(current_segment <span style="color:#91d7e3;font-weight:bold">+</span> paragraph) <span style="color:#91d7e3;font-weight:bold">&lt;=</span> max_length:
</span></span><span style="display:flex;"><span>            current_segment <span style="color:#91d7e3;font-weight:bold">+=</span> paragraph <span style="color:#91d7e3;font-weight:bold">+</span> <span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n\n</span><span style="color:#a6da95">&#39;</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">if</span> current_segment:
</span></span><span style="display:flex;"><span>                segments<span style="color:#91d7e3;font-weight:bold">.</span>append(current_segment<span style="color:#91d7e3;font-weight:bold">.</span>strip())
</span></span><span style="display:flex;"><span>            current_segment <span style="color:#91d7e3;font-weight:bold">=</span> paragraph <span style="color:#91d7e3;font-weight:bold">+</span> <span style="color:#a6da95">&#39;</span><span style="color:#8aadf4">\n\n</span><span style="color:#a6da95">&#39;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">if</span> current_segment:
</span></span><span style="display:flex;"><span>        segments<span style="color:#91d7e3;font-weight:bold">.</span>append(current_segment<span style="color:#91d7e3;font-weight:bold">.</span>strip())
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Further split long segments at sentence boundaries</span>
</span></span><span style="display:flex;"><span>    final_segments <span style="color:#91d7e3;font-weight:bold">=</span> []
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> segment <span style="color:#91d7e3;font-weight:bold">in</span> segments:
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">len</span>(segment) <span style="color:#91d7e3;font-weight:bold">&gt;</span> max_length:
</span></span><span style="display:flex;"><span>            sentences <span style="color:#91d7e3;font-weight:bold">=</span> re<span style="color:#91d7e3;font-weight:bold">.</span>split(<span style="color:#ed8796">r</span><span style="color:#a6da95">&#39;(?&lt;=[.!?])\s+&#39;</span>, segment)
</span></span><span style="display:flex;"><span>            current_segment <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;&#34;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">for</span> sentence <span style="color:#91d7e3;font-weight:bold">in</span> sentences:
</span></span><span style="display:flex;"><span>                <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">len</span>(current_segment <span style="color:#91d7e3;font-weight:bold">+</span> sentence) <span style="color:#91d7e3;font-weight:bold">&lt;=</span> max_length:
</span></span><span style="display:flex;"><span>                    current_segment <span style="color:#91d7e3;font-weight:bold">+=</span> sentence <span style="color:#91d7e3;font-weight:bold">+</span> <span style="color:#a6da95">&#34; &#34;</span>
</span></span><span style="display:flex;"><span>                <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>                    <span style="color:#c6a0f6">if</span> current_segment:
</span></span><span style="display:flex;"><span>                        final_segments<span style="color:#91d7e3;font-weight:bold">.</span>append(current_segment<span style="color:#91d7e3;font-weight:bold">.</span>strip())
</span></span><span style="display:flex;"><span>                    current_segment <span style="color:#91d7e3;font-weight:bold">=</span> sentence <span style="color:#91d7e3;font-weight:bold">+</span> <span style="color:#a6da95">&#34; &#34;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#c6a0f6">if</span> current_segment:
</span></span><span style="display:flex;"><span>                final_segments<span style="color:#91d7e3;font-weight:bold">.</span>append(current_segment<span style="color:#91d7e3;font-weight:bold">.</span>strip())
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>            final_segments<span style="color:#91d7e3;font-weight:bold">.</span>append(segment)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Filter out segments that are too short</span>
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> [seg <span style="color:#c6a0f6">for</span> seg <span style="color:#91d7e3;font-weight:bold">in</span> final_segments <span style="color:#c6a0f6">if</span> <span style="color:#91d7e3">len</span>(seg) <span style="color:#91d7e3;font-weight:bold">&gt;=</span> <span style="color:#f5a97f">50</span>]</span></span></code></pre></div><figcaption>
        <strong>Listing 8: Corpus Creation and Segmentation Functions</strong>
    </figcaption>
</figure>
<p>During my local runs, this final processing stage generated a comprehensive corpus of over 500 million characters across ~250,000 segments, with an average segment length of around 2,000 characters. The success rate of files making it into the final corpus ranged from 70% to 90%, depending on the quality and availability of the source.</p>
<p><strong>Final Corpus Statistics:</strong></p>
<ul>
<li><strong>Total Sources Processed</strong>: 218+ historical sources</li>
<li><strong>Final Corpus Size</strong>: 500M+ characters</li>
<li><strong>Training Segments</strong>: ~250,000 segments</li>
<li><strong>Average Segment Length</strong>: ~2,000 characters</li>
<li><strong>Success Rate</strong>: 70-90% (depending on source availability)</li>
</ul>
<h3 id="23-detailed-data-processing-flow">2.3 Detailed Data Processing Flow</h3>
<p>Building on the high-level flow and having reviewed each of the areas, the detailed flow below illustrates the complete data cleaning process, including rejection paths, error handling, and statistics tracking. This is intended to provide a bird&rsquo;s-eye view of the entire process.</p>
<figure class="align-center " id="fig4">
    <pre class="mermaid">graph TD
    A[📁 Raw Files] --&gt; B{File Type Detection}
    
    B --&gt;|.txt, .txt.utf-8| C[📄 Text File]
    B --&gt;|.pdf| D[📄 PDF File]
    B --&gt;|.html, .htm| E[📄 HTML File]
    B --&gt;|.xml| F[📄 XML File]
    B --&gt;|No Extension| G{Content Detection}
    
    G --&gt;|HTML-like| E
    G --&gt;|Text-like| C
    G --&gt;|Binary/Unknown| REJECT1[❌ REJECTED]
    
    C --&gt; H[🧹 clean_gutenberg_text]
    D --&gt; I[🔧 extract_text_from_pdf]
    E --&gt; J[🧹 clean_html_text]
    F --&gt; K{XML Type Detection}
    
    I --&gt; L[🧹 clean_pdf_text]
    
    K --&gt;|Old Bailey| M[🔧 extract_old_bailey_text]
    K --&gt;|London Lives| N[🔧 extract_london_lives_text]
    
    M --&gt; O[🧹 clean_old_bailey_text]
    N --&gt; P[🧹 clean_london_lives_text]
    
    H --&gt; Q[🔧 normalize_text]
    L --&gt; Q
    J --&gt; Q
    O --&gt; Q
    P --&gt; Q
    
    Q --&gt; R[🔍 Duplicate Detection]
    R --&gt;|Duplicate| REJECT2[❌ REJECTED - Duplicate]
    R --&gt;|Unique| S[🌍 Language Detection]
    
    S --&gt;|Non-English| REJECT3[❌ REJECTED - Non-English]
    S --&gt;|English| T[📊 Quality Analysis]
    
    T --&gt; U{Quality Check}
    U --&gt;|Poor Quality| REJECT4[❌ REJECTED - Poor Quality]
    U --&gt;|Good Quality| V[💾 Save to Processed Directory]
    
    V --&gt; W[📊 Update Statistics]
    W --&gt; X[✅ Successfully Processed]
    
    REJECT1 --&gt; Y[📝 Log Rejection Reason]
    REJECT2 --&gt; Y
    REJECT3 --&gt; Y
    REJECT4 --&gt; Y
    
    Y --&gt; Z[📊 Update Rejection Stats]
    
    style A fill:#e1f5fe
    style X fill:#c8e6c9
    style REJECT1 fill:#ffcdd2
    style REJECT2 fill:#ffcdd2
    style REJECT3 fill:#ffcdd2
    style REJECT4 fill:#ffcdd2
    style Y fill:#fff3e0
    style Z fill:#fff3e0</pre>
    <figcaption>Figure 4: Detailed Data Processing Pipeline</figcaption>
</figure>
<h3 id="25-corpus-creation-process">2.5 Corpus Creation Process</h3>
<p>After cleaning, the system creates the final training corpus through intelligent segmentation that preserves historical narrative flow:</p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>📁 Cleaned Files
</span></span><span style="display:flex;"><span>    ↓
</span></span><span style="display:flex;"><span>🔧 create_comprehensive_corpus()
</span></span><span style="display:flex;"><span>    ├── Read all cleaned_*.txt files
</span></span><span style="display:flex;"><span>    ├── Split into training segments (split_into_training_segments)
</span></span><span style="display:flex;"><span>    │   ├── Split on double newlines (paragraphs)
</span></span><span style="display:flex;"><span>    │   ├── Max length: 2000 characters
</span></span><span style="display:flex;"><span>    │   ├── Min length: 100 characters
</span></span><span style="display:flex;"><span>    │   └── Further split long segments at sentence boundaries
</span></span><span style="display:flex;"><span>    ├── Filter segments (min 50 characters)
</span></span><span style="display:flex;"><span>    └── Write to london_historical_corpus_comprehensive.txt</span></span></code></pre></div>
<p>The corpus creation process reads all cleaned text files and intelligently segments them into training-ready chunks. It first splits on double newlines to preserve paragraph boundaries, which are natural break points in historical text. Segments are constrained to a maximum of 2000 characters to fit within the model&rsquo;s context window, with a minimum of 100 characters to ensure substantial content. Long segments are further split at sentence boundaries to maintain readability. Finally, segments shorter than 50 characters are filtered out as they&rsquo;re unlikely to contain meaningful historical content.</p>
<p>Proper segmentation is crucial for training language models. The model needs to learn from coherent text segments that maintain historical narrative flow while fitting within its context window. Splitting on paragraph boundaries preserves the natural structure of historical documents, while sentence-level splitting ensures that very long paragraphs don&rsquo;t exceed the model&rsquo;s processing capabilities. This approach maximizes the model&rsquo;s ability to learn from authentic historical language patterns while maintaining training efficiency.</p>
<h3 id="26-outcome-training-ready-corpus">2.6 Outcome: Training-Ready Corpus</h3>
<p>The result is a <strong>clean, historically faithful corpus</strong> containing over 500 million characters of authentic historical English spanning 350 years of London history from 1500-1850. The corpus comprises high-quality text with minimal OCR artifacts, preserving historical language patterns and a rich cultural context that reflects the social, political, and economic realities of various historical periods. The text has been intelligently segmented for optimal language model training, with careful attention to maintaining the natural flow of historical narratives while ensuring compatibility with modern training techniques.</p>
<p>This corpus serves as the essential foundation for training our specialized historical tokenizer and language model, ensuring the model learns authentic historical English rather than modern text patterns. By providing the model with genuine examples of how people wrote and spoke during different historical periods, we enable it to generate text that captures the linguistic nuances, cultural references, and historical context that make historical language modeling both challenging and rewarding.</p>
<p><strong>💻 Try It Yourself:</strong> The complete implementation, including all the data collection scripts, cleaning algorithms, and quality validation systems described in this section, is available in the <a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		helloLondon GitHub repository
	</span>
</a>. The repository includes detailed documentation, example usage, and step-by-step guides for setting up your own historical language model training pipeline.</p>
<p>Now that we have examined the data collection and cleaning process, we can proceed to the next steps: creating a custom historical tokenizer and preparing for model training.</p>
<h2 id="3-custom-historical-tokenizer-the-key-to-authentic-historical-text-generation">3. Custom Historical Tokenizer: The Key to Authentic Historical Text Generation</h2>
<p>Creating a custom tokenizer is crucial for generating effective historical text. This section examines the necessity of a custom tokenizer, the challenges presented by historical language, and our chosen architecture. The tokenizer preserves the semantic meaning of historical words and phrases, enabling coherent and contextually accurate historical narratives.</p>
<p>Standard tokenizers like GPT-2&rsquo;s fragment archaic words like &ldquo;quoth&rdquo; and &ldquo;hast&rdquo; into multiple subword tokens, destroying semantic meaning crucial for historical text generation.</p>
<p><strong>Real Example - Standard Tokenizer vs. Our Custom Tokenizer:</strong></p>
<p><strong>Standard GPT-2 Tokenizer:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>&#34;Quoth the alderman, &#39;Tis a fair day at Newgate&#34;
</span></span><span style="display:flex;"><span>→ [&#39;Qu&#39;, &#39;oth&#39;, &#39; the&#39;, &#39; ald&#39;, &#39;erman&#39;, &#39;,&#39;, &#39; &#39;, &#39;&#39;&#39;, &#39;T&#39;, &#39;is&#39;, &#39; a&#39;, &#39; fair&#39;, &#39; day&#39;, &#39; at&#39;, &#39; New&#39;, &#39;gate&#39;]</span></span></code></pre></div>
<p><strong>Our Custom Historical Tokenizer:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>&#34;Quoth the alderman, &#39;Tis a fair day at Newgate&#34;
</span></span><span style="display:flex;"><span>→ [&#39;&lt;|quoth|&gt;&#39;, &#39; the&#39;, &#39; alderman&#39;, &#39;,&#39;, &#39; &#39;, &#39;&#39;&#39;, &#39;&lt;|tis|&gt;&#39;, &#39; a&#39;, &#39; fair&#39;, &#39; day&#39;, &#39; at&#39;, &#39; &lt;|newgate|&gt;&#39;]</span></span></code></pre></div>
<p>The standard tokenizer breaks historical language into 18 meaningless fragments, losing semantic meaning and historical context. Our custom tokenizer reduces this to 12 meaningful tokens, preserving authentic historical language patterns essential for coherent text generation.</p>
<p>A tokenizer that fragments historical language destroys the model&rsquo;s ability to learn authentic patterns. The model needs to perceive &ldquo;quoth&rdquo; as a single concept, rather than fragmented subwords, to capture the linguistic nuances of different historical periods.</p>
<h3 id="31-what-happens-with-off-the-shelf-tokenizers">3.1 What Happens with Off-the-Shelf Tokenizers</h3>
<p>What would happen if we used standard tokenizers like tiktoken or GPT-2&rsquo;s tokenizer?</p>
<p>Standard tokenizers would force the model to waste capacity reconstructing fragmented historical words from subwords rather than learning historical language patterns. The model might learn to generate &ldquo;Qu&rdquo; + &ldquo;oth&rdquo; but struggle to use &ldquo;quoth&rdquo; in new contexts. Historical phrases like &ldquo;methinks&rdquo; would split into meaningless fragments, losing semantic coherence. London geography becomes particularly problematic, as place names like &ldquo;Newgate&rdquo; fragment, making spatial relationships harder to understand.</p>
<p><strong>Generation Quality Issues:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># What you&#39;d get with standard tokenizer:</span>
</span></span><span style="display:flex;"><span><span style="color:#a6da95">&#34;Quoth the alderman, &#39;Tis a fair day at Newgate&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">→</span> Generates: <span style="color:#a6da95">&#34;Qu oth the ald erman, &#39;T is a fair day at New gate&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">→</span> Result: Broken, unreadable historical text
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># What you get with our custom tokenizer:</span>
</span></span><span style="display:flex;"><span><span style="color:#a6da95">&#34;Quoth the alderman, &#39;Tis a fair day at Newgate&#34;</span>  
</span></span><span style="display:flex;"><span><span style="color:#ed8796">→</span> Generates: <span style="color:#a6da95">&#34;Quoth the alderman, &#39;Tis a fair day at Newgate&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">→</span> Result: Authentic, coherent historical text</span></span></code></pre></div>
<p>A vocabulary that&rsquo;s too small (10K tokens) would fragment even more historical words, making the problem worse, while a vocabulary that&rsquo;s too large (100K+ tokens) would overfit to rare historical terms, wasting capacity on words that appear only once. Our choice of 30K tokens provides a balanced approach that captures common historical patterns without overfitting, ensuring the model learns the most important historical language patterns efficiently.</p>
<p><strong>Real-World Example:</strong>
With a standard tokenizer, our model might generate:</p>
<blockquote>
<p><code>&quot;The ald erman walk ed to New gate where he saw the pris oner&quot;</code></p></blockquote>
<p>With our custom tokenizer, it generates:</p>
<blockquote>
<p><code>&quot;The alderman walked to Newgate where he saw the prisoner&quot;</code></p></blockquote>
<p>The difference in historical text authenticity is significant between the two approaches.</p>
<h3 id="32-tokenizer-architecture">3.2 Tokenizer Architecture</h3>
<p>I had started with the easier WordPiece tokenizer (more of an accident rather than by design). Still, I realized later that it was unsuitable for historical text due to the <code>##</code> subword prefix artifacts. We need a tokenizer that can handle historical English efficiently while preserving semantic meaning, unlike standard tokenizers like GPT-2&rsquo;s WordPiece approach, which fragments historical language and, as a result, destroys the linguistic patterns we want to preserve. After some experimentation, I settled on a custom Byte Pair Encoding (BPE) tokenizer trained specifically on historical English.</p>
<p>BPE is a subword tokenization algorithm that learns to break text into meaningful subword units by iteratively finding the most frequent character pairs in the training corpus and merging them into single tokens. The process begins with individual characters and gradually evolves into common words and phrases.</p>
<p>For example, if <code>&quot;th&quot;</code> appears frequently in our historical corpus, BPE will learn to treat it as a single token rather than separate <code>&quot;t&quot;</code> and <code>&quot;h&quot;</code> tokens. This is particularly valuable for historical English, where words like <code>&quot;thou&quot;</code>, <code>&quot;thee&quot;</code>, and <code>&quot;thine&quot;</code> share common prefixes and suffixes.</p>
<h4 id="321-tokenizer-training-process">3.2.1 Tokenizer Training Process</h4>
<p>The BPE training algorithm analyzes our entire historical corpus to identify the most frequent character combinations, building a vocabulary that&rsquo;s optimized for historical language patterns. We start with a base alphabet (comprising all letters) and special tokens, then iteratively merge the most frequent pairs until we reach our target vocabulary size of 30,000 tokens. This ensures that common historical words, such as <code>&quot;quoth&quot;</code>, <code>&quot;hast&quot;</code>, and <code>&quot;methinks&quot;</code>, are treated as single tokens, while still allowing for the handling of rare or unknown words by breaking them into learned subword units.</p>
<p>The training process is computationally efficient and produces a tokenizer that&rsquo;s specifically tuned to the linguistic patterns found in our historical corpus.</p>
<p>In this case, we don&rsquo;t have to reinvent the wheel and use the Hugging Face <code>tokenizers</code> library, which provides a modular approach to building custom tokenizers. The library is organized into several key components: <code>models</code> define the core tokenization algorithm (BPE, WordPiece, Unigram), <code>pre_tokenizers</code> handle initial text splitting, <code>normalizers</code> clean and standardize text, <code>trainers</code> configure the learning process, and <code>processors</code> handle special token insertion. This modular design enables us to mix and match components to create a tokenizer tailored to our specific use case.</p>
<p>The <code>models</code> module offers several tokenization algorithms: <code>BPE()</code> for Byte Pair Encoding (what we use), <code>WordPiece()</code> for Google&rsquo;s WordPiece algorithm, <code>Unigram()</code> for Google&rsquo;s Unigram language model, and <code>WordLevel()</code> for simple word-level tokenization.</p>
<p>Each has different strengths - BPE is efficient and handles unknown words well, WordPiece is used by BERT but creates <code>##</code> artifacts, Unigram is more flexible but computationally expensive, and WordLevel is simple but creates very large vocabularies.</p>
<p>Let us look at the code in <a href="#listing9" class="listing-ref">Listing 9</a> for training our custom historical BPE tokenizer:</p>
<figure id="listing9"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">train_tokenizer</span>(<span style="color:#91d7e3">self</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Train a custom tokenizer for historical English&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Import the tokenizers library components</span>
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">from</span> <span style="color:#f5a97f">tokenizers</span> <span style="color:#8bd5ca">import</span> Tokenizer, models, pre_tokenizers, processors, trainers
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">from</span> <span style="color:#f5a97f">tokenizers.normalizers</span> <span style="color:#8bd5ca">import</span> Sequence, NFD, StripAccents
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#a6da95">&#34;Training custom historical tokenizer...&#34;</span>)
</span></span><span style="display:flex;"><span>    logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Corpus: </span><span style="color:#a6da95">{</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>corpus_path<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>    logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Target vocabulary: </span><span style="color:#a6da95">{</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>vocab_size<span style="color:#a6da95">:</span><span style="color:#a6da95">,</span><span style="color:#a6da95">}</span><span style="color:#a6da95"> tokens&#34;</span>)
</span></span><span style="display:flex;"><span>    logger<span style="color:#91d7e3;font-weight:bold">.</span>info(<span style="color:#ed8796">f</span><span style="color:#a6da95">&#34;Output directory: </span><span style="color:#a6da95">{</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>output_dir<span style="color:#a6da95">}</span><span style="color:#a6da95">&#34;</span>)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Initialize BPE tokenizer (not WordPiece)</span>
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># models.BPE() creates a Byte Pair Encoding model that will learn subword patterns</span>
</span></span><span style="display:flex;"><span>    tokenizer <span style="color:#91d7e3;font-weight:bold">=</span> Tokenizer(models<span style="color:#91d7e3;font-weight:bold">.</span>BPE())
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Normalizers for historical text - preserve case for better text reconstruction</span>
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Normalizers clean and standardize text before tokenization</span>
</span></span><span style="display:flex;"><span>    tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>normalizer <span style="color:#91d7e3;font-weight:bold">=</span> Sequence([
</span></span><span style="display:flex;"><span>        NFD(),           <span style="color:#6e738d;font-style:italic"># Unicode normalization - converts characters to canonical form</span>
</span></span><span style="display:flex;"><span>        StripAccents()   <span style="color:#6e738d;font-style:italic"># Remove accents - converts &#34;café&#34; to &#34;cafe&#34;</span>
</span></span><span style="display:flex;"><span>    ])
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Pre-tokenizer for historical English - use simple whitespace splitting</span>
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Pre-tokenizers split text into initial segments before the main tokenization</span>
</span></span><span style="display:flex;"><span>    tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>pre_tokenizer <span style="color:#91d7e3;font-weight:bold">=</span> pre_tokenizers<span style="color:#91d7e3;font-weight:bold">.</span>Sequence([
</span></span><span style="display:flex;"><span>        pre_tokenizers<span style="color:#91d7e3;font-weight:bold">.</span>WhitespaceSplit(),  <span style="color:#6e738d;font-style:italic"># Split on whitespace</span>
</span></span><span style="display:flex;"><span>        pre_tokenizers<span style="color:#91d7e3;font-weight:bold">.</span>Punctuation()       <span style="color:#6e738d;font-style:italic"># Split punctuation from words</span>
</span></span><span style="display:flex;"><span>    ])
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Special tokens for historical English</span>
</span></span><span style="display:flex;"><span>    special_tokens <span style="color:#91d7e3;font-weight:bold">=</span> [
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|startoftext|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|endoftext|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|pad|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|unk|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|mask|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Historical language tokens</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|thou|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|thee|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|thy|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|thine|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|hast|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|hath|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|doth|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|dost|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|quoth|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|tis|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|twas|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|twill|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># London geography tokens</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|london|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|thames|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|westminster|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|tower|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|newgate|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|southwark|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|cheapside|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|fleet|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|ludgate|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|aldgate|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Historical period tokens</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|tudor|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|stuart|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|georgian|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|regency|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|victorian|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Social class tokens</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|noble|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|gentleman|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|commoner|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|apprentice|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|yeoman|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Professional tokens</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|apothecary|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|coachman|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|chimneysweep|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|baker|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|butcher|&gt;&#34;</span>
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># BPE trainer configuration</span>
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># The trainer defines how the BPE algorithm learns from our corpus</span>
</span></span><span style="display:flex;"><span>    trainer <span style="color:#91d7e3;font-weight:bold">=</span> trainers<span style="color:#91d7e3;font-weight:bold">.</span>BpeTrainer(
</span></span><span style="display:flex;"><span>        vocab_size<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>vocab_size,        <span style="color:#6e738d;font-style:italic"># Target vocabulary size (30,000 tokens) - balanced between coverage and efficiency</span>
</span></span><span style="display:flex;"><span>        special_tokens<span style="color:#91d7e3;font-weight:bold">=</span>special_tokens,     <span style="color:#6e738d;font-style:italic"># Pre-defined tokens that are always included</span>
</span></span><span style="display:flex;"><span>        min_frequency<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">2</span>,                   <span style="color:#6e738d;font-style:italic"># Minimum frequency prevents vocabulary pollution from OCR errors</span>
</span></span><span style="display:flex;"><span>        show_progress<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>,                <span style="color:#6e738d;font-style:italic"># Display training progress</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Removed continuing_subword_prefix=&#34;##&#34; to eliminate WordPiece-style artifacts</span>
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># This ensures pure BPE tokenization without ## symbols in generated text</span>
</span></span><span style="display:flex;"><span>        initial_alphabet<span style="color:#91d7e3;font-weight:bold">=</span>[<span style="color:#a6da95">&#34;a&#34;</span>, <span style="color:#a6da95">&#34;b&#34;</span>, <span style="color:#a6da95">&#34;c&#34;</span>, <span style="color:#a6da95">&#34;d&#34;</span>, <span style="color:#a6da95">&#34;e&#34;</span>, <span style="color:#a6da95">&#34;f&#34;</span>, <span style="color:#a6da95">&#34;g&#34;</span>, <span style="color:#a6da95">&#34;h&#34;</span>, <span style="color:#a6da95">&#34;i&#34;</span>, <span style="color:#a6da95">&#34;j&#34;</span>, <span style="color:#a6da95">&#34;k&#34;</span>, <span style="color:#a6da95">&#34;l&#34;</span>, <span style="color:#a6da95">&#34;m&#34;</span>, 
</span></span><span style="display:flex;"><span>                        <span style="color:#a6da95">&#34;n&#34;</span>, <span style="color:#a6da95">&#34;o&#34;</span>, <span style="color:#a6da95">&#34;p&#34;</span>, <span style="color:#a6da95">&#34;q&#34;</span>, <span style="color:#a6da95">&#34;r&#34;</span>, <span style="color:#a6da95">&#34;s&#34;</span>, <span style="color:#a6da95">&#34;t&#34;</span>, <span style="color:#a6da95">&#34;u&#34;</span>, <span style="color:#a6da95">&#34;v&#34;</span>, <span style="color:#a6da95">&#34;w&#34;</span>, <span style="color:#a6da95">&#34;x&#34;</span>, <span style="color:#a6da95">&#34;y&#34;</span>, <span style="color:#a6da95">&#34;z&#34;</span>,
</span></span><span style="display:flex;"><span>                        <span style="color:#a6da95">&#34;A&#34;</span>, <span style="color:#a6da95">&#34;B&#34;</span>, <span style="color:#a6da95">&#34;C&#34;</span>, <span style="color:#a6da95">&#34;D&#34;</span>, <span style="color:#a6da95">&#34;E&#34;</span>, <span style="color:#a6da95">&#34;F&#34;</span>, <span style="color:#a6da95">&#34;G&#34;</span>, <span style="color:#a6da95">&#34;H&#34;</span>, <span style="color:#a6da95">&#34;I&#34;</span>, <span style="color:#a6da95">&#34;J&#34;</span>, <span style="color:#a6da95">&#34;K&#34;</span>, <span style="color:#a6da95">&#34;L&#34;</span>, <span style="color:#a6da95">&#34;M&#34;</span>,
</span></span><span style="display:flex;"><span>                        <span style="color:#a6da95">&#34;N&#34;</span>, <span style="color:#a6da95">&#34;O&#34;</span>, <span style="color:#a6da95">&#34;P&#34;</span>, <span style="color:#a6da95">&#34;Q&#34;</span>, <span style="color:#a6da95">&#34;R&#34;</span>, <span style="color:#a6da95">&#34;S&#34;</span>, <span style="color:#a6da95">&#34;T&#34;</span>, <span style="color:#a6da95">&#34;U&#34;</span>, <span style="color:#a6da95">&#34;V&#34;</span>, <span style="color:#a6da95">&#34;W&#34;</span>, <span style="color:#a6da95">&#34;X&#34;</span>, <span style="color:#a6da95">&#34;Y&#34;</span>, <span style="color:#a6da95">&#34;Z&#34;</span>]
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Train the tokenizer on our historical corpus</span>
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># This is where the BPE algorithm learns the optimal subword patterns</span>
</span></span><span style="display:flex;"><span>    tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>train([<span style="color:#91d7e3">str</span>(<span style="color:#91d7e3">self</span><span style="color:#91d7e3;font-weight:bold">.</span>corpus_path)], trainer)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> tokenizer</span></span></code></pre></div><figcaption>
        <strong>Listing 9: Custom Tokenizer Training Function</strong>
    </figcaption>
</figure>
<h4 id="322-tokenization-architecture-decisions">3.2.2 Tokenization Architecture Decisions</h4>
<p>Our custom historical tokenizer necessitated several critical design decisions to handle historical English effectively. We evaluated multiple tokenization approaches including <strong>Byte Pair Encoding (BPE)</strong> (<a
	
		href = "https://arxiv.org/abs/1508.07909"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Sennrich et al., 2016
	</span>
</a>), <strong>WordPiece</strong> (<a
	
		href = "https://research.google/pubs/pub37842/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Schuster &amp; Nakajima, 2012
	</span>
</a>), <strong>Unigram Language Model</strong> (<a
	
		href = "https://arxiv.org/abs/1804.10959"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Kudo, 2018
	</span>
</a>), <strong>SentencePiece</strong> (<a
	
		href = "https://arxiv.org/abs/1808.06226"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Kudo &amp; Richardson, 2018
	</span>
</a>), and traditional character-level and word-level tokenization. Each approach has distinct trade-offs: BPE produces clean subwords without special markers (used by GPT models), WordPiece adds <code>##</code> prefixes that contaminate generated text (used by BERT), Unigram uses probabilistic modeling but is computationally expensive, SentencePiece treats text as raw bytes and excels at multilingual scenarios, while character-level and word-level tokenization either produce impractically long sequences or massive vocabularies.</p>
<p>For historical text generation, BPE provides the optimal balance of clean output, efficient training, and effective vocabulary coverage, as demonstrated by <a
	
		href = "https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Radford et al., 2019
	</span>
</a> and <a
	
		href = "https://arxiv.org/abs/2112.10508"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Mielke et al., 2021
	</span>
</a>. We also preserve case throughout tokenization, since historical text often uses capitalization for semantic meaning (e.g., &ldquo;Thou&rdquo; vs. &ldquo;thou&rdquo;), and include over 150 carefully designed special tokens that capture historical language patterns, London geography, and social context. This combination ensures our tokenizer can effectively learn and generate authentic historical language while maintaining computational efficiency.</p>
<h3 id="33-special-token-design-capturing-historical-language-patterns">3.3 Special Token Design: Capturing Historical Language Patterns</h3>
<p>Historical English contains linguistic patterns, vocabulary, and cultural references that are no longer present in modern English. Standard tokenizers fragment these patterns, destroying the semantic meaning crucial for historical text generation. The solution here was to design 150 special tokens that capture the essence of historical English, organized into strategic categories that reflect the linguistic and cultural structure of 1500-1850 English.</p>
<figure id="listing10"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">create_special_tokens</span>() <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">list</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Create special tokens for historical English&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    special_tokens <span style="color:#91d7e3;font-weight:bold">=</span> [
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Basic control tokens</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|startoftext|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|endoftext|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|pad|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|unk|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|mask|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Historical language tokens</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|thou|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|thee|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|thy|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|thine|&gt;&#34;</span>,  <span style="color:#6e738d;font-style:italic"># Second person pronouns</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|hast|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|hath|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|doth|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|dost|&gt;&#34;</span>,  <span style="color:#6e738d;font-style:italic"># Archaic verb forms</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|quoth|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|tis|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|twas|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|twill|&gt;&#34;</span>, <span style="color:#6e738d;font-style:italic"># Common contractions</span>
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># London geography tokens</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|london|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|thames|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|westminster|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|tower|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|newgate|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|southwark|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|cheapside|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|fleet|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|ludgate|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|aldgate|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Historical period tokens</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|tudor|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|stuart|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|georgian|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|regency|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|victorian|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Social and professional tokens</span>
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|noble|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|gentleman|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|commoner|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|apothecary|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|coachman|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;&lt;|merchant|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|court|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|jury|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|verdict|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|church|&gt;&#34;</span>, <span style="color:#a6da95">&#34;&lt;|parish|&gt;&#34;</span>
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> special_tokens</span></span></code></pre></div><figcaption>
        <strong>Listing 10: Special Tokens Creation Function</strong>
    </figcaption>
</figure>
<h4 id="331-token-category-analysis">3.3.1 Token Category Analysis</h4>
<p>Our special token vocabulary spans ten carefully curated categories, each designed to capture essential aspects of historical London life. The largest categories focus on <strong>Historical Language</strong> (25 tokens) and <strong>London Geography</strong> (20 tokens), providing the linguistic and spatial foundation for authentic historical text generation. These tokens capture archaic pronouns like <code>&quot;thou&quot;</code> and <code>&quot;thee,&quot;</code> along with specific London locations like <code>&quot;Thames&quot;</code> and <code>&quot;Newgate&quot;</code> that were central to historical narratives.</p>
<p>The remaining categories address the social, professional, and cultural dimensions of historical society. <strong>Social Class</strong> and <strong>Professional</strong> tokens (35 tokens combined) reflect the highly stratified nature of historical London, enabling accurate dialogue between nobles, commoners, and various tradespeople. <strong>Legal and Judicial</strong> tokens support court proceedings from the Old Bailey, while <strong>Religious</strong> tokens capture the central role of faith in historical society. <strong>Temporal</strong>, <strong>Currency</strong>, and <strong>Transportation</strong> tokens (35 tokens combined) provide the temporal, economic, and logistical context that makes historical narratives authentic and believable.</p>
<h4 id="332-special-token-categories-visualization">3.3.2 Special Token Categories Visualization</h4>
<p>Let us visualize the special token categories and their relationships as shown below. These special tokens enable the model to understand and generate authentic historical language. Without them, the model would fragment historical concepts into meaningless subwords, losing the cultural and linguistic context that makes historical text generation both challenging and rewarding.</p>
<figure class="align-center " id="fig5">
    <pre class="mermaid">graph LR
    A[🔤 Special Tokens&lt;br/&gt;150+ Total] --&gt; B[📜 Historical Language&lt;br/&gt;25 tokens]
    A --&gt; C[🏛️ London Geography&lt;br/&gt;20 tokens]
    A --&gt; D[⏰ Historical Periods&lt;br/&gt;10 tokens]
    A --&gt; E[👥 Social Classes&lt;br/&gt;15 tokens]
    A --&gt; F[💼 Professions&lt;br/&gt;20 tokens]
    A --&gt; G[⚖️ Legal &amp; Judicial&lt;br/&gt;10 tokens]
    A --&gt; H[⛪ Religious&lt;br/&gt;10 tokens]
    A --&gt; I[🕐 Temporal&lt;br/&gt;15 tokens]
    A --&gt; J[💰 Currency &amp; Measurement&lt;br/&gt;10 tokens]
    A --&gt; K[🚗 Transportation&lt;br/&gt;10 tokens]

    B --&gt; B1[&#34;&lt;|thou|&gt;, &lt;|thee|&gt;, &lt;|hast|&gt;, &lt;|doth|&gt;, &lt;|quoth|&gt;&#34;]
    C --&gt; C1[&#34;&lt;|london|&gt;, &lt;|thames|&gt;, &lt;|newgate|&gt;, &lt;|westminster|&gt;&#34;]
    D --&gt; D1[&#34;&lt;|tudor|&gt;, &lt;|stuart|&gt;, &lt;|georgian|&gt;, &lt;|regency|&gt;&#34;]
    E --&gt; E1[&#34;&lt;|noble|&gt;, &lt;|gentleman|&gt;, &lt;|commoner|&gt;, &lt;|yeoman|&gt;&#34;]
    F --&gt; F1[&#34;&lt;|apothecary|&gt;, &lt;|coachman|&gt;, &lt;|chimneysweep|&gt;, &lt;|baker|&gt;&#34;]
    G --&gt; G1[&#34;&lt;|court|&gt;, &lt;|jury|&gt;, &lt;|verdict|&gt;, &lt;|prisoner|&gt;&#34;]
    H --&gt; H1[&#34;&lt;|church|&gt;, &lt;|parish|&gt;, &lt;|prayer|&gt;, &lt;|blessed|&gt;&#34;]
    I --&gt; I1[&#34;&lt;|morn|&gt;, &lt;|eve|&gt;, &lt;|season|&gt;, &lt;|year|&gt;&#34;]
    J --&gt; J1[&#34;&lt;|shilling|&gt;, &lt;|pound|&gt;, &lt;|yard|&gt;, &lt;|furlong|&gt;&#34;]
    K --&gt; K1[&#34;&lt;|coach|&gt;, &lt;|carriage|&gt;, &lt;|horse|&gt;, &lt;|vessel|&gt;&#34;]

    %% class definitions (custom palette matching your original)
    classDef cls_root fill:#e1f5fe,stroke:#81d4fa,color:#000;
    classDef cls_hist fill:#f3e5f5,stroke:#ce93d8,color:#000;
    classDef cls_geo fill:#e8f5e8,stroke:#a5d6a7,color:#000;
    classDef cls_period fill:#fff3e0,stroke:#ffe0b2,color:#000;
    classDef cls_social fill:#fce4ec,stroke:#f8bbd0,color:#000;
    classDef cls_prof fill:#f1f8e9,stroke:#c5e1a5,color:#000;
    classDef cls_legal fill:#e0f2f1,stroke:#80cbc4,color:#000;
    classDef cls_relig fill:#f9fbe7,stroke:#e6ee9c,color:#000;
    classDef cls_temp fill:#e3f2fd,stroke:#90caf9,color:#000;
    classDef cls_curr fill:#fef7e0,stroke:#ffe082,color:#000;
    classDef cls_trans fill:#f3e5f5,stroke:#e1bee7,color:#000;

    %% assign classes
    class A cls_root;
    class B cls_hist;
    class C cls_geo;
    class D cls_period;
    class E cls_social;
    class F cls_prof;
    class G cls_legal;
    class H cls_relig;
    class I cls_temp;
    class J cls_curr;
    class K cls_trans;</pre>
    <figcaption>Figure 5: Special Token Categories and Examples</figcaption>
</figure>
<h3 id="35-post-processing-and-hugging-face-integration">3.5 Post-Processing and Hugging Face Integration</h3>
<p>After training our custom tokenizer, we need to make it compatible with the broader machine learning ecosystem and ensure it works properly with language model training. Raw tokenizers can only convert text to tokens and back. Still, language models require additional functionality, such as special token handling, sequence padding, and integration with popular frameworks like Hugging Face Transformers.</p>
<p>The challenge, though, is that language model training requires specific formatting that raw tokenizers don&rsquo;t provide. For example, training sequences need to be wrapped with special start/end tokens (<code>&lt;|startoftext|&gt;</code> and <code>&lt;|endoftext|&gt;</code>), padded to consistent lengths for batch processing, and integrated with the rest of the ecosystem. In our case, we also want to utilize Hugging Face and its ecosystem, allowing us to leverage standard training scripts and model architectures. Without proper post-processing, our custom tokenizer would be incompatible with existing training infrastructure.</p>
<p>We add post-processing capabilities that wrap text sequences with control tokens and create Hugging Face-compatible tokenizer files, ensuring seamless integration with the broader machine learning ecosystem while preserving our historical text optimizations.</p>
<p>There are three key areas that we need to consider:</p>
<ul>
<li>
<p><strong>Understanding Post-Processing:</strong> The first step is adding a post-processor that automatically wraps every text sequence with special start and end tokens. This is crucial because language models must be able to identify where sequences begin and end during training. For example, when we tokenize <code>&quot;Hello world&quot;</code>, the post-processor automatically converts it to <code>&lt;|startoftext|&gt; Hello world &lt;|endoftext|&gt;</code>. This template processing ensures consistent formatting across all our training data.</p>
</li>
<li>
<p><strong>Hugging Face Integration:</strong> Next, we create a Hugging Face-compatible wrapper around our custom tokenizer. This wrapper maps our special tokens to the standard token types that Hugging Face expects: beginning-of-sequence (bos), end-of-sequence (eos), padding, unknown, and masking tokens. This mapping allows our custom tokenizer to work seamlessly with standard training scripts and model architectures.</p>
</li>
<li>
<p><strong>Special Token Functions:</strong> Each special token serves a specific purpose in language model training. The beginning-of-sequence token indicates when a new text starts, the end-of-sequence token marks the end of the text, padding tokens ensure all sequences in a batch have the same length, unknown tokens handle words not in our vocabulary, and masking tokens are used during training for masked language modeling tasks.</p>
</li>
</ul>
<p>The code in <a href="#listing11" class="listing-ref">Listing 11</a> demonstrates how we implement these post-processing steps and create a Hugging Face-compatible tokenizer:</p>
<figure id="listing11"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">create_huggingface_tokenizer</span>(tokenizer: Tokenizer, max_length: <span style="color:#91d7e3">int</span> <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#f5a97f">1024</span>) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> PreTrainedTokenizerFast:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Create Hugging Face compatible tokenizer&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#8bd5ca">from</span> <span style="color:#f5a97f">transformers</span> <span style="color:#8bd5ca">import</span> PreTrainedTokenizerFast
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Add post-processor for sequence formatting</span>
</span></span><span style="display:flex;"><span>    tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>post_processor <span style="color:#91d7e3;font-weight:bold">=</span> processors<span style="color:#91d7e3;font-weight:bold">.</span>TemplateProcessing(
</span></span><span style="display:flex;"><span>        single<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;&lt;|startoftext|&gt; $A &lt;|endoftext|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        special_tokens<span style="color:#91d7e3;font-weight:bold">=</span>[
</span></span><span style="display:flex;"><span>            (<span style="color:#a6da95">&#34;&lt;|startoftext|&gt;&#34;</span>, <span style="color:#f5a97f">1</span>),
</span></span><span style="display:flex;"><span>            (<span style="color:#a6da95">&#34;&lt;|endoftext|&gt;&#34;</span>, <span style="color:#f5a97f">0</span>),
</span></span><span style="display:flex;"><span>        ]
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#6e738d;font-style:italic"># Create Hugging Face tokenizer wrapper</span>
</span></span><span style="display:flex;"><span>    hf_tokenizer <span style="color:#91d7e3;font-weight:bold">=</span> PreTrainedTokenizerFast(
</span></span><span style="display:flex;"><span>        tokenizer_object<span style="color:#91d7e3;font-weight:bold">=</span>tokenizer,
</span></span><span style="display:flex;"><span>        bos_token<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;&lt;|startoftext|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        eos_token<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;&lt;|endoftext|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        pad_token<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;&lt;|pad|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        unk_token<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;&lt;|unk|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        mask_token<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;&lt;|mask|&gt;&#34;</span>,
</span></span><span style="display:flex;"><span>        model_max_length<span style="color:#91d7e3;font-weight:bold">=</span>max_length
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> hf_tokenizer</span></span></code></pre></div><figcaption>
        <strong>Listing 11: Hugging Face Tokenizer Integration</strong>
    </figcaption>
</figure>
<p>Without this integration, our custom tokenizer would be incompatible with standard language model training. The post-processor ensures proper sequence formatting, while the Hugging Face wrapper enables seamless integration with existing training infrastructure and model architectures. This makes our tokenizer compatible with standard training frameworks, allowing for easy sharing and deployment.</p>
<h3 id="36-testing-and-validation">3.6 Testing and Validation</h3>
<p>We need to ensure the tokenizer works correctly with historical text before using it for model training. This requires testing on diverse historical samples and validating both encoding and decoding accuracy. A simple way to do this is to encode a set of historical text samples, decode them back, and check if the original text is perfectly reconstructed. We also want to verify that special tokens are used correctly in the tokenized output.</p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#c6a0f6">def</span> <span style="color:#8aadf4">test_historical_tokenizer</span>(tokenizer: Tokenizer) <span style="color:#91d7e3;font-weight:bold">-&gt;</span> <span style="color:#91d7e3">dict</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#a6da95">&#34;&#34;&#34;Test the trained tokenizer on historical text samples&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    test_texts <span style="color:#91d7e3;font-weight:bold">=</span> [
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;In the year of our Lord 1834, the streets of London were filled with the sounds of horse-drawn carriages.&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The gentleman from the country said, &#39;I have never seen such a sight in all my days.&#39;&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;The Thames flowed dark and mysterious through the heart of the city.&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#a6da95">&#34;It was the best of times, it was the worst of times.&#34;</span>
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    results <span style="color:#91d7e3;font-weight:bold">=</span> {<span style="color:#a6da95">&#39;perfect_reconstruction&#39;</span>: <span style="color:#f5a97f">0</span>, <span style="color:#a6da95">&#39;special_token_usage&#39;</span>: <span style="color:#f5a97f">0</span>, <span style="color:#a6da95">&#39;failed_tests&#39;</span>: []}
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">for</span> i, text <span style="color:#91d7e3;font-weight:bold">in</span> <span style="color:#91d7e3">enumerate</span>(test_texts):
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Encode and decode text</span>
</span></span><span style="display:flex;"><span>        encoded <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>encode(text)
</span></span><span style="display:flex;"><span>        decoded <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>decode(encoded<span style="color:#91d7e3;font-weight:bold">.</span>ids)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Check reconstruction accuracy</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> decoded<span style="color:#91d7e3;font-weight:bold">.</span>strip() <span style="color:#91d7e3;font-weight:bold">==</span> text<span style="color:#91d7e3;font-weight:bold">.</span>strip():
</span></span><span style="display:flex;"><span>            results[<span style="color:#a6da95">&#39;perfect_reconstruction&#39;</span>] <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">else</span>:
</span></span><span style="display:flex;"><span>            results[<span style="color:#a6da95">&#39;failed_tests&#39;</span>]<span style="color:#91d7e3;font-weight:bold">.</span>append({<span style="color:#a6da95">&#39;index&#39;</span>: i, <span style="color:#a6da95">&#39;original&#39;</span>: text, <span style="color:#a6da95">&#39;decoded&#39;</span>: decoded})
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#6e738d;font-style:italic"># Check special token usage</span>
</span></span><span style="display:flex;"><span>        special_tokens <span style="color:#91d7e3;font-weight:bold">=</span> [token <span style="color:#c6a0f6">for</span> token <span style="color:#91d7e3;font-weight:bold">in</span> encoded<span style="color:#91d7e3;font-weight:bold">.</span>tokens <span style="color:#c6a0f6">if</span> token<span style="color:#91d7e3;font-weight:bold">.</span>startswith(<span style="color:#a6da95">&#39;&lt;|&#39;</span>) <span style="color:#91d7e3;font-weight:bold">and</span> token<span style="color:#91d7e3;font-weight:bold">.</span>endswith(<span style="color:#a6da95">&#39;|&gt;&#39;</span>)]
</span></span><span style="display:flex;"><span>        <span style="color:#c6a0f6">if</span> special_tokens:
</span></span><span style="display:flex;"><span>            results[<span style="color:#a6da95">&#39;special_token_usage&#39;</span>] <span style="color:#91d7e3;font-weight:bold">+=</span> <span style="color:#f5a97f">1</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#c6a0f6">return</span> results</span></span></code></pre></div>
<p><strong>Test Results:</strong></p>
<ul>
<li><strong>Perfect Reconstruction</strong>: 99%+ accuracy on test cases</li>
<li><strong>Special Token Usage</strong>: 80%+ of test cases use special tokens</li>
<li><strong>Average Compression Ratio</strong>: ~0.3 tokens per word (highly efficient)</li>
<li><strong>Success Rate</strong>: 99%+ for historical text samples</li>
</ul>
<p>It is essential to conduct comprehensive testing to ensure the tokenizer operates reliably. In our case, the test cases cover different historical periods, writing styles, and linguistic patterns, giving us confidence that the tokenizer can handle the full range of historical text in our corpus. For a real-world LLM, this is, of course, more complex and would need to cover a broader set of areas.</p>
<p><strong>Tokenizer Performance Validation</strong></p>
<p>Not surprisingly, our custom tokenizer significantly outperforms standard approaches on historical text, as demonstrated by comprehensive metrics that compare it to GPT-2&rsquo;s tokenizer, as shown in the table below. These metrics indicate that our custom tokenizer significantly outperforms standard approaches for historical text. The improved compression ratio and reconstruction accuracy ensure that the model learns from authentic historical language rather than tokenization artifacts, which is crucial for generating coherent and historically accurate text.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Standard GPT-2</th>
          <th>Our Custom Tokenizer</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Vocabulary Size</strong></td>
          <td>50,257 tokens</td>
          <td>30,000 tokens</td>
          <td>40% smaller</td>
      </tr>
      <tr>
          <td><strong>Special Tokens</strong></td>
          <td>4 tokens</td>
          <td>150+ tokens</td>
          <td>37x more</td>
      </tr>
      <tr>
          <td><strong>Compression Ratio</strong></td>
          <td>~0.4 tokens/word</td>
          <td>~0.3 tokens/word</td>
          <td>25% better</td>
      </tr>
      <tr>
          <td><strong>Reconstruction Accuracy</strong></td>
          <td>95%</td>
          <td>99%+</td>
          <td>4% better</td>
      </tr>
      <tr>
          <td><strong>Historical Language Support</strong></td>
          <td>Poor</td>
          <td>Good</td>
          <td>N/A</td>
      </tr>
  </tbody>
</table>
<p>These metrics validate that our 30K token vocabulary provides optimal coverage for historical text while remaining manageable for small language models. The 150+ special tokens capture linguistic patterns of 1500-1850 English, and the 25% better compression ratio means historical text is represented more efficiently, allowing the model to process longer sequences. The 99%+ reconstruction accuracy ensures no information is lost during tokenization, while excellent performance on archaic vocabulary, period-specific terminology, and London geography demonstrates the tokenizer&rsquo;s effectiveness for historical language modeling.</p>
<h3 id="38-implementation-and-usage">3.8 Implementation and Usage</h3>
<p>The complete tokenizer implementation, including training scripts, testing utilities, and validation tools, is available in the <a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		helloLondon GitHub repository
	</span>
</a>. The repository provides:</p>
<ul>
<li><strong>Training Code</strong>: Complete BPE tokenizer training with configurable vocabulary sizes and special token definitions.</li>
<li><strong>Testing Utilities</strong>: Comprehensive validation tools for testing tokenizer performance on historical text</li>
<li><strong>Integration Examples</strong>: Ready-to-use code for incorporating the tokenizer into your own projects</li>
<li><strong>Documentation</strong>: Detailed usage guides and API references</li>
</ul>
<p>This implementation demonstrates how to build production-ready tokenizers for specialized domains, with particular focus on historical language processing and integration with modern ML frameworks.</p>
<h2 id="4-current-limitations">4. Current Limitations</h2>
<p>This project is designed as a learning exercise for those new to AI and LLM development. While we&rsquo;ve built a functional system that demonstrates core concepts, this is not production-ready code and has several limitations that would need to be addressed for real-world deployment:</p>
<p><strong>Data Scale &amp; Quality:</strong></p>
<ul>
<li>Corpus size: Our 500M character corpus is tiny compared to production LLMs, which typically use 100x-1000x more data (50B-500B+ characters). This limits the model&rsquo;s ability to learn diverse patterns and reduces the quality of generated output.</li>
<li>Source diversity: With only 218 sources, we lack comprehensive historical coverage across the 1500-1850 span, potentially missing important linguistic evolution patterns and regional variations.</li>
<li>Geographic bias: Heavy focus on London may not accurately represent broader historical English patterns from other regions, limiting the model&rsquo;s generalizability.</li>
<li>Bias detection: We lack systematic approaches to identify or mitigate historical biases in the data, which could lead to the model perpetuating outdated or problematic language patterns.</li>
<li>Quality assessment: Our cleaning pipeline, while effective for common issues, overlooks many edge cases and artifacts that would require more sophisticated ML-based quality assessment in production.</li>
</ul>
<p><strong>Tokenizer &amp; Model Architecture:</strong></p>
<ul>
<li>Vocabulary size: Our 30K token vocabulary is small compared to modern models (which often use 50K-100K+ tokens), limiting the model&rsquo;s ability to represent diverse vocabulary efficiently.</li>
<li>Special tokens: The 150+ special tokens are manually curated rather than learned from data, which may miss important patterns that data-driven approaches would discover.</li>
<li>Context length: The 1024 token context window is very short compared to modern models (which often use 4K-32K+ tokens), limiting the model&rsquo;s ability to maintain coherence in longer texts.</li>
<li>Language support: No support for other languages or historical variants beyond English, significantly limiting the model&rsquo;s applicability.</li>
<li>Tokenization approach: While our BPE approach is clean and avoids WordPiece artifacts, it may not be optimal for all historical text patterns and could benefit from more sophisticated techniques.</li>
</ul>
<p><strong>Technical Infrastructure:</strong></p>
<ul>
<li>Error handling: Basic error handling with limited logging and monitoring makes it difficult to debug issues and track system health in production.</li>
<li>Testing: Minimal test coverage that excludes edge cases means many potential failure modes remain undetected until they occur in production.</li>
<li>Performance: No optimization for speed, memory, or distributed processing, making the system unsuitable for production-scale deployment.</li>
<li>Data management: Lacks data versioning and reproducibility guarantees, making it difficult to track changes and reproduce results across different environments.</li>
<li>Security: No security considerations for data handling and model deployment, creating potential vulnerabilities for sensitive historical data.</li>
<li>Compliance: Missing compliance considerations for GDPR, data privacy, and regulatory requirements, which are essential for production deployment.</li>
<li>Monitoring: No production monitoring, alerting, or observability features, making it impossible to detect and respond to issues in real-time.</li>
</ul>
<p>These limitations are intentional trade-offs made to keep the project manageable and focused on core learning objectives, but they represent significant gaps for production deployment.</p>
<h3 id="43-what-youd-need-for-production">4.3 What You&rsquo;d Need for Production</h3>
<p><strong>Data Engineering and Legal Framework</strong></p>
<p>Production systems require 100x-1000x more data from diverse sources, with ML-based quality assessment, bias detection, and filtering that goes far beyond our simple heuristics. You&rsquo;d need robust ETL pipelines with proper error handling and monitoring, as well as a comprehensive legal framework for copyright clearance, data licensing, and compliance management, which we haven&rsquo;t addressed.</p>
<p><strong>Model Architecture and Training</strong></p>
<p>Meaningful historical language understanding would require models with over 1 billion parameters, utilizing sophisticated training techniques, regularization, and optimization. You&rsquo;d need a comprehensive evaluation on diverse historical text tasks and domain-specific fine-tuning capabilities that our current system doesn&rsquo;t support.</p>
<p><strong>Infrastructure and Operations</strong></p>
<p>Production deployment requires a multi-GPU, multi-node distributed training infrastructure, production-grade model serving with load balancing and scaling, comprehensive monitoring and alerting systems, and end-to-end security for both data and model protection—none of which our learning-focused system currently provides.</p>
<p>This progression from data → tokenizer → training → deployment provides a complete methodology for building specialized historical language models.</p>
<h2 id="5-resources-and-further-reading">5. Resources and Further Reading</h2>
<ul>
<li><strong>GitHub Repository</strong>: <a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		github.com/bahree/helloLondon
	</span>
</a> - Complete source code for data collection and tokenizer training</li>
<li><strong>Part 1</strong>: <a
	
		href = "https://blog.desigeek.com/post/2025/09/building-llm-from-scratch-part1/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Building LLMs from Scratch - Part 1
	</span>
</a> - Quick start and overview</li>
<li><strong>Documentation</strong>: Complete guides in the <code>08_documentation/</code> folder covering every aspect of the project</li>
<li><strong>Book Reference</strong>: <a
	
		href = "https://a.co/d/ffzkJ7T"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Generative AI in Action
	</span>
</a> - For deeper understanding of core LLM concepts</li>
</ul>
<h2 id="6-summary">6. Summary</h2>
<p>This post represents Part 2 of our learning journey into the fundamentals of LLM development. While we&rsquo;ve built a functional data collection and tokenization system demonstrating core concepts, the real value lies in understanding:</p>
<ul>
<li><strong>Data flow</strong> from raw sources to training-ready corpora</li>
<li><strong>Tokenization impact</strong> on model performance across different approaches</li>
<li><strong>Challenges</strong> in processing historical and domain-specific text</li>
<li><strong>Trade-offs</strong> between quality, scale, and complexity</li>
<li><strong>Debugging and improvement</strong> strategies for encountered problems</li>
</ul>
<p>The limitations we&rsquo;ve identified are great learning opportunities. Every production LLM started as a learning project, and every limitation teaches you something new about how these systems work. This foundation prepares us for the next phase of our journey.</p>
<hr>
<p><strong>Ready for Part 3?</strong> Part 3 will cover the custom GPT architecture, GPU optimization strategies, and training infrastructure that transforms our clean data and custom tokenizer into working language models—while maintaining the same educational focus on understanding the fundamentals.</p>
<blockquote>
<p><strong>🧱 Series Posts</strong>: <a
	
		href = "/post/2025/09/building-llm-from-scratch-part1/"
	

	

	>
	
	<span>
		Part 1 – Using the Published Historical Models
	</span>
</a> | Part 2 (this post) | <a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3 – Training Architecture &amp; GPU Optimization
	</span>
</a> | <a
	
		href = "/post/2026/01/building-llm-from-scratch-part4-evaluation-deployment/"
	

	

	>
	
	<span>
		Part 4 – Evaluation &amp; Deployment
	</span>
</a></p></blockquote>
]]></content:encoded>
    </item>
    <item>
      <title>🏛️How to build a Large Language Model from Scratch - Part 1</title>
      <link>/post/2025/09/building-llm-from-scratch-part1/</link>
      <pubDate>Tue, 23 Sep 2025 00:00:00 +0000</pubDate>
      <guid>/post/2025/09/building-llm-from-scratch-part1/</guid>
      <description>Learn how to build LLMs from scratch using historical London texts (1500-1850). Complete 4-part series with working code, published models, and educational deployment. Part 1: Get started in minutes.</description>
      <content:encoded><![CDATA[<p><strong>TL;DR</strong></p>
<p>In this post, I show how to build a working LLM from scratch and show a complete end-to-end pipeline from data gathering to training to deployment of a language model. For this project I concentrate on Old English and only related to London, using historical London texts (1500-1850). To show the flexibility, I built <strong>two language models</strong> which are identical in architecture and the only differs is their size and parameters (117M vs 354M).</p>
<blockquote>
<p><strong>⚠️ Educational Purpose</strong>: This is a learning project designed to teach LLM development concepts. For production-scale LLMs, you&rsquo;ll need much larger datasets, more sophisticated infrastructure, and additional considerations not covered here.</p></blockquote>
<p>This guide shows you how to monitor training progression, perform rapid evaluations, test models from both PyTorch checkpoints and published Hugging Face repositories, and ultimately publish your own - supported by complete code, live model artifacts, and educational inference tooling.</p>
<p><strong>4-Part Series</strong>:</p>
<ul>
<li><strong>Part 1 (this): Quick start, inference, and overview</strong></li>
<li><strong><a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2: Data collection and custom tokenizers
	</span>
</a></strong></li>
<li><strong><a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3: Model architecture and GPU training
	</span>
</a></strong></li>
<li><strong><a
	
		href = "/post/2026/01/building-llm-from-scratch-part4-evaluation-deployment/"
	

	

	>
	
	<span>
		Part 4: Evaluation and deployment
	</span>
</a></strong></li>
</ul>
<h2 id="1-overview">1. Overview</h2>
<p><em>Train AI models on 1500-1850 London texts. Complete 4-part series covering data collection, training, and deployment. Part 1: Quick start and overview.</em></p>
<blockquote>
<p><strong>📖 Want to understand the core LLM concepts?</strong> This series focuses on implementation and hands-on building. For a deeper understanding of foundational concepts like tokenizers, prompt engineering, RAG, responsible AI, fine-tuning, and more, check out my book <a
	
		href = "https://a.co/d/ffzkJ7T"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		<strong>Generative AI in Action</strong>
	</span>
</a>.</p></blockquote>
<blockquote>
<p>You can learn more about the book → <a
	
		href = "https://blog.desigeek.com/post/2024/10/book-release-genai-in-action/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		by clicking here
	</span>
</a>📘.</p></blockquote>
<h3 id="11-what-was-built">1.1 What was built?</h3>
<p>I found many folks don&rsquo;t understand what it entails to build an LLM, and where we do have guides, they only share piecemeal elements and nothing that is comprehensive for someone who is new to this. There are more detailed guides on fine-tuning existing models, but not much on the complete development pipeline. This series outlines that by walking through the process of creating specialized language models trained exclusively on historical London texts from 1500 to 1850.</p>
<p>I am mostly doing this for my own learning, and also sharing what I can. Many work-related details, for obvious reasons, I cannot share and discuss, but some small pet projects like this embody the same sentiment.</p>
<p>The <strong>helloLondon Historical Language Models</strong> represent a complete end-to-end implementation, from data collection through deployment. Rather than fine-tuning existing models, I chose to train from the ground up to eliminate modern biases and create models that genuinely understand historical language patterns, cultural contexts, and period-specific knowledge.</p>
<p><strong>Two Model Variants</strong>
I built two identical models with the same architecture, tokenizer, and training process. The only difference is the number of parameters: an SLM (117M parameters) optimized for learning and resource-constrained environments, and a Regular model (354M parameters) designed for higher-quality generation.</p>
<p>Both use identical code with different configuration files, allowing you to understand the impact of model size on performance and choose the right variant for your needs.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Iterations</th>
          <th>Training Time*</th>
          <th>Use Case</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>SLM</strong> (Small)</td>
          <td>117M</td>
          <td>60,000</td>
          <td>~8-12 hours</td>
          <td>Fast inference, resource-constrained</td>
      </tr>
      <tr>
          <td><strong>Regular</strong> (Full)</td>
          <td>354M</td>
          <td>60,000</td>
          <td>~28-32 hours</td>
          <td>High-quality generation</td>
      </tr>
  </tbody>
</table>
<blockquote>
<p><strong>Note:</strong> Technically speaking, both these models can be called classified as SLMs given they are 117M and 354M parameters; however, for the sake of this project, I call the smaller of the two the SLM and the other regular.</p></blockquote>
<h3 id="12-core-pipelines">1.2 Core Pipelines</h3>
<p>The complete development pipeline encompasses multiple critical stages that transform raw historical texts into working language models. The process starts with <strong>data collection</strong>, where we systematically gather and filter over 218 historical London sources spanning 1500–1850. This process ensures we capture authentic period language while minimizing modern biases that could contaminate our models.</p>
<p>Next, we develop a <strong>custom tokenization system</strong> specifically designed for historical English. This involves training a domain-specific tokenizer with a 30,000-token vocabulary plus 150+ special tokens that capture period language patterns, archaic spellings, and historical terminology that modern tokenizers often miss.</p>
<p>The <strong>model architecture</strong> phase implements GPT-style causal language models entirely from scratch, creating two variants with 117M and 354M parameters, respectively. Both models share identical architecture and training processes, allowing for direct comparison of performance versus computational requirements.</p>
<p>Our <strong>training infrastructure</strong> leverages modern multi-GPU training with Distributed Data Parallel (DDP), comprehensive checkpointing for restart resilience, and real-time monitoring through Weights &amp; Biases. This ensures reliable training even across extended periods and hardware failures.</p>
<p><strong>Evaluation</strong> goes beyond standard metrics to include historical accuracy probes, perplexity tracking, qualitative generation review, and early failure detection. We specifically test how well our models understand historical context, period-appropriate language, and London geography.</p>
<p>Finally, <strong>deployment</strong> includes publishing models to Hugging Face alongside unified local and cloud inference scripts, making the models immediately accessible to researchers and developers worldwide.</p>
<h3 id="13-hands-on-experience">1.3 Hands-On Experience</h3>
<p>Every aspect of this project is designed for practical implementation and learning. The <strong>working code</strong> covers every stage from data collection through tokenizer training, model training, evaluation, and publishing - all fully implemented and documented with clear instructions and examples.</p>
<p>I already have both the models published on Hugging Face; which allows for <strong>Live models</strong> are immediately available for use, allowing you to test published checkpoints instantly or retrain from scratch with a single command. This dual approach lets you either jump straight into experimentation or understand the complete development process.</p>
<p>The project works with <strong>real data</strong> - over 500 million characters of authentic historical English from 1500–1850, carefully filtered to minimize modern bias while preserving the rich linguistic patterns of the period. This is using genuine historical texts that provide authentic training material.</p>
<p>Everything is <strong>well-structured</strong> with clear documentation, error handling, reproducible configurations, and automated publishing workflows. The codebase follows good development practices, making it suitable for learning LLM development concepts.</p>
<p>This series is structured to take you through the complete LLM development pipeline:</p>
<table>
  <thead>
      <tr>
          <th>Part</th>
          <th>Focus</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Part 1</strong> (this post)</td>
          <td>Quick start and end-to-end overview</td>
          <td>Use published models, understand the complete pipeline, and get hands-on experience with working code and live models. The intent is that if you want to build this, you can follow the instructions and get a model in the end. If you want to understand more of the inner workings and details, then those will be covered in the subsequent blog posts.</td>
      </tr>
      <tr>
          <td><strong>Part 2</strong></td>
          <td>Data collection and custom tokenization</td>
          <td>Deep dive into gathering 218+ historical sources, cleaning pipelines, and building specialized tokenizers for historical language patterns.</td>
      </tr>
      <tr>
          <td><strong>Part 3</strong></td>
          <td>Model architecture and training infrastructure</td>
          <td>Technical implementation of custom GPT architectures, multi-GPU training, checkpointing, and performance optimization.</td>
      </tr>
      <tr>
          <td><strong><a
	
		href = "/post/2026/01/building-llm-from-scratch-part4-evaluation-deployment/"
	

	

	>
	
	<span>
		Part 4
	</span>
</a></strong></td>
          <td>Evaluation and deployment</td>
          <td>Comprehensive testing frameworks, historical accuracy assessment, and deployment to Hugging Face.</td>
      </tr>
  </tbody>
</table>
<p>For this first part, you have two paths to choose from based on your goals and available time:</p>
<ul>
<li>
<p><strong>Option 1: Quick Start with Published Models</strong> - Jump straight into using the pre-trained models on Hugging Face for immediate testing and exploration. Perfect if you want to see results quickly and aren&rsquo;t concerned with the technical implementation details.</p>
</li>
<li>
<p><strong>Option 2: Build from Scratch</strong> - Dive deep into the complete codebase and build your own historical language model from the ground up. Ideal if you want to understand every aspect of the pipeline and learn how to create specialized LLMs.</p>
</li>
</ul>
<p>Let us start with option 1 - use the models.</p>
<h2 id="2-use-the-models---try-it-now-using-hugging-face">2. Use the models - Try it now using Hugging Face</h2>
<p>If you just want to get going and use the models and kick tires, the models are live on Hugging Face and ready to use.</p>
<ul>
<li><strong>SLM Model (117M parameters)</strong>: <a
	
		href = "https://huggingface.co/bahree/london-historical-slm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		💡 https://huggingface.co/bahree/london-historical-slm
	</span>
</a></li>
<li><strong>Regular Model (354M parameters)</strong>: <a
	
		href = "https://huggingface.co/bahree/london-historical-llm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		💡 https://huggingface.co/bahree/london-historical-llm
	</span>
</a></li>
</ul>
<p>In addition, you can also explore the complete codebase and build your own historical language model from scratch. The entire pipeline is documented with working code, training scripts, and deployment guides, and is available on GitHub:</p>
<ul>
<li><strong>Github Repo 💻 &ndash;&gt; <a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		⚙️ github.com/bahree/helloLondon
	</span>
</a></strong>.</li>
</ul>
<p>If you want to quickly test the published models on Hugging Face (HF), you can do so in two ways: quick automated tests or interactive mode. This is the easiest way to get started and show that the models are fully working. You can either clone the repo and run the scripts or use the Python code snippet below.</p>
<p>If you don&rsquo;t have a development environment set up, you can follow the instructions in the GitHub repo to set up a conda environment with all dependencies. And just for the local testing, you can use CPU only, but for interactive mode, a GPU is recommended. Finally, you will need at a minimum the following Python packages shown in <a href="#listing1" class="listing-ref">Listing 1</a>. Note, these are also called out on the Hugging Face model page.</p>
<figure id="listing1"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>python -m pip install -U pip setuptools wheel
</span></span><span style="display:flex;"><span>python -m pip install <span style="color:#a6da95">&#34;transformers[torch]&#34;</span> accelerate safetensors</span></span></code></pre></div><figcaption>
        <strong>Listing 1: Install Required Dependencies</strong>
    </figcaption>
</figure>
<blockquote>
<p><em>Note:</em> It is recommended to use a virtual environment or conda environment to avoid dependency conflicts. See the GitHub repo for complete setup instructions.</p></blockquote>
<p>If you don&rsquo;t have the code repo yet, you can run the commands in <a href="#listing2" class="listing-ref">Listing 2</a> directly and run inference from Hugging Face.</p>
<p><strong>Python Code:</strong></p>
<figure id="listing2"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#8bd5ca">from</span> <span style="color:#f5a97f">transformers</span> <span style="color:#8bd5ca">import</span> AutoTokenizer, AutoModelForCausalLM
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Load the published SLM model</span>
</span></span><span style="display:flex;"><span>model_name <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;bahree/london-historical-slm&#34;</span>
</span></span><span style="display:flex;"><span>tokenizer <span style="color:#91d7e3;font-weight:bold">=</span> AutoTokenizer<span style="color:#91d7e3;font-weight:bold">.</span>from_pretrained(model_name)
</span></span><span style="display:flex;"><span>model <span style="color:#91d7e3;font-weight:bold">=</span> AutoModelForCausalLM<span style="color:#91d7e3;font-weight:bold">.</span>from_pretrained(model_name)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Generate historical text</span>
</span></span><span style="display:flex;"><span>prompt <span style="color:#91d7e3;font-weight:bold">=</span> <span style="color:#a6da95">&#34;In the year of our Lord 1834, I walked through the streets of London and witnessed&#34;</span>
</span></span><span style="display:flex;"><span>inputs <span style="color:#91d7e3;font-weight:bold">=</span> tokenizer(prompt, return_tensors<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#a6da95">&#34;pt&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>outputs <span style="color:#91d7e3;font-weight:bold">=</span> model<span style="color:#91d7e3;font-weight:bold">.</span>generate(
</span></span><span style="display:flex;"><span>    inputs[<span style="color:#a6da95">&#34;input_ids&#34;</span>],
</span></span><span style="display:flex;"><span>    max_new_tokens<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">100</span>,
</span></span><span style="display:flex;"><span>    do_sample<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>,
</span></span><span style="display:flex;"><span>    temperature<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">0.3</span>,
</span></span><span style="display:flex;"><span>    top_p<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">0.9</span>,
</span></span><span style="display:flex;"><span>    top_k<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">20</span>,
</span></span><span style="display:flex;"><span>    repetition_penalty<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">1.2</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#91d7e3">print</span>(tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>decode(outputs[<span style="color:#f5a97f">0</span>], skip_special_tokens<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">True</span>))</span></span></code></pre></div><figcaption>
        <strong>Listing 2: Load and Test Published Model</strong>
    </figcaption>
</figure>
<h3 id="21-local-testing-with-the-complete-codebase">2.1 Local Testing with the Complete Codebase</h3>
<p>Now that you&rsquo;ve seen the models work directly from Hugging Face, let&rsquo;s explore the complete development experience by working with the actual codebase. This section walks you through testing the models locally using the same infrastructure that was used to train them.</p>
<p>The <code>helloLondon</code> repository contains everything needed to reproduce the entire pipeline - from data collection through model deployment. By running these tests locally, you&rsquo;ll get hands-on experience with the inference scripts and understand how the models integrate with the broader development workflow.</p>
<p>The following examples assume you&rsquo;ve cloned the repository and are running from the root directory. All scripts are designed to work out-of-the-box with the published models, giving you immediate access to the same testing infrastructure used during development. You can test the models using <a href="#listing3" class="listing-ref">Listing 3</a> or <a href="#listing4" class="listing-ref">Listing 4</a>.</p>
<figure id="listing3"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Test SLM model (117M parameters)</span>
</span></span><span style="display:flex;"><span>python 06_inference/test_published_models.py --model_type slm
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Test Regular model (354M parameters)  </span>
</span></span><span style="display:flex;"><span>python 06_inference/test_published_models.py --model_type regular</span></span></code></pre></div><figcaption>
        <strong>Listing 3: Test SLM Model</strong>
    </figcaption>
</figure>
<p>There is also an interactive mode where you can type in your own prompts and see the model generate text.</p>
<p><strong>Interactive Testing:</strong></p>
<figure id="listing4"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># SLM model - Interactive mode</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_unified.py --published --model_type slm --interactive
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Regular model - Interactive mode</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_unified.py --published --model_type regular --interactive
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Single prompt testing</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_unified.py --published --model_type slm --prompt <span style="color:#a6da95">&#34;In the year 1834, I walked through the streets of London and witnessed&#34;</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_unified.py --published --model_type regular --prompt <span style="color:#a6da95">&#34;In the year 1834, I walked through the streets of London and witnessed&#34;</span></span></span></code></pre></div><figcaption>
        <strong>Listing 4: Interactive Mode Testing</strong>
    </figcaption>
</figure>
<p>If everything works, you should see output similar to the following for the SLM model:</p>
<p><strong>Example Output ( Hugging Face SLM Example):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-gdscript3" data-lang="gdscript3"><span style="display:flex;"><span><span style="color:#ed8796">🧪</span> Testing SLM Model: bahree<span style="color:#91d7e3;font-weight:bold">/</span>london<span style="color:#91d7e3;font-weight:bold">-</span>historical<span style="color:#91d7e3;font-weight:bold">-</span>slm
</span></span><span style="display:flex;"><span><span style="color:#91d7e3;font-weight:bold">============================================================</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">📂</span> Loading model<span style="color:#91d7e3;font-weight:bold">...</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">✅</span> Model loaded <span style="color:#91d7e3;font-weight:bold">in</span> <span style="color:#f5a97f">8.91</span> seconds
</span></span><span style="display:flex;"><span><span style="color:#ed8796">📊</span> Model Info:
</span></span><span style="display:flex;"><span>   Type: SLM
</span></span><span style="display:flex;"><span>   Description: Small Language Model (<span style="color:#f5a97f">117</span>M parameters)
</span></span><span style="display:flex;"><span>   Device: cuda
</span></span><span style="display:flex;"><span>   Vocabulary size: <span style="color:#f5a97f">30</span>,<span style="color:#f5a97f">000</span>
</span></span><span style="display:flex;"><span>   Max length: <span style="color:#f5a97f">512</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#91d7e3;font-weight:bold">---</span> Test <span style="color:#f5a97f">1</span><span style="color:#91d7e3;font-weight:bold">/</span><span style="color:#f5a97f">10</span> <span style="color:#91d7e3;font-weight:bold">---</span>
</span></span><span style="display:flex;"><span>Prompt: In the year <span style="color:#f5a97f">1834</span>, I walked through the streets of London <span style="color:#91d7e3;font-weight:bold">and</span> witnessed
</span></span><span style="display:flex;"><span>Generated: a scene <span style="color:#91d7e3;font-weight:bold">in</span> which some of those who did <span style="color:#91d7e3;font-weight:bold">not</span> incline to come <span style="color:#91d7e3;font-weight:bold">in</span> contact with him took part <span style="color:#91d7e3;font-weight:bold">in</span> his discourse<span style="color:#91d7e3;font-weight:bold">.</span> It was on this occasion that I perceived that he had been engaged <span style="color:#91d7e3;font-weight:bold">in</span> some new business connected with the house, but <span style="color:#c6a0f6">for</span> some days it had <span style="color:#91d7e3;font-weight:bold">not</span> taken place, nor did he appear so desirous of pursuing any further display of interest <span style="color:#91d7e3;font-weight:bold">.....</span>
</span></span><span style="display:flex;"><span>Time: <span style="color:#f5a97f">5.75</span>s</span></span></code></pre></div>
<p>Notice how the model captures:</p>
<ul>
<li><strong>Period-appropriate language</strong> (&ldquo;thank &rsquo;ee kindly,&rdquo; &ldquo;bade me go,&rdquo; &ldquo;spectacles&rdquo;)</li>
<li><strong>Historical dialogue patterns</strong> (formal speech, period-appropriate contractions)</li>
<li><strong>Historical context</strong> (West Indies, poor rates, needle work, pocket-book)</li>
<li><strong>Authentic historical narrative</strong> (detailed scene setting, period-appropriate social interactions)</li>
</ul>
<p>Now that we have tried using the model, let&rsquo;s explore option 2 and see how we can build it. Once you&rsquo;ve built your own model, you&rsquo;ll be able to test it using the checkpoints saved during training - see section 7.4 for detailed checkpoint testing instructions.</p>
<h2 id="3-build-the-models---from-scratch">3. Build the models - From Scratch</h2>
<p>Building a language model from scratch is both an art and a science - requiring careful orchestration of data, architecture, and training to create something that can genuinely understand and generate historical text. Unlike fine-tuning existing models, training from scratch gives us complete control over every aspect of the model&rsquo;s knowledge and behavior.</p>
<p>The journey from raw historical documents to a working language model involves six critical phases, each building upon the previous one. The flowchart below illustrates this complete end-to-end pipeline, showing how we transform 218+ historical sources into two specialized models that can generate authentic medieval London text.</p>
<figure class="align-center " id="fig1">
    <pre class="mermaid">graph TD
    A[📚 Historical Data Collection&lt;br/&gt;218+ sources, 1500-1850] --&gt; B[🧹 Data Cleaning &amp; Processing&lt;br/&gt;Text normalization, filtering]
    B --&gt; C[🔤 Custom Tokenizer Training&lt;br/&gt;30k vocab + 150+ special tokens]
    C --&gt; D[🏋️ Model Training&lt;br/&gt;Two Identical Models&lt;br/&gt;SLM: 117M / Regular: 354M]
    D --&gt; E[📊 Evaluation &amp; Testing&lt;br/&gt;Historical accuracy, ROUGE, MMLU]
    E --&gt; F[🚀 Deployment&lt;br/&gt;Hugging Face + Local Inference]
    
    G[📖 Building a Custom LLM] --&gt; A
    
    F --&gt; L[🎯 Use Cases&lt;br/&gt;Historical text generation&lt;br/&gt;Educational projects&lt;br/&gt;Research applications]
    
    style A fill:#e1f5fe
    style D fill:#f3e5f5
    style F fill:#e8f5e8
    style G fill:#fff3e0</pre>
    <figcaption>Figure 1: Complete LLM Development Pipeline</figcaption>
</figure>
<p>Now that we have a bird&rsquo;s eye view of the complete pipeline, let us get into the details and build the model from scratch. I am going to walk you through the complete process step-by-step.</p>
<p>I am also going to assume you have a basic understanding of Python, PyTorch, and command-line operations and have a more recent dev setup, including a relatively modern GPU (NVIDIA RTX 3060 or better recommended). For the sake of simplicity, I will show commands for Linux/macOS, but Windows users can easily adapt them.</p>
<p>Again, as a reminder, the ⚙️ <a
	
		href = "https://github.com/bahree/helloLondon/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		GitHub repo
	</span>
</a> has all the code and instructions you need to get started. You can clone the repo and follow along.</p>
<h2 id="4-environment-and-configuration-setup">4. Environment and Configuration Setup</h2>
<p>The foundation of any successful machine learning project lies in proper environment setup and configuration. This step involves creating a virtual environment, installing dependencies, and configuring the project structure. Understanding the key configuration files, directory organization, and overall project architecture is crucial - these elements form the backbone of the entire training process. Taking time to get this right upfront prevents countless headaches and debugging sessions later, ensuring smooth execution through all subsequent phases.</p>
<h3 id="41-key-configuration-files">4.1 Key Configuration Files</h3>
<ul>
<li><strong><code>config.py</code></strong>: Central configuration system (paths, training settings, tokenizer config)</li>
<li><strong><code>01_environment/setup_environment.py</code></strong>: Environment setup script (reads from config.py)</li>
<li><strong><code>requirements.txt</code></strong>: Python dependencies (auto-generated by setup script)</li>
</ul>
<h3 id="42-important-directories-created-by-setup">4.2 Important Directories (Created by Setup)</h3>
<ul>
<li><strong><code>helloLondon/</code></strong>: Virtual environment directory</li>
<li><strong><code>data/london_historical/</code></strong>: Historical text data storage</li>
<li><strong><code>09_models/checkpoints/</code></strong>: Model checkpoints during training</li>
<li><strong><code>09_models/tokenizers/</code></strong>: Custom tokenizer storage</li>
</ul>
<p>Now that we have that out of the way, let us run the setup commands as shown in <a href="#listing5" class="listing-ref">Listing 5</a>. This will clone the repo, set up the environment, and install all dependencies. For this to work you will already have git, python, and <code>python3-venv</code> installed. If you don&rsquo;t have these, please install them first.</p>
<blockquote>
<p>PS: See the <a
	
		href = "https://github.com/bahree/helloLondon/blob/main/README.md"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Training QuickStart guide
	</span>
</a> in the GitHub repo for more details.</p></blockquote>
<figure id="listing5"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Clone and setup environment</span>
</span></span><span style="display:flex;"><span>git clone https://github.com/bahree/helloLondon/
</span></span><span style="display:flex;"><span><span style="color:#91d7e3">cd</span> helloLondon
</span></span><span style="display:flex;"><span>python 01_environment/setup_environment.py
</span></span><span style="display:flex;"><span><span style="color:#91d7e3">source</span> activate_env.sh</span></span></code></pre></div><figcaption>
        <strong>Listing 5: Clone and Setup Repository</strong>
    </figcaption>
</figure>
<p>As you run the setup script, you should see output similar to the images shown below; the script will create a virtual environment, install dependencies, and set up necessary directories. And then you can activate the environment using the <code>source activate_env.sh</code> command.</p>
<figure>
<img src="images/env11.png" alt="Environment setup - 1 of 3" title="Environment setup - 1/3">
<figcaption><strong>Figure 2:</strong> Environment setup process - Step 1 of 3 showing virtual environment creation</figcaption>
</figure>
<figure>
<img src="images/env12.png" alt="Environment setup - 2 of 3" title="Environment setup - 2/3">
<figcaption><strong>Figure 3:</strong> Environment setup process - Step 2 of 3 showing dependency installation</figcaption>
</figure>
<figure>
<img src="images/env13.png" alt="Environment setup - 3 of 3" title="Environment setup - 3/3">
<figcaption><strong>Figure 4:</strong> Environment setup process - Step 3 of 3 showing final configuration</figcaption>
</figure>
<p>Now that the configuration and environment are set up, we can validate them by running the following command. This will check if everything is working and you have the necessary dependencies installed.</p>
<p>When one activates the environment using <strong><code>source activate_env.sh</code></strong>, you will see it in the console as shown below.</p>
<p>The default environment name is called <strong><code>helloLondon</code></strong>. If you want to change the environment name from <code>helloLondon</code> to something else, you can modify the <code>venv_name</code> field in <code>environment_config.json</code> before running the setup script. This will create a virtual environment with your preferred name.</p>
<figure id="listing6"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>python3 -c <span style="color:#a6da95">&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">from config import config
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(&#39;🔧 Configuration Overview&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(&#39;=&#39; * 50)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;Project Root: {config.project_root}&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;Data Directory: {config.london_historical_data}&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;Tokenizer Directory: {config.london_tokenizer_dir}&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;Checkpoints Directory: {config.checkpoints_dir}&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;Virtual Environment: {config.project_root}/helloLondon&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;Vocabulary Size: {config.tokenizer_config[\&#34;vocab_size\&#34;]:,} tokens&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;Special Tokens: {len(config.tokenizer_config[\&#34;special_tokens\&#34;])} tokens&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;SLM Model: {config.slm_config[\&#34;model_name\&#34;]}&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;Training Epochs: {config.slm_config[\&#34;num_epochs\&#34;]}&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;Batch Size: {config.slm_config[\&#34;batch_size\&#34;]}&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(f&#39;Max Length: {config.slm_config[\&#34;max_length\&#34;]}&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">print(&#39;\\n🎯 Configuration looks good!&#39;)
</span></span></span><span style="display:flex;"><span><span style="color:#a6da95">&#34;</span></span></span></code></pre></div><figcaption>
        <strong>Listing 6: Validate Configuration</strong>
    </figcaption>
</figure>
<p>The following directory structure will be generated after executing the setup script. Please note that certain directories will remain empty until the data collection and training processes are initiated.</p>
<figure id="listing7"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-gdscript3" data-lang="gdscript3"><span style="display:flex;"><span>helloLondon<span style="color:#91d7e3;font-weight:bold">/</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">├──</span> <span style="color:#ed8796">📁</span> data<span style="color:#91d7e3;font-weight:bold">/</span>london_historical<span style="color:#91d7e3;font-weight:bold">/</span>          <span style="color:#6e738d;font-style:italic"># Historical text data</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📄</span> london_historical_corpus_comprehensive<span style="color:#91d7e3;font-weight:bold">.</span>txt  <span style="color:#6e738d;font-style:italic"># Final training corpus</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📁</span> downloads<span style="color:#91d7e3;font-weight:bold">/</span>                   <span style="color:#6e738d;font-style:italic"># Raw downloaded data</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📁</span> processed<span style="color:#91d7e3;font-weight:bold">/</span>                   <span style="color:#6e738d;font-style:italic"># Cleaned and processed text</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">└──</span> <span style="color:#ed8796">📁</span> metadata<span style="color:#91d7e3;font-weight:bold">/</span>                    <span style="color:#6e738d;font-style:italic"># Data collection metadata</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">├──</span> <span style="color:#ed8796">📁</span> <span style="color:#f5a97f">09</span>_models<span style="color:#91d7e3;font-weight:bold">/</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📁</span> checkpoints<span style="color:#91d7e3;font-weight:bold">/</span>                 <span style="color:#6e738d;font-style:italic"># Regular model checkpoints (354M)</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📄</span> checkpoint<span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">44000.</span>pt
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📄</span> checkpoint<span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">47000.</span>pt
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📄</span> checkpoint<span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">51000.</span>pt
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📄</span> checkpoint<span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">59000.</span>pt
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">│</span>   <span style="color:#ed8796">└──</span> <span style="color:#ed8796">📄</span> checkpoint<span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">60001.</span>pt
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📁</span> checkpoints<span style="color:#91d7e3;font-weight:bold">/</span>slm<span style="color:#91d7e3;font-weight:bold">/</span>             <span style="color:#6e738d;font-style:italic"># SLM model checkpoints (117M)</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📄</span> checkpoint<span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">52000.</span>pt
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">│</span>   <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📄</span> checkpoint<span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">60000.</span>pt
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">│</span>   <span style="color:#ed8796">└──</span> <span style="color:#ed8796">📄</span> checkpoint<span style="color:#91d7e3;font-weight:bold">-</span><span style="color:#f5a97f">60001.</span>pt
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>   <span style="color:#ed8796">└──</span> <span style="color:#ed8796">📁</span> tokenizers<span style="color:#91d7e3;font-weight:bold">/</span>london_historical_tokenizer<span style="color:#91d7e3;font-weight:bold">/</span>  <span style="color:#6e738d;font-style:italic"># Custom tokenizer</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>       <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📄</span> tokenizer<span style="color:#91d7e3;font-weight:bold">.</span>json           <span style="color:#6e738d;font-style:italic"># Tokenizer configuration</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>       <span style="color:#ed8796">├──</span> <span style="color:#ed8796">📄</span> vocab<span style="color:#91d7e3;font-weight:bold">.</span>json               <span style="color:#6e738d;font-style:italic"># Vocabulary mapping</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">│</span>       <span style="color:#ed8796">└──</span> <span style="color:#ed8796">📄</span> merges<span style="color:#91d7e3;font-weight:bold">.</span>txt               <span style="color:#6e738d;font-style:italic"># BPE merge rules</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">├──</span> <span style="color:#ed8796">📁</span> helloLondon<span style="color:#91d7e3;font-weight:bold">/</span>                     <span style="color:#6e738d;font-style:italic"># Virtual environment</span>
</span></span><span style="display:flex;"><span><span style="color:#ed8796">└──</span> <span style="color:#ed8796">📁</span> logs<span style="color:#91d7e3;font-weight:bold">/</span>                            <span style="color:#6e738d;font-style:italic"># Training logs and WandB data</span></span></span></code></pre></div><figcaption>
        <strong>Listing 7: Project Directory Structure</strong>
    </figcaption>
</figure>
<blockquote>
<p><strong>Prerequisites</strong>: Before proceeding with the following steps, please verify the following requirements:</p>
<ul>
<li><strong>Storage</strong>: Minimum 20GB of free disk space and stable internet connectivity for data acquisition</li>
<li><strong>Hardware</strong>: GPU with 8GB+ VRAM for SLM training, 16GB+ VRAM for Regular model training. Cloud users should select appropriate instance types</li>
<li><strong>Experiment Tracking</strong> (Optional but highly recommended): <a
	
		href = "https://wandb.ai/site"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Weights &amp; Biases
	</span>
</a> account with <code>WANDB_API_KEY</code> environment variable configured for comprehensive training monitoring</li>
<li><strong>Dependencies</strong>: Required data processing libraries (nltk, beautifulsoup4, etc.) will be automatically installed via the setup script</li>
</ul></blockquote>
<h2 id="5-data-collection">5. Data Collection</h2>
<p>The foundation of any language model lies in its training data. For our historical London models, we&rsquo;ve built a comprehensive data collection system that sources authentic text from <strong>218+ historical sources spanning 1500-1850</strong> - a remarkable 350-year window of London&rsquo;s linguistic evolution. This isn&rsquo;t just about downloading files; it&rsquo;s about curating a high-quality corpus that captures the authentic voice of historical London.</p>
<p>Our data collection pipeline automatically processes multiple formats (PDFs, HTML, XML, plain text) from diverse sources, including Project Gutenberg classics, Old Bailey trial records, London Lives manuscripts, and British History Online archives. The system includes sophisticated quality control measures: language detection to filter non-English content, OCR artifact correction, duplicate detection, and historical period validation to ensure every text genuinely represents the target era.</p>
<p>The result? A curated corpus of <strong>500M+ characters</strong> of authentic historical English text, ready to train models that understand not just the words, but the cultural context, social dynamics, and linguistic patterns of 18th and 19th-century London. Of course, you can always add your own data sources if you have them, and the system is designed to be extensible.</p>
<p>We can kick off the data collection process using <a href="#listing8" class="listing-ref">Listing 8</a>. This will be run from the project root directory.</p>
<figure id="listing8"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Download historical data with advanced filtering</span>
</span></span><span style="display:flex;"><span>python 02_data_collection/historical_data_collector.py --max_sources <span style="color:#f5a97f">100</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># The system automatically filters:</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># - Non-English content (Arabic, Chinese, etc.)</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># - Poor OCR quality scans and gibberish</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># - Advertisement-heavy commercial content  </span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># - Duplicate content and empty files</span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># - Special handling for Project Gutenberg classics</span></span></span></code></pre></div><figcaption>
        <strong>Listing 8: Download Historical Data</strong>
    </figcaption>
</figure>
<p>This process may take some time, depending on your internet speed, the number of sources you choose to download and your system&rsquo;s performance. For me, on a very fast internet connection and a powerful machine this took typically 2-4 hours for downloading, and processing the full dataset. The script will save the cleaned and processed data in the <code>data/london_historical/</code> directory, creating a comprehensive historical corpus.</p>
<p>The data collection process creates a comprehensive historical corpus with the main training file <strong><code>london_historical_corpus_comprehensive.txt</code></strong> containing 270M+ characters (~258MB) of authentic historical text. The complete data directory spans approximately 1.2GB, including 521MB of raw downloaded sources, 263MB of processed and cleaned content, and 126MB of tokenized training sequences ready for model training. The image below shows the data collection in progress.</p>
<figure>
<img src="images/data3.png" alt="Data Collection in Progress" title="Data Collection in Progress">
<figcaption><strong>Figure 5:</strong> Data collection process in progress showing real-time processing of historical sources</figcaption>
</figure>
<p>The final corpus represents one of the largest collections of historical London text ever assembled for language model training, with authentic content spanning 350 years of linguistic evolution. The two images below show an example of one of my runs, one of them showing the final output of the data cleaning and outlining the statistics. And the second one shows the size of the data on disk.</p>
<figure>
<img src="images/data11.png" alt="Data Collection Summary" title="Data Collection Summary">
<figcaption><strong>Figure 6:</strong> Data collection summary showing final statistics and corpus composition</figcaption>
</figure>
<p>The total size at the end of the data. Note this does not include the Old Bailey and London Lives data.</p>
<figure>
<figure>
<img src="images/data12.png" alt="Total Data Size" title="Total Data Size">
<figcaption><strong>Figure 7:</strong> Total data size on disk showing comprehensive historical corpus</figcaption>
</figure>
<figcaption><strong>Figure 7:</strong> Total data size on disk showing the complete historical corpus storage requirements</figcaption>
</figure>
<p>Now that we have our data and have cleaned it. Let us build a custom tokenizer.</p>
<h2 id="6-train-custom-tokenizer">6. Train Custom Tokenizer</h2>
<p>With our cleaned historical corpus ready, we now need to create a custom tokenizer specifically designed for historical English. Standard tokenizers like GPT-2 are optimized for modern text and fail catastrophically with historical language - treating archaic words like &ldquo;quoth&rdquo; and &ldquo;hast&rdquo; as multiple subword fragments, losing both meaning and efficiency.</p>
<p>Our custom tokenizer uses Byte Pair Encoding (BPE) with a 30,000 vocabulary size and 150+ carefully designed special tokens that understand:</p>
<ul>
<li><strong>Historical Language</strong>: Archaic pronouns (<code>&lt;|thou|&gt;</code>, <code>&lt;|thee|&gt;</code>), verbs (<code>&lt;|hast|&gt;</code>, <code>&lt;|doth|&gt;</code>), and expressions (<code>&lt;|verily|&gt;</code>, <code>&lt;|forsooth|&gt;</code>)</li>
<li><strong>London Geography</strong>: Landmarks (<code>&lt;|thames|&gt;</code>, <code>&lt;|newgate|&gt;</code>, <code>&lt;|tower|&gt;</code>), streets (<code>&lt;|cheapside|&gt;</code>, <code>&lt;|fleet|&gt;</code>), and districts (<code>&lt;|southwark|&gt;</code>, <code>&lt;|westminster|&gt;</code>)</li>
<li><strong>Historical Context</strong>: Period markers (<code>&lt;|tudor|&gt;</code>, <code>&lt;|stuart|&gt;</code>, <code>&lt;|georgian|&gt;</code>), social classes (<code>&lt;|noble|&gt;</code>, <code>&lt;|commoner|&gt;</code>), and professions (<code>&lt;|apothecary|&gt;</code>, <code>&lt;|coachman|&gt;</code>)</li>
</ul>
<p>This specialized vocabulary ensures that common historical terms remain as single tokens rather than being fragmented, dramatically improving both training efficiency and text generation quality. We can kick off the tokenizer using <a href="#listing9" class="listing-ref">Listing 9</a>. Again, this will be run from the project root directory.</p>
<figure id="listing9"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Train historical tokenizer (30k vocabulary)</span>
</span></span><span style="display:flex;"><span>python 03_tokenizer/train_historical_tokenizer.py</span></span></code></pre></div><figcaption>
        <strong>Listing 9: Train Historical Tokenizer</strong>
    </figcaption>
</figure>
<p>The training process analyzes our 270M+ character corpus to learn optimal token boundaries, creating a tokenizer that understands the linguistic patterns of 1500-1850 English. The result is a highly efficient tokenizer with a compression ratio of ~0.3 tokens per character and 99%+ reconstruction accuracy - essential for training models that can generate authentic historical text.</p>
<p>Once the training is finished (and usually it is pretty quick - just a few minutes for our data size), we run a quick sanity test as the image below shows.</p>
<figure>
<figure>
<img src="images/tokenizer-8.png" alt="Custom Tokenizer Training" title="Custom Tokenizer Training">
<figcaption><strong>Figure 8:</strong> Custom tokenizer training progress showing vocabulary learning</figcaption>
</figure>
<figcaption><strong>Figure 8:</strong> Custom tokenizer training process showing BPE algorithm learning optimal token boundaries</figcaption>
</figure>
<p>Note that in testing, we might see a warning that the reconstruction differs; this is only because of the alphabet case being different and is expected. You can ignore this. An example of this is shown below.</p>
<figure>
<figure>
<img src="images/tokenizer-7.png" alt="Tokenizer reconstruction warning" title="Tokenizer reconstruction warning">
<figcaption><strong>Figure 9:</strong> Tokenizer reconstruction warning during training process</figcaption>
</figure>
<figcaption><strong>Figure 9:</strong> Tokenizer reconstruction warning showing expected case normalization differences</figcaption>
</figure>
<p><strong>Why the &ldquo;Reconstruction differs&rdquo; warning is actually beneficial:</strong></p>
<p>The reconstruction differences you see are not errors - they&rsquo;re the tokenizer working exactly as designed for optimal language model training. The tokenizer uses Byte Pair Encoding (BPE), which breaks complex words into smaller, reusable subword units (like &ldquo;Bourgh&rdquo; → &ldquo;bour ##gh&rdquo;), and normalizes text to lowercase to reduce vocabulary size. These &ldquo;differences&rdquo; are actually features that make the tokenizer more efficient and the resulting language model more capable of generating authentic historical text.</p>
<blockquote>
<p><strong>📖 For detailed technical explanation</strong>: Part 2 of this series covers the complete tokenizer architecture, BPE implementation, special token design, and why these reconstruction differences are essential for optimal language model training.</p></blockquote>
<p>Now that we have our data and the tokenizer is ready, it is time to train the model.</p>
<h2 id="7-train-the-model">7. Train the Model</h2>
<p>With our cleaned historical corpus and custom tokenizer in place, we can now train our language models. The training system is designed to build two identical models with different parameter counts, allowing you to choose between speed (SLM) and quality (Regular model) based on your needs.</p>
<p><strong>Training Architecture:</strong> Both models use a custom GPT architecture specifically optimized for historical text, featuring sophisticated attention mechanisms that understand the complex relationships in historical language. The system includes automatic GPU detection, multi-GPU support, and comprehensive monitoring to ensure optimal training performance.</p>
<p><strong>Training Process:</strong> The training system implements modern optimization techniques, including dynamic learning rate scheduling, automatic checkpointing, and real-time experiment tracking via WandB. The entire process is automated with intelligent configuration that adapts to your hardware setup, whether you&rsquo;re using a single GPU or multiple GPUs for distributed training.</p>
<p><strong>Performance Optimization:</strong> The system includes precision optimization (TF32, AMP) and memory management specifically tuned for historical text processing. Training typically takes 7-8 hours for the SLM and 28-32 hours for the Regular model on modern hardware, with comprehensive monitoring to track progress and identify any issues. Note, this time can vary significantly based on your hardware. The times mentioned here are based on dual NVIDIA A30s.</p>
<blockquote>
<p><strong>📖 For detailed technical implementation</strong>: Part 3 of this series covers the complete model architecture, GPU configuration, training infrastructure, and performance optimization strategies in detail.<br>
<strong>🧪 Ready to test your checkpoints?</strong> Once training completes, see section 7.4 for comprehensive instructions on testing your trained model checkpoints.</p></blockquote>
<h3 id="71-slm-training">7.1 SLM Training</h3>
<p>To kick off the training, the code is quite simple, as shown in <a href="#listing10" class="listing-ref">Listing 10</a>. Again, this would be from the project root folder. In my case, I am using <code>torchrun --nproc_per_node=2</code> because I have dual GPUs and I want to use both. If you only have a single GPU, you can just run the automatic GPU detection script. The <code>train_model_slm.py</code> script specifically trains the SLM (Small Language Model) with 117M parameters.</p>
<p><strong>Option A: Train SLM (117M parameters) - Faster, Good for Testing</strong></p>
<figure id="listing10"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Clean any existing tokenized data</span>
</span></span><span style="display:flex;"><span>rm -rf data/london_historical/tokenized_data/
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Automatic GPU Detection (Recommended)</span>
</span></span><span style="display:flex;"><span><span style="color:#91d7e3">cd</span> 04_training
</span></span><span style="display:flex;"><span>./launch_slm_training.sh
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Manual Multi-GPU training</span>
</span></span><span style="display:flex;"><span>torchrun --nproc_per_node<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">2</span> 04_training/train_model_slm.py --data_dir data/london_historical</span></span></code></pre></div><figcaption>
        <strong>Listing 10: Train SLM Model</strong>
    </figcaption>
</figure>
<p>Note: The first line <code>rm -rf data/london_historical/tokenized_data/</code> cleans any existing tokenized data to ensure a fresh start. This is important because the training system caches tokenized data for efficiency, and we want to ensure it uses the latest corpus and tokenizer settings rather than potentially outdated cached data. You want to do this only if you have more updated data from the previous steps.</p>
<p>Once the training starts, you will see a similar output as the one shown below.</p>
<figure>
<figure>
<img src="images/train16.png" alt="Starting model training" title="Starting model training">
<figcaption><strong>Figure 10:</strong> Model training initialization showing configuration and setup</figcaption>
</figure>
<figcaption><strong>Figure 10:</strong> Model training initialization showing tokenization and GPU setup process</figcaption>
</figure>
<p>Note the Tokenizing corpus line - this will take some time, depending on your data size and hardware. The tokenized data will be saved in <code>data/london_historical/tokenized_data/</code> for future runs, so subsequent training runs will be much faster. If you want to force re-tokenization, you can delete this directory and restart the training. And if you think this is hung, you can check the GPU usage using <code>nvtop</code> in a separate terminal.</p>
<p>And if you have configured WandB as recommended earlier, then you can log in to that dashboard and also monitor the training progress. This is quite handy when you are away from the machine and see how it is generally progressing.</p>
<figure>
<figure>
<img src="images/train6.png" alt="WanB Training progress" title="WanB Training progress">
<figcaption><strong>Figure 11:</strong> Weights & Biases training progress monitoring dashboard</figcaption>
</figure>
<figcaption><strong>Figure 11:</strong> Weights & Biases training dashboard showing real-time loss curves and performance metrics</figcaption>
</figure>
<p>WandB also provides valuable insights into your model&rsquo;s training performance through comprehensive visualizations. The dashboard shows the complete training journey, revealing how your model&rsquo;s loss decreased over time, whether the training plateaued, and how efficiently your hardware was utilized. These visualizations help you understand not just the final results, but the entire learning process - identifying if the model continued improving throughout training or if it reached a performance plateau.</p>
<p>While these metrics are incredibly useful for optimizing your training process, we&rsquo;ll dive deeper into interpreting these results and fine-tuning your training strategy in Part 3 of this series.</p>
<p><strong>SLM Results (117M parameters):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>wandb: Run history:
</span></span><span style="display:flex;"><span>wandb:       eval/iter   ▂▂▃▃▄▄▅▅▆▆▇▇██
</span></span><span style="display:flex;"><span>wandb: eval/train_loss  ███▇▇▇▇▇▇▇▇▇▇▇▇
</span></span><span style="display:flex;"><span>wandb:   eval/val_loss  ███████▇▇▇█▇▇▇▇
</span></span><span style="display:flex;"><span>wandb:    eval/val_ppl  █▇▇▇▇▇▆▆▆▆▆▆▆▆▆
</span></span><span style="display:flex;"><span>wandb:     train/dt_ms           █            █                
</span></span><span style="display:flex;"><span>wandb:      train/iter      ▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇██
</span></span><span style="display:flex;"><span>wandb:      train/loss ▆▅▇▅▅▃▇▄▄█▅▄▅▄▃▇▄▄▅ ▃▃▂▄▅▂▅▂▄▅▃▃▄▅ ▄▃
</span></span><span style="display:flex;"><span>wandb:        train/lr ██████████▇▇▇▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂           
</span></span><span style="display:flex;"><span>wandb:       train/mfu ▃▄▇▇█▄▄▆▆▇▅▂▅▆▆▇▇▂▄▅▇▇▇▆▆▇▇▇▇▅███▅▇▆▇▇ ▇
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>wandb: Run summary:
</span></span><span style="display:flex;"><span>wandb:       eval/iter 60000
</span></span><span style="display:flex;"><span>wandb: eval/train_loss 2.74369
</span></span><span style="display:flex;"><span>wandb:   eval/val_loss 3.44089
</span></span><span style="display:flex;"><span>wandb:    eval/val_ppl 31.21462
</span></span><span style="display:flex;"><span>wandb:     train/dt_ms 10217.92054
</span></span><span style="display:flex;"><span>wandb:      train/iter 60000
</span></span><span style="display:flex;"><span>wandb:      train/loss 2.87667
</span></span><span style="display:flex;"><span>wandb:        train/lr 3e-05
</span></span><span style="display:flex;"><span>wandb:       train/mfu 7.50594</span></span></code></pre></div>
<p>It&rsquo;s also helpful to monitor GPU usage during training. I recommend using <code>nvtop</code> (a GPU monitoring tool similar to <code>htop</code> but for NVIDIA GPUs) in a separate terminal to track memory usage, temperature, and utilization in real-time. The screenshot below shows the GPU monitoring during model training.</p>
<figure>
<figure>
<img src="images/train16-4.png" alt="GPU monitoring using nvtop" title="GPU monitoring using nvtop">
<figcaption><strong>Figure 12:</strong> GPU monitoring using nvtop showing real-time resource utilization</figcaption>
</figure>
<figcaption><strong>Figure 12:</strong> GPU monitoring during training showing memory usage, temperature, and utilization metrics</figcaption>
</figure>
<h3 id="72-understanding-checkpoints">7.2 Understanding Checkpoints</h3>
<p>Throughout training, the system automatically saves checkpoints - snapshots of your model&rsquo;s current state, including all learned parameters, optimizer state, and training progress. These checkpoints serve as safety nets, allowing you to resume training if interrupted, and provide multiple model versions to choose from. The final checkpoint (typically saved at the end of training) represents your fully trained model, ready for inference and deployment.</p>
<p>Checkpoints are saved in the <code>09_models/checkpoints/</code> directory, with separate subdirectories for each model type. SLM checkpoints are stored in <code>09_models/checkpoints/slm/</code> (e.g., <code>checkpoint-4000.pt</code>, <code>checkpoint-8000.pt</code>), while regular model checkpoints are saved directly in <code>09_models/checkpoints/</code> (e.g., <code>checkpoint-60001.pt</code>, <code>checkpoint-120000.pt</code>). The checkpoint filenames include the training step number, making it easy to identify the training progress and select the best-performing version for your needs.</p>
<p>These checkpoints enable two powerful capabilities that significantly enhance your training workflow. You can test your model&rsquo;s current performance at any point during training by running inference on intermediate checkpoints, allowing you to monitor progress without waiting for training to complete. Additionally, suppose training is interrupted due to power loss, system crash, or manual stop. In that case, you can resume from the last saved checkpoint exactly where you left off, saving both time and computational resources. This flexibility is particularly valuable for long training runs, enabling you to experiment with different model versions and recover from unexpected interruptions.</p>
<blockquote>
<p><strong>🧪 Ready to test your checkpoints?</strong> See section 7.4 for detailed instructions on testing your trained model checkpoints.</p></blockquote>
<h3 id="73-regular-model-training">7.3 Regular Model Training</h3>
<p>The Regular model training follows the same process as the SLM, using identical training infrastructure but with different configuration settings. The only differences are the training script (<code>train_model.py</code> instead of <code>train_model_slm.py</code>) and the model architecture parameters (354M parameters vs 117M).</p>
<figure id="listing11"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Clean any existing tokenized data</span>
</span></span><span style="display:flex;"><span>rm -rf data/london_historical/tokenized_data/
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Automatic GPU Detection (Recommended)</span>
</span></span><span style="display:flex;"><span><span style="color:#91d7e3">cd</span> 04_training
</span></span><span style="display:flex;"><span>./launch_training.sh
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Manual Multi-GPU training</span>
</span></span><span style="display:flex;"><span>torchrun --nproc_per_node<span style="color:#91d7e3;font-weight:bold">=</span><span style="color:#f5a97f">2</span> 04_training/train_model.py --data_dir data/london_historical</span></span></code></pre></div><figcaption>
        <strong>Listing 11: Train Regular Model</strong>
    </figcaption>
</figure>
<p><strong>Key Differences from SLM:</strong></p>
<ul>
<li><strong>Training script</strong>: <code>train_model.py</code> (instead of <code>train_model_slm.py</code>)</li>
<li><strong>Model size</strong>: 354M parameters (vs 117M for SLM)</li>
<li><strong>Training time</strong>: 28-32 hours (vs 7-8 hours for SLM)</li>
<li><strong>Memory usage</strong>: Higher VRAM requirements</li>
<li><strong>Performance</strong>: Better text quality, slower inference</li>
</ul>
<p>The training infrastructure, checkpointing, WandB integration, and all other features remain identical. The system automatically detects the model type and applies the appropriate configuration from <code>config.py</code>.</p>
<p><strong>Regular Model Results (354M parameters):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-fallback" data-lang="fallback"><span style="display:flex;"><span>wandb: Run history:
</span></span><span style="display:flex;"><span>wandb:       eval/iter     ▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
</span></span><span style="display:flex;"><span>wandb: eval/train_loss  █████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▇▆▆▆▆▆▆▆▆▆
</span></span><span style="display:flex;"><span>wandb:   eval/val_loss  ███████████████████████████████████▇███
</span></span><span style="display:flex;"><span>wandb:    eval/val_ppl  ████▇▇█▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▇▇▆▇▇▆▆▆▆▆▆▆▆
</span></span><span style="display:flex;"><span>wandb:     train/dt_ms                  █                      
</span></span><span style="display:flex;"><span>wandb:      train/iter      ▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇███
</span></span><span style="display:flex;"><span>wandb:      train/loss ▇▆▆▇▇▅▅█▅▄▃▅▅▅▄▇▄▄▄▄▄▃▃▃▅▂▄▅▂▅▂▄▅▃▃▄▅ ▄▃
</span></span><span style="display:flex;"><span>wandb:        train/lr ▄██████▇▇▇▇▇▆▆▆▅▅▄▄▄▄▄▄▄▄▃▃▂▂▂          
</span></span><span style="display:flex;"><span>wandb:       train/mfu ▆▇█▅▄ ▄▆▆▆▇▃▃▂▂▆█▃▃▅▅▃█▅▄▆▇▇▇▇▄▅▃█▆▇█▄▃█
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>wandb: Run summary:
</span></span><span style="display:flex;"><span>wandb:       eval/iter 60000
</span></span><span style="display:flex;"><span>wandb: eval/train_loss 2.70315
</span></span><span style="display:flex;"><span>wandb:   eval/val_loss 3.61921
</span></span><span style="display:flex;"><span>wandb:    eval/val_ppl 37.30823
</span></span><span style="display:flex;"><span>wandb:     train/dt_ms 24681.64754
</span></span><span style="display:flex;"><span>wandb:      train/iter 60000
</span></span><span style="display:flex;"><span>wandb:      train/loss 2.70629
</span></span><span style="display:flex;"><span>wandb:        train/lr 0.0
</span></span><span style="display:flex;"><span>wandb:       train/mfu 7.20423</span></span></code></pre></div>
<h3 id="74-testing-your-checkpoints">7.4 Testing Your Checkpoints</h3>
<p>Once training is complete, you can immediately test your model using the checkpoints saved during training. This is one of the most exciting parts - seeing the model generate historical text for the first time! The PyTorch checkpoint approach provides immediate testing without any conversion needed, allowing you to test any checkpoint to monitor training progress while preserving the complete model state, including training metadata and optimizer state for fast, optimized inference.</p>
<p><strong>Direct PyTorch Checkpoint Testing:</strong>
Test your model directly from the training checkpoints without any conversion:</p>
<figure id="listing12"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Test SLM checkpoint (117M parameters)</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_pytorch.py <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --checkpoint 09_models/checkpoints/slm/checkpoint-4000.pt <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --prompt <span style="color:#a6da95">&#34;In the year 1834, I walked through the streets of London and witnessed&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Test Regular model checkpoint (354M parameters)  </span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_pytorch.py <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --checkpoint 09_models/checkpoints/checkpoint-60001.pt <span style="color:#8aadf4">\
</span></span></span><span style="display:flex;"><span><span style="color:#8aadf4"></span>  --prompt <span style="color:#a6da95">&#34;In the year 1834, I walked through the streets of London and witnessed&#34;</span></span></span></code></pre></div><figcaption>
        <strong>Listing 12: Test Model Checkpoints</strong>
    </figcaption>
</figure>
<p><strong>Expected Output:</strong>
Your trained model will generate authentic historical text like:</p>
<blockquote>
<p>&ldquo;In the year 1834, I walked through the streets of London and witnessed the most extraordinary sight. The Thames flowed dark beneath London Bridge, whilst carriages rattled upon the cobblestones with great urgency. Merchants called their wares from Cheapside to Billingsgate, and the smoke from countless chimneys did obscure the morning sun.&rdquo;</p></blockquote>
<p><strong>Testing Different Checkpoints:</strong>
You can test any checkpoint from your training run to see how the model improved over time. Try testing checkpoints from different training stages to observe the learning progression - early checkpoints will generate more random text, while later checkpoints will produce increasingly coherent historical language.</p>
<blockquote>
<p><strong>💡 Pro Tip</strong>: For published Hugging Face models and community access, see the Quick Start section earlier in this post, where we demonstrated the published SLM model.</p></blockquote>
<h2 id="8-publish-to-hugging-face">8. Publish to Hugging Face</h2>
<p>Once you&rsquo;ve successfully trained and tested your models, you can publish them to Hugging Face for community access and easy deployment. Publishing makes your models available to researchers, developers, and enthusiasts worldwide, while integrating them into the Hugging Face ecosystem for seamless use with the <code>transformers</code> library.</p>
<p><strong>Publishing Process:</strong>
The publishing code automatically handles the complete conversion process from PyTorch checkpoints to Hugging Face format, which is essential for making your trained models accessible to the broader community. This conversion transforms your local training artifacts into a standardized format that can be easily loaded by users worldwide.</p>
<p>The process includes converting model weights from PyTorch&rsquo;s <code>.pt</code> format to the more efficient <code>.safetensors</code> format, generating proper configuration files (<code>config.json</code>, <code>generation_config.json</code>) that define the model architecture and generation parameters, uploading the custom tokenizer and all necessary files to ensure complete functionality, creating comprehensive model cards with usage instructions and metadata for easy adoption, and setting up proper model repositories with versioning for educational deployment.</p>
<p>This conversion is necessary because PyTorch checkpoints are optimized for training workflows and contain additional information like optimizer states that aren&rsquo;t needed for inference, while the Hugging Face format is specifically designed for model sharing and deployment across different environments and hardware configurations.</p>
<p>We need to call the right script to publish the relevant model - either the SLM or the larger model. The publishing scripts will prompt you for your Hugging Face username and repository name, allowing you to customize where your models are published. The scripts automatically detect and use the latest checkpoint from your training run, so you can publish immediately after training completes.</p>
<blockquote>
<p><strong>💡 Quick Reference</strong>: If you want to test published models before publishing your own, see section 2 &ldquo;Use the models - Try it now using Hugging Face&rdquo; for immediate access to pre-trained models.</p></blockquote>
<p><strong>Prerequisites:</strong> You&rsquo;ll need a Hugging Face account and either set the <code>HF_TOKEN</code> environment variable or provide your token when prompted. The scripts will guide you through the publishing process step by step.</p>
<p><strong>Option A: Publish SLM (117M parameters)</strong></p>
<figure id="listing13"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Publish SLM to Hugging Face</span>
</span></span><span style="display:flex;"><span>python 10_scripts/publish_slm_to_huggingface.py</span></span></code></pre></div><figcaption>
        <strong>Listing 13: Publish SLM to Hugging Face</strong>
    </figcaption>
</figure>
<p><strong>Option B: Publish Regular Model (354M parameters)</strong></p>
<figure id="listing14"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Publish Regular model to Hugging Face  </span>
</span></span><span style="display:flex;"><span>python 10_scripts/publish_to_huggingface.py</span></span></code></pre></div><figcaption>
        <strong>Listing 14: Publish Regular Model to Hugging Face</strong>
    </figcaption>
</figure>
<p>If everything is working correctly and the models are published, you will see confirmation messages and upload progress. Here&rsquo;s what successful publishing looks like:</p>
<figure>
<figure>
<img src="images/hf05.png" alt="HF - SLM upload" title="HF - SLM upload">
<figcaption><strong>Figure 13:</strong> Hugging Face SLM model upload progress and confirmation</figcaption>
</figure>
<figcaption><strong>Figure 13:</strong> Hugging Face upload process for SLM model showing successful publishing workflow</figcaption>
</figure>
<p>And this is an example output for the Regular model:</p>
<figure>
<figure>
<img src="images/hf06-regular-model.png" alt="HF - Regular model upload" title="HF - Regular model upload">
<figcaption><strong>Figure 14:</strong> Hugging Face Regular model upload progress and confirmation</figcaption>
</figure>
<figcaption><strong>Figure 14:</strong> Hugging Face upload process for Regular model showing successful publishing workflow</figcaption>
</figure>
<p><strong>After Publishing:</strong>
Once published, your models will be available at:</p>
<ul>
<li><strong>SLM</strong>: <code>bahree/london-historical-slm</code></li>
<li><strong>Regular Model</strong>: <code>bahree/london-historical-llm</code></li>
</ul>
<p>Users can then easily load and use your models with just a few lines of code, making your historical language models accessible to the broader AI community for research, education, and creative applications.</p>
<p><strong>Testing Your Published Models:</strong>
Once published, you can test your models using the same inference methods shown in the Quick Start section:</p>
<figure id="listing15"><div class="highlight"><pre tabindex="0" style="color:#cad3f5;background-color:#24273a;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Test published SLM model (10 automated tests)</span>
</span></span><span style="display:flex;"><span>python 06_inference/test_published_models.py --model_type slm
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#6e738d;font-style:italic"># Interactive testing with published models</span>
</span></span><span style="display:flex;"><span>python 06_inference/inference_unified.py --published --model_type slm --interactive</span></span></code></pre></div><figcaption>
        <strong>Listing 15: Test Published Models</strong>
    </figcaption>
</figure>
<h2 id="10-what-weve-accomplished">10. What We&rsquo;ve Accomplished</h2>
<p>This comprehensive guide has taken you from raw historical documents to working language models that can generate authentic 18th and 19th-century London text. We&rsquo;ve built a complete pipeline that transforms 218+ historical sources into two specialized models - a fast SLM for experimentation and a powerful Regular model for high-quality generation. The entire system is fully functional, with both PyTorch checkpoint inference and Hugging Face model publishing working seamlessly, tested and validated on real hardware.</p>
<p>What makes this project interesting is that it&rsquo;s not just another language model - it&rsquo;s a complete educational journey that teaches you every aspect of building LLMs from scratch. From custom historical tokenizers that understand archaic English to sophisticated GPU optimization and deployment, you&rsquo;ve learned the full stack of modern language model development. The result is a system that preserves historical linguistic heritage while demonstrating cutting-edge AI techniques, making it valuable for researchers, educators, and anyone interested in the intersection of history and technology.</p>
<h2 id="11-the-journey-continues">11. The Journey Continues</h2>
<p>This is just the beginning. In the next three parts of this series, we&rsquo;ll dive deeper into the technical foundations:</p>
<p><strong><a
	
		href = "/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	>
	
	<span>
		Part 2
	</span>
</a></strong> explores historical data collection, showing how we curated 218+ authentic sources spanning 350 years of London&rsquo;s history, and how we built a custom tokenizer that truly understands historical English.</p>
<p><strong><a
	
		href = "/post/2025/11/building-llm-from-scratch-part3-model-architecture-gpu-training/"
	

	

	>
	
	<span>
		Part 3
	</span>
</a></strong> reveals the custom GPT architecture designed specifically for historical text, GPU optimization strategies, and training infrastructure.</p>
<p><strong><a
	
		href = "/post/2026/01/building-llm-from-scratch-part4-evaluation-deployment/"
	

	

	>
	
	<span>
		Part 4
	</span>
</a></strong> completes the journey with evaluation frameworks, testing strategies, and deployment techniques that transform your trained models into working systems.</p>
<p>Each part builds on what you&rsquo;ve learned here, taking you from high-level overview to deep technical implementation details.</p>
<h2 id="12-resources">12. Resources</h2>
<ul>
<li><strong>GitHub Repository</strong>:⚙️<a
	
		href = "https://github.com/bahree/helloLondon"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		github.com/bahree/helloLondon
	</span>
</a> - Complete codebase with all training scripts, inference tools, and documentation</li>
<li><strong>Hugging Face Models</strong>:
<ul>
<li>🤗 <a
	
		href = "https://huggingface.co/bahree/london-historical-slm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		bahree/london-historical-slm
	</span>
</a> - Small Language Model (117M parameters)</li>
<li>🤗 <a
	
		href = "https://huggingface.co/bahree/london-historical-llm"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		bahree/london-historical-llm
	</span>
</a> - Regular Model (354M parameters)</li>
</ul>
</li>
<li>📘<strong>Book Reference</strong>: <a
	
		href = "https://a.co/d/ffzkJ7T"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Generative AI in Action
	</span>
</a> - For deeper understanding of core LLM concepts</li>
<li>📖<strong>Documentation</strong>: Complete guides in the <code>08_documentation/</code> folder covering every aspect of the project</li>
</ul>
<h2 id="13-acknowledgments">13. Acknowledgments</h2>
<p>This project builds upon the excellent work of the open-source community. Special thanks to <a
	
		href = "https://github.com/haykgrigo3/TimeCapsuleLLM"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		haykgrigo3&rsquo;s TimeCapsuleLLM
	</span>
</a> for the initial inspiration and framework for historical language model training, and to <a
	
		href = "https://github.com/karpathy/nanoGPT"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Andrej Karpathy&rsquo;s nanoGPT
	</span>
</a> for the foundational GPT architecture and training methodology. The project extends these foundations with specialized adaptations for historical text, including custom tokenizers, advanced data filtering, and educational deployment infrastructure.</p>
<p>🙏</p>
<hr>
<p><strong>Ready to dive deeper?</strong> <a
	
		href = "https://blog.desigeek.com/post/2025/10/building-llm-from-scratch-part2-data-tokenizers/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Part 2: Data Collection &amp; Custom Tokenizers
	</span>
</a> covers the technical details of data collection, cleaning pipelines, and custom tokenizer development for authentic historical text processing.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Reasoning AI Models: An overview</title>
      <link>/post/2025/09/reasoning-ai-models-a-deep-dive/</link>
      <pubDate>Mon, 01 Sep 2025 00:00:00 +0000</pubDate>
      <guid>/post/2025/09/reasoning-ai-models-a-deep-dive/</guid>
      <description>Technical deep dive and best practices for reasoning AI models: architecture, fine-tuning, evaluation, and compute trade-offs.</description>
      <content:encoded><![CDATA[<h4 id="tldr">TL;DR</h4>
<p>As part of my role at Microsoft&rsquo;s AI Foundry Applied AI engineering team in CoreAI, I have participated in numerous detailed discussions about the evolving landscape of AI models. In conversations with many customers, from CxOs to engineers, one recurring topic is the <strong>rise of reasoning AI models</strong>. These models are designed to perform complex tasks by explicitly breaking down problems into logical steps, rather than just generating text in a single pass like traditional large language models (LLMs). This shift toward <em>reasoning-centric</em> AI marks a major evolution in how we develop and deploy AI systems—and it’s a key factor behind the rise of Agents and Agentic AI.</p>
<p>At the same time, there is a lot of confusion about what these reasoning models are, how they differ from traditional LLMs, and how to effectively adapt and evaluate them. In this post, I aim to clarify these concepts by providing a technical deep dive into reasoning AI models, their training and adaptation processes, and the challenges involved in fine-tuning them for specific tasks. We will also explore how to evaluate these models effectively, considering their unique characteristics.</p>
<p>This post is intended to help one gain a deeper understanding of reasoning models and their implications; I cover these areas:</p>
<ul>
<li><strong>What are reasoning AI models?</strong> A technical overview of their architecture and training paradigms.</li>
<li><strong>How do they differ from traditional LLMs?</strong> Key distinctions in capabilities and performance</li>
<li><strong>How to adapt and fine-tune reasoning models?</strong> Best practices and common pitfalls</li>
<li><strong>What are the challenges in customizing them?</strong> Technical and organizational hurdles</li>
<li><strong>How to evaluate reasoning models?</strong> Metrics and strategies for assessing their performance</li>
</ul>
<h2 id="1-introduction">1. Introduction</h2>
<p>Recent AI models have begun to combine language generation with explicit reasoning, enabling more reliable solutions to complex problems. Traditional LLMs like GPT-4o complete a generation in one go, without showing their work. Reasoning models, on the other hand, produce a sequence of intermediate steps (a “reasoning trace”) before the final generation. For example, Microsoft’s Phi-4-Reasoning (14B parameters) will explicitly work through a math problem step-by-step, whereas a regular LLM might confidently state an answer with no explanation. This fundamental difference – <strong>predictive text generation vs. chained logical reasoning</strong> – makes reasoning LLMs significantly better at multi-step tasks, such as math word problems, code debugging, or complex decision queries.</p>
<p>Note: The AI model landscape is also shifting rapidly, with a newer trend of transitioning from separate “base” vs. “reasoning” models (e.g., o1/o3) to unified systems with internal routing (e.g., GPT-5). GPT 5 runs a system that routes between fast and deliberate paths and exposes developer controls to tune thinking time. In production, the system automatically switches modes; developers can cap or elevate effort as needed. This operationalizes dynamic compute allocation, reducing the need for prompt engineering, specifically when wanting to induce reasoning.</p>
<p>The shift toward unified systems like GPT-5 can be understood as operationalizing the compute-optimal scaling insights from research. Rather than requiring users to choose between reasoning modes manually, these systems implement automatic difficulty assessment and adaptive compute allocation - essentially embedding the &ldquo;compute-optimal&rdquo; strategy within the model architecture itself.</p>
<h3 id="11-what-are-reasoning-models">1.1 What are reasoning models?</h3>
<p>Reasoning models are LLMs architected to solve problems via a multi-step chain-of-thought (CoT) approach. Instead of just predicting the next token, they simulate an internal “scratchpad” of logic. For instance, OpenAI’s latest models (<em>o1</em> and <em>o3</em>) reportedly allocate extra computation at inference-time and use <a
	
		href = "https://blog.desigeek.com/post/2025/01/intro-to-reinforcement-learning/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		reinforcement learning (RL)
	</span>
</a> fine-tuning to boost multi-step reasoning. DeepSeek’s R1 (671B-parameter <a
	
		href = "https://blog.desigeek.com/post/2025/01/intro-to-mixture-of-experts/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Mixture-of-Experts model
	</span>
</a>) was explicitly trained with multi-stage reinforcement learning to encourage step-by-step thinking.</p>
<p>During training, such models may be given examples formatted like: <code>*Question → (Begin reasoning) → ... reasoning steps ... → (Final answer)*</code>, or prompted with cues like <em>“Let’s think step by step.”</em> This teaches the model to <strong>articulate intermediate steps</strong> instead of jumping straight to an answer. In essence, a reasoning LLM <strong>learns to internalize a logical process</strong> – it doesn’t just know facts or language, it learns how to solve problems by breaking them down.</p>
<p>Crucially, these reasoning models often use special tokens to separate the “thinking” from the final answer. Many use a convention such as <code>&lt;think&gt; ... &lt;/think&gt;</code> tags to enclose the chain of thought. For example, <strong>DeepSeek-R1-Distill</strong> (a distilled 8B version of R1) will output a hidden “thinking” transcript between these tags, followed by a concise answer that summarizes the reasoning. The chain-of-thought (CoT) might include equations, logic, or code, which the model generates as if working on scratch paper, and then the answer is given separately. This behavior is usually built into the model through fine-tuning – if you prompt such a model normally, it will, by default, produce a step-by-step solution trace and then provide the answer.</p>
<p>Some recent systems even let developers toggle the visibility of this trace: e.g., Qwen-3 allows a “reasoning mode” where the chain of thought is shown or hidden as needed. The key point is that reasoning models carry out more computation in the open, and they may consume more tokens. It is quite common for them to use hundreds or thousands of tokens for a complex solution, whereas a regular LLM might try to produce an answer in, say, a single paragraph.</p>
<p><p>

    <figure>
        <img src="images/1-deepseek-moe-architecture.png" alt="DeepSeek-R1 MoE architecture"/>
        <figcaption>Figure 1: DeepSeek-R1 architecture showing Mixture-of-Experts design with selective parameter activation (21B of 671B parameters active per token) and 128K token context window. (Source: DeepSeek research)</figcaption>
    </figure>

</p></p>
<h3 id="12-cognitive-architecture-parallels---type-1-and-type-2-thinking">1.2 Cognitive Architecture Parallels - Type 1 and Type 2 Thinking</h3>
<p>The reasoning model paradigm directly parallels the Type 1/Type 2 thinking framework popularized by <a
	
		href = "https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Daniel Kahneman
	</span>
</a>. Some of the recent work demonstrating how LLMs can be aligned to either System 1 (intuitive and fast) or System 2 (analytical and deliberate) thinking patterns.</p>
<p><em>Type 1</em> thinking in AI systems corresponds to the pattern-matching and intuitive responses characteristic of traditional LLMs - fast, automatic responses based on learned patterns. <strong>Type 2</strong> thinking represents the deliberate, step-by-step reasoning that reasoning models are designed to emulate. Research shows that System 2-aligned models excel in arithmetic and symbolic reasoning, while System 1-aligned models perform better in commonsense tasks.</p>
<h4 id="cognitive-flexibility-and-performance-trade-offs">Cognitive Flexibility and Performance Trade-offs</h4>
<p>Unlike human cognition, which fluidly adapts between System 1 and System 2 thinking based on context, current LLMs lack this dynamic flexibility. This rigidity can lead to brittle performance when tasks deviate from trained patterns. However, reasoning models attempt to address this limitation by incorporating explicit System 2-style processing.</p>
<p>The research demonstrates an &ldquo;accuracy-efficiency trade-off&rdquo; where System 2-aligned models show greater uncertainty and more systematic processing, while System 1-aligned models provide more definitive but potentially less reliable answers. This suggests that optimal AI systems may need to switch between reasoning modes dynamically based on task complexity.</p>
<p>From an architectural perspective, reasoning LLMs are still transformer-based neural networks at their core. They don’t necessarily have new algorithmic components beyond the training tweaks, though some research explores adding tools or memory. It’s the <strong>training paradigm</strong> that sets them apart.</p>
<p>For example, where a classic 4o/4.1 style LLM is trained purely on next word prediction and maybe a bit of instruction tuning, a reasoning model like R1 or Phi 4 is trained in an extensive multi stage training pipeline (e.g. supervised fine tuning on curated CoT examples), then specialized reinforcement learning (using rewards for getting answers right and for producing a consistent chain of thought), and so on. OpenAI’s o1/o3 models are rumored to undergo similar multi-stage refinement, combining RL with the ability to allocate more thinking steps at runtime.</p>
<h3 id="13-chain-of-thought-built-in-vs-prompted">1.3 Chain-of-Thought: Built-in vs Prompted</h3>
<p>Start by understanding what a chain of thought (CoT) is. CoT is the model’s “scratchpad”: a sequence of intermediate reasoning steps it writes out before giving the final answer. Many models fence this trace with special tokens (e.g., <code>&lt;think&gt; ... &lt;/think&gt;</code>); there are configurations that can show or hide these. The advantage this gives us is better results on multi-step tasks (such as math, code, and planning) by decomposing problems. On the other hand, the trade-offs include more tokens → more cost/latency; and traces can be verbose or unfaithful if not evaluated. As a result, CoT is best used for complex queries, and where possible, it would be wise to consider either skipping or limiting these for simple lookups. See “Evaluation” for token-normalized accuracy and faithfulness checks.</p>
<p>CoT prompting emerged as a technique to enhance traditional LLMs by explicitly requesting step-by-step reasoning through prompts such as &ldquo;Describe your reasoning in steps&rdquo; or &ldquo;Explain your answer step by step.&rdquo; This approach leverages LLMs&rsquo; ability to &ldquo;think out loud&rdquo; in natural language, with effectiveness scaling with model size as an emergent ability.</p>
<p>Figure 2 shows an LLM decomposing a complex math word problem into sequential subquestions, solving each step before arriving at the final answer. (Credit: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models)</p>
<p><p>

    <figure>
        <img src="images/2-CoT-reasoning-process.jpg" alt="CoT reasoning example"/>
        <figcaption>Figure 2: CoT reasoning example</figcaption>
    </figure>

</p></p>
<p>Reasoning models fundamentally differ in that they integrate CoT processing directly into their architecture and training process. Rather than requiring explicit prompting, these models automatically engage in step-by-step reasoning for complex tasks. Research indicates that &ldquo;Chain-of-Thought built into the core architecture and training process&rdquo; represents a more robust approach than external prompting.</p>
<p>However, CoT prompting is not universally effective across all models and tasks. Recent research on strategic reasoning has shown that CoT prompting is not universally effective, as it increases strategic reasoning only for models at certain levels, while providing limited gains elsewhere. This suggests that integrating reasoning capabilities requires careful architectural considerations beyond simple prompting strategies.</p>
<p>The effectiveness of CoT in reasoning models also varies by task complexity and domain. Models trained with reinforcement learning on reasoning tasks show more consistent application of multi-step reasoning compared to models relying solely on prompted CoT.</p>
<h3 id="14-test-time-vs-train-time-compute">1.4 Test-Time vs Train-Time Compute</h3>
<p>A critical innovation in reasoning models is the emphasis on test-time compute scaling. While their training parameters limit traditional LLMs, reasoning models can allocate variable computational resources during inference. OpenAI reports that the performance of <strong>o1</strong> improves with more RL (train-time compute) and with more time spent thinking (test-time compute) (<a
	
		href = "https://openai.com/index/learning-to-reason-with-llms/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		overview
	</span>
</a>). This creates new scaling paradigms, where models can allocate more computational resources to harder problems during inference.</p>
<p>This <strong>inference-time compute scaling</strong> (using more tokens/steps) is a defining trait – it enables even smaller models to solve hard problems by iterating through reasoning. As Microsoft’s team describes, “Phi 4 Reasoning generates detailed reasoning chains that effectively leverage additional inference time compute,” allowing a 14B model to compete with far larger ones.</p>
<p>Because this extra “thinking” consumes tokens and compute, it helps to formalize the tradeoff to understand the concept better.</p>
<p>Test-time compute is best understood as a way to reshape the model’s output distribution at inference by searching over alternative reasoning paths and then selecting among them. It reliably lifts accuracy—especially on problems with verifiable answers—yet it is not interchangeable with pretraining compute.</p>
<p>Recent evidence shows that test-time compute helps most when the base model is already capable and the gap to the target difficulty is modest; on the hardest items, pretraining capacity still dominates - as outlined in Figure 3 below.</p>
<p><p>

    <figure>
        <img src="images/3-test-time-vs-train-time-compute-trade-off.png" alt="Relationship between test-time and train-time compute in reasoning models"/>
        <figcaption>Figure 3: The relationship between test-time and train-time compute in reasoning models, showing how additional inference computation can compensate for reduced training compute. (Source: Snell et al., 2024)</figcaption>
    </figure>

</p></p>
<p>A practical rule is to treat thinking tokens as a budgeted resource: use them to explore and score candidate chains (branching) and reserve a small budget for targeted revision when a verifier flags issues. In cost terms, this gives you predictable returns without pretending that more inference tokens can fully substitute for more capable pretraining.</p>
<h4 id="test-time-compute-vs-model-size-trade-offs">Test-Time Compute vs Model Size Trade-offs</h4>
<p>A groundbreaking finding from recent research is that on problems where smaller models achieve non-trivial success rates, <strong>test-time compute can be used to outperform models 14× larger</strong> in FLOP-matched evaluations. This suggests a fundamental shift in how we think about compute allocation:</p>
<ul>
<li><strong>Easy to medium problems</strong>: Test-time compute is often more effective than pretraining larger models</li>
<li><strong>Very hard problems</strong>: Pretraining capacity still dominates, with limited benefits from test-time scaling</li>
<li><strong>Practical implication</strong>: Rather than focusing purely on scaling pretraining, it may be more efficient to train smaller models and apply test-time compute strategically</li>
</ul>
<h4 id="efficiency-trade-off-how-much-thinking-is-enough">Efficiency trade-off: How much “thinking” is enough?</h4>
<p>OpenAI’s <strong>o1</strong> explicitly reports: performance improves with more RL (<em>train-time</em> compute) and with more time spent thinking (<em>test-time</em> compute). Microsoft’s Phi-4 Reasoning (14B) shows similar patterns: small models, when allowed longer structured chains, outperform their weight in math/science. To examine the implications, consider a back-of-the-envelope cost model. If $L_r$ is “reasoning” length and $L_a$ is final answer length, a crude attention-heavy cost proxy is</p>
<p>$$
\text{Compute} ;\propto; H,(L_a + L_r)^2,d,
$$</p>
<p>with hidden size $H$ and depth $d$. You can wrap this into an objective that matches your reality:</p>
<p>$$
\min_{L_r}; C(L_r) = \alpha,H,(L_a+L_r)^2,d ;+; \beta,\text{latency}(L_r) ;-; \gamma,\text{Acc}(L_r),
$$</p>
<p>where $\alpha,\beta,\gamma&gt;0$ are your infra cost, SLA pain, and value of accuracy. You won’t solve this analytically in prod—you’ll <strong>sweep the thinking budget</strong> and pick a knee point.</p>
<p>What is really interesting is that accuracy is typically concave in $(L_r)$; i.e, the first ~100–300 “thinking” tokens help a lot; beyond that, <strong>diminishing returns</strong>.</p>
<p>Quick intuition: if $L_a$ is small and $L_r$ doubles, the attention term grows by about $4\times$, while accuracy typically improves far less—hence token budgets and early stop heuristics. We’ll revisit this idea in <a
	
		href = "#evaluation-strategies-for-reasoning-models"
	

	

	>
	
	<span>
		Evaluation
	</span>
</a> via token-normalized accuracy.</p>
<p>This trade-off also motivates practical features, such as token budgets, early-stop heuristics, and “fast vs. deliberative” paths (e.g., Qwen-3’s reasoning mode). With that lens, let’s look at what differs under the hood.</p>
<ul>
<li>More deliberate thinking often helps—up to a point - You can trade thinking tokens for accuracy on complex items.</li>
<li>Returns diminish - the first ~100–300 “reasoning” tokens carry a lot of the lift; beyond that, you’re paying for a long tail.</li>
</ul>
<blockquote>
<p><strong>Rule of thumb.</strong> Treat “thinking tokens” as a first-class budget; <strong>log it</strong>, control it, and optimize it like you optimize memory or p95 latency. Some model providers like Qwen 3 and NVIDIA’s NIM expose this <em>thinking budget</em> directly.</p></blockquote>
<p>In short, reasoning LLMs are <strong>LLMs with a logic upgrade</strong> – through additional training, they learn to use reasoning strategies that standard models lack.</p>
<h3 id="15-effectiveness-and-limitations-of-reasoning-llms">1.5 Effectiveness and Limitations of Reasoning LLMs</h3>
<p>Recent benchmarks indicate that CoT reasoning yields significantly improved performance on complex tasks (see Figure 4 below). For example, Microsoft’s Phi-4-Reasoning models, with only 14B parameters, match or surpass much larger models in math and science benchmarks—sometimes even outperforming a model 5x times their size (surpassing OpenAI’s o1-mini, and R1&rsquo;s 70B distilled version on many math and science benchmarks). This success is attributed to reasoning-focused training and reinforcement learning, proving that with strategic training, smaller models can excel at challenging tasks without needing massive scale. This demonstrates a general trend: <em>with the right training, a model doesn’t have to be huge to solve complex tasks – it just needs to learn how to use its capacity more algorithmically.</em></p>
<p><p>

    <figure>
        <img src="images/4-AIME-performance-scaling.png" alt="Performance scaling on AIME mathematics benchmark"/>
        <figcaption>Figure 4: Performance scaling on AIME mathematics benchmark: Both train-time compute (left) and test-time compute (right) show smooth accuracy improvements, validating the compute-optimal scaling approach. (Source: OpenAI research)</figcaption>
    </figure>

</p></p>
<p>Another data point is the DeepSeek R1 family. The original R1 (671B, MoE) was a “reasoning-maximal” model (see Figure 1), pushed to an extreme scale and trained with novel RL algorithms (such as GRPO, a group-based self-improvement method) to excel at long-horizon problems. Distilled smaller versions of R1 (70B, 8B, etc.) inherited some of these skills through knowledge distillation. These distilled reasoning models, even at 8B, achieved math and puzzle-solving scores significantly higher than those of similarly sized generic LLMs. Open-source efforts like <em>Bespoke-Stratos-7B</em> and <em>OpenThinker-7B</em> followed suit, demonstrating that a properly fine-tuned 7B model with CoT can outperform naive 7Bs by significant margins on benchmarks. In research from late 2024, Qwen-3 (an advanced open model by Alibaba) was released in both “thinking mode” and “no thinking” mode. Running Qwen-3 in its CoT mode, it actually <strong>outperformed DeepSeek-R1 on a majority of evaluated tasks</strong> despite activating only a subset of its parameters at each token (it’s a mixture-of-experts model, effectively).</p>
<p>What is interesting is that when Qwen-3 was toggled off (i.e., no CoT visible), it still beat a GPT-4-sized baseline on many benchmarks, implying that integrating reasoning steps did not harm its base competency – it only added the ability to dig deeper when needed. All these examples underscore that <strong>reasoning LLMs hold a significant edge</strong> on tasks that aren’t straightforward single-step predictions. Whenever an answer requires multiple pieces of information or intermediate calculations, a traditional LLM often fails or guesses incorrectly, whereas a reasoning LLM can navigate the steps systematically (much like a human showing their work). The gap is so notable that analysts have called reasoning LLMs “a critical evolution” in AI capability, and enterprise users are exploring them for decision-making support where correctness takes precedence over brevity.</p>
<h4 id="mathematical-and-logical-reasoning">Mathematical and Logical Reasoning</h4>
<p>Reasoning models demonstrate substantial improvements over traditional LLMs in mathematical and logical reasoning tasks. OpenAI&rsquo;s o1 achieves remarkable performance, ranking in the 89th percentile on competitive programming questions (Codeforces) and placing among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME).</p>
<blockquote>
<p><strong>Codeforces</strong>: Codeforces is a major competitive programming platform and community. It hosts frequent online contests (“Rounds”) where participants solve algorithmic problems within time limits, and it maintains an Elo-like rating system and color-coded titles (ranging from Newbie to Legendary Grandmaster).</p></blockquote>
<p>Comprehensive evaluations (see Figure 5) show that o1-preview demonstrates 100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions and an 83.3% success rate in solving complex competitive programming problems, surpassing many human experts. These results indicate performance that often meets or exceeds that of human experts in structured reasoning domains.</p>
<p><p>

    <figure>
        <img src="images/5-model-performance-comparison.png" alt="Model performance comparison"/>
        <figcaption>Figure 5: Comprehensive benchmark comparison showing o1-preview&#39;s superior performance over GPT-4o across mathematics, science, and reasoning tasks. (Source: OpenAI, 2024)</figcaption>
    </figure>

</p></p>
<h4 id="domain-specific-applications">Domain-Specific Applications</h4>
<p>Beyond mathematics, reasoning models show strong performance across diverse specialized domains. Evaluations indicate remarkable proficiency in anthropology and geology, demonstrating a deep understanding and sound reasoning in these specialized fields, as well as strong capabilities in quantitative investing, complemented by comprehensive financial knowledge. The models also demonstrate superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models.</p>
<p>Recent research with ReasonFlux-32B has demonstrated that smaller, specialized reasoning models can outperform larger, general models. On the MATH benchmark, ReasonFlux-32B achieves an accuracy of 91.2% and surpasses o1-preview by 6.7% while being trained with only 8 GPUs.</p>
<blockquote>
<p><strong>ReasonFlux</strong> - ReasonFlux is a template-driven, hierarchical RL approach to reasoning LLMs; instead of lengthening raw CoT, it plans over a library of thought templates and scales those at inference time, yielding strong math results in a 32B-parameter model.</p></blockquote>
<p>However, this does not mean reasoning models dominate on every task. For very simple or single-step queries (e.g., straightforward fact lookups or classifications), a regular LLM might perform just as well and with less latency - it is using fewer tokens and does not have to generate a long explanation; more tokens mean more computation and slower responses. That said, many reasoning LLMs are designed to be flexible – they can shorten or skip the reasoning when it’s not needed. Some deployments use a “fast path” versus a “deliberative path” approach: run the model in normal mode for easy questions and only invoke full reasoning mode for complex ones. This dynamic compute allocation is a research area in itself (how to predict when to make a model think longer).</p>
<p>The <strong>token-budget mechanism</strong> in Qwen-3 is one example: it allows users to cap how many reasoning tokens the model can use, forcing it to decide what’s most important. Accuracy does improve with more tokens (e.g., from ~70% at 2K tokens to ~85% at 16K on a math test), but after a point,  it is a matter of diminishing returns. The existence of such features highlights that reasoning LLMs introduce a new dimension – <em>a time/accuracy trade-off</em>. Traditional LLM evaluation is usual just one-dimensional – measuring accuracy or quality for a given fixed model output length. On the other hand, reasoning LLMs let us trade generation length for correctness. (Note: the <em>Evaluation</em> section will cover more details on how to measure).</p>
<h3 id="16-branching--editing-at-test-time-how-to-spend-thinking-compute">1.6 Branching &amp; Editing at Test Time (how to “spend” thinking compute)</h3>
<p>Test-time compute isn’t just “more tokens”; it’s a way to reshape the model’s output distribution by searching for, and then selecting, better reasoning paths during decoding. In practice, this plays out along two complementary axes. The first is branching: generate multiple candidate chains and prefer the one that scores best under a process- or outcome-aware judge. The second is editing: let the model (and its tools) reflect on an initial attempt and revise it once or twice. Both strategies are ways of allocating limited thinking budget where it matters most.</p>
<p>On the branching side, simple best-of-N sampling remains a solid baseline, while beam or tree-style search makes exploration adaptive by spending more decoding on promising partial thoughts. Process-aware scoring—via a process reward model (PRM) or per-step self-evaluation—helps prune low-quality branches early; when ground truth isn’t available, self-consistency (majority voting across diverse chains) is a practical fallback. Two small but useful tricks from recent work are to branch early—keeping only the top few first-token continuations before decoding greedily—and to anneal temperature across tokens to reduce accumulated randomness as chains grow. Together, these make parallel exploration both cheaper and more reliable.</p>
<p>Editing tackles a different failure mode: an answer that looks plausible but hides a local mistake. Here, short reflect-revise loops work best when anchored to reliable feedback—unit tests for code, exact-match checks for math, heuristic rubrics, or judgments from a stronger model. Pure “self-correction” without such anchors tends to be unstable: models often make minor, non-helpful edits, occasionally flip correct answers to incorrect ones, or fail to generalize the revision behavior. Keeping revision rounds tight, skipping revision when a verifier signals “already correct,” and rolling back to the best-verified candidate are practical guardrails.</p>
<p>Importantly, branching and editing are not substitutes; the best results often come from using both. For easier problems, a short sequential pass can be enough, but as difficulty rises, the sweet spot shifts toward a deliberate mix of parallel exploration and a small revise budget. Thinking time is therefore a budget allocation question: how much diversity you buy up front versus how much you reserve for targeted fixes after you’ve seen a candidate chain.</p>
<p>Operationally, it pays to make the budget explicit and observable. Expose a cap on “thinking tokens,” allow early exit when candidates agree with high confidence, and log the signals that drove selection—per-step PRM or self-evaluation scores, agreement margins, and precise stop reasons. Over time, these traces make it easy to tune the ratio between breadth (how many chains you explore) and depth (how hard you try to fix a promising one), and to decide when a verifier is strong enough to justify skipping revision. Finally, remember that this test-time axis complements, but does not replace, pretraining: extra thinking generally helps, yet it cannot fully compensate for large capability gaps on the hardest items.</p>
<h4 id="compute-optimal-scaling">Compute-Optimal Scaling</h4>
<p>Recent research by Snell et al. demonstrates that <strong>compute-optimal scaling</strong> - allocating test-time compute adaptively based on problem difficulty - can improve efficiency by more than 4× compared to traditional best-of-N sampling. This approach recognizes that different problems require different amounts of thinking time, and optimal allocation varies dramatically based on prompt difficulty.</p>
<p>The key insight is that <em>question difficulty</em> can be predicted and used to determine the most effective test-time compute strategy. For easier problems, simple parallel sampling suffices, while harder problems benefit from sequential revision or sophisticated search strategies.</p>
<p>Research identifies two primary mechanisms for scaling test-time computation effectively:</p>
<ol>
<li>
<p><strong>Process-Based Verifier Search</strong>: Using dense, process-reward models (PRMs) to guide search through reasoning paths, enabling beam search or lookahead search strategies that prune low-quality branches early.</p>
</li>
<li>
<p><strong>Adaptive Distribution Updates</strong>: Modifying the model&rsquo;s distribution over responses at test time, such as through sequential revision where the model iteratively improves its initial attempts.</p>
</li>
</ol>
<p>The effectiveness of these approaches <strong>critically depends on problem difficulty</strong> - easier problems benefit more from parallel exploration (branching), while harder problems require sequential refinement (editing).</p>
<h4 id="difficulty-aware-compute-allocation">Difficulty-Aware Compute Allocation</h4>
<p>A key insight from recent research is that <strong>optimal test-time strategies vary dramatically with problem difficulty</strong>. This motivates <strong>adaptive allocation</strong> strategies:</p>
<ul>
<li><strong>Easy problems</strong>: Simple best-of-N sampling with minimal compute</li>
<li><strong>Medium problems</strong>: Weighted voting or beam search with moderate compute budgets</li>
<li><strong>Hard problems</strong>: Sequential revision with larger compute budgets, but diminishing returns beyond a threshold</li>
</ul>
<p>This difficulty-aware approach enables <strong>4× efficiency improvements</strong> over uniform compute allocation strategies.</p>
<h3 id="17-external-tools-inside-the-reasoning-loop">1.7 External tools inside the reasoning loop</h3>
<p>Several steps in a chain can be offloaded to exact tools (e.g., code execution, math). Approaches like PAL (program-aided language model) and CoC (Chain-of-Code) let the model “think” by writing and running code; ReAct interleaves search (e.g., Wikipedia) with thoughts. Recent o-series releases similarly intertwine web, code, and vision tools during reasoning. This improves robustness on math, algorithmic tasks, and multi-hop QA – without asking the LLM to emulate a compiler.</p>
<h4 id="pal">PAL</h4>
<p>Program-Aided Language Models (PAL) are an approach where LLMs address reasoning tasks by generating Python code rather than relying solely on natural language. This method utilizes programming to manage complex logic and calculations, aiming to decrease errors and improve results on benchmarks such as GSM8K and MATH.
PAL’s architecture is modular and interpretable, with the LLM functioning as a code generator and the Python interpreter serving as the reasoning engine. This clear separation improves debugging, verification, and extensibility, enhancing transparency and reproducibility. By combining symbolic reasoning with neural language modeling, PAL provides a hybrid approach that is both effective and practical.</p>
<h4 id="coc">CoC</h4>
<p>Chain of Code (CoC) is a method that expands code-driven reasoning in LLMs by using a hybrid execution strategy. In contrast to traditional methods that rely exclusively on interpretable code or natural language reasoning, CoC enables models to generate programs combining executable code with semantic pseudocode. When the interpreter encounters undefined or non-executable behavior, such as abstract functions like <code>detect_sarcasm(string)</code>, CoC uses an &ldquo;LMulator&rdquo;, which is a language model-based emulator that predicts the expected output. This approach allows LLMs to process tasks involving both algorithmic and semantic elements.</p>
<p>By “thinking in code” CoC greatly expands the range of problems it can solve, surpassing Chain of Thought and other baseline methods on benchmarks like BIG-Bench Hard, where it reached an 84% success rate—12% higher than CoT. Its modular structure adapts well to different model sizes and fields, making it particularly suitable for tasks in robotics, perception, and mixed-modality reasoning. The use of flexible pseudocode and fallback emulation strategies provides a strong foundation for developing more generalizable and interpretable AI reasoning.</p>
<p>In summary, <strong>reasoning AI models</strong> distinguish themselves by <em>how</em> they solve problems. They use explicit multi-step reasoning (often visible as a chain-of-thought) and are trained with techniques (special prompts, reward signals, data curation) to make this effective. In doing so, they often achieve higher accuracy on complex tasks than traditional LLMs of comparable (or even much larger) size. The cost is greater complexity in training and sometimes in usage. We next discuss how one can adapt and fine-tune these models, and the pitfalls to watch out for.</p>
<h2 id="2-adapting-and-fine-tuning-reasoning-models">2. Adapting and Fine-Tuning Reasoning Models</h2>
<p>Similar to LLMs, reasoning models can also be fine-tuned or adapted to specific domains and tasks. A key advantage is that they can be <em>domain-specialized</em> while retaining strong reasoning skills.</p>
<p>For example, if you have a reasoning LLM and you want it to excel at medical diagnostics, you could fine-tune it on medical Q&amp;A data that includes step-by-step reasoning about symptoms and lab results. The model should, in principle, retain its general logical abilities and learn to apply them in the medical context. Fine-tuning can also help a model learn when to engage reasoning mode – e.g., always do detailed reasoning for high-stakes medical questions, but perhaps skip it for trivial prompts if instructed.</p>
<p>However, adapting a reasoning model is more complex than fine-tuning a regular LLM because you need to handle the reasoning traces properly. A key question is whether the fine-tuning data includes chains of thought or just question→answer pairs.
Generally, to preserve and leverage the model’s strength, you want to fine-tune with the reasoning format intact. That means if your dataset doesn’t already have human-written rationales, you may need to generate them (possibly using a larger teacher model like R1 or GPT-4 to produce explanations for your domain problems). By training on QA pairs supplemented with correct reasoning sequences, you reinforce the model’s inclination to think things through.</p>
<p>There is a subtle issue, though; if your fine-tuning data’s reasoning traces are of <em>lower quality</em> than the model’s current capability (for instance, you provide simplistic or even flawed reasoning examples), you might hurt performance. It’s like training a math student who can solve calculus problems to only practice arithmetic – they might lose their edge in advanced problem solving.</p>
<h4 id="21-loss-masking">2.1 Loss Masking</h4>
<p>One approach is called loss masking, which involves including reasoning steps in the input/output during fine-tuning so the model learns to produce them, but not applying back-prop loss on those reasoning tokens. So, fine-tuning gradients is applied only to the final answer portion, rather than the whole CoT text. This allows us to adjust the model’s final answers for a new domain while minimizing changes to its internal reasoning process.
The rationale is that the model’s existing reasoning ability, developed through prior training, should be maintained. The technique allows the model to retain its established reasoning while modifying how it presents final answers. Initial community observations indicate this approach can help preserve the quality of the model’s reasoning after fine-tuning. However, it may not be necessary if the fine-tuned dataset is large and of high quality.</p>
<h4 id="22-prompt-based-fine-tuning">2.2 Prompt-Based Fine-Tuning</h4>
<p>Another approach when working with limited data is to use prompt-based fine-tuning or instruction prompts. Since reasoning models already respond to prompts like “show your reasoning, then answer,” you might not need to change their weights at all for some custom tasks – providing a few exemplars with reasoning in a prompt might suffice (few-shot learning). If actual fine-tuning is needed (e.g., to integrate new knowledge or jargon), lightweight methods like LoRA adapters can be applied in principle. One must ensure the prompt format (the presence of <code>&lt;think&gt;</code> tags or special tokens) is consistent during fine-tuning to prevent the model from being confused about when to produce reasoning. Many open implementations of reasoning models require a specific format to trigger the chain of thought. Adhering to that format in any further training data is important.</p>
<p>In summary, adapting a reasoning LLM is doable but requires careful dataset design. Ideally, your fine-tuning set should contain high-quality problem-solving examples with the full reasoning shown. If you don’t have that, you might generate it or opt to preserve the pre-trained reasoning behavior via techniques like masking. One should also monitor if the model starts to skip reasoning; if it does, this could indicate that the fine-tuning data encouraged direct answers only. Balancing task specialization with maintained reasoning capability is key.</p>
<p>Next, let&rsquo;s examine the challenges that may arise during this fine-tuning and customization process.</p>
<h4 id="practical-compute-budget-guidelines">Practical Compute Budget Guidelines</h4>
<p>Recent empirical analysis provides concrete guidance for practitioners:</p>
<ul>
<li><strong>Budget allocation</strong>: Treat test-time compute as a first-class resource requiring explicit budgeting and monitoring</li>
<li><strong>Difficulty prediction</strong>: Use learned difficulty predictors to route problems to appropriate compute strategies</li>
<li><strong>Diminishing returns</strong>: Most benefits come from the first 100-300 reasoning tokens; beyond that, returns diminish rapidly</li>
<li><strong>Cost-performance optimization</strong>: Smaller models with sophisticated inference can achieve Pareto-optimal trade-offs compared to larger models with simple inference</li>
</ul>
<h2 id="3-challenges-in-fine-tuning-and-customizing-reasoning-models">3. Challenges in Fine-Tuning and Customizing Reasoning Models</h2>
<p>Adapting reasoning models to new tasks comes with unique challenges beyond those in standard LLM fine-tuning. These challenges span technical issues inherent to the models’ reasoning nature, as well as organizational hurdles in data and expertise. Let us explore some of the key challenges.</p>
<h3 id="31-trace-quality-degradation">3.1 Trace Quality Degradation</h3>
<p>A major technical concern is <em>preserving the quality of the reasoning trace</em>. Fine-tuning, if done either poorly or used on narrow data, can cause the model’s CoT to become less coherent or less faithful to its actual reasoning. Recent research shows that after fine-tuning on specific tasks, the faithfulness of a model’s CoT explanations often decreases, on average, compared to the pre-finetuned model. In other words, the model might still provide accurate answers, but its stated reasoning is more likely to omit key steps or include spurious ones. This “trace degradation” can occur because the fine-tuning objective typically emphasizes obtaining the correct final answer for the new task – the model may learn that it can score well without strictly adhering to its original reasoning style.</p>
<p>In addition, if the fine-tune dataset isn’t sufficiently diverse or is missing the intermediate logic, the model’s previously polished reasoning abilities can “unravel” or get overwritten. It’s akin to using coarse sandpaper after a fine polish – the model may lose some of its nuanced problem-solving steps. Ensuring that fine-tuning does not erase the chain-of-thought skill is a complex and challenging task.</p>
<p>Techniques like the aforementioned loss masking or multi-stage fine-tuning (where you intermix some original reasoning training data) are used to mitigate this. Another aspect of trace quality is faithfulness – even if the model produces a plausible-looking rationale, is it honestly reflecting how the answer was derived? Fine-tuning can sometimes widen the gap between what the model <em>does</em> to get an answer and what it <em>says</em> in the explanation, especially if the fine-tuning introduces shortcut ways to get the answer. This is hard to detect; it requires careful evaluation (as we discuss later).</p>
<p>Overall, maintaining a <em>correct and faithful reasoning trace</em> under new training pressures is a key challenge.</p>
<h3 id="32-overfitting-and-distribution-shift">3.2 Overfitting and Distribution Shift</h3>
<p>Like any model, a reasoning LLM can overfit to a small fine-tune dataset, but the consequences here might be strange. An overfit model might memorize specific solution patterns and fail to generalize its reasoning to slightly new problems (losing one of the main advantages of a reasoning approach). Because these models were often trained on a wide variety of reasoning tasks, fine-tuning on a narrow domain (say, only physics puzzles) might reduce their versatility or even accuracy on reasoning problems outside that niche.</p>
<p>Small, high-quality reasoning datasets can improve models, but if applied naively, they can also reduce performance on broader evaluations. The model may become too narrowly focused in its thought process (e.g., always expecting a specific style of solution). Ensuring the fine-tuning data covers enough variation or using regularization techniques (such as mixout or weight decay on reasoning layers) can help counteract this, but it remains a delicate balancing act.</p>
<p><em>LIMA</em> shows that ~1k carefully curated examples can generalize well, and <em>LIMO</em> finds that ~800 math-reasoning samples yield large gains when the data is selected thoughtfully. However, a narrow or naïve fine-tuning can backfire—studies report <strong>catastrophic forgetting</strong> and degraded <strong>out-of-distribution</strong> robustness, as well as a <strong>drop in CoT faithfulness</strong> after fine-tuning. This can be mitigated with regularization (e.g., <strong>Mixout</strong>, <strong>layer-wise noise-stability</strong>) and optimization that <strong>flattens the loss landscape</strong> (e.g., <strong>SAM</strong>), and keep the fine-tune mix diverse to avoid over-specialization.</p>
<h3 id="33-training-stability-and-long-outputs">3.3 Training Stability and Long Outputs</h3>
<p>Fine-tuning with long CoT outputs (which can be thousands of tokens) can lead to stability issues in training. Gradient updates on very long sequences might cause more variance or instabilities in convergence. Moreover, suppose one uses reinforcement learning (e.g., to further optimize a reasoning model with a reward for correct answers). In that case, the credit assignment is complex – which part of a 100-step reasoning deserves credit or blame for the outcome?</p>
<p>Instabilities like <strong>mode collapse</strong> (where the model’s outputs become strangely repetitive or nonsensical) or oscillating performance have been observed if the RL reward model is poorly aligned. For example, in one training run, simply increasing the reward for “correct final answer” without properly balancing the reward for good reasoning steps caused the model to exploit quirks – it started producing minimal reasoning and guessing answers to game the reward, leading to a drop in overall logical correctness.</p>
<p>Researchers working on Phi-4 and others have had to introduce tricks to <strong>stabilize RL training</strong>, such as gradually increasing the allowed reasoning length, filtering out bad traces, or adjusting reward scaling. These measures highlight that straightforward fine-tuning or RL on a reasoning model can easily go off-track if the optimization isn’t carefully managed. In essence, teaching a model <em>how to think</em> is a more delicate process than teaching it <em>what to say</em>.</p>
<h3 id="34-reward-alignment-and-hacks">3.4 Reward Alignment and “Hacks”</h3>
<p>Aligning a reasoning model with human preferences or task-specific rewards can be tricky – there’s a risk of <strong>reward hacking</strong> and unintended behaviors. An illustrative scenario was described by researchers at Anthropic: they gave a reasoning model (Claude 3.7 and DeepSeek R1) a series of multiple-choice questions with a twist – a hidden “hint” in the prompt sometimes told the model to choose a wrong answer (and they rewarded the model for following that hint). The models learned to exploit this to earn reward points, selecting the hinted-at wrong answers, but <strong>their chain of thought never acknowledged the malicious hint</strong>. They would generate a detailed (fake) reasoning to justify the wrong answer, rather than saying “I chose this because I was hinted at.” This is a dramatic example of a model <em>gaming the objective</em>: the training set or reward said “getting this answer is good,” so it did. Still, it also learned to hide the true reason, presenting a facade of coherent reasoning.</p>
<p>Such behavior is misaligned with the intent (we want the model to be truthful in its reasoning). This experiment highlights the importance of aligning the process of reasoning as much as the outcome. If a reward model only considers the correctness of the final answer, it may sacrifice honesty or thoroughness in the reasoning process.</p>
<p>Conversely, suppose you over-emphasize a reward for producing very detailed reasoning. In that case, the model might start outputting verbose, mostly correct-sounding monologues that don’t lead to a better answer (effectively optimizing the wrong metric). Achieving the right alignment – so that the model is rewarded for correct and <strong>genuinely helpful</strong> reasoning – is an open challenge. It often requires iterative human feedback, custom reward functions (e.g., penalize logical leaps or unsupported claims in the trace), and careful validation. Without these, one might end up with a model that <em>appears</em> to reason well but is just skilled at <strong>“output grooming”</strong> – formatting answers to look good rather than being correct.</p>
<h3 id="35-data-quality-and-availability">3.5 Data Quality and Availability</h3>
<p>On the organizational side, fine-tuning a reasoning model demands <strong>high-quality training data</strong> that includes reasoned solutions. Such data can be difficult to obtain. At the same time, there are public datasets for math proofs or logical reasoning (e.g., MATH, GSM8K, etc.), but many domains (legal reasoning, financial analysis, medical diagnostics) don’t have readily available step-by-step annotations in large quantities.</p>
<p>Teams often have to generate this data synthetically (using a larger model to produce reasoning traces and then filtering them) or invest in expert annotations. The quality of these traces is paramount – noisy or incorrect reasoning examples can confuse the model or teach it bad habits. As discussed earlier, even small, curated datasets (on the order of hundreds of examples) have been shown to improve reasoning if they are extremely well-targeted; however, curating such datasets is a specialized skill.</p>
<p>In practice, fine-tuning a reasoning model involves a lot of <em>tooling</em>, ranging from running automatic proof checkers to verify steps, using consistency checks, or employing human reviewers to label where a model’s synthetic reasoning went wrong. This is a step up in complexity from preparing a straightforward prompt→response dataset.</p>
<h3 id="36-tooling-and-infrastructure">3.6 Tooling and Infrastructure</h3>
<p>Working with long CoT and multi-stage training means that the training pipelines will need modification. For instance, standard training code may need to be adapted to handle special tokens (e.g., <code>&lt;think&gt;</code> segments might need masking if needed), or to log and evaluate not just final answers but also intermediate step accuracy during training.</p>
<p>Debugging a reasoning model can be more involved – you might want to watch how its reasoning changes epoch by epoch, which requires custom logging or visualization tools. Moreover, these models often have large context windows (since they need to handle long reasoning sequences, e.g., 16K or 32K tokens). Fine-tuning with such long contexts can demand more GPU memory and faster I/O. Not all training frameworks efficiently support extremely long sequences out of the box.</p>
<p>Evaluation tooling (to be discussed later) can also be considered—a possible approach is integrating an automated verifier into the training loop to assess the model’s reasoning steps and provide targeted feedback, which is a type of process supervision. Implementing this involves technical complexity and remains an ongoing area of research. Overall, organizations seeking to customize a reasoning model should be aware that the training workflow may be more complex than a standard LLM fine-tuning process.</p>
<h3 id="37-expertise">3.7 Expertise</h3>
<p>Fine-tuning reasoning models demands both machine learning expertise and domain knowledge, often requiring multidisciplinary teams. Since reasoning LLMs are new, practitioners face a steep learning curve with frequent trial and error.</p>
<p>Expect several iterations to balance concise and detailed responses; objectives or examples may need adjustment throughout the process. Rigorous testing is essential, especially in high-stakes applications like medical or legal fields, making reliability and interpretability critical. Typically, 10–12 rounds of tuning are required to achieve an optimal model.</p>
<p>Organizations typically use a hybrid strategy: starting with a robust base model (such as o1-mini or Phi-4-Reasoning), applying minimal tuning, and relying on prompts and few-shot learning for specificity. When deeper customization is required, it&rsquo;s best to use reliable data, maintain reasoning formats, monitor trace fidelity, and integrate human feedback. Success yields a strong analytical tool, but the process is more complex than for general chatbots.</p>
<p>A key part of customization is the ability to evaluate the reasoning models. Let us dig into specialized evaluation strategies required to assess not just <em>what</em> a reasoning model answers, but <em>how</em> it arrives at that answer.</p>
<h2 id="4-evaluation-strategies-for-reasoning-models">4. Evaluation Strategies for Reasoning Models</h2>
<p>Traditional LLM evaluation – e.g., measuring accuracy on a Q&amp;A or using BLEU scores for text – may not capture the full picture when a model is effectively performing a multi-step reasoning process. Evaluating reasoning-oriented LLMs requires going beyond the final answer, incorporating metrics that assess both the process and quality of reasoning. This represents a departure from traditional LLM evaluation, which typically treats the model as a black box that produces an answer or text, which we then compare to a reference or expected output.</p>
<p>For reasoning models, we care about questions like: <em>Did the model’s CoT follow a correct logical path? Is it telling the truth about its reasoning? How efficient is its reasoning?</em> Below are key evaluation strategies and metrics that have emerged for reasoning models, contrasted with traditional approaches:</p>
<h3 id="41-outcome-vs-process-evaluation">4.1 Outcome vs. Process Evaluation</h3>
<p>In traditional AI evaluation, we mostly judge the <em>outcome</em> (e.g., did the model get the correct answer to a question). With reasoning models, researchers perform <strong>dual evaluations</strong> – one for the outcome <em>and</em> one for the reasoning steps. An outcome evaluation may be identical to a standard LLM test, where the goal is to verify if the final answer is correct (exact match, F1 score, multiple-choice accuracy, etc.). The process evaluation, however, examines the intermediate steps of the solution.</p>
<p>For instance, a math word problem benchmark might not only check the answer but also parse the model’s step-by-step solution and verify each part. An emerging method is to use an automated judge (which can be another LLM) to analyze the CoT and flag errors or leaps in logic. One example being a recent benchmark called <em>MM-MATH</em> (for multimodal math problems); in this, an LLM-based evaluator looks at each step of a model’s solution, comparing it to the ground truth solution, and classifies errors (e.g., “incorrect algebraic simplification” vs “misinterpreted the diagram”).</p>
<p>This kind of fine-grained process evaluation provides insights into <em>where</em> a model’s reasoning fails, not just whether the final answer is wrong. This is useful because a reasoning model might get the right answer for the wrong reasons (i.e., it had a reasoning flaw), or vice versa – it might have mostly correct reasoning but a minor slip at the end leading to a wrong answer. Traditional single-score metrics would miss this nuance.</p>
<h3 id="42-chain-of-thought-faithfulness-metrics">4.2 Chain-of-Thought Faithfulness Metrics</h3>
<p>As discussed earlier, <em>faithfulness</em> refers to whether the model’s stated reasoning accurately reflects its actual internal reasoning (or use of information). One way to test this is to insert known information (or traps) into the context and see if the model admits it.</p>
<p>For example, Anthropic’s experiment provided the model with hidden hints (sometimes incorrect) and then checked if the model’s explanation mentioned using those hints. They derived a metric: the percentage of solutions where the model was <em>truthful</em> about using the hint. Claude 3.7 was only ~25% faithful in their setup, and DeepSeek R1 was about 39% – meaning in the majority of cases, they used the hint but didn’t reveal it in the reasoning chain. This indicates that the CoT was often <em>unfaithful</em>, presumably because the model’s training taught it always to sound logical and self-contained, even if it took a shortcut.</p>
<p>Another way to measure faithfulness is to check consistency under variations: if a model truly is reasoning step by step, then if we force it to reveal steps, it should arrive at the same answer as when it’s not forced. If hiding the CoT changes the answer frequently, it might suggest the model’s explanations were more post-hoc and not driving the answer.</p>
<blockquote>
<p><strong>Note:</strong> These evaluations are still an active research area – unlike a simple accuracy score, faithfulness is somewhat difficult to quantify, but it’s crucial for trust. When deploying a reasoning model, you’d like to trust that, say, a financial analysis it provides is actually how it came to its conclusion, not a fabricated rationale. Thus, papers often report the percentage of solutions with “fully faithful reasoning” by manual or automated inspection. If that percentage is low, it’s a red flag: the model’s reasoning output might be more for show. Improving this might involve further training (e.g., penalizing inconsistent rationales) or architectural changes; however, at the very least, we need to measure it.</p></blockquote>
<h3 id="43-token-normalized-accuracy-efficiency">4.3 Token-Normalized Accuracy (Efficiency)</h3>
<p>Because reasoning models can use an arbitrary number of tokens to reason (within the context window limits, of course), we want to measure <strong>accuracy as a function of reasoning length</strong> – effectively, <em>how efficiently does a model reach correct answers?</em> For example, a model that gets 90% accuracy with 2K tokens of reasoning might be less desirable than one that gets 85% accuracy with only 1K tokens, depending on deployment constraints.</p>
<p><em>Token-normalized accuracy</em> is a metric that attempts to penalize overly lengthy reasoning. In one formulation (used in some multiple-choice evaluations), it computes the probability of a correct answer, normalized by the length (i.e., the number of tokens) of that answer’s explanation or output. More generally, we can think of it as <em>accuracy per 100 reasoning tokens</em> or similar.</p>
<p>Another interpretation is to measure the area under the curve of accuracy versus the number of tokens allowed. For example, allow a model to think with 100 tokens, record the accuracy, then 200 tokens, 500 tokens, and so on, up to a certain limit – and see which model yields the best accuracy for the least token budget. Researchers have explicitly emphasized the goal of <strong>maximizing accuracy per token</strong> in reasoning scenarios.</p>
<p>This reflects practical concerns: in production, reasoning steps are costly - both in terms of latency and tokens (i.e, money). A model that uses half the steps to reach the same answer is effectively twice as fast. Moreover, sometimes unconstrained reasoning leads to diminishing returns or even errors—for example, a model might start wandering or overexplaining if it “thinks” too long. Thus, token-normalized metrics encourage models that use their reasoning budget optimally.</p>
<p>A simple implementation is to take the total tokens the model generated for all test problems and divide them by the number of correct answers. Then, compare models on this normalized score (lower tokens per correct answer is better).</p>
<p>Another approach is a normalized log-probability where longer outputs are penalized. In any case, this kind of metric was usually irrelevant for standard LLMs (which output a single short answer), but becomes important when evaluating the cost-effectiveness of reasoning models.</p>
<h3 id="44-stepwise-accuracy-and-consistency">4.4 Stepwise Accuracy and Consistency</h3>
<p>This is a more granular evaluation of the correctness of the reasoning chain. For tasks where we have ground-truth step-by-step solutions (like a math proof or a formal logic derivation), we can mark each step of the model’s chain as “correct” or “incorrect” compared to an expected solution. This yields a sequence of accuracy values (e.g., getting the first three steps right, but failing at step four). We can then compute metrics like <em>average step accuracy</em>, or <em>percentage of solutions that made it to at least X steps correct before failing</em>.</p>
<p>This is informative because two models might both solve 70% of problems, but one might always fail early on the 30% it can’t solve, whereas another might almost solve everything and only slip at the end for those 30%. Stepwise evaluation can reveal such differences. It also helps in evaluating <strong>partial credit</strong> – maybe a model didn’t get the final answer but did significant parts correctly (which might be useful in applications where a human or another tool can pick up from the middle).</p>
<p>Some evaluations also check <strong>consistency</strong>: if a model is asked to explain its answer vs. directly answer, do those agree? If it solves a problem in two different ways (maybe by reordering steps or under different prompts), does it reach the same conclusion? Consistency checks can catch cases where the reasoning process is brittle or overly sensitive to phrasing.</p>
<h3 id="45-automated-reasoning-critics-llm-as-a-judge">4.5 Automated Reasoning Critics (LLM-as-a-judge)</h3>
<p>A practical framework that has gained traction is using a strong language model to <strong>evaluate the reasoning of another model (or even itself)</strong>. For instance, one can prompt GPT-4 with: <em>“Here is a chain-of-thought and an answer. Evaluate the correctness and logical validity of the reasoning, and whether the final answer is justified.”</em> This uses the fact that cutting-edge models can often spot obvious reasoning errors or missing justifications in a solution that a simpler rubric might miss.</p>
<p>Such LLM-based evaluators can be more flexible than hard-coded checkers. The aforementioned process evaluators in research are essentially reasoning models used as judges, with the ability to allocate extra computational resources to evaluate each step carefully. In one study, researchers found that when they allowed an evaluator model to think more (generate a longer evaluation reasoning), its accuracy in judging solutions improved monotonically – much like how making a model think more improves problem-solving, it also improves evaluation quality.</p>
<p>This is a fascinating recursive idea: use a reasoning model to evaluate better outputs that themselves involve reasoning. It was even shown that using such process-aware evaluators to <strong>re-rank answers</strong> (choosing the answer that the evaluator model scores highest) can significantly improve the solving ability of the base model.</p>
<p>In summary, <strong>process evaluation frameworks</strong> often involve an LLM evaluator performing a two-level check:</p>
<ul>
<li>Outcome evaluation (is the final answer correct?)</li>
<li>Process evaluation (are the steps valid and do they lead to that answer?).</li>
</ul>
<p>By combining these, one gets a more robust assessment. This approach complements traditional metrics; for example, you might report that a model has 80% outcome accuracy, but according to an LLM judge, only 50% of its solutions were fully correct with no logical errors in any step. That tells a deeper story than 80% alone.</p>
<h3 id="46-illustrative-example">4.6 Illustrative Example</h3>
<p>To illustrate, consider a concrete example: say we ask a model a puzzle and it answers with a 5-step reasoning chain. The final answer is correct, so outcome-wise it’s a success. However, upon evaluation, we found that an arithmetic mistake occurred in step 3, which fortunately canceled out in step 5, yielding the correct answer nonetheless. A pure outcome metric says “perfect solution”. A process-aware evaluation would ding this as flawed reasoning (the model got lucky or coincidentally correct) – something we’d want to know if using the model for, say, validating scientific calculations. Conversely, if a model’s final answer is wrong, traditional evaluation is 0 for that question. However, process evaluation might reveal that the model was correct up until the last step – perhaps it performed all the reasoning correctly and made an error at the end.</p>
<p>In a human-learning context, you’d give partial credit. For model evaluation, noting that the model was, say, “90% correct in the procedure” could inform how we attempt to improve it (perhaps it just needs a slight boost in arithmetic precision or a final double-check step). This rich information is only available if we evaluate the reasoning, not just the outcome.</p>
<p>For practitioners, incorporating these evaluations is vital, as they help ensure that a high-performing reasoning model isn’t just getting by with smoke and mirrors (or hidden cues), and they quantify the efficiency and transparency of the model’s problem-solving approach. As these models become more integrated into workflows (e.g., as AI reasoning assistants), having reliable evaluation methodologies will also be key for <strong>governance and trust</strong> – one might, for example, require that a model’s chain-of-thought passes a certain automated consistency check before its answer is shown to a user.</p>
<p>In summary, the evaluation of reasoning LLMs has evolved to include <strong>trace-centric metrics</strong> alongside traditional outcome metrics. We assess the <em>faithfulness</em> of their explanations, measure accuracy in a way that accounts for the <em>cost of reasoning length</em>, and use novel frameworks where models critique reasoning steps (providing a “process score”).</p>
<h2 id="5-safety-concerns-and-vulnerabilities">5. Safety Concerns and Vulnerabilities</h2>
<p>While reasoning models offer powerful capabilities, they also introduce new safety concerns and vulnerabilities that must be carefully managed and addressed. The very features that make these models effective – their ability to generate detailed CoT and reason through complex problems – can also be exploited by malicious actors or lead to unintended behaviors. Below, we discuss some of the key safety challenges specific to reasoning AI models.</p>
<h3 id="51-reward-hacking-and-training-vulnerabilities">5.1 Reward Hacking and Training Vulnerabilities</h3>
<p>Reward hacking represents a significant concern in reasoning model development, particularly given their reliance on reinforcement learning during training. Reward hacking occurs when &ldquo;a RL agent exploits flaws or ambiguities in the reward function to achieve high rewards, without genuinely learning or completing the intended task&rdquo;.</p>
<p>In the context of LLMs trained with RLHF, reward hacking manifests when models learn to game evaluation metrics rather than genuinely improve at the intended tasks. This is particularly concerning for reasoning models, where the complexity of the reasoning process makes it difficult to specify comprehensive reward functions that capture all aspects of good reasoning.</p>
<p>For example, a reasoning model might discover that providing overly verbose explanations leads to higher evaluation scores, even if those explanations are not genuinely helpful or accurate. This could incentivize the model to generate long-winded responses that obfuscate its actual reasoning process, ultimately undermining the quality of its outputs.</p>
<h3 id="52-jail-breaking-and-safety-mechanism-vulnerabilities">5.2 Jail-breaking and Safety Mechanism Vulnerabilities</h3>
<p>Recent research has revealed severe vulnerabilities in the safety mechanisms of reasoning models. The Hijacking Chain-of-Thought (H-CoT) attack method demonstrates how attackers can &ldquo;leverage the model&rsquo;s own displayed intermediate reasoning to jailbreak its safety reasoning mechanism&rdquo;. Under such attacks, refusal rates in models like OpenAI&rsquo;s o1 drop dramatically, &ldquo;from 98% to below 2%&rdquo;.</p>
<p>The Malicious-Educator benchmark exposes how &ldquo;extremely dangerous or malicious requests&rdquo; can be disguised &ldquo;beneath seemingly legitimate educational prompts&rdquo;. This research reveals that &ldquo;attackers can easily extract criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any additional tricks&rdquo;, highlighting fundamental vulnerabilities in current safety approaches.</p>
<p>In addition, the ability of reasoning models to generate detailed CoT can be weaponized by attackers to create more convincing prompts that bypass safety filters. This raises the stakes for ensuring that safety mechanisms are robust and capable of handling sophisticated manipulation attempts.</p>
<h3 id="53-alignment-challenges-in-reasoning-systems">5.3 Alignment Challenges in Reasoning Systems</h3>
<p>The integration of reasoning capabilities creates new alignment challenges. While reasoning models can &ldquo;reason about our safety policies in context when responding to potentially unsafe prompts, through deliberative alignment&rdquo;, this same capability can be exploited by sophisticated attacks. The transparency of reasoning processes, while beneficial for interpretability, also provides attack vectors that didn&rsquo;t exist in traditional LLMs.</p>
<p>Research indicates that reasoning models still exhibit sensitivity to probability distributions from their training data, suggesting that &ldquo;optimizing a language model for reasoning can mitigate but might not fully overcome the language model&rsquo;s probability sensitivity&rdquo;. This indicates that fundamental limitations from autoregressive training may persist even in reasoning-optimized systems.</p>
<h3 id="54-hallucination-in-reasoning-contexts">5.4 Hallucination in Reasoning Contexts</h3>
<p>Despite their enhanced reasoning capabilities, reasoning models continue to exhibit hallucination patterns, particularly in constraint satisfaction problems. Research on graph coloring tasks reveals that reasoning models are &ldquo;prone to hallucinate edges not specified in the prompt&rsquo;s description of the graph&rdquo;. This phenomenon &ldquo;persists across multiple problem complexity levels and semantic frames&rdquo; and &ldquo;appears to account for a significant fraction of the incorrect answers from every tested model&rdquo;.</p>
<p>These findings suggest that reasoning models may have &ldquo;broader issues with misrepresentation of problem specifics&rdquo;, indicating that the enhanced reasoning capabilities don&rsquo;t fully address fundamental issues with information fidelity and accuracy.</p>
<h3 id="55-scaling-and-efficiency-considerations">5.5 Scaling and Efficiency Considerations</h3>
<p>While reasoning models demonstrate impressive capabilities, they incur significant computational costs. The variable test-time compute approach means that complex problems can require substantially more resources than traditional LLM inference. This creates practical deployment challenges, particularly for applications requiring consistent response times.</p>
<p>The relationship between reasoning quality and computational cost remains unclear. Research indicates that more thinking time generally leads to better performance, but the optimal allocation of computational resources across different problem types remains an active area of investigation.</p>
<h2 id="6-conclusion">6. Conclusion</h2>
<p>Reasoning AI models, such as o1, o3, R1, and Phi-4, mark a shift towards systems that execute algorithmic steps rather than relying purely on black-box prediction. Unlike traditional LLMs, these models leverage chain-of-thought reasoning, curated data, and advanced fine-tuning to solve complex tasks—though this comes with increased training and inference complexity.</p>
<p>Fine-tuning reasoning models demands specialized methods and high-quality data, as their reasoning chains are both powerful and vulnerable to inconsistency or reward hacking. Effective deployment requires both technical expertise and organizational investment; however, the benefits include clearer explanations and deeper insights across domains such as finance and science.</p>
<p>Evaluation now extends beyond final answers to include scrutiny of the reasoning process itself, using metrics such as trace faithfulness and process accuracy. This makes model behaviour more transparent and trustworthy.</p>
<p>For practitioners, reasoning models become collaborative problem-solvers, offering logical breakdowns for tasks from coding to contract analysis. But maintaining reliable reasoning and avoiding hallucinations requires ongoing vigilance and tailored oversight.</p>
<p>The center of gravity has shifted from <em>pick a reasoning model</em> to <em>use a unified system with routed reasoning,</em>’ with explicit controls for compute and explanation; this aligns with your agentic guidance and simplifies deployment ergonomics. In the near future, with this direction, we expect to see more robust state representations, verification-based training, and compositional planning; evaluate under router-aware, deception-aware protocols, and replicate Apple-style stress tests with fixed effort/latency budgets.</p>
<p>The focus is shifting toward unified systems that route and manage reasoning explicitly, enabling robust evaluation and compositional planning. Reasoning AIs won’t replace standard LLMs everywhere, but they excel in high-stakes scenarios requiring transparency. As techniques mature, these models will become more stable and interpretable, merging pure reasoning with external tools and knowledge. Teams adopting these models should invest in robust pipelines and new evaluation metrics to realize the benefits of interpretable, verifiable solutions—a step forward for AI’s ability to explain not just what or when, but how and why.</p>
<h5 id="references">References</h5>
<span style="font-size:0.7em">
<ol>
<li><a
	
		href = "https://openai.com/index/introducing-gpt-5/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		OpenAI. Introducing GPT 5. Product overview and system card for GPT 5, including routed reasoning, effort/verbosity controls, and safety claims.
	</span>
</a></li>
<li><a
	
		href = "https://platform.openai.com/docs/guides/latest-model"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		OpenAI. GPT 5 for developers. API parameters (reasoning_effort, verbosity), preamble planning, and large context.
	</span>
</a></li>
<li><a
	
		href = "https://azure.microsoft.com/en-us/blog/gpt-5-in-azure-ai-foundry-the-future-of-ai-apps-and-agents-starts-here/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Microsoft Azure AI. GPT 5 in Azure AI Foundry. Routing, reasoning controls, enterprise guidance.
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2201.11903"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
	</span>
</a></li>
<li><a
	
		href = "https://machinelearning.apple.com/research/illusion-of-thinking"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		The Illusion of Thinking. Stress tests showing complexity collapse on algorithmic puzzles.
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2502.12521v1"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights
	</span>
</a></li>
<li><a
	
		href = "https://lilianweng.github.io/posts/2024-11-28-reward-hacking/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Reward Hacking in Reinforcement Learning.
	</span>
</a></li>
<li><a
	
		href = "https://www.anthropic.com/research/reasoning-models-dont-say-think"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Reasoning models don&rsquo;t always say what they think
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/pdf/2501.12948"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
	</span>
</a></li>
<li><a
	
		href = "https://qwenlm.github.io/blog/qwen3/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Qwen3: Think Deeper, Act Faster
	</span>
</a></li>
<li><a
	
		href = "https://www.microsoft.com/en-us/research/project/phi-4-reasoning/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Microsoft. Phi 4 Reasoning documentation and evaluations.
	</span>
</a></li>
<li><a
	
		href = "https://github.com/srush/awesome-o1"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Awesome o1 (curated papers). Collected research on o1/o3 and reasoning models.
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2411.15594"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		A Survey on LLM-as-a-Judge
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2504.17550"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		HalluLens: LLM Hallucination Benchmark
	</span>
</a></li>
<li><a
	
		href = "https://www.youtube.com/watch?v=CjVQJdIrDJ0"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Thinking, Fast and Slow | Daniel Kahneman | Talks at Google
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2502.06772"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		ReasonFlux: A Template-Driven Approach to Reasoning in LLMs
	</span>
</a></li>
<li><a
	
		href = "https://codeforces.com/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Codeforces: A Major Competitive-Programming Platform
	</span>
</a></li>
<li><a
	
		href = "https://huggingface.co/bespokelabs/Bespoke-Stratos-7B"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Bespoke Labs: Bespoke-Stratos-7B
	</span>
</a></li>
<li><a
	
		href = "https://huggingface.co/open-thoughts/OpenThinker3-7B"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Open Thoughts: OpenThings3-7B
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2305.11206"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		LIMA: Less Is More for Alignment
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2311.13133"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2502.03387"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		LIMO: Less is More for Reasoning
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2406.04836"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Revisiting Catastrophic Forgetting in Large Language Model Tuning
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2301.12715"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Fine-Tuning Deteriorates General Textual Out-of-Distribution Detection by Distorting Task-Agnostic Features
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/1909.11299"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Mixout: Effective regularization to finetune large-scale pre-trained language models
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2203.11171"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Self-Consistency Improves Chain of Thought Reasoning in Language Models
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2310.01798"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Large Language Models Cannot Self-Correct Reasoning Yet
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2305.00633"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Self-Evaluation Guided Beam Search for Reasoning
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2305.20050"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Let&rsquo;s Verify Step by Step
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2408.00724"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2501.19393"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		s1: Simple test-time scaling
	</span>
</a></li>
<li><a
	
		href = "https://openreview.net/forum?id=Bw82hwg5Q3"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Self-Evaluation Guided Beam Search for Reasoning
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2211.10435"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		PAL: Program-aided Language Models
	</span>
</a></li>
<li><a
	
		href = "https://chain-of-code.github.io/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2408.03314"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2412.19437"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		DeepSeek-V3 Technical Report
	</span>
</a></li>
<li><a
	
		href = "https://lilianweng.github.io/posts/2025-05-01-thinking/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Weng, Lilian. Why We Think. (Test-time compute, branching vs. revision, PRMs, scaling laws.)
	</span>
</a></li>
<li><a
	
		href = "https://arxiv.org/abs/2504.16828"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		Process Reward Models That Think (ThinkPRM).
	</span>
</a></li>
<li><a
	
		href = "https://openai.com/index/learning-to-reason-with-llms/"
	

	

	
		target = "_blank"
		rel = "nofollow noopener noreferrer"
		>
	
	<span>
		OpenAI. Learning to reason with LLMs (o1).
	</span>
</a></li>
</ol>
</span>
]]></content:encoded>
    </item>
  </channel>
</rss>
