🏛️How to build a Large Language Model from Scratch - Part 1

Wed, 24 Sep 2025 00:00:00 +0000

TL;DR

In this post, I show how to build a working LLM from scratch and show a complete end-to-end pipeline from data gathering to training to deployment of a language model. For this project, I concentrate on Old English and only relate it to London, using historical London texts (1500-1850). To show the flexibility, I built two language models which are identical in architecture and the only differs is their size and parameters (117M vs 354M).

This guide shows you how to monitor training progression, perform rapid evaluations, test models from both PyTorch checkpoints and published Hugging Face repositories, and ultimately publish your own - supported by complete code, live model artifacts, and production-ready inference tooling.

4-Part Series:

Part 1 (this): Quick start, inference, and overview
Part 2: Data collection and custom tokenizers
Part 3: Model architecture and GPU training
Part 4: Evaluation and deployment

1. Overview

Train AI models on 1500-1850 London texts. Complete 4-part series covering data collection, training, and deployment. Part 1: Quick start and overview.

📖 Want to understand the core LLM concepts? This series focuses on implementation and hands-on building. For a deeper understanding of foundational concepts like tokenizers, prompt engineering, RAG, responsible AI, fine-tuning, and more, check out my book Generative AI in Action .

You can learn more about the book → by clicking here 📘.

1.1 What was built?

I found many folks don’t understand what it entails to build an LLM, and where we do have guides, they only share piecemeal elements and nothing that is comprehensive for someone who is new to this. There are more detailed guides on fine-tuning existing models, but not much on the complete development pipeline. This series outlines that by walking through the process of creating specialized language models trained exclusively on historical London texts from 1500 to 1850.

I am mostly doing this for my own learning, and also sharing what I can. Many work-related details, for obvious reasons, I cannot share and discuss, but some small pet projects like this embody the same sentiment.

The helloLondon Historical Language Models represent a complete end-to-end implementation, from data collection through deployment. Rather than fine-tuning existing models, I chose to train from the ground up to eliminate modern biases and create models that genuinely understand historical language patterns, cultural contexts, and period-specific knowledge.

Two Model Variants I built two identical models with the same architecture, tokenizer, and training process. The only difference is the number of parameters: an SLM (117M parameters) optimized for learning and resource-constrained environments, and a Regular model (354M parameters) designed for higher-quality generation.

Both use identical code with different configuration files, allowing you to understand the impact of model size on performance and choose the right variant for your needs.

Model	Parameters	Iterations	Training Time*	Use Case
SLM (Small)	117M	60,000	~8-12 hours	Fast inference, resource-constrained
Regular (Full)	354M	60,000	~28-32 hours	High-quality generation

Note: Technically speaking, both these models can be called classified as SLMs given they are 117M and 354M parameters; however, for the sake of this project, I call the smaller of the two the SLM and the other regular.

1.2 Core Pipelines

The complete development pipeline encompasses multiple critical stages that transform raw historical texts into production-ready language models. The process starts with data collection, where we systematically gather and filter over 218 historical London sources spanning 1500–1850. This process ensures we capture authentic period language while minimizing modern biases that could contaminate our models.

Next, we develop a custom tokenization system specifically designed for historical English. This involves training a domain-specific tokenizer with a 30,000-token vocabulary plus 150+ special tokens that capture period language patterns, archaic spellings, and historical terminology that modern tokenizers often miss.

The model architecture phase implements GPT-style causal language models entirely from scratch, creating two variants with 117M and 354M parameters, respectively. Both models share identical architecture and training processes, allowing for direct comparison of performance versus computational requirements.

Our training infrastructure leverages modern multi-GPU training with Distributed Data Parallel (DDP), comprehensive checkpointing for restart resilience, and real-time monitoring through Weights & Biases. This ensures reliable training even across extended periods and hardware failures.

Evaluation goes beyond standard metrics to include historical accuracy probes, perplexity tracking, qualitative generation review, and early failure detection. We specifically test how well our models understand historical context, period-appropriate language, and London geography.

Finally, deployment includes publishing models to Hugging Face alongside unified local and cloud inference scripts, making the models immediately accessible to researchers and developers worldwide.

1.3 Hands-On Experience

Every aspect of this project is designed for practical implementation and learning. The working code covers every stage from data collection through tokenizer training, model training, evaluation, and publishing - all fully implemented and documented with clear instructions and examples.

I already have both the models published on Hugging Face, which allows for Live models to be immediately available for use, allowing you to test published checkpoints instantly or retrain from scratch with a single command. This dual approach lets you either jump straight into experimentation or understand the complete development process.

The project works with real data - over 500 million characters of authentic historical English from 1500–1850, carefully filtered to minimize modern bias while preserving the rich linguistic patterns of the period. This is using genuine historical texts that provide authentic training material.

Everything is production-ready with structured logging, comprehensive error handling, reproducible configurations, and automated publishing workflows. The codebase follows professional development practices, making it suitable for both learning and real-world deployment.

This series is structured to take you through the complete LLM development pipeline:

Part	Focus	Description
Part 1 (this post)	Quick start and end-to-end overview	Use published models, understand the complete pipeline, and get hands-on experience with working code and live models. The intent is that if you want to build this, you can follow the instructions and get a model in the end. If you want to understand more of the inner workings and details, then those will be covered in the subsequent blog posts.
Part 2	Data collection and custom tokenization	Deep dive into gathering 218+ historical sources, cleaning pipelines, and building specialized tokenizers for historical language patterns.
Part 3	Model architecture and training infrastructure	Technical implementation of custom GPT architectures, multi-GPU training, checkpointing, and performance optimization.
Part 4	Evaluation and deployment	Comprehensive testing frameworks, historical accuracy assessment, and production deployment to Hugging Face.

For this first part, you have two paths to choose from based on your goals and available time:

Option 1: Quick Start with Published Models - Jump straight into using the pre-trained models on Hugging Face for immediate testing and exploration. Perfect if you want to see results quickly and aren’t concerned with the technical implementation details.
Option 2: Build from Scratch - Dive deep into the complete codebase and build your own historical language model from the ground up. Ideal if you want to understand every aspect of the pipeline and learn how to create specialized LLMs.

Let us start with option 1 - use the models.

2. Use the models - Try it now using Hugging Face

If you want to get going and use the models and kick tires, the models are live on Hugging Face and ready to use.

SLM Model (117M parameters): 💡 https://huggingface.co/bahree/london-historical-slm
Regular Model (354M parameters): 💡 https://huggingface.co/bahree/london-historical-llm

In addition, you can also explore the complete codebase and build your own historical language model from scratch. The entire pipeline is documented with working code, training scripts, and deployment guides, and is available on GitHub:

Github Repo 💻 –> ⚙️ github.com/bahree/helloLondon .

If you want to quickly test the published models on Hugging Face (HF), you can do so in two ways: quick automated tests or interactive mode. This is the easiest way to get started and show that the models are fully working. You can either clone the repo and run the scripts or use the Python code snippet below.

If you don’t have a development environment set up, you can follow the instructions in the GitHub repo to set up a conda environment with all dependencies. And just for the local testing, you can use CPU only, but for interactive mode, a GPU is recommended. Finally, you will need at a minimum the following Python packages. Note, these are also called out on the Hugging Face model page.

python -m pip install -U pip setuptools wheel
python -m pip install "transformers[torch]" accelerate safetensors

Note: It is recommended to use a virtual environment or conda environment to avoid dependency conflicts. See the GitHub repo for complete setup instructions.

If you don’t have the code repo yet, you can run the following commands directly and run inference from Hugging Face.

Python Code:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the published SLM model
model_name = "bahree/london-historical-slm"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate historical text
prompt = "In the year of our Lord 1834, I walked through the streets of London and witnessed"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    inputs["input_ids"],
    max_new_tokens=100,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.2
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

2.1 Local Testing with the Complete Codebase

Now that you’ve seen the models work directly from Hugging Face, let’s explore the complete development experience by working with the actual codebase. This section walks you through testing the models locally using the same infrastructure that was used to train them.

The helloLondon repository contains everything needed to reproduce the entire pipeline - from data collection through model deployment. By running these tests locally, you’ll get hands-on experience with the production-ready inference scripts and understand how the models integrate with the broader development workflow.

The following examples assume you’ve cloned the repository and are running from the root directory. All scripts are designed to work out of the box with the published models, giving you immediate access to the same testing infrastructure used during development.

# Test SLM model (117M parameters)
python 06_inference/test_published_models.py --model_type slm

# Test Regular model (354M parameters)  
python 06_inference/test_published_models.py --model_type regular

There is also an interactive mode where you can type in your own prompts and see the model generate text.

Interactive Testing:

# SLM model - Interactive mode
python 06_inference/inference_unified.py --published --model_type slm --interactive

# Regular model - Interactive mode
python 06_inference/inference_unified.py --published --model_type regular --interactive

# Single prompt testing
python 06_inference/inference_unified.py --published --model_type slm --prompt "In the year 1834, I walked through the streets of London and witnessed"
python 06_inference/inference_unified.py --published --model_type regular --prompt "In the year 1834, I walked through the streets of London and witnessed"

If everything works, you should see output similar to the following for the SLM model:

Example Output ( Hugging Face SLM Example):

🧪 Testing SLM Model: bahree/london-historical-slm
============================================================
📂 Loading model...
✅ Model loaded in 8.91 seconds
📊 Model Info:
   Type: SLM
   Description: Small Language Model (117M parameters)
   Device: cuda
   Vocabulary size: 30,000
   Max length: 512

--- Test 1/10 ---
Prompt: In the year 1834, I walked through the streets of London and witnessed
Generated: a scene in which some of those who did not incline to come in contact with him took part in his discourse. It was on this occasion that I perceived that he had been engaged in some new business connected with the house, but for some days it had not taken place, nor did he appear so desirous of pursuing any further display of interest .....
Time: 5.75s

Notice how the model captures:

Period-appropriate language (“thank ’ee kindly,” “bade me go,” “spectacles”)
Historical dialogue patterns (formal speech, period-appropriate contractions)
Historical context (West Indies, poor rates, needle work, pocket-book)
Authentic historical narrative (detailed scene setting, period-appropriate social interactions)

Now that we have tried using the model, let’s explore option 2 and see how we can build it. Once you’ve built your own model, you’ll be able to test it using the checkpoints saved during training - see section 7.4 for detailed checkpoint testing instructions.

3. Build the models - From Scratch

Building a language model from scratch is both an art and a science - requiring careful orchestration of data, architecture, and training to create something that can genuinely understand and generate historical text. Training from scratch gives us complete control over every aspect of the model’s knowledge and behavior.

The journey from raw historical documents to a working language model involves six critical phases, each building upon the previous one. The flowchart below illustrates this complete end-to-end pipeline, showing how we transform 218+ historical sources into two specialized models that can generate authentic medieval London text.

graph TD
    A[📚 Historical Data Collection
218+ sources, 1500-1850] --> B[🧹 Data Cleaning & Processing
Text normalization, filtering]
    B --> C[🔤 Custom Tokenizer Training
30k vocab + 150+ special tokens]
    C --> D[🏋️ Model Training
Two Identical Models
SLM: 117M / Regular: 354M]
    D --> E[📊 Evaluation & Testing
Historical accuracy, ROUGE, MMLU]
    E --> F[🚀 Deployment
Hugging Face + Local Inference]
    
    G[📖 Building a Custom LLM] --> A
    
    F --> L[🎯 Use Cases
Historical text generation
Educational projects
Research applications]
    
    style A fill:#e1f5fe
    style D fill:#f3e5f5
    style F fill:#e8f5e8
    style G fill:#fff3e0

Now that we have a bird’s eye view of the complete pipeline, let us get into the details and build the model from scratch. I am going to walk you through the complete process step-by-step.

I am also going to assume you have a basic understanding of Python, PyTorch, and command-line operations and have a more recent dev setup, including a relatively modern GPU (NVIDIA RTX 3060 or better recommended). For the sake of simplicity, I will show commands for Linux/macOS, but Windows users can easily adapt them.

Again, as a reminder, the ⚙️ GitHub repo has all the code and instructions you need to get started. You can clone the repo and follow along.

4. Environment and Configuration Setup

The foundation of any successful machine learning project lies in proper environment setup and configuration. This step involves creating a virtual environment, installing dependencies, and configuring the project structure. Understanding the key configuration files, directory organization, and overall project architecture is crucial - these elements form the backbone of the entire training process. Taking time to get this right upfront prevents countless headaches and debugging sessions later, ensuring smooth execution through all subsequent phases.

Key Configuration Files:

config.py: Central configuration system (paths, training settings, tokenizer config)
01_environment/setup_environment.py: Environment setup script (reads from config.py)
requirements.txt: Python dependencies (auto-generated by setup script)

Important Directories (Created by Setup):

helloLondon/: Virtual environment directory
data/london_historical/: Historical text data storage
09_models/checkpoints/: Model checkpoints during training
09_models/tokenizers/: Custom tokenizer storage

Now that we have that out of the way, let us run the setup commands as shown in the listing below. This will clone the repo, set up the environment, and install all dependencies. For this to work you will already have git, python, and python3-venv installed. If you don’t have these, please install them first.

PS: See the Training QuickStart guide in the GitHub repo for more details.

# Clone and setup environment
git clone https://github.com/bahree/helloLondon/
cd helloLondon
python 01_environment/setup_environment.py
source activate_env.sh

As you run the setup script, you should see output similar to the images shown below; the script will create a virtual environment, install dependencies, and set up necessary directories. And then you can activate the environment using the source activate_env.sh command.

Environment setup - 1/3

Environment setup - 2/3

Environment setup - 3/3

Now that the configuration and environment are set up, we can validate them by running the following command. This will check if everything is working and you have the necessary dependencies installed.

When one activates the environment using source activate_env.sh, you will see it in the console as shown below.

The default environment name is called helloLondon. If you want to change the environment name from helloLondon to something else, you can modify the venv_name field in environment_config.json before running the setup script. This will create a virtual environment with your preferred name.

python3 -c "
from config import config
print('🔧 Configuration Overview')
print('=' * 50)
print(f'Project Root: {config.project_root}')
print(f'Data Directory: {config.london_historical_data}')
print(f'Tokenizer Directory: {config.london_tokenizer_dir}')
print(f'Checkpoints Directory: {config.checkpoints_dir}')
print(f'Virtual Environment: {config.project_root}/helloLondon')
print(f'Vocabulary Size: {config.tokenizer_config[\"vocab_size\"]:,} tokens')
print(f'Special Tokens: {len(config.tokenizer_config[\"special_tokens\"])} tokens')
print(f'SLM Model: {config.slm_config[\"model_name\"]}')
print(f'Training Epochs: {config.slm_config[\"num_epochs\"]}')
print(f'Batch Size: {config.slm_config[\"batch_size\"]}')
print(f'Max Length: {config.slm_config[\"max_length\"]}')
print('\\n🎯 Configuration looks good!')
"

The following directory structure will be generated after executing the setup script. Please note that certain directories will remain empty until the data collection and training processes are initiated.

helloLondon/
├── 📁 data/london_historical/          # Historical text data
│   ├── 📄 london_historical_corpus_comprehensive.txt  # Final training corpus
│   ├── 📁 downloads/                   # Raw downloaded data
│   ├── 📁 processed/                   # Cleaned and processed text
│   └── 📁 metadata/                    # Data collection metadata
├── 📁 09_models/
│   ├── 📁 checkpoints/slm/             # Model checkpoints during training
│   │   ├── 📁 checkpoint-500/          # Checkpoint every 500 steps
│   │   ├── 📁 checkpoint-1000/
│   │   └── 📁 pretokenized_data/       # Pre-tokenized data (performance boost)
│   └── 📁 tokenizers/london_historical_tokenizer/  # Custom tokenizer
│       ├── 📄 tokenizer.json           # Tokenizer configuration
│       ├── 📄 vocab.json               # Vocabulary mapping
│       └── 📄 merges.txt               # BPE merge rules
├── 📁 helloLondon/                     # Virtual environment
└── 📁 logs/                            # Training logs and WandB data

Prerequisites: Before proceeding with the following steps, please verify the following requirements:

Storage: Minimum 20GB of free disk space and stable internet connectivity for data acquisition

Hardware: GPU with 8GB+ VRAM for SLM training, 16GB+ VRAM for Regular model training. Cloud users should select appropriate instance types

Experiment Tracking (Optional but highly recommended): Weights & Biases account with WANDB_API_KEY environment variable configured for comprehensive training monitoring

Dependencies: Required data processing libraries (nltk, beautifulsoup4, etc.) will be automatically installed via the setup script

5. Data Collection

The foundation of any language model lies in its training data. For our historical London models, we’ve built a comprehensive data collection system that sources authentic text from 218+ historical sources spanning 1500-1850 - a remarkable 350-year window of London’s linguistic evolution. This isn’t just about downloading files; it’s about curating a high-quality corpus that captures the authentic voice of historical London.

Our data collection pipeline automatically processes multiple formats (PDFs, HTML, XML, plain text) from diverse sources, including Project Gutenberg classics, Old Bailey trial records, London Lives manuscripts, and British History Online archives. The system includes sophisticated quality control measures: language detection to filter non-English content, OCR artifact correction, duplicate detection, and historical period validation to ensure every text genuinely represents the target era.

The result? A curated corpus of 500M+ characters of authentic historical English text, ready to train models that understand not just the words, but the cultural context, social dynamics, and linguistic patterns of 18th and 19th-century London. Of course, you can always add your own data sources if you have them, and the system is designed to be extensible.

We can kick off the data collection process using the command below. This will be run from the project root directory.

# Download historical data with advanced filtering
python 02_data_collection/historical_data_collector.py --max_sources 100

# The system automatically filters:
# - Non-English content (Arabic, Chinese, etc.)
# - Poor OCR quality scans and gibberish
# - Advertisement-heavy commercial content  
# - Duplicate content and empty files
# - Special handling for Project Gutenberg classics

This process may take some time, depending on your internet speed, the number of sources you choose to download, and your system’s performance. For me, on a very fast internet connection and a powerful machine this took typically 2-4 hours for downloading, and processing the full dataset. The script will save the cleaned and processed data in the data/london_historical/ directory, creating a comprehensive historical corpus.

The data collection process creates a comprehensive historical corpus with the main training file london_historical_corpus_comprehensive.txt containing 270M+ characters (~258MB) of authentic historical text. The complete data directory spans approximately 1.2GB, including 521MB of raw downloaded sources, 263MB of processed and cleaned content, and 126MB of tokenized training sequences ready for model training. The image below shows the data collection in progress.

Data Collection in Progress

The final corpus represents one of the largest collections of historical London text ever assembled for language model training, with authentic content spanning 350 years of linguistic evolution. The two images below show an example of one of my runs, one of them showing the final output of the data cleaning and outlining the statistics. And the second one shows the size of the data on disk.

Data Collection Summary

The total size at the end of the data. Note this does not include the Old Bailey and London Lives data.

Total Data Size

Now that we have our data and have cleaned it. Let us build a custom tokenizer.

6. Train Custom Tokenizer

With our cleaned historical corpus ready, we now need to create a custom tokenizer specifically designed for historical English. Standard tokenizers like GPT-2 are optimized for modern text and fail catastrophically with historical language - treating archaic words like “quoth” and “hast” as multiple subword fragments, losing both meaning and efficiency.

Our custom tokenizer uses Byte Pair Encoding (BPE) with a 30,000 vocabulary size and 150+ carefully designed special tokens that understand:

Historical Language: Archaic pronouns (<|thou|>, <|thee|>), verbs (<|hast|>, <|doth|>), and expressions (<|verily|>, <|forsooth|>)
London Geography: Landmarks (<|thames|>, <|newgate|>, <|tower|>), streets (<|cheapside|>, <|fleet|>), and districts (<|southwark|>, <|westminster|>)
Historical Context: Period markers (<|tudor|>, <|stuart|>, <|georgian|>), social classes (<|noble|>, <|commoner|>), and professions (<|apothecary|>, <|coachman|>)

This specialized vocabulary ensures that common historical terms remain as single tokens rather than being fragmented, dramatically improving both training efficiency and text generation quality. We can kick off the tokenizer using the command below. Again, this will be run from the project root directory.

# Train historical tokenizer (30k vocabulary)
python 03_tokenizer/train_historical_tokenizer.py

The training process analyzes our 270M+ character corpus to learn optimal token boundaries, creating a tokenizer that understands the linguistic patterns of 1500-1850 English. The result is a highly efficient tokenizer with a compression ratio of ~0.3 tokens per character and 99%+ reconstruction accuracy - essential for training models that can generate authentic historical text.

Once the training is finished (and usually it is pretty quick - just a few minutes for our data size), we run a quick sanity test as the image below shows.

Custom Tokenizer Training

Note that in testing, we might see a warning that the reconstruction differs; this is only because of the alphabet case being different and is expected. You can ignore this. An example of this is shown below.

Tokenizer reconstruction warning

Why the “Reconstruction differs” warning is actually beneficial:

The reconstruction differences you see are not errors - they’re the tokenizer working exactly as designed for optimal language model training. The tokenizer uses Byte Pair Encoding (BPE), which breaks complex words into smaller, reusable subword units (like “Bourgh” → “bour ##gh”), and normalizes text to lowercase to reduce vocabulary size. These “differences” are actually features that make the tokenizer more efficient and the resulting language model more capable of generating authentic historical text.

📖 For detailed technical explanation: Part 2 of this series covers the complete tokenizer architecture, BPE implementation, special token design, and why these reconstruction differences are essential for optimal language model training.

Now that we have our data and the tokenizer is ready, it is time to train the model.

7. Train the Model

With our cleaned historical corpus and custom tokenizer in place, we can now train our language models. The training system is designed to build two identical models with different parameter counts, allowing you to choose between speed (SLM) and quality (Regular model) based on your needs.

Training Architecture: Both models use a custom GPT architecture specifically optimized for historical text, featuring sophisticated attention mechanisms that understand the complex relationships in historical language. The system includes automatic GPU detection, multi-GPU support, and comprehensive monitoring to ensure optimal training performance.

Training Process: The training system implements modern optimization techniques, including dynamic learning rate scheduling, automatic checkpointing, and real-time experiment tracking via WandB. The entire process is automated with intelligent configuration that adapts to your hardware setup, whether you’re using a single GPU or multiple GPUs for distributed training.

Performance Optimization: The system includes precision optimization (TF32, AMP) and memory management specifically tuned for historical text processing. Training typically takes 7-8 hours for the SLM and 28-32 hours for the Regular model on modern hardware, with comprehensive monitoring to track progress and identify any issues. Note, this time can vary significantly based on your hardware. The times mentioned here are based on dual NVIDIA A30s.

📖 For detailed technical implementation: Part 3 of this series covers the complete model architecture, GPU configuration, training infrastructure, and performance optimization strategies in detail.
🧪 Ready to test your checkpoints? Once training completes, see section 7.4 for comprehensive instructions on testing your trained model checkpoints.

7.1 SLM Training

To kick off the training, the code is quite simple, as shown below. Again, this would be from the project root folder. In my case, I am using torchrun --nproc_per_node=2 because I have dual GPUs and I want to use both. If you only have a single GPU, you can just run the automatic GPU detection script. The train_model_slm.py script specifically trains the SLM (Small Language Model) with 117M parameters.

Option A: Train SLM (117M parameters) - Faster, Good for Testing

# Clean any existing tokenized data
rm -rf data/london_historical/tokenized_data/

# Automatic GPU Detection (Recommended)
cd 04_training
./launch_slm_training.sh

# Manual Multi-GPU training
torchrun --nproc_per_node=2 04_training/train_model_slm.py --data_dir data/london_historical

Note: The first line rm -rf data/london_historical/tokenized_data/ cleans any existing tokenized data to ensure a fresh start. This is important because the training system caches tokenized data for efficiency, and we want to ensure it uses the latest corpus and tokenizer settings rather than potentially outdated cached data. You want to do this only if you have more updated data from the previous steps.

Once the training starts, you will see a similar output to the one shown below.

Starting model training

Note the Tokenizing corpus line - this will take some time, depending on your data size and hardware. The tokenized data will be saved in data/london_historical/tokenized_data/ for future runs so that subsequent training runs will be much faster. If you want to force re-tokenization, you can delete this directory and restart the training. And if you think this is hung, you can check the GPU usage using nvtop in a separate terminal.

And if you have configured WandB as recommended earlier, then you can log in to that dashboard and also monitor the training progress. This is quite handy when you are away from the machine and see how it is generally progressing.

WanB Training progress

WandB also provides valuable insights into your model’s training performance through comprehensive visualizations. The dashboard shows the complete training journey, revealing how your model’s loss decreased over time, whether the training plateaued, and how efficiently your hardware was utilized. These visualizations help you understand not just the final results, but the entire learning process - identifying if the model continued improving throughout training or if it reached a performance plateau.

While these metrics are incredibly useful for optimizing your training process, we’ll dive deeper into interpreting these results and fine-tuning your training strategy in Part 3 of this series.

SLM Results (117M parameters):

wandb: Run history:
wandb:       eval/iter   ▂▂▃▃▄▄▅▅▆▆▇▇██
wandb: eval/train_loss  ███▇▇▇▇▇▇▇▇▇▇▇▇
wandb:   eval/val_loss  ███████▇▇▇█▇▇▇▇
wandb:    eval/val_ppl  █▇▇▇▇▇▆▆▆▆▆▆▆▆▆
wandb:     train/dt_ms           █            █                
wandb:      train/iter      ▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇██
wandb:      train/loss ▆▅▇▅▅▃▇▄▄█▅▄▅▄▃▇▄▄▅ ▃▃▂▄▅▂▅▂▄▅▃▃▄▅ ▄▃
wandb:        train/lr ██████████▇▇▇▅▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂           
wandb:       train/mfu ▃▄▇▇█▄▄▆▆▇▅▂▅▆▆▇▇▂▄▅▇▇▇▆▆▇▇▇▇▅███▅▇▆▇▇ ▇

wandb: Run summary:
wandb:       eval/iter 60000
wandb: eval/train_loss 2.74369
wandb:   eval/val_loss 3.44089
wandb:    eval/val_ppl 31.21462
wandb:     train/dt_ms 10217.92054
wandb:      train/iter 60000
wandb:      train/loss 2.87667
wandb:        train/lr 3e-05
wandb:       train/mfu 7.50594

It’s also helpful to monitor GPU usage during training. I recommend using nvtop (a GPU monitoring tool similar to htop but for NVIDIA GPUs) in a separate terminal to track memory usage, temperature, and utilization in real-time. The screenshot below shows the GPU monitoring during model training.

GPU monitoring using nvtop

7.2 Understanding Checkpoints

Throughout training, the system automatically saves checkpoints - snapshots of your model’s current state, including all learned parameters, optimizer state, and training progress. These checkpoints serve as safety nets, allowing you to resume training if interrupted, and provide multiple model versions to choose from. The final checkpoint (typically saved at the end of training) represents your fully trained model, ready for inference and deployment.

Checkpoints are saved in the 09_models/checkpoints/ directory, with separate subdirectories for each model type. SLM checkpoints are stored in 09_models/checkpoints/slm/ (e.g., checkpoint-4000.pt, checkpoint-8000.pt), while regular model checkpoints are saved directly in 09_models/checkpoints/ (e.g., checkpoint-60001.pt, checkpoint-120000.pt). The checkpoint filenames include the training step number, making it easy to identify the training progress and select the best-performing version for your needs.

These checkpoints enable two powerful capabilities that significantly enhance your training workflow. You can test your model’s current performance at any point during training by running inference on intermediate checkpoints, allowing you to monitor progress without waiting for training to complete. Additionally, suppose training is interrupted due to power loss, system crash, or manual stop. In that case, you can resume from the last saved checkpoint exactly where you left off, saving both time and computational resources. This flexibility is particularly valuable for long training runs, enabling you to experiment with different model versions and recover from unexpected interruptions.

🧪 Ready to test your checkpoints? See section 7.4 for detailed instructions on testing your trained model checkpoints.

7.3 Regular Model Training

The Regular model training follows the same process as the SLM, using identical training infrastructure but with different configuration settings. The only differences are the training script (train_model.py instead of train_model_slm.py) and the model architecture parameters (354M parameters vs 117M).

# Clean any existing tokenized data
rm -rf data/london_historical/tokenized_data/

# Automatic GPU Detection (Recommended)
cd 04_training
./launch_training.sh

# Manual Multi-GPU training
torchrun --nproc_per_node=2 04_training/train_model.py --data_dir data/london_historical

Key Differences from SLM:

Training script: train_model.py (instead of train_model_slm.py)
Model size: 354M parameters (vs 117M for SLM)
Training time: 28-32 hours (vs 7-8 hours for SLM)
Memory usage: Higher VRAM requirements
Performance: Better text quality, slower inference

The training infrastructure, checkpointing, WandB integration, and all other features remain identical. The system automatically detects the model type and applies the appropriate configuration from config.py.

Regular Model Results (354M parameters):

wandb: Run history:
wandb:       eval/iter     ▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: eval/train_loss  █████████▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▆▇▆▆▆▆▆▆▆▆▆
wandb:   eval/val_loss  ███████████████████████████████████▇███
wandb:    eval/val_ppl  ████▇▇█▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▇▇▆▇▇▆▆▆▆▆▆▆▆
wandb:     train/dt_ms                  █                      
wandb:      train/iter      ▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇███
wandb:      train/loss ▇▆▆▇▇▅▅█▅▄▃▅▅▅▄▇▄▄▄▄▄▃▃▃▅▂▄▅▂▅▂▄▅▃▃▄▅ ▄▃
wandb:        train/lr ▄██████▇▇▇▇▇▆▆▆▅▅▄▄▄▄▄▄▄▄▃▃▂▂▂          
wandb:       train/mfu ▆▇█▅▄ ▄▆▆▆▇▃▃▂▂▆█▃▃▅▅▃█▅▄▆▇▇▇▇▄▅▃█▆▇█▄▃█

wandb: Run summary:
wandb:       eval/iter 60000
wandb: eval/train_loss 2.70315
wandb:   eval/val_loss 3.61921
wandb:    eval/val_ppl 37.30823
wandb:     train/dt_ms 24681.64754
wandb:      train/iter 60000
wandb:      train/loss 2.70629
wandb:        train/lr 0.0
wandb:       train/mfu 7.20423

7.4 Testing Your Checkpoints

Once training is complete, you can immediately test your model using the checkpoints saved during training. This is one of the most exciting parts - seeing the model generate historical text for the first time! The PyTorch checkpoint approach provides immediate testing without any conversion needed, allowing you to test any checkpoint to monitor training progress while preserving the complete model state, including training metadata and optimizer state for fast, optimized inference.

Direct PyTorch Checkpoint Testing: Test your model directly from the training checkpoints without any conversion:

# Test SLM checkpoint (117M parameters)
python 06_inference/inference_pytorch.py \
  --checkpoint 09_models/checkpoints/slm/checkpoint-4000.pt \
  --prompt "In the year 1834, I walked through the streets of London and witnessed"

# Test Regular model checkpoint (354M parameters)  
python 06_inference/inference_pytorch.py \
  --checkpoint 09_models/checkpoints/checkpoint-60001.pt \
  --prompt "In the year 1834, I walked through the streets of London and witnessed"

Expected Output: Your trained model will generate authentic historical text like:

“In the year 1834, I walked through the streets of London and witnessed the most extraordinary sight. The Thames flowed dark beneath London Bridge, whilst carriages rattled upon the cobblestones with great urgency. Merchants called their wares from Cheapside to Billingsgate, and the smoke from countless chimneys did obscure the morning sun.”

Testing Different Checkpoints: You can test any checkpoint from your training run to see how the model improved over time. Try testing checkpoints from different training stages to observe the learning progression - early checkpoints will generate more random text, while later checkpoints will produce increasingly coherent historical language.

💡 Pro Tip: For published Hugging Face models and community access, see the Quick Start section earlier in this post, where we demonstrated the published SLM model.

8. Publish to Hugging Face

Once you’ve successfully trained and tested your models, you can publish them to Hugging Face for community access and easy deployment. Publishing makes your models available to researchers, developers, and enthusiasts worldwide, while integrating them into the Hugging Face ecosystem for seamless use with the transformers library.

Publishing Process: The publishing code automatically handles the complete conversion process from PyTorch checkpoints to Hugging Face format, which is essential for making your trained models accessible to the broader community. This conversion transforms your local training artifacts into a standardized format that can be easily loaded by users worldwide.

The process includes converting model weights from PyTorch’s .pt format to the more efficient .safetensors format, generating proper configuration files (config.json, generation_config.json) that define the model architecture and generation parameters, uploading the custom tokenizer and all necessary files to ensure complete functionality, creating comprehensive model cards with usage instructions and metadata for easy adoption, and setting up proper model repositories with versioning for professional deployment.

This conversion is necessary because PyTorch checkpoints are optimized for training workflows and contain additional information like optimizer states that aren’t needed for inference, while the Hugging Face format is specifically designed for model sharing and deployment across different environments and hardware configurations.

We need to call the right script to publish the relevant model - either the SLM or the larger model. The publishing scripts will prompt you for your Hugging Face username and repository name, allowing you to customize where your models are published. The scripts automatically detect and use the latest checkpoint from your training run, so you can publish immediately after training completes.

💡 Quick Reference: If you want to test published models before publishing your own, see section 2 “Use the models - Try it now using Hugging Face” for immediate access to pre-trained models.

Prerequisites: You’ll need a Hugging Face account and either set the HF_TOKEN environment variable or provide your token when prompted. The scripts will guide you through the publishing process step by step.

Option A: Publish SLM (117M parameters)

# Publish SLM to Hugging Face
python 10_scripts/publish_slm_to_huggingface.py

Option B: Publish Regular Model (354M parameters)

# Publish Regular model to Hugging Face  
python 10_scripts/publish_to_huggingface.py

If everything is working correctly and the models are published, you will see confirmation messages and upload progress. Here’s what successful publishing looks like:

HF - SLM upload

And this is an example output for the Regular model:

HF - Regular model upload

After Publishing: Once published, your models will be available at:

SLM: bahree/london-historical-slm
Regular Model: bahree/london-historical-llm

Users can then easily load and use your models with just a few lines of code, making your historical language models accessible to the broader AI community for research, education, and creative applications.

Testing Your Published Models: Once published, you can test your models using the same inference methods shown in the Quick Start section:

# Test published SLM model (10 automated tests)
python 06_inference/test_published_models.py --model_type slm

# Interactive testing with published models
python 06_inference/inference_unified.py --published --model_type slm --interactive

10. What We’ve Accomplished

This comprehensive guide has taken you from raw historical documents to production-ready language models that can generate authentic 18th and 19th-century London text. We’ve built a complete pipeline that transforms 218+ historical sources into two specialized models - a fast SLM for experimentation and a powerful Regular model for high-quality generation. The entire system is fully functional, with both PyTorch checkpoint inference and Hugging Face model publishing working seamlessly, tested and validated on real hardware.

What makes this project interesting is that it’s not just another language model - it’s a complete educational journey that teaches you every aspect of building LLMs from scratch. From custom historical tokenizers that understand archaic English to sophisticated GPU optimization and production-ready deployment, you’ve learned the full stack of modern language model development. The result is a system that preserves historical linguistic heritage while demonstrating cutting-edge AI techniques, making it valuable for researchers, educators, and anyone interested in the intersection of history and technology.

11. The Journey Continues

This is just the beginning. In the next three parts of this series, we’ll dive deeper into the technical foundations:

Part 2 explores historical data collection, showing how we curated 218+ authentic sources spanning 350 years of London’s history, and how we built a custom tokenizer that truly understands historical English.

Part 3 reveals the custom GPT architecture designed specifically for historical text, GPU optimization strategies, and production-ready training infrastructure.

Part 4 completes the journey with professional evaluation frameworks, testing strategies, and deployment techniques that transform your trained models into production-ready systems.

Each part builds on what you’ve learned here, taking you from a high-level overview to deep technical implementation details.

12. Resources

GitHub Repository:⚙️ github.com/bahree/helloLondon - Complete codebase with all training scripts, inference tools, and documentation
Hugging Face Models:
- 🤗 bahree/london-historical-slm - Small Language Model (117M parameters)
- 🤗 bahree/london-historical-llm - Regular Model (354M parameters)
📘Book Reference: Generative AI in Action - For deeper understanding of core LLM concepts
📖Documentation: Complete guides in the 08_documentation/ folder covering every aspect of the project

13. Acknowledgments

This project builds upon the excellent work of the open-source community. Special thanks to haykgrigo3’s TimeCapsuleLLM for the initial inspiration and framework for historical language model training, and to Andrej Karpathy’s nanoGPT for the foundational GPT architecture and training methodology. The project extends these foundations with specialized adaptations for historical text, including custom tokenizers, advanced data filtering, and production-ready deployment infrastructure.

🙏

Ready to dive deeper? Part 2 will cover the technical details of data cleaning, GPU optimization, training procedures, and advanced techniques for building specialized language models.

LLM on Amit Bahree's (useless?) insight!