Building LLMs from Scratch - Part 4: Evaluation & Deployment

TL;DR

In this final part of our 4-part series on building language models from scratch, we explore the evaluation, testing, and deployment pipeline that transforms our trained historical language models into working systems. Part 1 showed you how to use the published models, Part 2 covered data collection and custom tokenization, and Part 3 detailed the model architecture and training infrastructure. Here, we complete the journey with evaluation frameworks, testing infrastructure, and deployment to Hugging Face Hub.

⚠️ Educational Purpose: This is a learning project designed to teach LLM development concepts. For production-scale LLMs, you’ll need much larger datasets, more sophisticated infrastructure, and additional considerations not covered here.

As outlined in Part 1 , both the SLM (117M parameters) and the Regular Model (354M parameters) use the same training code and infrastructure with different configurations defined in config.py. The evaluation and deployment infrastructure is also identical - only the model architecture parameters differ.

Both PyTorch checkpoint inference and Hugging Face model inference are fully working and available. Both the SLM and the Regular model are published on Hugging Face Hub . Local PyTorch checkpoints can be used directly for inference with the script inference_pytorch.py.

🔗 GitHub Repository: github.com/bahree/helloLondon - Complete evaluation and deployment infrastructure (05_evaluation/, 06_inference/, 10_scripts/) plus guides (08_documentation/EVALUATION_GUIDE.md, 08_documentation/HUGGINGFACE_PUBLISHING.md, 08_documentation/DEPLOYMENT_GUIDE.md)
🟥 Series Posts: Part 1 - Using the Published Historical Models | Part 2 - Data Collection & Custom Tokenizer | Part 3 - Training Architecture & GPU Optimization | Part 4 (this post)
🟧 Published Models: SLM Model | Regular Model - Ready-to-use historical language models on Hugging Face
📗 Book Reference: Generative AI in Action - For deeper understanding of core LLM concepts

1. The Evaluation Challenge: Measuring What Matters for Historical Language Models

Now that we have trained models from Part 3 , we face a critical question: How do we know if our models actually work? This isn’t just about checking if the code runs - it’s about validating that the models can generate historically accurate, linguistically appropriate text that captures the essence of 1500-1850 London English.

The challenge with evaluating historical language models goes far beyond standard LLM metrics. Standard evaluation approaches like Perplexity and BLEU scores (we explain these and other metrics in Section 2.1 ) tell us whether the model generates fluent text. Still, they don’t answer the questions that matter for historical applications: Does the model avoid anachronisms? Can it distinguish between Tudor and Victorian language patterns? Does it understand London geography and historical context?

Consider a simple example: if we prompt the model with “In the year 1600, I traveled to London by railway”, a standard language model might generate this without flagging the obvious problem - railways didn’t exist in 1600. The evaluation framework needs to catch these temporal inconsistencies, period-inappropriate language, and historical inaccuracies that standard metrics miss.

This evaluation challenge requires building a specialized assessment pipeline that understands historical context, temporal boundaries, and period-specific linguistic patterns. We need metrics that can distinguish between a model that generates fluent modern English and one that produces authentic historical text - two very different capabilities.

1.1 High-Level Evaluation Strategy

Our evaluation framework provides two complementary approaches that work with both PyTorch checkpoints and Hugging Face models, as illustrated in Figure 1 below.

graph TD
    A[🤖 Trained Models<br/>SLM 117M / Regular 354M] --> B{Evaluation Type}
    
    B -->|Quick| C[⚡ Quick Evaluation<br/>Historical accuracy, language quality, coherence]
    B -->|Comprehensive| D[🔬 Comprehensive Evaluation<br/>Benchmarks, G-Eval, groundedness]
    
    C --> E[📊 Evaluation Results<br/>Historical accuracy scores, metrics]
    D --> E
    
    E --> F{Quality OK?}
    F -->|Yes| G[🚀 Deployment Options]
    F -->|No| H[🔄 Retrain/Adjust]
    H --> A
    
    G --> I[📦 PyTorch Checkpoints<br/>Direct inference]
    G --> J[🤗 Hugging Face Hub<br/>Published models]
    G --> K[💻 Local Deployment<br/>API, CLI, notebooks]
    
    I --> L[✅ Working Models<br/>Ready for use]
    J --> L
    K --> L
    
    style A fill:#e1f5fe
    style E fill:#f3e5f5
    style L fill:#e8f5e8
    style H fill:#fff3e0

Figure 1: Complete Evaluation and Deployment Pipeline

Quick Evaluation (quick_eval.py): Rapid validation testing historical accuracy on key events (e.g., 1665 plague, 1666 fire, etc.), language quality metrics (vocabulary diversity, historical pattern detection, readability), and coherence (ROUGE scores). Runs in minutes without external APIs.

Comprehensive Evaluation (comprehensive_evaluator.py): Extends quick evaluation with benchmark datasets (small MMLU and HellaSWAG subsets), groundedness/fluency metrics, and optional LLM-as-a-judge scoring via G-Eval (using an external GPT model). Produces detailed reports with generation samples.

Both evaluators test across historical periods (such as Tudor, Stuart, and Georgian), language patterns (archaic pronouns and verb forms), and London-specific knowledge (geography and landmarks). The framework goes beyond standard LM metrics to assess period-appropriate language, temporal consistency, and historical accuracy.

2. Model Evaluation Framework

Now that we’ve outlined the evaluation challenge, let’s dive into the implementation. Our evaluation framework provides two complementary approaches that work with both PyTorch checkpoints and Hugging Face models. The framework is designed to be practical for a learning project while still providing meaningful insights into model performance.

2.1 Historical, Linguistic, and Category-Specific Evaluation

To make the evaluation concrete, we look at the model from three complementary aspects that together capture how well it understands the period, writes fluent text, and handles the different slices of the corpus. This multi-dimensional approach ensures we catch various types of failures - a model might generate grammatically perfect text but fail historically, or vice versa.

Historical assessments: Quick evaluation uses targeted prompts around key events (e.g., 1665 plague, 1666 fire, Old Bailey trials) and checks for expected keywords and phrases. Comprehensive evaluation adds temporal consistency checks (forbidden/required terms per period), date-range sanity checks, and historical benchmarks (custom historical questions and the MMLU subset).
Linguistic assessments: We measure surface quality (chars/words/sentences per sample, words per sentence), vocabulary diversity (unique/total tokens), readability (Flesch-style scores), and presence of historical patterns (archaic verb forms like hath, doth, pronouns like thou, thee, conjunctions and interjections). This shows whether the model writes in a historically flavored yet readable style.
Category-specific benchmarks: Evaluations are grouped by period (Tudor, Stuart, Georgian), by linguistic phenomena (archaic forms, dialogue patterns), and by London knowledge (Thames, Westminster, Old Bailey, etc.). The comprehensive evaluator further probes general reasoning using HellaSWAG and MMLU subsets to assess the model’s performance across broader benchmarks.

Industry-Standard Evaluation Metrics and Benchmarks
Our evaluation framework uses several standard metrics and benchmarks from LLM research. Here’s what each one measures and why we include it:
Perplexity: How surprised the model is by the reference text; lower is better because it means the model assigns higher probability to what actually happened in the corpus.
BLEU / ROUGE: N-gram overlap between generated and reference text, giving a rough sense of literal similarity and how closely the model “sticks” to the reference phrasing. We use ROUGE-L (longest common subsequence) to evaluate coherence and narrative flow.
MMLU (Massive Multitask Language Understanding): A large multiple-choice exam covering many academic subjects. Here, we use a tiny subset as a sanity check for general knowledge and reasoning, not as a primary goal.
HellaSWAG: A commonsense inference benchmark where the model must pick a plausible continuation for a short story-like context. We use it to see whether the model’s basic reasoning looks sensible.
G-Eval: An LLM-as-a-judge pattern where a stronger reference model (for example, GPT) scores generated text along dimensions like coherence or groundedness. In this project, it is optional and requires an external API key.
Groundedness: Asks: does the model stick to the provided context / known facts, or hallucinate? Our implementation approximates this by comparing generations against reference answers and historical constraints.
For a deeper treatment of evaluation benchmarks (including MMLU, HellaSWAG, and LLM-as-a-judge methods like G-Eval), see Chapter 12 - Evaluating and Monitoring Generative Systems in the book 📘 Generative AI in Action .

2.2 Automated Evaluation Pipeline

The run_comprehensive_evaluation function in 05_evaluation/comprehensive_evaluator.py orchestrates the entire evaluation process. Listing 1 shows how it works: We iterate over test sets, generate text with the model, compute all the metrics defined above, and aggregate the results into a results dictionary for analysis.

def run_comprehensive_evaluation(model, tokenizer, test_data, device='cuda'):
    """Run comprehensive evaluation on historical language model"""
    
    # Initialize evaluation metrics
    metrics = {
        'perplexity': [],
        'bleu_scores': [],
        'rouge_scores': [],
        'historical_accuracy': [],
        'linguistic_quality': [],
        'coherence_scores': [],
        'temporal_consistency': []
    }
    
    # Evaluate on different text types
    for text_type, samples in test_data.items():
        logger.info(f"Evaluating on {text_type} samples...")
        
        for sample in samples:
            # Generate text
            generated = generate_text(model, tokenizer, sample['prompt'], device)
            
            # Calculate metrics
            perplexity = calculate_perplexity(model, tokenizer, sample['text'], device)
            bleu = calculate_bleu(generated, sample['reference'])
            rouge = calculate_rouge(generated, sample['reference'])
            hist_acc = assess_historical_accuracy(generated, sample['context'])
            ling_qual = assess_linguistic_quality(generated)
            coherence = assess_coherence(generated)
            temp_cons = assess_temporal_consistency(generated, sample['time_period'])
            
            # Store metrics
            metrics['perplexity'].append(perplexity)
            metrics['bleu_scores'].append(bleu)
            metrics['rouge_scores'].append(rouge)
            metrics['historical_accuracy'].append(hist_acc)
            metrics['linguistic_quality'].append(ling_qual)
            metrics['coherence_scores'].append(coherence)
            metrics['temporal_consistency'].append(temp_cons)
    
    # Calculate aggregate metrics
    results = {}
    for metric_name, values in metrics.items():
        results[metric_name] = {
            'mean': np.mean(values),
            'std': np.std(values),
            'min': np.min(values),
            'max': np.max(values),
            'median': np.median(values)
        }
    
    return results

Listing 1: Comprehensive Evaluation Pipeline

The pipeline computes all the metrics we outlined above (standard LM metrics such as perplexity and BLEU/ROUGE, plus our historically specific assessments of accuracy, linguistic quality, and coherence). Each metric provides a different lens through which to view model performance: perplexity measures how well the model predicts the training distribution, BLEU/ROUGE measures literal similarity to the reference text, and the custom metrics assess historical authenticity and linguistic appropriateness.

Why This Multi-Metric Approach Matters?

Standard language model evaluation often focuses on perplexity and n-gram overlap metrics, which measure general language quality but miss domain-specific requirements. For historical language models, we need to know not just whether the text is fluent, but whether it’s historically accurate, temporally consistent, and linguistically appropriate for the target period. This multi-metric approach ensures we catch different types of failures - a model might generate grammatically perfect text but fail historically, or produce historically accurate content with poor linguistic quality.

The aggregation step (computing mean, std, min, max, median) provides a comprehensive view of model performance across different test cases. This statistical summary helps identify whether the model performs consistently or has high variance, whether certain types of prompts cause failures, and how the model compares across different historical periods and linguistic phenomena.

2.3 Historical Accuracy Assessment

Standard LLM evaluation metrics (perplexity, BLEU, ROUGE) measure general language quality, but they don’t tell us whether the model generates historically accurate text for London between 1500-1850. To address this, we built customized evaluation tools that check period-appropriate language, temporal consistency, London-specific geography and landmarks, and historical fact accuracy. These tools are implemented in 05_evaluation/comprehensive_evaluator.py as shown in Listing 2:

def assess_historical_accuracy(generated_text, historical_context):
    """Assess the historical accuracy of generated text"""
    
    accuracy_score = 0.0
    total_checks = 0
    
    # Check temporal consistency
    temporal_score = check_temporal_consistency(generated_text, historical_context['time_period'])
    accuracy_score += temporal_score
    total_checks += 1
    
    # Check historical facts
    fact_score = check_historical_facts(generated_text, historical_context['facts'])
    accuracy_score += fact_score
    total_checks += 1
    
    # Check period-appropriate language
    language_score = check_period_language(generated_text, historical_context['time_period'])
    accuracy_score += language_score
    total_checks += 1
    
    # Check geographical accuracy
    geo_score = check_geographical_accuracy(generated_text, historical_context['location'])
    accuracy_score += geo_score
    total_checks += 1
    
    # Check social context accuracy
    social_score = check_social_context(generated_text, historical_context['social_class'])
    accuracy_score += social_score
    total_checks += 1
    
    return accuracy_score / total_checks

def check_temporal_consistency(text, time_period):
    """Check if text maintains temporal consistency with the specified period"""
    
    # Define period-specific constraints
    period_constraints = {
        '1500-1600': {
            'forbidden_terms': ['electricity', 'steam engine', 'railway'],
            'required_terms': ['ye', 'hath', 'doth'],
            'date_range': (1500, 1600)
        },
        '1600-1700': {
            'forbidden_terms': ['railway', 'telegraph'],
            'required_terms': ['hath', 'doth', 'verily'],
            'date_range': (1600, 1700)
        },
        '1700-1800': {
            'forbidden_terms': ['telegraph', 'telephone'],
            'required_terms': ['hath', 'doth', 'indeed'],
            'date_range': (1700, 1800)
        },
        '1800-1850': {
            'forbidden_terms': ['telephone', 'automobile'],
            'required_terms': ['indeed', 'verily', 'pray'],
            'date_range': (1800, 1850)
        }
    }
    
    if time_period not in period_constraints:
        return 0.5  # Neutral score for unknown periods
    
    constraints = period_constraints[time_period]
    score = 1.0
    
    # Check for forbidden terms (anachronisms)
    for term in constraints['forbidden_terms']:
        if term.lower() in text.lower():
            score -= 0.2
    
    # Check for required period-appropriate terms
    period_terms_found = 0
    for term in constraints['required_terms']:
        if term.lower() in text.lower():
            period_terms_found += 1
    
    if constraints['required_terms']:
        score += 0.3 * (period_terms_found / len(constraints['required_terms']))
    
    # Check date references
    date_score = check_date_references(text, constraints['date_range'])
    score += 0.2 * date_score
    
    return max(0.0, min(1.0, score))

Listing 2: Historical Accuracy Assessment

The forbidden terms (like “electricity” for 1500-1600, “railway” for 1600-1700) are anachronisms - technologies or concepts that didn’t exist in those periods. We selected them based on historical timelines: electricity wasn’t harnessed until the late 1700s, railways didn’t appear until the early 1800s, and telegraphs came later. Similarly, the required terms (such as “hath”, “doth”, and “verily”) are archaic language patterns we observed frequently in the training corpus for each period.

We analyzed the corpus to identify which linguistic markers were most characteristic of each era, then selected a small set that would catch obvious anachronisms without being overly restrictive. This is a practical heuristic rather than an exhaustive historical grammar - we focus on high-impact anachronisms and common period markers that are easy to detect automatically.

How the scoring works

The check_temporal_consistency() function starts with a score of 1.0 and applies penalties and bonuses: each forbidden term found subtracts 0.2 (so finding “railway” in 1600-1700 text drops the score), while finding required period-appropriate terms adds up to 0.3 based on how many are present. Date references within the period add up to 0.2. The final score ranges from 0.0 to 1.0.

The overall assess_historical_accuracy() function then averages the five component scores (temporal consistency, historical facts, period-appropriate language, geographical accuracy, and social context) to produce a single score between 0 and 1, with higher values indicating better historical accuracy. In practice (and yes, we are generalizing), scores above 0.7 indicate good historical consistency, while scores below 0.5 suggest significant anachronisms or factual errors.

2.4 Linguistic Quality Evaluation

While historical accuracy checks whether the model gets facts and period-appropriate terms right, linguistic quality measures how well the model writes - grammar, coherence, vocabulary diversity, sentence structure, and the presence of historical language patterns.

Standard metrics like BLEU and ROUGE don’t capture whether the text reads naturally or uses appropriate archaic forms. We built customized tools that assess these dimensions, implemented in 05_evaluation/comprehensive_evaluator.py as shown in Listing 3:

To make this easier to read, it helps to view the code as a scoring scaffold rather than a complete NLP system. Each check_* function is expected to return a normalized score in the range [0, 1] (higher is better), and assess_linguistic_quality() simply averages those components so you can track one headline number over time.

This mirrors patterns from earlier in the series: in Part 2 we used lightweight, automatable checks to validate data quality, and in Part 3 we relied on simple, repeatable metrics to judge training health. Here, we do the same for generation quality: start with cheap checks that run everywhere, then iterate toward richer evaluators as needed.

Also note that the exact weights (0.3/0.2, etc.) are tunable. The main benefit is splitting “linguistic quality” into components you can inspect individually, so when output is bad, you can tell why (grammar-ish structure vs coherence vs vocabulary vs historically flavored patterns).

def assess_linguistic_quality(generated_text):
    """Assess the linguistic quality of generated historical text"""
    
    quality_score = 0.0
    total_checks = 0
    
    # Check grammatical correctness
    grammar_score = check_grammatical_correctness(generated_text)
    quality_score += grammar_score
    total_checks += 1
    
    # Check coherence and flow
    coherence_score = check_text_coherence(generated_text)
    quality_score += coherence_score
    total_checks += 1
    
    # Check vocabulary appropriateness
    vocab_score = check_vocabulary_appropriateness(generated_text)
    quality_score += vocab_score
    total_checks += 1
    
    # Check sentence structure variety
    structure_score = check_sentence_structure_variety(generated_text)
    quality_score += structure_score
    total_checks += 1
    
    # Check historical language patterns
    pattern_score = check_historical_language_patterns(generated_text)
    quality_score += pattern_score
    total_checks += 1
    
    return quality_score / total_checks

def check_grammatical_correctness(text):
    """Check grammatical correctness of generated text"""
    
    # Parse text into sentences
    sentences = nltk.sent_tokenize(text)
    
    if not sentences:
        return 0.0
    
    correct_sentences = 0
    
    for sentence in sentences:
        # Check for basic grammatical patterns
        if check_sentence_grammar(sentence):
            correct_sentences += 1
    
    return correct_sentences / len(sentences)

def check_historical_language_patterns(text):
    """Check if text follows appropriate historical language patterns"""
    
    score = 0.0
    total_patterns = 0
    
    # Check for appropriate use of historical verb forms
    historical_verbs = ['hath', 'doth', 'dost', 'art', 'wilt', 'shalt']
    verb_score = 0
    for verb in historical_verbs:
        if verb in text.lower():
            verb_score += 1
    if historical_verbs:
        score += 0.3 * (verb_score / len(historical_verbs))
    total_patterns += 1
    
    # Check for appropriate use of historical pronouns
    historical_pronouns = ['thou', 'thee', 'thy', 'thine', 'ye']
    pronoun_score = 0
    for pronoun in historical_pronouns:
        if pronoun in text.lower():
            pronoun_score += 1
    if historical_pronouns:
        score += 0.3 * (pronoun_score / len(historical_pronouns))
    total_patterns += 1
    
    # Check for appropriate use of historical conjunctions
    historical_conjunctions = ['whilst', 'betwixt', 'amongst', 'ere', 'anon']
    conj_score = 0
    for conj in historical_conjunctions:
        if conj in text.lower():
            conj_score += 1
    if historical_conjunctions:
        score += 0.2 * (conj_score / len(historical_conjunctions))
    total_patterns += 1
    
    # Check for appropriate use of historical interjections
    historical_interjections = ['verily', 'indeed', 'forsooth', 'prithee', 'marry']
    interj_score = 0
    for interj in historical_interjections:
        if interj in text.lower():
            interj_score += 1
    if historical_interjections:
        score += 0.2 * (interj_score / len(historical_interjections))
    total_patterns += 1
    
    return score / total_patterns if total_patterns > 0 else 0.0

Listing 3: Linguistic Quality Evaluation

About NLTK: We use NLTK (Natural Language Toolkit), a standard Python library for natural language processing, to handle text tokenization. If you followed Part 1 ’s setup instructions, NLTK was already installed as part of the data processing dependencies. In check_grammatical_correctness(), we use nltk.sent_tokenize() to split text into sentences so we can evaluate grammar sentence-by-sentence. NLTK also provides word tokenization (word_tokenize) and BLEU score calculation (sentence_bleu), which are used elsewhere in the evaluation pipeline.

We chose NLTK because it’s well-established, handles edge cases (like abbreviations and historical punctuation), and provides reliable sentence boundaries even with archaic English patterns. The same qualities made it useful during data collection and cleaning (covered in Part 2 ).

The historical language patterns we check (verbs like hath, doth, pronouns like thou, thee, conjunctions like whilst, betwixt, and interjections like verily, forsooth) are the same archaic forms we identified during corpus analysis for temporal consistency. The difference here is that we’re measuring their presence as a positive signal of historical authenticity, rather than using them as required/forbidden constraints. Each pattern category (verbs, pronouns, conjunctions, interjections) contributes proportionally to the score based on how many patterns from that category appear in the text.

How the scoring works

The assess_linguistic_quality() function averages five component scores (grammar, coherence, vocabulary appropriateness, sentence structure variety, and historical language patterns) to produce a single score between 0 and 1. Each component is evaluated independently and returns a score in the range [0, 1].

For example, check_grammatical_correctness() counts the proportion of grammatically correct sentences, while check_historical_language_patterns() weights the presence of archaic verb forms (30%), pronouns (30%), conjunctions (20%), and interjections (20%) to produce a pattern score. The final linguistic quality score is the simple average of all five components. In practice, scores above 0.75 indicate strong linguistic quality with good grammar and historical flavor, while scores below 0.6 suggest the model struggles with either basic grammar or historical language patterns.

2.5 Running Evaluations

You can run the evaluators directly from the command line. The framework defaults to CPU for safety (so you can evaluate during training without GPU conflicts), but you can use --device gpu when the GPU is free for faster evaluation.

Quick example:

# Quick evaluation (runs in minutes, no external APIs)
python 05_evaluation/run_evaluation.py --mode quick --device cpu

# Comprehensive evaluation (includes benchmarks, optional G-Eval)
python 05_evaluation/run_evaluation.py --mode comprehensive --device cpu

The unified launcher (run_evaluation.py) supports multiple modes: setup (install dependencies), quick (fast validation), comprehensive (full suite with benchmarks), dataset (generate test cases), and all (complete evaluation). You can also call quick_eval.py or comprehensive_evaluator.py directly if you need more control.

Practical Evaluation Workflow:

Our typical evaluation workflow follows this pattern:

After Training: Run a quick evaluation to get immediate feedback on model performance
Before Publishing: Run a comprehensive evaluation to ensure the model meets quality standards
During Development: Use interactive testing to explore model behavior on specific prompts
For Research: Generate custom test datasets and run targeted evaluations

The framework defaults to CPU for safety (so you can evaluate during training without GPU conflicts), but you can use --device gpu when the GPU is free for faster evaluation. This design allows continuous assessment throughout the training process without interfering with GPU resources needed for training.

For complete usage examples, command-line options, and troubleshooting, see the Evaluation Guide in the repository.

3. Comprehensive Testing Pipeline

3.1 Automated Testing Framework

The 06_testing package contains a parallel set of tests that double-check the full system. Listing 4 captures the idea behind run_comprehensive_tests.

We group tests into basic functionality, historical accuracy, linguistic quality, performance, edge cases, and integration, then run them as a batch and emit a structured report. This mirrors how you would build a real CI test suite, but at a scale appropriate for this learning project.

def run_comprehensive_tests(model, tokenizer, device='cuda'):
    """Run comprehensive tests on historical language model"""
    
    test_results = {
        'basic_functionality': test_basic_functionality(model, tokenizer, device),
        'historical_accuracy': test_historical_accuracy(model, tokenizer, device),
        'linguistic_quality': test_linguistic_quality(model, tokenizer, device),
        'performance_metrics': test_performance_metrics(model, tokenizer, device),
        'edge_cases': test_edge_cases(model, tokenizer, device),
        'integration_tests': test_integration(model, tokenizer, device)
    }
    
    # Generate test report
    generate_test_report(test_results)
    
    return test_results

def test_basic_functionality(model, tokenizer, device):
    """Test basic model functionality"""
    
    tests = {
        'text_generation': test_text_generation(model, tokenizer, device),
        'tokenization': test_tokenization(tokenizer),
        'model_loading': test_model_loading(model, device),
        'memory_usage': test_memory_usage(model, device),
        'inference_speed': test_inference_speed(model, tokenizer, device)
    }
    
    return tests

def test_historical_accuracy(model, tokenizer, device):
    """Test historical accuracy of generated text"""
    
    tests = {
        'temporal_consistency': test_temporal_consistency(model, tokenizer, device),
        'factual_accuracy': test_factual_accuracy(model, tokenizer, device),
        'period_appropriate_language': test_period_language(model, tokenizer, device),
        'geographical_accuracy': test_geographical_accuracy(model, tokenizer, device),
        'social_context_accuracy': test_social_context(model, tokenizer, device)
    }
    
    return tests

Listing 4: Comprehensive Testing Framework

Automated tests cover basics, historical accuracy, linguistic quality, performance, edge cases, and integration.

3.2 Interactive Testing and Validation

For manual exploration, the interactive testing interface (conceptually similar to the CLI flows in 06_inference/inference_unified.py) lets you type prompts, trigger specific test groups, and immediately inspect analysis for each generation. Listing 5 shows a simple REPL loop that dispatches to the same evaluation helpers used in the automated tests.

def interactive_testing(model, tokenizer, device='cuda'):
    """Interactive testing interface for historical language model"""
    
    print("Interactive Testing Mode")
    print("=" * 50)
    print("Enter prompts to test the model. Type 'quit' to exit.")
    print("Available commands:")
    print("  - Enter any text prompt to generate continuation")
    print("  - 'test_historical' - Run historical accuracy tests")
    print("  - 'test_linguistic' - Run linguistic quality tests")
    print("  - 'test_performance' - Run performance tests")
    print("  - 'quit' - Exit testing mode")
    print()
    
    while True:
        try:
            prompt = input("Enter prompt: ").strip()
            
            if prompt.lower() == 'quit':
                break
            elif prompt.lower() == 'test_historical':
                run_historical_tests(model, tokenizer, device)
            elif prompt.lower() == 'test_linguistic':
                run_linguistic_tests(model, tokenizer, device)
            elif prompt.lower() == 'test_performance':
                run_performance_tests(model, tokenizer, device)
            elif prompt:
                # Generate text
                generated = generate_text(model, tokenizer, prompt, device)
                print(f"Generated: {generated}")
                print()
                
                # Analyze generated text
                analysis = analyze_generated_text(generated, prompt)
                print(f"Analysis: {analysis}")
                print()
            else:
                print("Please enter a valid prompt or command.")
                
        except KeyboardInterrupt:
            print("\nExiting interactive testing mode...")
            break
        except Exception as e:
            print(f"Error: {e}")
            print("Please try again.")

def analyze_generated_text(text, prompt):
    """Analyze generated text for quality and accuracy"""
    
    analysis = {
        'length': len(text),
        'sentences': len(nltk.sent_tokenize(text)),
        'historical_accuracy': assess_historical_accuracy(text, {}),
        'linguistic_quality': assess_linguistic_quality(text),
        'coherence': assess_coherence(text),
        'relevance': assess_relevance(text, prompt)
    }
    
    return analysis

Listing 5: Interactive Testing Interface

Interactive mode lets you try prompts, run quick tests, and see immediate analysis.

3.3 Performance Benchmarking

Performance benchmarking follows the same pattern: generate controlled workloads and measure speed and resource usage. Listing 6 illustrates how we vary sequence length, measure average latency, and compute tokens-per-second, alongside separate helpers for memory, batch throughput, long-sequence handling, and basic concurrency.

def benchmark_model_performance(model, tokenizer, device='cuda'):
    """Benchmark model performance across different scenarios"""
    
    benchmarks = {
        'inference_speed': benchmark_inference_speed(model, tokenizer, device),
        'memory_usage': benchmark_memory_usage(model, device),
        'batch_processing': benchmark_batch_processing(model, tokenizer, device),
        'long_sequence_handling': benchmark_long_sequences(model, tokenizer, device),
        'concurrent_requests': benchmark_concurrent_requests(model, tokenizer, device)
    }
    
    return benchmarks

def benchmark_inference_speed(model, tokenizer, device):
    """Benchmark inference speed for different sequence lengths"""
    
    sequence_lengths = [50, 100, 200, 500, 1000]
    results = {}
    
    for length in sequence_lengths:
        # Generate test prompts of different lengths
        prompts = generate_test_prompts(length, num_prompts=100)
        
        # Measure inference time
        start_time = time.time()
        for prompt in prompts:
            generate_text(model, tokenizer, prompt, device)
        end_time = time.time()
        
        total_time = end_time - start_time
        avg_time_per_prompt = total_time / len(prompts)
        tokens_per_second = length / avg_time_per_prompt
        
        results[length] = {
            'avg_time_per_prompt': avg_time_per_prompt,
            'tokens_per_second': tokens_per_second,
            'total_time': total_time
        }
    
    return results

Listing 6: Performance Benchmarking

Benchmarks capture inference speed, memory, batch throughput, long-sequence handling, and simple concurrency.

4. Model Deployment and Publishing

With evaluation and testing complete, we’re ready to make our models available for use. This section covers the two deployment paths we support: direct inference from PyTorch checkpoints (useful during development and for maximum control) and publishing to Hugging Face Hub (for easy sharing and community access).

As called out in Part 1 , both the SLM (117M parameters) and the Regular Model (354M parameters) are fully trained and available. The SLM has already been published on Hugging Face Hub , while the Regular Model is ready for publication. Both can also be run directly from local PyTorch checkpoints.

4.1 Two Paths to Inference

We provide two complementary ways to run inference, each suited to different use cases.

PyTorch Checkpoint Inference gives you direct access to the trained model weights without any conversion overhead. This is ideal during development, when you want to test a freshly trained checkpoint, or when you need maximum control over the inference process. The checkpoints live in 09_models/checkpoints/ - the SLM at slm/checkpoint-4000.pt (117M parameters) and the Regular Model at checkpoint-60001.pt (354M parameters). The inference_pytorch.py script handles loading these directly: Listing 7

# SLM inference from checkpoint
python 06_inference/inference_pytorch.py \
  --checkpoint 09_models/checkpoints/slm/checkpoint-4000.pt \
  --prompt "In the year 1834, I walked through the streets of London and witnessed"

# Regular model inference from checkpoint
python 06_inference/inference_pytorch.py \
  --checkpoint 09_models/checkpoints/checkpoint-60001.pt \
  --prompt "In the year 1834, I walked through the streets of London and witnessed"

Listing 7: Running Inference from PyTorch Checkpoints

Hugging Face Model Inference uses the published models on Hugging Face Hub, which means anyone can load and use them with just a few lines of code - no need to download checkpoints or set up the full training environment. The inference_unified.py script provides a consistent interface for both published models and local checkpoints: Listing 8

# Published model inference (downloads from Hugging Face Hub)
python 06_inference/inference_unified.py \
  --published \
  --model_name bahree/london-historical-slm \
  --prompt "In the year 1834, I walked through the streets of London and witnessed"

# Interactive mode for exploration
python 06_inference/inference_unified.py --published --model_type slm --interactive

# Demo mode with curated historical prompts
python 06_inference/inference_unified.py --published --model_type slm --demo

Listing 8: Hugging Face Model Inference

We’ve tested both paths extensively. The published SLM loads in about 9 seconds on a GPU, generates text in under 6 seconds, and passes all 10 automated validation tests. The unified inference script provides clean logging, proper model detection, and accurate parameter counts - small details that make a big difference when debugging or demonstrating the models.

4.2 Publishing to Hugging Face Hub

Publishing to Hugging Face Hub makes our models accessible to the broader community without requiring anyone to clone our repository or set up a training environment. The process involves converting our PyTorch checkpoints to the Hugging Face format, creating a model card with documentation, and uploading everything to the Hub.

The publishing workflow is handled by scripts in 10_scripts/ - specifically publish_slm_to_huggingface.py for the SLM and publish_to_huggingface.py for the Regular Model. Listing 9 shows the core publishing flow: authenticate with the Hub, create (or reuse) a repository, save the model and tokenizer locally in Hugging Face format, upload the folder, and generate a model card.

def publish_to_huggingface(model, tokenizer, model_name, description, tags):
    """Publish model to Hugging Face Hub"""
    from huggingface_hub import HfApi
    
    api = HfApi()
    
    # Create model repository
    repo_id = f"bahree/{model_name}"
    api.create_repo(repo_id=repo_id, exist_ok=True)
    
    # Save model and tokenizer locally
    model.save_pretrained(f"./models/{model_name}")
    tokenizer.save_pretrained(f"./models/{model_name}")
    
    # Upload to Hub
    api.upload_folder(
        folder_path=f"./models/{model_name}",
        repo_id=repo_id,
        commit_message="Initial model upload"
    )
    
    # Generate and upload model card (README.md)
    model_card = generate_model_card(model_name, description, tags)
    api.upload_file(
        path_or_fileobj=model_card,
        path_in_repo="README.md",
        repo_id=repo_id
    )
    
    return repo_id

Listing 9: Hugging Face Publishing

The generate_model_card() function creates the README.md that appears on the Hugging Face model page. This includes model description, architecture details, training data sources, usage examples, and limitations. You can see the live model cards at bahree/london-historical-slm and bahree/london-historical-llm .

Listing 10 shows how to load and use the published models:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load the published model
model_name = "bahree/london-historical-slm"  # or "bahree/london-historical-llm"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Generate historical text
prompt = "In the year 1834, I walked through the streets of London and witnessed"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    inputs["input_ids"],
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.2,
    pad_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Listing 10: Loading Models from Hugging Face Hub

4.3 Publishing Workflow

If you want to publish your own trained model to Hugging Face Hub, here’s the workflow we followed:

Set up authentication: Install huggingface_hub and authenticate with a token that has Write permissions. You can generate tokens at huggingface.co/settings/tokens .
Convert the checkpoint: PyTorch training checkpoints include optimizer states and training metadata that aren’t needed for inference. The conversion scripts extract just the model weights and translate them to Hugging Face’s naming conventions (covered in detail in Section 5).
Prepare the tokenizer: Save the tokenizer files alongside the model. Our custom tokenizer with 30,000 tokens and 150+ historical special tokens needs to be converted to the transformers library format.
Generate a model card: The README.md on your Hugging Face model page serves as documentation. Include model architecture details, training data sources, usage examples, evaluation results, and limitations. The scripts generate this automatically, but you should review and customize it.
Upload and validate: Push everything to the Hub, then immediately test with from_pretrained() to ensure the published model loads and generates correctly.

📝 Full documentation: See 08_documentation/HUGGINGFACE_PUBLISHING.md and 08_documentation/DEPLOYMENT_GUIDE.md in the repository for the complete step-by-step workflow with troubleshooting guidance.

5. PyTorch to Hugging Face Format Conversion

5.1 Why Format Conversion is Necessary

During training, our models are saved in PyTorch’s native .pt format. These checkpoints include everything needed to resume training: model weights, optimizer states, learning rate schedules, and training metadata. However, for deployment and sharing, we need a leaner, inference-optimized format compatible with the broader machine learning ecosystem.

Think of it like the difference between a development environment and a production deployment: training checkpoints are like a developer’s workspace with all the tools and intermediate files, while Hugging Face format is like a clean, standardized package that anyone can use without understanding the internal training details.

The Hugging Face Hub expects models to follow specific file structures, naming conventions, and metadata requirements. The conversion process extracts just the model weights (discarding optimizer states and training metadata), translates weight names to match Hugging Face conventions, creates proper configuration files, and ensures the tokenizer is compatible with the transformers library.

5.2 The Conversion Process

The conversion handles several transformations to bridge PyTorch and Hugging Face formats:

Weight name mapping: PyTorch layer names like transformer.h.0.attn.c_attn.weight become Hugging Face names like transformer.h.0.attn.c_attn.weight (mostly the same for GPT-2, but with careful handling of edge cases)
Automatic torch.compile handling: If you used torch.compile() during training, weights get prefixed with _orig_mod. - the conversion strips these prefixes
Configuration translation: Model hyperparameters (n_layer, n_head, n_embd, etc.) are mapped to Hugging Face’s config.json format
Tokenizer conversion: Our custom 30,000-token vocabulary with 150+ historical special tokens is converted to transformers library format
Validation: After conversion, we verify that the model loads correctly and produces expected outputs

💻 Full Implementation: See 10_scripts/publish_slm_to_huggingface.py for the complete conversion pipeline with error handling, validation, and model card generation.

5.3 Dependencies for Hugging Face Integration

The Hugging Face integration requires specific dependencies and follows established patterns for model publishing and usage: Listing 11

# Required dependencies for Hugging Face integration
huggingface_dependencies = {
    "transformers": ">=4.21.0",
    "torch": ">=1.12.0",
    "tokenizers": ">=0.12.0",
    "safetensors": ">=0.3.0",
    "accelerate": ">=0.20.0",
    "huggingface_hub": ">=0.10.0"
}

# Model loading and usage example
def load_published_model(model_name="bahree/london-historical-slm"):
    """Load published model from Hugging Face Hub"""
    
    # Suppress warnings for cleaner output
    import os
    import warnings
    import logging
    
    os.environ['TRANSFORMERS_VERBOSITY'] = 'error'
    warnings.filterwarnings('ignore')
    logging.getLogger("transformers").setLevel(logging.ERROR)
    os.environ['TOKENIZERS_PARALLELISM'] = 'false'
    
    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    
    # Set pad token if not set
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
    
    return model, tokenizer

def generate_historical_text(model, tokenizer, prompt, max_length=50, temperature=0.3):
    """Generate historical text using the published model"""
    
    # Tokenize input
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    
    # Generate text
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_new_tokens=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            top_k=20,
            repetition_penalty=1.2,
            no_repeat_ngram_size=3,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
            early_stopping=True
        )
    
    # Decode output
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# Example usage
if __name__ == "__main__":
    # Load the published model
    model, tokenizer = load_published_model()
    
    # Test prompts
    test_prompts = [
        "In the year 1834, I walked through the streets of London and witnessed",
        "The gentleman from the country said, 'we have never seen such a sight",
        "The Thames flowed dark and mysterious through the heart",
        "Parliament sat in Westminster Hall",
        "The Great Fire of 1666 had destroyed"
    ]
    
    # Generate text for each prompt
    for prompt in test_prompts:
        generated = generate_historical_text(model, tokenizer, prompt)
        print(f"Prompt: {prompt}")
        print(f"Generated: {generated}")
        print("-" * 80)

Listing 11: Hugging Face Dependencies

Hugging Face integration provides standard from_pretrained() loading and generation with minimal setup, making the models easy to share and reuse.

5.4 Comprehensive Testing and Validation Framework

Once a model is on the Hub, 06_inference/test_published_models.py provides a concrete implementation of the testing pattern in Listing 12. It loads the model via from_pretrained, runs functional, historical, linguistic, and performance checks, and prints a human-readable summary so you can verify the published artefact behaves like your local checkpoints.

def test_published_model(model_name="bahree/london-historical-slm"):
    """Comprehensive testing of published model"""
    
    print(f"Testing published model: {model_name}")
    
    # Load model
    model, tokenizer = load_published_model(model_name)
    
    # Test basic functionality
    basic_tests = test_basic_functionality(model, tokenizer)
    
    # Test historical accuracy
    historical_tests = test_historical_accuracy(model, tokenizer)
    
    # Test linguistic quality
    linguistic_tests = test_linguistic_quality(model, tokenizer)
    
    # Test performance metrics
    performance_tests = test_performance_metrics(model, tokenizer)
    
    # Compile results
    results = {
        "basic_functionality": basic_tests,
        "historical_accuracy": historical_tests,
        "linguistic_quality": linguistic_tests,
        "performance_metrics": performance_tests
    }
    
    # Print summary
    print("\nTest Results Summary:")
    print("=" * 50)
    for category, tests in results.items():
        print(f"\n{category.replace('_', ' ').title()}:")
        for test_name, result in tests.items():
            status = "PASS" if result else "FAIL"
            print(f"  {test_name}: {status}")
    
    return results

def test_basic_functionality(model, tokenizer):
    """Test basic model functionality"""
    
    tests = {}
    
    # Test model loading
    tests["model_loading"] = model is not None and tokenizer is not None
    
    # Test tokenizer functionality
    test_text = "In the year 1834, London was"
    tokens = tokenizer.encode(test_text)
    decoded = tokenizer.decode(tokens)
    tests["tokenizer_encode_decode"] = test_text in decoded
    
    # Test model generation
    try:
        inputs = tokenizer.encode(test_text, return_tensors="pt")
        with torch.no_grad():
            outputs = model.generate(inputs, max_new_tokens=10, do_sample=False)
        generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
        tests["model_generation"] = len(generated) > len(test_text)
    except Exception as e:
        tests["model_generation"] = False
    
    # Test special tokens
    special_tokens = ["<|london|>", "<|thou|>", "<|hath|>", "<|doth|>"]
    special_token_tests = []
    for token in special_tokens:
        if token in tokenizer.get_vocab():
            special_token_tests.append(True)
        else:
            special_token_tests.append(False)
    tests["special_tokens"] = any(special_token_tests)
    
    return tests

def test_historical_accuracy(model, tokenizer):
    """Test historical accuracy of generated text"""
    
    tests = {}
    
    # Test prompts for different historical periods
    period_prompts = {
        "1500-1600": "In the year 1550, the gentleman said",
        "1600-1700": "In the year 1650, the gentleman said",
        "1700-1800": "In the year 1750, the gentleman said",
        "1800-1850": "In the year 1834, the gentleman said"
    }
    
    for period, prompt in period_prompts.items():
        try:
            generated = generate_historical_text(model, tokenizer, prompt, max_length=30)
            
            # Check for period-appropriate language
            period_terms = {
                "1500-1600": ["ye", "hath", "doth", "thou", "thee"],
                "1600-1700": ["hath", "doth", "thou", "thee", "verily"],
                "1700-1800": ["hath", "doth", "thou", "thee", "indeed"],
                "1800-1850": ["indeed", "verily", "whilst", "pray"]
            }
            
            found_terms = sum(1 for term in period_terms[period] if term in generated.lower())
            tests[f"period_{period}"] = found_terms > 0
            
        except Exception as e:
            tests[f"period_{period}"] = False
    
    # Test London-specific knowledge
    london_prompts = [
        "The Thames flowed through",
        "Westminster Hall was",
        "The Tower of London",
        "Cheapside was filled with"
    ]
    
    london_tests = []
    for prompt in london_prompts:
        try:
            generated = generate_historical_text(model, tokenizer, prompt, max_length=20)
            london_tests.append(len(generated) > len(prompt))
        except Exception as e:
            london_tests.append(False)
    
    tests["london_knowledge"] = any(london_tests)
    
    return tests

def test_linguistic_quality(model, tokenizer):
    """Test linguistic quality of generated text"""
    
    tests = {}
    
    # Test prompts for linguistic quality
    quality_prompts = [
        "The gentleman walked through the garden",
        "In the morning, the sun rose",
        "The old man sat by the fire",
        "The young woman read her book"
    ]
    
    quality_tests = []
    for prompt in quality_prompts:
        try:
            generated = generate_historical_text(model, tokenizer, prompt, max_length=30)
            
            # Check for basic linguistic quality
            sentences = generated.split('.')
            quality_tests.append(len(sentences) > 1)
            
        except Exception as e:
            quality_tests.append(False)
    
    tests["linguistic_quality"] = any(quality_tests)
    
    # Test coherence
    coherence_prompt = "The gentleman walked through the garden and"
    try:
        generated = generate_historical_text(model, tokenizer, coherence_prompt, max_length=50)
        tests["coherence"] = len(generated) > len(coherence_prompt)
    except Exception as e:
        tests["coherence"] = False
    
    return tests

def test_performance_metrics(model, tokenizer):
    """Test performance metrics of the model"""
    
    tests = {}
    
    # Test inference speed
    test_prompt = "In the year 1834, London was"
    try:
        start_time = time.time()
        generated = generate_historical_text(model, tokenizer, test_prompt, max_length=50)
        end_time = time.time()
        
        inference_time = end_time - start_time
        tests["inference_speed"] = inference_time < 5.0  # Should complete within 5 seconds
        
    except Exception as e:
        tests["inference_speed"] = False
    
    # Test memory usage
    try:
        import psutil
        process = psutil.Process()
        memory_usage = process.memory_info().rss / 1024 / 1024  # MB
        tests["memory_usage"] = memory_usage < 8000  # Should use less than 8GB
        
    except Exception as e:
        tests["memory_usage"] = True  # Skip if psutil not available
    
    return tests

Listing 12: Testing Published Models

Published model tests validate loading, generation, historical accuracy, and basic performance before and after publication.

5.5 Model Card Generation

The model card serves as the primary documentation on Hugging Face Hub, making it the first thing users see when they discover your model. A well-crafted model card helps users understand what the model does, how to use it, and its limitations. The generate_comprehensive_model_card() function in 10_scripts/publish_slm_to_huggingface.py creates this documentation automatically.

What Makes an Effective Model Card:

The model card for our historical language models includes several key sections that provide users with everything they need to get started. At a minimum, include:

Model Description & Key Features: A clear explanation that the model was trained from scratch (not fine-tuned), emphasizing the 117M parameter SLM variant and 354M parameter Regular Model, with details about the custom 30,000-token vocabulary and 150+ historical special tokens.
Setup Instructions: Platform-specific guidance for creating virtual environments (Linux/macOS/Windows), installing dependencies (transformers, torch, accelerate), and handling different accelerators (CPU, NVIDIA CUDA, AMD ROCm).
Quick Start Code: Auto-device detection that works across CPU, CUDA, and ROCm with sensible generation parameters (temperature=0.8, top_p=0.95, repetition_penalty=1.2).
Training Details: Architecture specifics (GPT-2 Small/Medium), training infrastructure (2x GPU with Distributed Data Parallel), performance metrics (training loss, MFU utilization), and data sources (218+ historical sources spanning 1500-1850).
Example Prompts: Period-specific prompts demonstrating different historical eras (Tudor, Stuart, Georgian, Victorian) and London-specific contexts (Thames, Westminster, Parliament).
Testing & Validation: Instructions for running the automated test suite (test_published_models.py) and interactive testing with custom prompts.
Troubleshooting: Common issues and solutions for PyTorch installation, GPU detection, and memory constraints.
Citation & License: BibTeX citation format and MIT license information.

Key Implementation Details:

The model card generation follows Hugging Face conventions with YAML frontmatter specifying license, library, pipeline type, language, and tags. The script emphasizes that models were trained from scratch (not fine-tuned) and provides device-agnostic code examples that run on CPU, CUDA, and ROCm.

The card also includes detailed model selection guidance comparing the SLM (faster, lower memory) versus the Regular Model (higher quality, more parameters), helping users choose the right model for their use case - whether that’s quick experimentation, educational purposes, or production deployment.

💻 Complete Implementation: See 10_scripts/publish_slm_to_huggingface.py and 10_scripts/publish_to_huggingface.py for the full model card generation implementation.

👀 Live Model Cards: View the published cards at bahree/london-historical-slm and bahree/london-historical-llm .

📝 Documentation: See HUGGINGFACE_PUBLISHING.md and DEPLOYMENT_GUIDE.md for complete publishing and deployment workflows.

5.6 Local Deployment Options

Finally, Listing 13 sketches how you might wrap a trained model into a simple REST API or CLI. These patterns are intentionally minimal, meant to help you connect the dots between the inference utilities in 06_inference/ and real applications (dashboards, notebooks, small services).

def setup_local_deployment(model, tokenizer, deployment_type='api'):
    """Set up local deployment for historical language model"""
    
    if deployment_type == 'api':
        return setup_api_deployment(model, tokenizer)
    elif deployment_type == 'cli':
        return setup_cli_deployment(model, tokenizer)
    elif deployment_type == 'notebook':
        return setup_notebook_deployment(model, tokenizer)
    else:
        raise ValueError(f"Unknown deployment type: {deployment_type}")

def setup_api_deployment(model, tokenizer):
    """Set up REST API deployment"""
    
    from flask import Flask, request, jsonify
    import torch
    
    app = Flask(__name__)
    
    @app.route('/generate', methods=['POST'])
    def generate_text():
        data = request.get_json()
        prompt = data.get('prompt', '')
        max_length = data.get('max_length', 100)
        temperature = data.get('temperature', 0.3)
        
        # Generate text
        inputs = tokenizer.encode(prompt, return_tensors="pt")
        with torch.no_grad():
            outputs = model.generate(
                inputs,
                max_length=max_length,
                temperature=temperature,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        return jsonify({
            'generated_text': generated_text,
            'prompt': prompt,
            'parameters': {
                'max_length': max_length,
                'temperature': temperature
            }
        })
    
    @app.route('/health', methods=['GET'])
    def health_check():
        return jsonify({'status': 'healthy', 'model_loaded': True})
    
    return app

def setup_cli_deployment(model, tokenizer):
    """Set up command-line interface deployment"""
    
    import argparse
    
    def main():
        parser = argparse.ArgumentParser(description='Historical Language Model CLI')
        parser.add_argument('--prompt', type=str, required=True, help='Text prompt')
        parser.add_argument('--max_length', type=int, default=100, help='Maximum length')
        parser.add_argument('--temperature', type=float, default=0.3, help='Temperature')
        parser.add_argument('--interactive', action='store_true', help='Interactive mode')
        
        args = parser.parse_args()
        
        if args.interactive:
            run_interactive_mode(model, tokenizer)
        else:
            generate_and_print(model, tokenizer, args.prompt, args.max_length, args.temperature)
    
    return main

Listing 13: Local Deployment Setup

Local deployment options: REST API, CLI, or notebook integration for different workflows.

6. Quality Assurance and Validation

Before wrapping up, let’s look at the quality assurance systems that ensure the models behave reliably across different scenarios.

6.1 Automated Quality Checks

The system includes automated quality checks that validate model performance and reliability: Listing 14

def run_quality_checks(model, tokenizer, device='cuda'):
    """Run quality checks on historical language model"""
    
    quality_checks = {
        'model_integrity': check_model_integrity(model),
        'tokenizer_consistency': check_tokenizer_consistency(tokenizer),
        'generation_quality': check_generation_quality(model, tokenizer, device),
        'historical_accuracy': check_historical_accuracy(model, tokenizer, device),
        'performance_metrics': check_performance_metrics(model, tokenizer, device),
        'memory_usage': check_memory_usage(model, device),
        'error_handling': check_error_handling(model, tokenizer, device)
    }
    
    # Generate quality report
    quality_report = generate_quality_report(quality_checks)
    
    return quality_checks, quality_report

def check_model_integrity(model):
    """Check model integrity and consistency"""
    
    checks = {
        'parameter_count': check_parameter_count(model),
        'weight_distribution': check_weight_distribution(model),
        'gradient_flow': check_gradient_flow(model),
        'activation_patterns': check_activation_patterns(model)
    }
    
    return checks

def check_generation_quality(model, tokenizer, device):
    """Check quality of generated text"""
    
    test_prompts = [
        "In the year of our Lord 1750, London was",
        "The Thames flowed through the heart of",
        "Merchants and tradesmen plied their wares",
        "The Great Fire of 1666 had changed",
        "Parliament sat in Westminster, making laws"
    ]
    
    quality_scores = []
    
    for prompt in test_prompts:
        # Generate text
        generated = generate_text(model, tokenizer, prompt, device)
        
        # Check quality metrics
        quality_score = {
            'coherence': assess_coherence(generated),
            'grammatical_correctness': assess_grammatical_correctness(generated),
            'historical_accuracy': assess_historical_accuracy(generated, {}),
            'linguistic_quality': assess_linguistic_quality(generated),
            'relevance': assess_relevance(generated, prompt)
        }
        
        quality_scores.append(quality_score)
    
    return quality_scores

Listing 14: Quality Assurance Checks

Quality checks cover model integrity, generation quality, historical accuracy, performance, and error handling, ensuring the models behave reliably across different scenarios.

6.2 Continuous Integration and Testing

If you want to wire this into a lightweight CI gate, keep it simple and CPU-friendly. The goal is not to re-run full benchmarks in CI - it’s to catch obvious regressions (can the model load, can it generate, do the evaluators still run).

Minimal CI smoke checks (suggested):

# 1) Run a fast, local evaluation pass (no external APIs)
python 05_evaluation/run_evaluation.py --mode quick --device cpu

# 2) Run a local inference smoke test from a checkpoint (replace with your path)
python 06_inference/inference_pytorch.py --checkpoint <path-to-checkpoint.pt> --prompt "In the year 1834, London was"

# 3) Optional: test the published model (requires downloading from Hugging Face)
python 06_inference/test_published_models.py --model_name bahree/london-historical-slm

7. Summary

We’ve now completed the full cycle of building language models from scratch. This final part has shown how to transform trained models into working systems that can be evaluated, tested, and deployed for real-world use. The journey that began in Part 1 with using published models, continued through Part 2 ’s data collection and tokenization, and Part 3 ’s training architecture, now concludes with evaluation and deployment - the critical final steps that make models usable.

What we’ve built:

The evaluation, testing, and deployment pipeline provides a practical approach for bringing historical language models from research to deployment. We’ve created specialized assessment metrics that go beyond standard LLM evaluation to catch historical inaccuracies, temporal inconsistencies, and period-inappropriate language. The testing infrastructure ensures reliability across different scenarios, while multiple deployment options make the models accessible to researchers, educators, and developers worldwide.

Current Deployment Status:

PyTorch Checkpoint Inference: Fully working with both SLM and Regular models
Hugging Face Model Inference: SLM published and available, Regular model ready
Local Testing: Both inference methods tested and validated on a remote Ubuntu machine
Documentation: Complete guides and examples for all inference methods
Performance: Clean logging, proper model detection, accurate parameter counts

The Complete Pipeline:

This four-part series has demonstrated the complete LLM development lifecycle:

Data Collection ( Part 2 ): We gathered 218+ historical sources spanning 1500-1850, processed them through a sophisticated cleaning pipeline, and created a 500M+ character corpus of authentic historical English.
Custom Tokenization ( Part 2 ): We built a specialized BPE tokenizer with 30,000 vocabulary tokens and 150+ special tokens that understand historical language patterns, London geography, and period-specific terminology.
Model Training ( Part 3 ): We implemented custom GPT architectures, optimized for multi-GPU training, and successfully trained two models - an SLM (117M parameters) and a Regular model (354M parameters) - both capable of generating authentic historical text.
Evaluation & Deployment (This Part): We built comprehensive evaluation frameworks that assess historical accuracy, linguistic quality, and temporal consistency. We created a testing infrastructure for reliability and deployed models to the Hugging Face Hub for community access.

The Learning Journey:

What started as a learning project has become a complete, working system that demonstrates every aspect of LLM development - from raw data collection through model deployment. The principles and techniques we’ve covered scale from the 500M-character corpus to production-scale systems, and the evaluation frameworks we’ve built can be adapted to any domain-specific language modeling task.

Whether you’re a researcher exploring historical linguistics, an educator teaching AI concepts, or a developer building specialized language models, this series provides the complete toolkit for understanding and implementing LLM development from scratch. The models are published, the code is available, and the journey from data to deployment is complete.

🔗 GitHub Repository: github.com/bahree/helloLondon - Complete training infrastructure (04_training/), model architecture (config.py), and evaluation/deployment (05_evaluation/, 06_inference/, 10_scripts/)
🟥 Series Posts: Part 1 - Using the Published Historical Models | Part 2 - Data Collection & Custom Tokenizer | Part 3 - Training Architecture & GPU Optimization | Part 4 (this post)
🟧 Published Models: SLM Model | Regular Model - Ready-to-use historical language models on Hugging Face
📗 Book Reference: Generative AI in Action - For deeper understanding of core LLM concepts

8. Resources

If you want to reproduce the full pipeline (or adapt it to your own domain), these are the most useful starting points:

Public GitHub repo: github.com/bahree/helloLondon
Evaluation guide: https://github.com/bahree/helloLondon/blob/main/08_documentation/EVALUATION_GUIDE.md
Hugging Face publishing guide: https://github.com/bahree/helloLondon/blob/main/08_documentation/HUGGINGFACE_PUBLISHING.md
Deployment guide: https://github.com/bahree/helloLondon/blob/main/08_documentation/DEPLOYMENT_GUIDE.md
Published models: bahree/london-historical-slm | bahree/london-historical-llm

References

Vaswani et al. (2017) - Attention Is All You Need: https://arxiv.org/abs/1706.03762
Radford et al. (2019) - Language Models are Unsupervised Multitask Learners: https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe
Lin (2004) - ROUGE: A Package for Automatic Evaluation of Summaries: https://aclanthology.org/W04-1013/
Papineni et al. (2002) - BLEU: A Method for Automatic Evaluation of Machine Translation: https://aclanthology.org/P02-1040/
Hendrycks et al. (2021) - Measuring Massive Multitask Language Understanding (MMLU): https://arxiv.org/abs/2009.03300
Zellers et al. (2019) - HellaSwag: Can a Machine Really Finish Your Sentence?: https://arxiv.org/abs/1905.07830
Liu et al. (2023) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment: https://arxiv.org/abs/2303.16634

Acknowledgments

This project builds upon the excellent work of the open-source community. Special thanks to haykgrigo3’s TimeCapsuleLLM for the initial inspiration and framework for historical language model training, and to Andrej Karpathy’s nanoGPT for the foundational GPT architecture and training methodology. The project extends these foundations with specialized adaptations for historical text, including custom tokenizers, advanced data filtering, evaluation frameworks, and educational deployment infrastructure.