Building a microkernel in Rust (Part 4): Memory management and beyond

5-Part Series:

Part 0: Why build an OS from scratch?
Part 1: Foundations
Part 2: Communication
Part 3: Concurrency
Part 4 (this): Memory and beyond

GitHub Repository: bahree/rust-microkernel - full source code and build scripts
Docker Image: amitbahree/rust-microkernel - prebuilt dev environment with Rust, QEMU, and source code

Recap from Part 3 : we added timer interrupts and preemptive multitasking. The ARM Generic Timer fires every 100ms, the GIC routes interrupts to our handler, and a full context switch (31 registers + SP + ELR + SPSR) lets the OS forcibly alternate between two threads that never yield. The kernel is now in charge of scheduling, not the tasks.

Every program you’ve ever written has been lying to you about memory addresses. When your code reads at address 0x1000, it’s not actually reading the physical byte at 0x1000 in RAM. There’s a hardware translator sitting between your code and memory, silently remapping every address. We’re about to build that translator.

This is the final part of the series. Virtual memory is the foundation that makes modern operating systems possible: process isolation, memory protection, shared libraries, and even swap space. And the core mechanism is surprisingly elegant. You build a lookup table, point the hardware at it, flip a bit, and suddenly every memory access in the system goes through your table.

Let’s build it. 😃

TL;DR

We implement virtual memory on AArch64:

Frame allocator: A bump allocator that hands out 4KB physical memory pages (in this post, we’ll use a simple bump allocator, but in a real OS, you’ll need a more sophisticated allocator)
Page tables: 4-level translation structures that map virtual addresses to physical ones (in this post, we’ll use a 4-level page table, but in a real OS, you’ll need a more sophisticated page table)
MMU enablement: Configure the hardware translation unit and flip it on
Verification: Write through a virtual address, read it back, and prove the translation worked
TLB: A translation lookaside buffer that caches recent VA-to-PA translations (in this post, we’ll use a simple TLB, but in a real OS, you’ll need a more sophisticated TLB)

1. Why virtual memory?

Virtual memory is the foundation that makes modern operating systems possible: process isolation, memory protection, shared libraries, and even swap space. And the core mechanism is surprisingly elegant. You build a lookup table, point the hardware at it, flip a bit, and suddenly every memory access in the system goes through your table. It’s an awesome example of how hardware and software can work together to create a powerful abstraction.

1.1 The MMU is hardware, not software

Before we dive in, let’s ensure we’re on the same page. The Memory Management Unit (MMU) is a physical circuit inside the CPU, sitting between the processor core and the memory bus. When the CPU reads or writes any address, the MMU intercepts that access and translates it using a lookup table (the page table) before the request reaches RAM.

This is not a software layer; it’s dedicated hardware that runs on every single memory access, at full speed, with no CPU involvement once configured. All we need to do is build the lookup table and tell the MMU where to find it.

1.2 The problem with physical addressing

Right now, our kernel uses physical addresses directly. Address 0x4000_0000 in our code refers to physical byte 0x4000_0000 in RAM. This works fine for a single kernel, but it falls apart when you want multiple tasks to run simultaneously and independently. Let’s say we want to run two tasks simultaneously: Task A and Task B. Task A wants to use address 0x1000, and Task B wants to use the same address. If we use physical addressing, we’ll have a problem. Task A will write to address 0x1000, and Task B will write to address 0x1000. This is a problem because Task A and Task B are running simultaneously and independently, and they shouldn’t be able to write to each other’s memory.

This is where virtual memory comes in. Virtual memory is a system that allows each task to have its own view of the address space. Task A thinks it’s using address 0x1000, but the MMU translates that to physical address 0x4000_1000. Task B also thinks it’s using address 0x1000, but the MMU translates that to 0x4010_1000. Neither task is aware of the other’s memory.

Task A sees:                    Physical memory:
0x0000 -> 0x4000_0000 (RAM)      0x4000_0000: Task A code
0x1000 -> 0x4001_0000 (RAM)      0x4010_0000: Task B code
                                  0x5000_0000: Kernel
Task B sees:
0x0000 -> 0x4010_0000 (RAM)      ^ MMU translates
0x1000 -> 0x4011_0000 (RAM)      | every access

Isolation: Tasks can’t access each other. They have different translation tables. Relocation: Every task sees the same virtual addresses. Protection: You can mark pages read-only, non-executable, or kernel-only. Flexibility: Sparse virtual address spaces waste almost no physical memory. The flexibility of virtual memory is one of its most powerful features.

2. Pages and frames

The MMU doesn’t translate individual bytes. That would require billions of table entries (one per byte of address space). Instead, it groups addresses into fixed-size chunks called pages. We use 4KB pages (4096 bytes = 2^12), which has been the standard since the 1980s.

A page is a 4KB block in the virtual address space. A frame is a 4KB block in physical memory. The page table maps pages to frames. Think of it like a library catalog: a page is a shelf label (where you look), and a frame is the actual physical shelf (where the books are). The catalog maps labels to shelves, and you can rearrange which label points to which shelf without moving any books.

Think of the relationship between a page and a frame as a one-to-one mapping - one page maps to one frame. However, a page and frame can be different sizes. For example, a page can be 4KB, but a frame can be 2MB; this is because the page table is a tree, and the levels of the tree can be different sizes.

The bottom 12 bits of an address (the offset within the page) pass through untranslated, since both the virtual page and the physical frame share the same internal layout. For example, if virtual address 0x4000_1ABC maps to physical frame 0x7000_1000, then the offset 0xABC (the low 12 bits) is the same on both sides. The MMU only translates the upper bits, replacing 0x4000_1 with 0x7000_1, and the offset passes straight through.

Why 4KB specifically? It’s a trade-off. Smaller pages (e.g., 512 bytes) give finer-grained control over permissions and sharing, but need 8x more table entries to cover the same address range. On the other hand, larger pages (say 64KB) waste space when a program only needs a few hundred bytes - the rest of the page sits allocated but unused (this is called internal fragmentation). 4KB balances these concerns well for general-purpose systems. ARM also supports 16KB and 64KB granules, but Linux uses 4 KB on ARM, and we’ll use the same.

3. The page table walk

Let us walk through how the hardware translates a 48-bit virtual address (VA). First yt splits the address into five fields, and each field indexes into a different level of the page table. Think of it like a postal address: country, city, street, house number. Each part narrows down the search. At the end of the day, it’s a lot like a tree, with the top level being the most general, and the bottom level being the most specific.

The hardware starts at the root (L0), uses the first 9 bits to locate the L1 table, then uses the next 9 bits to locate the L2 table, then the next 9 bits to locate the L3 table, and finally uses the last 9 bits to locate the page within that table. The final 12 bits are the offset within the page.

48-bit Virtual Address (VA):
// VA = virtual address; PA = physical address. Low 12 bits are the page offset and pass through unchanged during translation.
+------+------+------+------+--------------+
|  L0  |  L1  |  L2  |  L3  | Page Offset  |
|  9b  |  9b  |  9b  |  9b  |    12b       |
+------+------+------+------+--------------+
 47:39  38:30  29:21  20:12      11:0

Each 9-bit index can hold values 0 through 511, so each page table has exactly 512 entries. At 8 bytes per entry, that’s 512 x 8 = 4096 bytes. One page table fits exactly in one 4KB page. Not a coincidence - the hardware designers intentionally made it this way to simplify memory management.

The hardware walks the tree on every memory access, starting from the root (L0) and following the pointers down to the leaf (L3) that contains the physical frame address. If any entry along the way is invalid, the MMU raises a page fault exception, which the OS can handle (e.g., by loading a page from disk or killing the offending process).

flowchart TD
    A["Virtual Address: 0x8000_1234"] --> B["L0 index = bits 47:39 = 0"]
    B --> C["L0 Table entry 0"]
    C --> D["Points to L1 Table"]
    D --> E["L1 index = bits 38:30 = 2"]
    E --> F["L1 Table entry 2"]
    F --> G["Points to L2 Table"]
    G --> H["L2 index = bits 29:21 = 0"]
    H --> I["L2 Table entry 0"]
    I --> J["Points to L3 Table"]
    J --> K["L3 index = bits 20:12 = 0"]
    K --> L["L3 Table entry 0"]
    L --> M["Physical Frame address"]
    M --> N["Add page offset (bits 11:0 = 0x234)"]
    N --> O["Physical Address"]

    style A fill:#9ff,stroke:#333
    style M fill:#ff9,stroke:#333
    style O fill:#9f9,stroke:#333

Figure 1: Virtual address translation

3.1 Working example

Let us use the same example as above, but let’s use a different virtual address: 0x8000_0000 to help us understand the translation process. We’ll trace the translation of VA 0x8000_0000 (the test address we’ll use later):

L0 index: bits 47:39 = 0x8000_0000 >> 39 = 0. First entry in L0.
L1 index: bits 38:30 = (0x8000_0000 >> 30) & 0x1FF = 2. Third entry in L1 (covers the 2-3 GB range).
L2 index: bits 29:21 = (0x8000_0000 >> 21) & 0x1FF = 0. First entry in L2.
L3 index: bits 20:12 = (0x8000_0000 >> 12) & 0x1FF = 0. First entry in L3.
Page offset: bits 11:0 = 0. Start of the page.

So the hardware walks: L0[0] -> L1[2] -> L2[0] -> L3[0] -> physical frame; then adds the page offset (0 in this case) to get the final physical address. If any of those entries were invalid, we’d get a page fault instead.

This multi-level structure allows us to efficiently map a huge virtual address space without needing an enormous flat page table. Each level of the tree only exists for the parts of the address space we actually use. If a program only uses a few megabytes of memory, we only need a handful of page table entries, not billions.

4. Why 4 levels?

Imagine a naive single-level page table with a 48-bit virtual address space (256 TB) and 4KB pages, you’d need 256 TB / 4 KB = 68 billion entries. At 8 bytes each, that’s 512 GB per process just for the page table, which is completely impractical. Even if you had that much RAM, the CPU would be overwhelmed trying to search through such a huge table on every memory access. That’s why we use a multi-level page table.

Each level of the tree allows us to cover a large portion of the address space with a single entry. The top-level L0 table has 512 entries, each covering 512 GB. The next level, L1, has 512 entries, each covering 1 GB. The next level, L2, has 512 entries, each covering 2 MB. Finally, the L3 level has 512 entries, each covering 4 KB (one page).

The solution is a tree structure where you only allocate page table nodes for memory regions you actually use; a small program using 8 KB of memory needs:

1 L0 table (4 KB)
1 L1 table (4 KB)
1 L2 table (4 KB)
1 L3 table (4 KB)
Total: 16 KB of page table overhead

Compare that to 512 GB for the flat table. That’s a factor of 33 million. Wow! 🫨 Memory efficiency is the main reason for the multi-level design. The tree structure also allows for efficient lookups. The hardware walks down the tree, and if it encounters an invalid entry at any level, it can immediately raise a page fault without searching a huge flat table.

graph TD
    L0["L0 Table (root)<br/>512 entries<br/>each covers 512 GB"]
    L1a["L1 Table<br/>512 entries<br/>each covers 1 GB"]
    L2a["L2 Table<br/>512 entries<br/>each covers 2 MB"]
    L3a["L3 Table<br/>512 entries<br/>each covers 4 KB"]
    P1["4 KB Frame"]
    P2["4 KB Frame"]

    L0 --> L1a
    L0 -.-> L1b["(unused)"]
    L1a --> L2a
    L1a -.-> L2b["(unused)"]
    L2a --> L3a
    L2a -.-> L3b["(unused)"]
    L3a --> P1
    L3a --> P2

    style L0 fill:#f99,stroke:#333
    style L1a fill:#ff9,stroke:#333
    style L2a fill:#9f9,stroke:#333
    style L3a fill:#9ff,stroke:#333
    style P1 fill:#99f,stroke:#333

Figure 2: Page table tree structure

Of course, in the real world, programs use scattered memory regions (heap, stack, code, libraries at various addresses), but since the tree is sparse, you only pay for what you use. The multi-level structure also allows for efficient lookups. The hardware walks down the tree, and if it encounters an invalid entry at any level, it can immediately raise a page fault.

5. Frame allocator

Before we can build page tables, we need a way to allocate physical memory. Our frame allocator is the simplest kind: a bump allocator.

A what now? In terms of frame allocators, there are many strategies you can use. A bump allocator is the simplest: it just keeps a pointer to the next free frame and increments it on each allocation. This is fast and simple, but it can’t free memory.

There is also a free list allocator that maintains a linked list of free frames, allowing for reuse but with more overhead. And finally, there is a bitmap allocator that uses a bitmap to track which frames are free or in use, which can be more space-efficient but also more complex.

For our demo, the bump allocator is sufficient since we only need to allocate a few frames during initialization.

There are two main challenges here: finding free memory to use as frames, and ensuring we don’t overwrite our kernel code or stack. We solve both by starting our allocator at the end of the kernel image (after the stack) and just bumping up from there. The linker script gives us a symbol (__stack_top) that marks the end of the kernel image, so we can safely start allocating frames from that point onward. This way, we avoid overwriting any critical data structures.

In a real OS, you’d want a more robust memory management system that can handle fragmentation and support freeing frames, but this simple approach is enough for our demo.

5.1 The actual code

Let’s look at the code; below is the implementation from mem.rs. The FrameAlloc struct keeps track of the next free frame and the end of available memory. The alloc() method returns the next free frame and advances the pointer, or returns None if we’ve run out of memory.

The code below is a simplified version of a frame allocator, suitable for our demo; in a production OS, you’d want to handle fragmentation, support freeing frames, and possibly implement more complex allocation strategies. The constants at the top define the RAM region provided by QEMU’s virt machine, and the align_up function ensures that our allocations are properly aligned to page boundaries.

const RAM_START: u64 = 0x4000_0000;
const RAM_SIZE: u64 = 256 * 1024 * 1024;
const RAM_END: u64 = RAM_START + RAM_SIZE;
const PAGE_SIZE: u64 = 4096;

extern "C" {
    static __stack_top: u8;
}

#[inline(always)]
fn align_up(x: u64, align: u64) -> u64 {
    (x + align - 1) & !(align - 1)
}

struct FrameAlloc {
    next: u64,
    end: u64,
}

impl FrameAlloc {
    fn new(start: u64, end: u64) -> Self {
        Self { next: start, end }
    }

    fn alloc(&mut self) -> Option<u64> {
        let p = self.next;
        if p + PAGE_SIZE > self.end {
            return None;
        }
        self.next += PAGE_SIZE;
        Some(p)
    }
}

Listing 1: Frame allocator (mem.rs)

QEMU’s virt machine gives us 256 MB of RAM starting at 0x4000_0000. The kernel image and stack live at the beginning of this region. We start allocating frames after the stack ends.

The __stack_top linker symbol

This doesn’t hold a value like a normal variable. Its address marks the end of the kernel image in memory. The linker script defines it. We take its address using &__stack_top as *const u8 as u64 to determine where free memory begins. This is a common pattern in OS development: using linker symbols to mark important memory boundaries (like the end of the kernel, the start of the heap, etc.) without hardcoding addresses. The linker ensures that __stack_top is placed at the correct location in the final binary, so we can rely on it to give us the starting point for our frame allocator.

The align_up bit trick

The expression (x + align - 1) & !(align - 1) rounds x up to the next multiple of align. Here’s how. If align is 4096 (0x1000), then align - 1 is 0xFFF (twelve 1-bits). Adding that ensures we overshoot if not already aligned. The bitwise AND with !(align - 1) = 0xFFFF_FFFF_FFFF_F000 clears the bottom 12 bits, snapping down to the nearest page boundary. This helps ensure that all our frame allocations are properly aligned to page boundaries, which is a requirement for the MMU.

Working example: align_up(0x4083, 0x1000) = (0x4083 + 0xFFF) & 0xFFFFF000 = 0x5082 & 0xFFFFF000 = 0x5000.

6. Page table structures in Rust

OK, now let’s define the page table structures for our page table. Each level of the page table is represented by a PageTable struct, which contains an array of 512 entries. We declare static instances for the L0, L1, and L2 tables, as well as a separate L3 table for our test virtual address. The hardware requires these tables to be 4KB-aligned, so we use #[repr(align(4096))] to ensure that the Rust compiler places them at the correct boundaries in memory.

The code below defines the PageTable struct and the static instances for each level of the page table. Each table has 512 entries, and we initialize them to zero. The new() method is a const fn, allowing us to create these tables at compile time.

#[repr(align(4096))]
struct PageTable {
    entries: [u64; 512],
}

impl PageTable {
    const fn new() -> Self {
        Self { entries: [0; 512] }
    }
}

static mut TT_L0: PageTable = PageTable::new();
static mut TT_L1: PageTable = PageTable::new();
static mut TT_L2_0: PageTable = PageTable::new(); // VA 0..1GB (UART)
static mut TT_L2_1: PageTable = PageTable::new(); // VA 1..2GB (RAM)
static mut TT_L2_2: PageTable = PageTable::new(); // VA 2..3GB (test VA)
static mut TT_L3_TEST: PageTable = PageTable::new();

Listing 2: Page table struct and static tables (mem.rs)

Why #[repr(align(4096))]?

As we touched earlier, the hardware requires page tables to be 4KB-aligned. This means the physical address of each table must be divisible by 4096, which guarantees the low 12 bits are all zeros. This isn’t just a performance optimization - it’s a hard requirement. The hardware uses the lower 12 bits of page table pointers to store attribute flags (valid bit, table/block bit, etc.).

If the table weren’t aligned, its address would have non-zero low bits that would collide with the flag bits, and the MMU would misinterpret flags as address bits or vice versa. Each table is exactly 4KB (512 entries x 8 bytes per entry), conveniently fitting within a single physical frame.

Without #[repr(align(4096))], Rust would use its default alignment for [u64; 512], which is just 8 bytes (the alignment of u64). That’s not nearly enough. The repr(align) attribute tells the Rust compiler and linker to place this struct at a 4096-byte boundary.

Why static mut?

Page tables must live at fixed, known addresses because we write those addresses into hardware registers (TTBR0_EL1) and into other page table entries (each table descriptor contains the physical address of the next-level table). static mut gives us globally accessible, mutable, fixed-address data.

Normally, static mut is dangerous in Rust (data race risk), but it’s fine here because we only modify these tables during single-threaded initialization before the MMU is on. Once the MMU is enabled, we never modify these tables again in our demo. A real OS would need proper synchronization (and TLB invalidation) when modifying page tables at runtime.

Why separate L2 tables?

We have three L2 tables (TT_L2_0, TT_L2_1, TT_L2_2) for the three 1GB regions we map. Each L1 entry covers 1GB, and each L2 table covers the 512 2MB sub-regions within that 1GB range. We could theoretically use one big L2 table, but splitting them makes the code clearer and matches the logical structure: one for UART (0-1GB), one for RAM (1-2GB), one for our test VA (2-3GB).

In a real OS, you’d likely have many more L2 tables as you map more of the address space, but for our learning purposes, three is enough to illustrate the concept.

7. Descriptor format

A descriptor is a 64-bit value that encodes both the physical address of the next-level table (or the mapped frame) and various attribute flags (valid, block/table bit, access permissions, etc.). This is the format defined by the ARM architecture for page table entries. The hardware expects this exact layout, and any deviation will cause translation failures or incorrect behavior.

Below we have the bit layout of a page table entry (descriptor) for AArch64. The upper bits (47:12) contain the physical address of the next-level table or the mapped frame, while the lower bits contain various flags that control how the MMU interprets this entry.

+----------------------------------+----------+----+--+
|   Physical Address [47:12]       | Attributes| TBL|V |
|   (36 bits)                      |           |    |  |
+----------------------------------+----------+----+--+
 63                                              1   0

Key bits:
  [0]  : Valid (1 = entry is active)
  [1]  : Table/Page (1) vs Block (0)
  [10] : AF (Access Flag, must be 1)
  [9:8]: SH (Shareability)
  [4:2]: AttrIndx (memory type index into MAIR)
  [54] : UXN (User Execute Never)
  [53] : PXN (Privileged Execute Never)

Descriptor format (64 bits)

And below are the descriptor constants from our code; these are the flags we OR into the entries when building our page tables.

const DESC_VALID: u64 = 1 << 0;
const DESC_TABLE: u64 = 1 << 1;   // table/page descriptor
const DESC_BLOCK: u64 = 0 << 1;   // block descriptor

const AF: u64 = 1 << 10;          // Access Flag
const SH_INNER: u64 = 0b11 << 8;  // Inner Shareable
const ATTRIDX0: u64 = 0 << 2;     // Normal memory
const ATTRIDX1: u64 = 1 << 2;     // Device memory

const PXN: u64 = 1 << 53;
const UXN: u64 = 1 << 54;

Listing 3: Descriptor bit constants (mem.rs)

The DESC_VALID bit indicates that an entry is valid.
The DESC_TABLE bit indicates that this entry points to another table (as opposed to a block or page).
The AF bit is the Access Flag, which must be set for the MMU to consider the entry valid.
The SH_INNER bits mark the memory as inner shareable, which affects caching and ordering.
The ATTRIDX0 and ATTRIDX1 bits select between normal and device memory types defined in the MAIR register.
The PXN and UXN bits control execution permissions for privileged and user modes, respectively.

Why you can OR flags into addresses?

Page tables are 4KB-aligned, so the low 12 bits of any table address are guaranteed zero. We can safely OR attribute flags into those bits without corrupting the address. The address lives in bits 47:12, the flags live in bits 11:0 (plus some high bits like UXN/PXN). This is a common hardware trick: alignment guarantees give you “free” bits for metadata.

Table vs Block vs Page descriptors

At L0, L1, and L2, bit [1] = 1 means “this entry points to another table” (table descriptor). At L1 and L2, bit [1] = 0 means “this entry directly maps a large region” (block descriptor, 1GB at L1 or 2MB at L2). At L3, bit [1] = 1 means “this entry maps a 4KB page” (page descriptor). Blocks are useful for mapping large contiguous regions with a single entry, rather than an entire next-level table.

8. Building page tables

Let’s build the actual page tables. We have two goals here: create some identity mappings for the kernel and peripherals, and create one non-identity mapping to prove that translation is working.

Our build_tables() function creates three kinds of mappings. The first two are identity mappings for the UART and RAM, which are necessary for the kernel to continue functioning after the MMU is enabled. The third mapping is a non-identity mapping for a test virtual address (0x8000_0000) that points to an allocated frame. This allows us to verify that the MMU is correctly translating addresses.

Here’s the mapping layout:

Virtual Address	Physical Address	Type	Size
`0x0900_0000`	`0x0900_0000` (UART)	Device	2MB block
`0x4000_0000` - `0x5000_0000`	Same (RAM)	Normal	2MB blocks
`0x8000_0000`	`frame0` (allocated)	Normal	4KB page

As we touched on earlier, the first two are identity mappings (VA = PA). This is critical during boot because the kernel code was loaded at specific physical addresses, and the program counter already contains a physical address.

When the MMU turns on, it translates all addresses, including the instruction the CPU is currently executing. If that address isn’t identity-mapped, the CPU can’t fetch the next instruction and crashes.

The UART is identity-mapped, so we can keep printing debug messages after the MMU is on. The RAM is identity-mapped so the kernel can continue accessing its data structures. The test VA is a non-identity mapping that points to an allocated frame, allowing us to verify that the MMU is correctly translating addresses.

Lets us double-click and see what the code looks like for building these tables. The function build_tables() initializes the page tables with the appropriate entries to create the mappings described above. It sets up the L0, L1, and L2 tables to point to each other, and then fills in the L2 and L3 entries for the UART, RAM, and test VA. The function returns the physical address of the L0 table (which we will load into TTBR0_EL1) and the test virtual address for later verification.

Note, the code uses unsafe blocks to modify the static mutable page tables, which is necessary because we’re directly manipulating memory structures that the hardware will read.

fn build_tables(frame0: u64) -> (u64, u64) {
    let test_va: u64 = 0x8000_0000; // 2GB

    unsafe {
        TT_L0.entries = [0; 512];
        TT_L1.entries = [0; 512];
        TT_L2_0.entries = [0; 512];
        TT_L2_1.entries = [0; 512];
        TT_L2_2.entries = [0; 512];
        TT_L3_TEST.entries = [0; 512];

        // L0[0] -> L1 (covers low VA range)
        TT_L0.entries[0] =
            (&raw const TT_L1 as *const _ as u64) | DESC_VALID | DESC_TABLE;

        // L1[0] (0..1GB) -> L2_0
        TT_L1.entries[0] =
            (&raw const TT_L2_0 as *const _ as u64) | DESC_VALID | DESC_TABLE;
        // L1[1] (1..2GB) -> L2_1 (RAM at 0x4000_0000)
        TT_L1.entries[1] =
            (&raw const TT_L2_1 as *const _ as u64) | DESC_VALID | DESC_TABLE;
        // L1[2] (2..3GB) -> L2_2 (test VA)
        TT_L1.entries[2] =
            (&raw const TT_L2_2 as *const _ as u64) | DESC_VALID | DESC_TABLE;

        // Map UART 0x0900_0000 as a 2MB device block (identity).
        let uart_va: u64 = 0x0900_0000;
        let uart_l2 = ((uart_va >> 21) & 0x1FF) as usize;
        TT_L2_0.entries[uart_l2] =
            (uart_va & 0xFFFF_FFFF_FFE0_0000)
                | DESC_VALID | DESC_BLOCK | ATTRIDX1 | AF | PXN | UXN;

        // Map RAM 0x4000_0000..0x5000_0000 as 2MB blocks, normal memory.
        let blocks = RAM_SIZE / (2 * 1024 * 1024);
        for i in 0..blocks {
            let va = RAM_START + i * 2 * 1024 * 1024;
            let pa = va;
            let idx = ((va >> 21) & 0x1FF) as usize;
            TT_L2_1.entries[idx] =
                (pa & 0xFFFF_FFFF_FFE0_0000)
                    | DESC_VALID | DESC_BLOCK | ATTRIDX0 | AF | SH_INNER;
        }

        // Map test_va -> frame0 as a single 4KB page (through L3).
        let test_l2 = ((test_va >> 21) & 0x1FF) as usize;
        TT_L2_2.entries[test_l2] =
            (&raw const TT_L3_TEST as *const _ as u64)
                | DESC_VALID | DESC_TABLE;
        let test_l3 = ((test_va >> 12) & 0x1FF) as usize;
        TT_L3_TEST.entries[test_l3] = (frame0 & 0xFFFF_FFFF_FFFF_F000)
            | DESC_VALID
            | DESC_TABLE
            | ATTRIDX0
            | AF
            | SH_INNER;
    }

    let ttbr0 = &raw const TT_L0 as *const _ as u64;
    (ttbr0, test_va)
}

Listing 4: Building page tables (mem.rs)

Next, let us walk through the three mapping types - identity map for UART, identity map for RAM, and non-identity map for the test VA.

8.1 UART identity map (device memory)

The UART lives at physical address 0x0900_0000. We need it identity-mapped so we can keep printing after the MMU is on. If we didn’t map it, any access to the UART registers would fail once the MMU is enabled, and we’d lose our ability to print debug messages.

The UART is a memory-mapped device, so we mark it as device memory in the page table entry. Device memory has different caching and ordering rules than normal RAM, which is critical for correct operation. We also set the execute-never bits (PXN and UXN) to prevent any code execution from that region, which is a common safety measure for peripheral registers.

To find which L2 entry to write, we extract the L2 index from the address using a right bit shift: 0x0900_0000 >> 21 = 72. The >> operator shifts the binary representation right by 21 positions — equivalent to dividing by $2^{21}$ — which moves the L2 index field (bits 29:21 per the page table layout) down to the lowest bit positions. As a sanity check: each L2 entry covers 2MB, so entry 72 starts at $72 x 2 MB = 144 MB = 0x09000000. That’s exactly where our UART lives.

We write this entry with ATTRIDX1 (device memory), PXN and UXN (prevent any code execution from UART registers), and AF (access flag, required by the hardware). This creates a single 2MB block mapping for the UART. We could descend to L3 for finer-grained 4KB page control, but the UART registers are all within this one 2MB region, so a block descriptor is simpler and more efficient.

8.2 RAM identity map (normal memory, 2MB blocks)

The RAM identity map is straightforward. We want the entire RAM region to be accessible at the same addresses as their physical locations, so we can continue using our existing code without modification. In our case, the RAM spans 0x4000_0000 to 0x5000_0000 (256 MB). That’s 128 blocks, each 2 MB. We loop through all 128 and create block descriptors with ATTRIDX0 (normal memory, cacheable) and SH_INNER (inner shareable, for future multi-core support).

Using 2MB blocks instead of 4KB pages means we only need 128 L2 entries, not 65,536 L3 entries. Fewer entries means less memory used for page tables and fewer TLB misses. The downside is that we can’t set different permissions for individual 4KB pages within those 2MB blocks, but for our simple kernel, that’s not a concern.

In a real OS, you might want finer-grained control and use page descriptors at L3 for some regions. The loop calculates the virtual and physical addresses for each block, which are the same because of the identity mapping. The L2 index is calculated as (va >> 21) & 0x1FF, which gives us the correct entry in the L2 table for each 2MB block.

8.3 Test VA mapping (4KB page through L3)

Before we dive in: VA stands for “virtual address” - an address used by software which the MMU translates to a physical address (PA). The low 12 bits are the page offset and remain the same after translation; the MMU replaces the upper bits (the virtual page number) with the physical frame number. In short VA = virtual address, PA = physical address, and offset bits pass through unchanged.

Finally, the test VA mapping is the most interesting one. We want to prove that the MMU is actually translating addresses, so we create a non-identity mapping: virtual address 0x8000_0000 maps to a physical frame we allocated earlier (let’s call it frame0). This means that when we access 0x8000_0000 in our code, the MMU will translate it to frame0 in physical memory. If we can read and write to that address successfully, it proves that the MMU is working correctly.

It is also important to note that this is not an identity map. The VA and PA are different. This is the whole point: proving that the MMU is actually translating.

The walk: L0[0] -> L1[2] -> L2_2[0] -> L3_TEST[0] -> frame0.

This demonstrates a full 4-level walk with a non-identity mapping. The L0 entry points to the L1 table, the L1 entry at index 2 points to the L2_2 table, the L2_2 entry at index 0 points to the L3_TEST table, and the L3_TEST entry at index 0 maps to frame0. This shows that the MMU is correctly following the page table pointers and applying the translations as expected.

Why L1 index 2? Because 0x8000_0000 >> 30 = 2. Each L1 entry covers 1GB. Index 0 covers 0-1GB, index 1 covers 1-2GB, index 2 covers 2-3GB. Our test address is at the start of the 2-3GB range. We point L1[2] to TT_L2_2, then L2_2[0] points to TT_L3_TEST, and finally L3_TEST[0] maps the 4KB page to frame0. This demonstrates a full 4-level walk with a non-identity mapping.

9. MMU configuration

MMU configuration is critical. The hardware uses the values in these registers to control how it performs address translation. If you get any of these wrong, the MMU won’t work as expected. You might get immediate faults, or worse, silent data corruption due to incorrect caching or permissions. Understanding what each register does and how it interacts with the page tables is essential for successful MMU setup. When we enable the MMU, the hardware reads these registers to determine how to interpret our page tables and perform translations.

Three system registers control how the MMU translates addresses. Getting any of these wrong means either an immediate CPU fault or (worse) silently incorrect translations that can cause data corruption or security issues. Let us walk through each one and explain the fields we set and why.

These registers are: MAIR_EL1 (Memory Attribute Indirection Register), TCR_EL1 (Translation Control Register), and TTBR0_EL1 (Translation Table Base Register 0). Let’s dig into each one in detail.

9.1 MAIR_EL1 (Memory Attribute Indirection Register)

The MAIR defines the memory types that page table entries can reference. Each type specifies caching and ordering rules for memory accesses. This register defines up to 8 memory types (indexed 0 through 7). Each page table entry’s AttrIndx field (bits [4:2]) selects one of these types. Rather than encoding full memory attributes in every page table entry (which would take too many bits), ARM uses this level of indirection: the page table entry says “use type 3” and MAIR defines what “type 3” means.

We define two types:

Attr0 = 0xFF: Normal memory, write-back write-allocate. This is the highest-performance caching mode for RAM. Both inner and outer caches are enabled, writes go to the cache first and are flushed to the RAM later. The 0xFF encoding means inner write-back read/write-allocate (bits [7:4] = 0xF) and outer write-back read/write-allocate (bits [3:0] = 0xF).
Attr1 = 0x04: Device memory, nGnRE. The acronym stands for non-Gathering, non-Reordering, Early-acknowledgment. Every access goes directly to hardware, in exactly the order your code specifies. No caching, no write combining, no speculative reads. The CPU treats each load and store as a side effect that must be visible to the device immediately. nGnRE is a common choice for memory-mapped peripherals like the UART, ensuring correct behavior without risking stale data or out-of-order accesses.

Why two types?

Because UART registers and RAM need fundamentally different treatment. If you cache UART reads, you’ll see stale data (the UART has new input, but the cache still holds the old value). If you make RAM non-cacheable, performance drops by 10-100x because every load and store goes directly to DRAM instead of hitting the L1 cache. Getting this distinction right is one of the most common stumbling blocks in bare-metal ARM development. The MAIR allows us to define these types once and then reference them in our page tables, keeping our entries compact while still providing the necessary flexibility.

9.2 TCR_EL1 (Translation Control Register)

This is the most complex of the three registers. It configures the virtual address width, page granule, caching for the page table walk itself, and which translation tables are active. All of these are necessary for the MMU to function correctly. The TCR tells the hardware how to interpret our page tables and how to perform translations. If we set the virtual address size too small, we won’t be able to use the full address space. If we set the wrong granule size, the hardware will misinterpret our page table entries. If we forget to disable TTBR1, we might accidentally have translations from that register interfering with our intended mappings.

Before the field-by-field breakdown, one quick definition: a granule is the MMU page size for this translation regime. In this post, granule = 4KB, which means each translation unit is 4KB, page-table pages are 4KB, and all descriptor math (offset bits, alignment, index extraction) is based on that size. A useful mental model: if the granule changes, the “grid” the MMU uses to carve up memory changes too.

Here are the fields we set:

T0SZ = 16: This determines the virtual address size. The formula is: VA bits = 64 - T0SZ. So 64 - 16 = 48 bits, giving us a 256 TB virtual address space. Larger T0SZ values mean smaller address spaces but fewer levels of page table to walk. We choose 16 to get a 48-bit VA space, which is more than enough for our demo and matches what Linux typically uses on AArch64. Setting this too small (e.g., T0SZ=32 for a 32-bit VA space) would limit us to 4 GB of virtual memory, which isn’t enough for modern OSes. Setting it too large (e.g., T0SZ=0 for a full 64-bit VA space) would require more levels of page tables and more memory overhead.
TG0 = 0b00: Selects a 4KB granule for TTBR0 translations. This must match the software side (PAGE_SIZE = 4096) and the table assumptions we’ve used throughout the post. ARM also supports 16KB and 64KB granules, but if TG0 says 16KB while your tables are laid out for 4KB, the MMU parses the descriptors with the wrong geometry (different alignment and bit interpretation), and translations fail immediately.
IRGN0/ORGN0 = 0b01: The page table walk hardware itself does memory reads to traverse the tree. These fields control caching for those reads. Write-back write-allocate means the table entries get cached, so repeated walks to the same region are fast. If we set these to non-cacheable, every walk would go to RAM, causing a huge performance hit. If we set them to write-through, we lose the performance benefits of caching. Write-back write-allocate is the best choice for page table walks.
SH0 = 0b11: Inner shareable. This ensures that page table updates are visible across cores in a multi-core system. Even on our single-core setup, it’s good practice to set this correctly. If we set it to non-shareable, other cores might not see updates to the page tables, leading to stale translations and hard-to-debug issues in a multi-core OS. Setting it to outer shareable is also an option, but inner shareable is typically recommended for page tables since they are frequently accessed and updated by the CPU.
EPD1 = 1: Disable TTBR1 translations. AArch64 supports two translation table base registers: TTBR0 (typically for user space, lower VA range) and TTBR1 (typically for kernel, upper VA range). We only use TTBR0, so we disable TTBR1 to avoid accidental translations. If we forget to disable TTBR1, and it contains a non-zero value, the hardware might use it for addresses in the upper VA range, causing unexpected translations and potential security issues. Disabling it ensures that only TTBR0 is used for all translations, simplifying our setup and reducing the risk of mistakes.
IPS = 0b010: 40-bit physical address space, supporting up to 1 TB of physical RAM. QEMU’s virt machine only has 256 MB, but setting this wider does no harm and matches what Linux typically uses. Setting IPS too small (e.g., 32-bit PA space) would limit us to 4 GB of physical memory, which isn’t enough for modern hardware. Setting it too large (e.g., 48-bit PA space) would require more bits in page table entries for addresses, reducing the number of bits available for flags and potentially causing issues with our simple page table setup.

9.3 TTBR0_EL1 (Translation Table Base Register)

This is the simplest of the three registers - it holds the physical address of our L0 page table. The MMU uses it as the starting point for every translation. When the CPU accesses virtual address X, the hardware reads TTBR0 to find the L0 table, then walks from there. Changing TTBR0 changes the entire virtual address space - that’s how operating systems switch between process page tables on a context switch. In our case, we set TTBR0 to the physical address of TT_L0, which is the root of our page table tree. This tells the MMU where to start when translating addresses.

If we set TTBR0 to the wrong address, the MMU will read garbage data as the L0 table, causing all translations to fail. If it points to a valid but incorrect page table, we might get seemingly random translations that are very hard to debug.

Ensuring that TTBR0 points to the correct physical address of our L0 table is critical for the MMU to function correctly. We calculate this address using &raw const TT_L0 as *const _ as u64, which gives us the physical address of the TT_L0 static variable. This is the root of our page table hierarchy, and the MMU will use it to start translating virtual addresses.

10. Enabling the MMU

Enabling the MMU is a delicate dance of setting up the right registers, ensuring all memory writes are visible to the hardware, and then flipping the enable bit. The sequence matters, and so do the barriers. A barrier is a CPU instruction that forces ordering: “finish these operations first, then continue”. In this section we’ll use two of them: dsb (complete/commit prior memory effects) and isb (refresh the instruction pipeline after control-register changes). If you skip a barrier or use the wrong order, the MMU can observe stale state and fail in ways that are very hard to debug, sometimes before even UART prints still work.

The listing below is the actual enable_mmu assembly from boot.S. It performs the following steps: one, configure the MAIR and TCR registers; two, set TTBR0 to point to our L0 table; three, use barriers to ensure all writes are visible and the TLB is flushed; and four, set the M bit in SCTLR_EL1 to turn on the MMU.

The code can be a bit intimidating at first glance, but each step is necessary for correct MMU operation. The barriers (dsb and isb) ensure that all previous memory operations are complete and visible to the hardware before we enable the MMU. The TLB invalidate ensures that no stale translations are cached when we turn on the MMU. Finally, setting the M bit in SCTLR_EL1 actually enables the MMU; until that point, all addresses remain physical.

.global enable_mmu
enable_mmu:
  // MAIR_EL1: Attr0 = 0xFF (Normal WBWA), Attr1 = 0x04 (Device nGnRE)
  mov x1, #0xFF
  mov x2, #0x04
  lsl x2, x2, #8
  orr x1, x1, x2
  msr mair_el1, x1

  // TCR_EL1: T0SZ=16, 4k granule, SH=inner, IRGN/ORGN=WBWA, EPD1=1, IPS=40-bit
  mov x1, #16                  // T0SZ
  mov x2, #(0b11)              // SH0 = inner shareable
  lsl x2, x2, #12
  orr x1, x1, x2
  mov x2, #(0b01)              // ORGN0 = WBWA
  lsl x2, x2, #10
  orr x1, x1, x2
  mov x2, #(0b01)              // IRGN0 = WBWA
  lsl x2, x2, #8
  orr x1, x1, x2
  mov x2, #(1)                 // EPD1 = disable TTBR1
  lsl x2, x2, #23
  orr x1, x1, x2
  mov x2, #(0b010)             // IPS = 40-bit PA
  lsl x2, x2, #32
  orr x1, x1, x2
  msr tcr_el1, x1

  // TTBR0_EL1 (x0 = L0 table base, passed as argument)
  msr ttbr0_el1, x0

  // Synchronize and invalidate TLB
  dsb sy
  isb
  tlbi vmalle1
  dsb ish
  isb

  // Enable MMU: read-modify-write SCTLR_EL1 to set M=1
  mrs x1, sctlr_el1
  orr x1, x1, #1              // M = 1 (MMU on)
  bic x1, x1, #(1 << 2)       // C = 0 (data cache off)
  bic x1, x1, #(1 << 12)      // I = 0 (instruction cache off)
  msr sctlr_el1, x1
  isb
  ret

Listing 5: enable_mmu assembly (boot.S)

Let us walk through the critical steps and barriers in this code as they are often the source of confusion and bugs.

dsb sy (Data Synchronization Barrier, System)

This Waits until all pending memory writes complete. Our page table entries might still be in write buffers, not yet committed to RAM. Without this, the MMU could read a partially-written entry.

This is a common pitfall: you set up your page tables in memory, but the CPU hasn’t actually written them to RAM yet. If you enable the MMU before those writes are visible, the hardware will read garbage data for your page tables, causing all translations to fail.

The dsb sy ensures that all those writes are flushed out and visible to the MMU before we proceed.

isb (Instruction Synchronization Barrier)

This flushes the CPU’s instruction pipeline. After changing a system register, the pipeline still contains instructions fetched under the old settings. isb forces the CPU to re-fetch everything.

This is crucial after enabling the MMU because the very next instructions will be executed with virtual address translation active. If we didn’t have this barrier, the CPU might execute some instructions with the old physical address mappings, which could lead to unpredictable behavior or crashes. The isb ensures that all subsequent instructions are fetched and executed under the new MMU configuration.

tlbi vmalle1 (TLB Invalidate All at EL1)

The TLB caches old VA-to-PA translations. Stale entries could cause the MMU to use wrong translations. This instruction throws them all away. If we forget this step, the TLB might still contain entries from before we set up our new page tables, leading to incorrect translations and very confusing bugs.

For example, if the TLB has an old entry for a virtual address that points to a different physical address than what we set up in our new tables, the MMU will use that stale translation instead of the correct one, causing memory corruption or faults. The tlbi vmalle1 ensures that all TLB entries are invalidated, so the MMU will fetch fresh translations from our newly configured page tables.

The read-modify-write pattern

mrs x1, sctlr_el1 reads the entire system control register. orr x1, x1, #1 sets bit 0 (MMU enable) while preserving everything else. bic clears the cache bits (we leave caches off for simplicity). msr sctlr_el1, x1 writes it back. We don’t just write a fresh value because SCTLR has dozens of other control bits we don’t want to disturb.

If we just wrote msr sctlr_el1, #1, we’d accidentally clear all those other bits, which could disable caches, change endianness, or cause other unintended side effects. By using the read-modify-write pattern, we ensure that we only change the bits we intend to (enabling the MMU) while leaving all other settings intact.

After this function returns, every memory access goes through the page tables. Even the stack pointer is now a virtual address (which is fine because we identity-mapped RAM). If we set up everything correctly, the MMU will translate addresses according to our page tables, and we can access the UART and RAM through their virtual addresses. If we made a mistake in any of the previous steps (e.g., incorrect MAIR/TCR settings, wrong TTBR0 value, missing barriers), we might end up with a non-functional MMU, which can be very difficult to debug since even our debug prints might not work.

The sequence diagram below summarizes the interactions between our Rust code, the assembly function that enables the MMU, and the MMU hardware itself. It shows the steps taken to configure the MMU and the critical barriers that ensure correct operation.

sequenceDiagram
    participant Rust as Rust Code
    participant ASM as enable_mmu (Assembly)
    participant MMU as MMU Hardware

    Rust->>ASM: Call enable_mmu(ttbr0)
    activate ASM
    ASM->>MMU: Write MAIR_EL1 (memory attributes)
    ASM->>MMU: Write TCR_EL1 (translation control)
    ASM->>MMU: Write TTBR0_EL1 (page table base)
    ASM->>MMU: dsb sy + isb (drain writes)
    ASM->>MMU: tlbi vmalle1 (flush TLB)
    ASM->>MMU: dsb ish + isb (ensure flush complete)
    ASM->>MMU: Set SCTLR_EL1.M=1 (enable!)
    ASM->>MMU: isb (flush pipeline)
    MMU->>MMU: Translation now active
    ASM->>Rust: Return (virtual addresses in use)
    deactivate ASM
    Note over Rust: All memory accesses<br/>now go through page tables

Figure 3: MMU enablement sequence

11. Running the demo

OK, we built the page tables, we configured the MMU registers, and we enabled the MMU. How do we know it worked? The only way to truly prove that the MMU is functioning correctly is to perform a memory access through a virtual address that goes through our page tables and see if we get the expected result. If the MMU isn’t working, we’ll either get a fault (if the hardware detects an invalid access) or incorrect data (if the hardware misinterprets our page tables). In our demo, we write a known value (0xDEAD_BEEF) to the test virtual address (0x8000_0000) that we set up to point to frame0. If we can read back the same value from that virtual address after enabling the MMU, it proves that the MMU is correctly translating addresses through our page tables.

The listings below show how to run the full memory management demo, which includes frame allocation, page table construction, MMU enablement, and virtual address translation. The demo() function in mem.rs orchestrates all these steps and prints out the results.

pub fn demo() {
    UartLogger::puts("mm: demo start\n");

    let kernel_end = unsafe { &__stack_top as *const u8 as u64 };
    let free_start = align_up(kernel_end, PAGE_SIZE);

    put_hex("mm: kernel_end=0x", kernel_end);
    put_hex("mm: free_start=0x", free_start);
    put_hex("mm: ram_end=0x", RAM_END);

    let mut fa = FrameAlloc::new(free_start, RAM_END);

    // Allocate a few frames and write/read patterns.
    let f0 = fa.alloc().expect("no frame");
    let f1 = fa.alloc().expect("no frame");
    put_hex("mm: frame0=0x", f0);
    put_hex("mm: frame1=0x", f1);

    unsafe {
        write_volatile(f0 as *mut u32, 0xAABB_CCDD);
        write_volatile(f1 as *mut u32, 0x1122_3344);
        let r0 = read_volatile(f0 as *const u32) as u64;
        let r1 = read_volatile(f1 as *const u32) as u64;
        put_hex("mm: read0=0x", r0);
        put_hex("mm: read1=0x", r1);
    }

    // Build page tables and enable MMU.
    let (ttbr0, test_va) = build_tables(f0);
    put_hex("mm: ttbr0=0x", ttbr0);
    put_hex("mm: test_va=0x", test_va);

    UartLogger::puts("mm: enabling MMU (caches off)...\n");
    unsafe { enable_mmu(ttbr0) };

    // If we survived, translation is live!
    unsafe {
        let p = test_va as *mut u32;
        write_volatile(p, 0xDEAD_BEEF);
        let r = read_volatile(p) as u64;
        put_hex("mm: test_va_read=0x", r);
    }

    UartLogger::puts("mm: demo done (MMU is ON)\n");
}

Listing 6: Memory management demo (mem.rs)

The code allocates frames, builds the page tables with the mappings we discussed, enables the MMU, and then performs a read/write test on the virtual address to verify that translation is working. The expected output includes the addresses of the kernel end, free memory start, RAM end, allocated frames, page table base, test virtual address, and the value read back from the test virtual address after MMU enablement.

The unsafe blocks are necessary because we’re directly manipulating memory and hardware registers, which is inherently unsafe in Rust. The write_volatile and read_volatile functions are used to ensure that the compiler doesn’t optimize away our memory accesses, which is critical when working with memory-mapped hardware and page tables.

Now, let us build and run the demo with the following commands:

./scripts/build-aarch64-virt.sh demo-memory
./scripts/run-aarch64-virt.sh

Listing 7: Build and run the memory demo

Below the listing shows a trace from my run and gives you a sense of the expected output; of course the exact addresses may vary, but the key part is that we see the correct values read back from the test virtual address, proving that translation works.

rustOS: aarch64 QEMU virt boot OK
rustOS: memory management demo (frames + page tables)
mm: demo start
mm: kernel_end=0x000000004009A010
mm: free_start=0x000000004009B000
mm: ram_end=0x0000000050000000
mm: frame0=0x000000004009B000
mm: frame1=0x000000004009C000
mm: read0=0x00000000AABBCCDD
mm: read1=0x0000000011223344
mm: ttbr0=0x0000000040085000
mm: test_va=0x0000000080000000
mm: enabling MMU (caches off)...
mm: test_va_read=0x00000000DEADBEEF
mm: demo done (MMU is ON)

Memory management demo output

Memory management demo: frame allocation, page table construction, MMU enable, and virtual address translation — **Figure 4:** Memory management demo showing frame allocation, page table construction, MMU enablement, and virtual address translation.

So did it all work? Or is it all mumbo jumbo? Well, the three things prove it worked:

The MMU enabled without crashing. If our page tables had any errors (unmapped kernel code, unmapped stack, misaligned tables), the CPU would have faulted immediately. The fact that we got to the point of printing “MMU is ON” means the MMU is functioning well enough to fetch instructions and access memory without faults. This is the first and most basic proof that the MMU is working. If there were any critical errors in our page tables or MMU configuration, we would have seen a fault as soon as we enabled the MMU, and we wouldn’t have been able to print anything afterward.
The UART still works after MMU enable. That means our device memory identity mapping is correct. If we had forgotten to map the UART or set the wrong attributes, we would have lost our ability to print immediately after enabling the MMU. The fact that we can still print debug messages after MMU enablement is a strong indication that our page tables are correctly set up to allow access to the UART registers, and that the MMU is correctly translating those addresses.
test_va_read=0xDEADBEEF: We wrote 0xDEAD_BEEF to virtual address 0x8000_0000, and read it back successfully. The MMU translated that VA through our L0/L1/L2/L3 tables to the physical frame we allocated. Translation works. This is the ultimate proof that our MMU setup is correct. Woot!

If the MMU wasn’t working, we would either get a fault when trying to access test_va, or we would read back an incorrect value. The fact that we read back the exact value we wrote confirms that the MMU is correctly translating virtual addresses to physical addresses according to our page tables.

12. The TLB (Translation Lookaside Buffer)

You might be wondering - doesn’t walking 4 levels of page tables on every memory access make everything incredibly slow? Each level requires a memory read, so that’s 4 extra memory accesses for every load or store your program does. If a simple mov x0, [x1] normally takes ~4 cycles, adding 4 table lookups would make it ~20 cycles. That’s a 5x slowdown on everything - obviously unacceptable for a real OS.

Fortunately, the hardware has a solution to this problem - its called the Translation Lookaside Buffer (TLB). The TLB is a small, fast cache that stores recent virtual-to-physical address translations. When the MMU needs to translate a VA, it first checks the TLB to see if it already has a cached translation for that address. If it does (a TLB hit), it can get the physical address in about 1 cycle, which is much faster than walking the page tables. If it doesn’t (a TLB miss), then it falls back to walking the full page table tree.

Think of the TLB like a cheat sheet: instead of walking the full page table tree every time, the MMU first checks if it already knows the answer from a recent lookup. The TLB is typically 48-1536 entries (varying by CPU), stored in extremely fast SRAM right next to the MMU logic.

On a TLB hit (which happens ~95-99% of the time for typical workloads), the MMU gets the physical address in about 1 cycle - the same speed as if there were no translation at all. The high hit rate comes from temporal locality (programs tend to access the same addresses repeatedly) and spatial locality (a single 4KB page covers many consecutive accesses). Only on a TLB miss does the hardware walk the full 4-level tree, which might take 20-40 cycles depending on whether the page table entries themselves are in the data cache.

When you switch between processes (changing TTBR0 to a different page table), you need to invalidate the TLB because the old translations belong to a different address space. Without invalidation, Process B might use a stale TLB entry from Process A and access the wrong physical memory - a security and correctness disaster. That’s what our tlbi vmalle1 instruction does in enable_mmu.

ARM supports ASIDs (Address Space IDs) to avoid this cost. Each TLB entry is tagged with a small process ID (8 or 16 bits). When you switch to Process B, the TLB entries tagged with Process A’s ASID are simply ignored rather than flushed. Process B’s entries from a previous run might still be there, avoiding cold-start misses. This is a significant optimization on context-switch-heavy workloads, but it’s a story for another day.

13. Summary

To wrap up, let’s take a step back and look at the big picture. We started with a blank slate and built up a functioning kernel with memory management capabilities. We implemented a frame allocator to manage physical memory, constructed multi-level page tables to define our virtual address space, configured the MMU registers to tell the hardware how to perform translations, and finally enabled the MMU to activate virtual memory.

That is pretty cool, and not because the code is sophisticated (I can tell you - it’s not, it’s a teaching OS); but because these are the same fundamental mechanisms used by Linux, Windows, macOS, and every other operating system with the differences are in scale and sophistication, not in kind. A 4-level page table walk on Linux works exactly the same way ours does. The context switch saves the same registers. The GIC uses the same IAR/EOIR protocol.

Just for fun, let’s compare our tiny teaching kernel to real-world operating systems and see how we stack up.

rustOS vs Linux: Linux has ~30 million lines of code, supports 30+ architectures, has 10,000+ drivers, CFS scheduling, demand paging, swap, huge pages, NUMA, SELinux, namespaces, cgroups, and a full TCP/IP stack. We have about ~2K lines and a UART. But our boot sequence, IPC mechanism, context switch, and page table setup are structurally identical to what Linux does. If you read Linux’s arch/arm64/kernel/head.S, you’ll recognize the EL2-to-EL1 drop, the vector table installation, and the MMU enable sequence. You built that.

rustOS vs seL4: seL4 is a formally verified microkernel used in aerospace and medical devices. About ~10K lines of kernel code, backed by 200K lines of mathematical proofs showing the code behaves correctly. Its IPC takes ~100 cycles (whilst we don’t measure ours, it will be much slower). It has capability-based security, hard real-time guarantees, and true user/kernel separation. Our IPC design (endpoint-based mailboxes) is actually inspired by seL4’s endpoint model, just without the verification or performance.

rustOS vs xv6: MIT’s xv6 is the closest comparison. It’s a teaching OS in about ~10K lines of C, with a monolithic Unix-like design. It has a shell, a filesystem, fork()/exec(), and pipe-based IPC. Where xv6 goes deeper into Unix APIs, we go deeper into bare-metal ARM specifics.

If you want to keep going, here’s a rough roadmap, loosely ordered by difficulty and dependency.

Heap allocator. Right now everything is on the stack. Implementing GlobalAlloc (start with a linked-list allocator) unlocks Box, Vec, String, and the whole alloc crate. This enables almost everything else. Phil Oppermann’s heap allocation post is an excellent guide.

Process abstraction. Replace our static two-thread array with a proper process table: PIDs, state machine (ready, running, blocked), dynamic creation and destruction. See OSTEP Chapter 4 .

User mode. Drop from EL1 to EL0 to run user code. This requires per-process page tables (TTBR0 swap on context switch), separate user/kernel stacks, and exception handling for the EL0-to-EL1 transition. High difficulty, high reward.

System calls. The user-kernel API. On ARM, the svc instruction traps from EL0 to EL1. You need a syscall dispatch table, argument passing conventions, and at minimum exit(), write(), and yield().

I’ll stop there, but the sky’s the limit. You can implement filesystems, drivers, networking, SMP, graphics, and more. The only real limit is your time and interest.

15. Resources for going deeper

I did not set out to write a 5-part series on OS development, but quite a few folks reached out; so I wanted to provide a roadmap for those who want to go deeper. I am no expert and these are some of the better resources who are the experts in the field and are awesoeme for learning more about operating systems, ARM architecture, and low-level programming. This is by no means an exhaustive list, but it should give you a solid starting point for further exploration.

15.1 Books

Operating Systems: Three Easy Pieces (Arpaci-Dusseau). Free online at ostep.org . I think these are one of the best introductions to OS concepts! 💖
Computer Systems: A Programmer’s Perspective (Bryant & O’Hallaron). Essential CS fundamentals.
Linux Kernel Development (Robert Love). Practical Linux internals.

15.2 Papers

The UNIX Time-Sharing System (Ritchie & Thompson, 1974). The original.
Improving IPC by Kernel Design (Liedtke, 1993). Fast IPC in L4.
seL4: Formal Verification of an OS Kernel (Klein et al., 2009).

15.3 Projects

xv6: MIT’s teaching OS . Comes with a free textbook explaining every line.
Writing an OS in Rust: os.phil-opp.com . Philipp Oppermann’s excellent x86_64 blog series.
Redox: redox-os.org . A Unix-like OS written entirely in Rust.
Tock: tockos.org . An embedded OS in Rust.

15.4 Communities

OSDev Wiki . Invaluable reference for everything.
OSDev Forum . Helpful community.
r/osdev . Reddit community.

16. Final thoughts

Building an OS is hard. You’ve dealt with assembly boot code, interrupt timing constraints, context switching where a single wrong byte offset corrupts everything, and page tables where one misplaced bit means instant crash.

But you’ve also seen that it’s possible. The mechanisms aren’t magic. A timer fires, you save some registers, you load others, you jump. An address goes through a tree lookup. That’s it. The complexity in real operating systems comes from scale (thousands of devices, millions of users, decades of edge cases), not from fundamentally different ideas.

Even if you never write another line of kernel code, you now know why malloc can fail, why programs crash with “segmentation fault,” why fork() is fast (copy-on-write page tables), and why your laptop doesn’t freeze when one tab hangs (preemptive scheduling). You see through the abstractions.

Thanks for following along. Hopefully you picked up as much reading this as I did building it. And remember, next time you get a segfault, you know exactly what’s going on under the hood. Happy hacking! 😍

5-Part Series:

Part 0: Why build an OS from scratch?
Part 1: Foundations
Part 2: Communication
Part 3: Concurrency
Part 4 (this): Memory and beyond