Building a microkernel in Rust (Part 1): Foundations, booting on bare metal

5-Part Series:

Part 0: Why build an OS from scratch?
Part 1 (this): Foundations
Part 2: Communication
Part 3: Concurrency (coming soon)
Part 4: Memory and beyond (coming soon)

GitHub Repository: bahree/rust-microkernel — full source code and build scripts
Docker Image: amitbahree/rust-microkernel — prebuilt dev environment with Rust, QEMU, and source code

You’re about to write code that runs with nothing underneath it.

No operating system. No standard library. No println!. No heap. No file descriptors. No threads. Just you, a CPU, and some RAM.

That probably sounds terrifying. Or maybe thrilling. Honestly, it’s both. The first time you get a bare-metal kernel to print a single character to a serial port, you’ll stare at your terminal for a solid minute. One letter. It takes days of work. And it feels like you’ve built a cathedral.

This post walks through everything that happens between “the CPU wakes up” and “a message appears on screen.” We’ll go instruction by instruction through the assembly boot code, byte by byte through the UART driver, and concept by concept through the design decisions that make it all work. By the end, you’ll have a bootable AArch64 kernel running in QEMU that prints to a virtual serial port, built on a clean, platform-agnostic kernel architecture.

Let’s get to it.

TL;DR

In this part, we boot a minimal Rust kernel on the AArch64 QEMU virt machine. We write ARM assembly to set up a stack, zero the BSS, drop from EL2 (hypervisor level) to EL1 (kernel level), enable floating-point registers, and jump into Rust. The Rust side implements a PL011 UART driver using memory-mapped I/O, wraps it behind a Logger trait for platform abstraction, and calls into a shared kernel entry point. The whole thing compiles to a single ELF binary that QEMU loads directly.

1. What “bare metal” actually means

When you write a normal Rust program (or any program using a higher level language for that matter), there’s a tower of software beneath you. Think about what happens when you call println!("hello"). First, Rust’s standard library formats the string and then calls the OS to write it to stdout. Next, the OS looks up your process’s file descriptor table, finds that stdout points to a terminal, copies your bytes into a kernel buffer, and eventually, a terminal emulator reads those bytes and draws glyphs on screen. And to make all of this work, sitting below all of that, is the firmware which already tested the RAM, configured the CPU caches, initialized the PCI bus, and handed control to the bootloader.

Guess what? We’re removing all of that. 🤖

Our code is the first thing that runs after the CPU wakes up. There’s no runtime to set up the stack. No OS to provide virtual memory. No firmware to initialize the UART (well, QEMU helps us a little here, but we will ignore that for now). We are the bottom of the stack. Everything above us, we have to build.

In Rust terms, this means two attributes at the top of every bare-metal crate:

#![no_std]
#![no_main]

Bare-metal crate attributes

#![no_std] tells the compiler: don’t link the standard library. The standard library (std) gives you Vec, String, Box, println!, file I/O, threads, and networking. All of those features need an underlying operating system. Vec needs a heap allocator, which needs mmap or brk from the OS. println! needs stdout, which needs file descriptors, which need a kernel. Since we ARE the kernel, there’s nothing beneath us to provide those services.

With no_std, we only get core: basic types, traits, iterators, Option, Result, math operations. Things that need zero OS support.

#![no_main] is subtler. Normally, when you compile a Rust binary, the compiler generates a hidden entry point that sets up the stack, initializes the heap allocator, configures panic handling, spawns the main thread, and then calls your main() function. None of that machinery exists on bare metal. We handle all initialization ourselves, in assembly, and we tell the compiler not to look for a standard main entry point.

2. What’s an ELF?

Before we dive into boot code, a quick concept: when you compile a Rust program (or C, or anything that targets a Unix-like system), the output isn’t just a flat blob of machine code. It’s a structured file called an ELF (Executable and Linkable Format).

An ELF file is like a shipping manifest. It says: “here’s a chunk of executable code, load it at this address. Here’s read-only data, put it over here. Here’s uninitialized data (BSS), allocate this much space and zero it. Oh, and the program starts executing at this address.”

When we build our kernel, cargo and the linker produce an ELF binary. QEMU knows how to read ELF files, so it parses the headers, loads each section to the right memory address, and sets the program counter to the entry point (_start). On real hardware without an ELF-aware loader, you’d strip the ELF headers and produce a raw binary. But QEMU makes our lives easier.

The important sections in our ELF:

.text.boot: The very first code that runs (our assembly entry point)
.text: The rest of our compiled Rust code
.rodata: Read-only data (string constants, lookup tables)
.data: Initialized mutable data (statics with non-zero initial values)
.bss: Uninitialized data (statics that start at zero, just a size, no actual bytes in the file)

3. The QEMU virt machine

We’re targeting qemu-system-aarch64 -machine virt, which is QEMU’s generic ARM virtual machine. It doesn’t model any specific real-world board. Instead, it gives us a clean, well-documented set of virtual hardware:

Component	Details
CPU	Cortex-A53 (emulated), 64-bit ARMv8-A
RAM	256 MB starting at `0x4000_0000`
UART	PL011 at `0x0900_0000`
Interrupt controller	GICv2 (Generic Interrupt Controller)
Timer	ARM Generic Timer (CNTP)

Here’s the QEMU command we use to run our kernel:

qemu-system-aarch64 \
  -machine virt,gic-version=2 \
  -cpu cortex-a53 \
  -m 256M \
  -nographic \
  -serial mon:stdio \
  -kernel dist/virt/os-aarch64-virt.elf

QEMU launch command for AArch64 virt

Let’s break down what each of these mean:

-machine virt,gic-version=2: Use the generic ARM virtual machine with a GICv2 interrupt controller. GICv2 is simpler than GICv3 and sufficient for single-core work.
-cpu cortex-a53: Emulate a Cortex-A53 core. This is the same CPU in the Raspberry Pi Zero 2 W.
-m 256M: Give the machine 256 MB of RAM.
-nographic: Don’t open a GUI window. We’re using serial output, not a display.
-serial mon:stdio: Connect the virtual serial port to our terminal’s stdin/stdout. This is how we see output.
-kernel dist/virt/os-aarch64-virt.elf: Load our ELF binary directly. QEMU parses the ELF, loads sections to the right addresses, and jumps to _start.

The beauty of virt is its simplicity. No GPU firmware to deal with, no board-specific quirks, no SD card flashing. Build, run, see output. The iteration cycle is seconds.

4. Setup

You need Rust nightly (for bare-metal features like no_std binaries) and the AArch64 bare-metal target. Our repository includes a rust-toolchain.toml that handles most of this:

[toolchain]
channel = "nightly"
components = ["llvm-tools-preview", "rust-src"]
targets = ["x86_64-unknown-none", "aarch64-unknown-none"]

rust-toolchain.toml

If you’re setting up from scratch:

# Install Rust nightly with AArch64 target
rustup default nightly
rustup target add aarch64-unknown-none
rustup component add llvm-tools-preview rust-src

# Install QEMU (Linux)
sudo apt install qemu-system-aarch64

# Install QEMU (macOS)
brew install qemu

# Clone the repository
git clone https://github.com/bahree/rust-microkernel.git
cd rust-microkernel

Install Rust toolchain and QEMU

5. Build and run

./scripts/build-aarch64-virt.sh demo-ipc
./scripts/run-aarch64-virt.sh

Build and run the AArch64 virt kernel

Expected output:

rustOS: aarch64 QEMU virt boot OK
rustOS: IPC + cooperative scheduling demo
rustOS: kernel online
rustOS: microkernel step 1 (IPC + cooperative scheduling)
sched: starting
task/ping: poll
task/ping: sent ping
task/pong: got ping
task/ping: got pong

Expected serial output

Part 1 boot output: the kernel prints to serial, proving our boot assembly, UART driver, and platform abstraction all work — **Figure 1:** Boot output showing the kernel printing to serial, proving our boot assembly, UART driver, and platform abstraction all work.

If you see that, everything’s working. Press Ctrl-A then X to exit QEMU.

If you don’t pass a feature flag, the default build uses demo-memory (which we’ll cover in Part 4). The demo-ipc flag gives us the cooperative scheduling output shown above.

6. The boot sequence

Here’s what happens from the moment QEMU starts to the moment you see “boot OK” on your terminal:

flowchart TD
    A[QEMU loads ELF to 0x40080000] --> B[CPU starts at _start, running at EL2]
    B --> C[Set stack pointer]
    C --> D[Zero BSS section]
    D --> E{Current exception level?}
    E -->|EL2| F[Set EL1 stack pointer]
    F --> G[Configure SPSR_EL2 for EL1h return]
    G --> H[Set ELR_EL2 to el1_start]
    H --> I[Disable FP/ASIMD trapping at EL2]
    I --> J[eret: drop to EL1]
    J --> K[el1_start]
    E -->|EL1| K
    K --> L[Install exception vector table]
    L --> M[Enable FP/ASIMD at EL1]
    M --> N["bl rust_main"]
    N --> O[Initialize PL011 UART]
    O --> P["Print: boot OK"]
    P --> Q[Call kernel::kmain]

    style B fill:#f9f,stroke:#333
    style J fill:#ff9,stroke:#333
    style Q fill:#9f9,stroke:#333

AArch64 virt boot sequence

That’s a lot of steps. Let’s walk through these; however, first, you need to be able to read the assembly.

7. ARM assembly primer

If you’ve never read assembly before, the key is that we need to understand only a handful of concepts to follow the boot code. And as with most things in life, once you get the hang of it, assembly is surprisingly readable. The thing though is that everything is just very, very explicit about what it’s doing.

7.1 Registers

ARM gives you 31 general-purpose 64-bit registers, named x0 through x30. Think of them as 31 local variables that live inside the CPU itself, way faster than RAM. A register access takes maybe one clock cycle. A RAM access? Hundreds.

Some registers have conventional roles:

x0 through x7: Function arguments and return values
x29: Frame pointer (like rbp on x86)
x30: Link register, holds the return address after a bl (branch-with-link) call
sp: Stack pointer, tracks the top of the call stack
xzr: The zero register. Always reads as zero, discards writes. Surprisingly useful.

You’ll also see w0 through w30. These are the lower 32 bits of the corresponding x register. w0 is the bottom half of x0. ARM uses these when working with 32-bit values.

7.2 System registers

Beyond the general-purpose registers, ARM has a separate world of system registers that control CPU behavior. You can’t use them in normal instructions like add or sub. You access them with special instructions:

mrs x0, CurrentEL: Move from system register to general-purpose register. Read the current exception level into x0.
msr spsr_el2, x0: Move from general-purpose register to system register. Write x0 into the saved program status register for EL2.

These are the knobs and dials that configure how the CPU works - what exception level we’re at, whether floating-point is enabled, where the exception vector table lives, and what happens on an eret.

7.3 Common instructions

Here’s what you’ll see in the boot code:

Instruction	What it does
`ldr x0, =label`	Load the address of `label` into `x0`
`mov sp, x0`	Copy `x0` into the stack pointer
`str xzr, [x1], #8`	Store zero to the address in `x1`, then add 8 to `x1`
`cmp x1, x2`	Compare `x1` and `x2` (sets condition flags)
`b.ge 2f`	Branch forward to label `2:` if the comparison was greater-than-or-equal
`b.ne label`	Branch to `label` if not equal
`bl rust_main`	Branch with link: save return address in `x30`, jump to `rust_main`
`bic x0, x0, #(1 << 10)`	Bit clear: clear bit 10 in `x0`
`orr x0, x0, #(3 << 20)`	Bitwise OR: set bits 20 and 21 in `x0`
`lsr x1, x1, #2`	Logical shift right by 2 bits
`and x1, x1, #3`	Bitwise AND with 3 (keep only bits 0 and 1)
`adr x0, label`	Load the PC-relative address of `label` into `x0`
`isb`	Instruction synchronization barrier (flush pipeline)
`eret`	Exception return (drop to lower exception level)
`wfe`	Wait for event (low-power sleep)

7.4 Local labels and directives

Assembly uses numbered local labels like 1:, 2:, 3:. You reference them with a direction suffix: 1b means “search backward for label 1:” and 2f means “search forward for label 2:.” This avoids name collisions when the same pattern (like a loop) appears multiple times in the same file.

.section .text.boot tells the assembler: “put the following code into a section named .text.boot.” The linker script (which we’ll cover later) uses this name to ensure our boot code lands at the very start of the binary, exactly where the CPU expects it.

.global _start makes the _start symbol visible to the linker so that it can be used as the entry point.

8. Walking through boot.S

This is the heart of the post. Let’s go through the actual boot assembly line by line.

Here’s the file: crates/arch_aarch64_virt/src/boot.S. We’re showing the boot sequence portion (the first 60 lines) - you can see the full code from the repository.

.section .text.boot
.global _start
_start:
  // Set stack
  ldr x0, =__stack_top
  mov sp, x0

  // Zero BSS: [__bss_start, __bss_end)
  ldr x1, =__bss_start
  ldr x2, =__bss_end
1:
  cmp x1, x2
  b.ge 2f
  str xzr, [x1], #8
  b 1b
2:
  // If we entered at EL2 (typical for QEMU virt), drop to EL1 so the kernel runs
  // in a simpler environment (EL1 + GICv2 + CNTP timer).
  mrs x1, CurrentEL
  lsr x1, x1, #2
  and x1, x1, #3
  cmp x1, #2
  b.ne el1_start

  // Set up an EL1 stack pointer.
  ldr x0, =__stack_top
  msr sp_el1, x0

  // Configure EL2 to return to EL1h.
  mov x0, #(0b0101)         // EL1h
  msr spsr_el2, x0
  adr x0, el1_start
  msr elr_el2, x0

  // Enable FP/ASIMD access at EL1 and ensure EL2 doesn't trap it.
  mrs x0, cptr_el2
  bic x0, x0, #(1 << 10)    // TFP = 0 (don't trap FP/ASIMD)
  msr cptr_el2, x0
  isb
  eret

el1_start:
  // Install a minimal exception vector table for the *current* exception level.
  // QEMU `virt` typically enters at EL2, so we must set VBAR_EL2 (not just VBAR_EL1).
  adr x0, vectors
  msr vbar_el1, x0
  isb

  // Enable FP/ASIMD for Rust/LLVM.
  // Rust/LLVM may use NEON registers for struct copies/memcpy even in early bring-up.
  // If FP is disabled, this traps with EC=0x07 (FP/ASIMD access trap).
  mrs x0, cpacr_el1
  orr x0, x0, #(3 << 20)   // FPEN = 0b11
  msr cpacr_el1, x0
  isb

  bl rust_main
3:
  wfe
  b 3b

boot.S: the complete boot sequence (crates/arch_aarch64_virt/src/boot.S, lines 1-60)

Now let’s walk through each piece to help understand what it means.

8.1 Stack setup (lines 5-6)

  ldr x0, =__stack_top
  mov sp, x0

Setting up the stack pointer

The very first thing we do is set up a stack. But why? Because Rust literally cannot run without one. Every function call pushes a return address onto the stack. Every local variable lives on the stack. Without a valid stack pointer, the first bl instruction would try to save the return address to… nowhere. Instant crash.

__stack_top is a symbol defined by our linker script. It points to the top of a 64 KB stack memory region. ARM stacks grow downward (from high addresses to low), so we point sp at the top.

Why 64 KB? It’s a reasonable starting size. We don’t have many nested function calls yet, and we don’t have a heap, so the stack doesn’t need to be huge. If we run out, we’ll get a data abort (the ARM equivalent of a segfault). We’d increase it then.

8.2 BSS zeroing (lines 8-16)

  ldr x1, =__bss_start
  ldr x2, =__bss_end
1:
  cmp x1, x2
  b.ge 2f
  str xzr, [x1], #8
  b 1b
2:

Zeroing the BSS section

BSS stands for “Block Started by Symbol” (a historical name from the 1950s). It’s the section where uninitialized global and static variables live. In Rust, if you write static mut COUNTER: u32 = 0;, the compiler puts it in .bss. The key insight: the ELF file doesn’t actually store the zeros. It just records the size. That saves space in the binary. But it means the memory might contain garbage when we start.

The C and Rust languages both guarantee that uninitialized statics start at zero. Before any Rust code runs, we have to zero the entire BSS region manually. That’s what this loop does:

Load the start and end addresses of .bss (defined by the linker script) into x1 and x2.
Compare them. If x1 >= x2, the BSS is empty (or we’re done), skip to label 2:
Store zero (xzr, the zero register) to the address in x1, then increment x1 by 8 (one 64-bit word)
Jump back to the comparison

The [x1], #8 syntax is a post-increment addressing mode. It means “use the address in x1 for the store, THEN add 8 to x1.” It’s a single instruction that does two things. ARM loves these compound operations.

8.3 Checking the exception level (lines 19-23)

  mrs x1, CurrentEL
  lsr x1, x1, #2
  and x1, x1, #3
  cmp x1, #2
  b.ne el1_start

Detecting the current exception level

Now things get interesting. We need to know what exception level the CPU is running at. QEMU’s virt machine starts us at EL2 (hypervisor level), but we want to run our kernel at EL1 (the normal kernel level). If we’re already at EL1 for some reason, we skip the drop.

mrs x1, CurrentEL reads the CurrentEL system register. The exception level is encoded in bits [3:2] of this register (ARM’s register designs can be a little eccentric). So we shift right by 2 (lsr x1, x1, #2) and mask off everything except the bottom 2 bits (and x1, x1, #3). Now x1 holds 0, 1, 2, or 3 corresponding to EL0 through EL3.

If it’s 2, we proceed with the EL2-to-EL1 drop. If it’s not, we jump straight to el1_start.

8.4 Exception levels explained

Before we get into the drop, let’s talk about why exception levels exist.

ARM’s AArch64 architecture has four privilege levels:

Level	Name	Who runs here	What they can do
EL0	Application	User programs	Normal instructions only. Can’t touch hardware.
EL1	Kernel	Operating systems	Configure MMU, handle interrupts, access all memory
EL2	Hypervisor	Virtual machine monitors	Virtualize EL1 guests, trap privileged operations
EL3	Secure Monitor	TrustZone firmware	Switch between secure and non-secure worlds

Think about it this way: EL0 is a sandbox. Applications can compute, but they can’t mess with memory mappings, disable interrupts, or talk to hardware directly. If they try, the CPU traps to EL1, and the kernel decides what to do (usually: kill the process with a signal).

EL1 is where kernels live. Linux, macOS, Windows, and our rustOS all run here. You can configure page tables, handle interrupts, and access device memory.

EL2 is for hypervisors. If you want to run multiple operating systems on the same hardware (like AWS does with EC2 instances), the hypervisor at EL2 virtualizes the hardware so each guest OS at EL1 thinks it has the machine to itself.

EL3 is for firmware-level security (ARM TrustZone). We don’t touch it.

QEMU starts us at EL2 because it’s the most flexible starting point. If you were writing a hypervisor, you’d stay there. Since we’re writing a kernel, we drop to EL1.

8.5 The EL2 to EL1 drop (lines 26-40)

  // Set up an EL1 stack pointer.
  ldr x0, =__stack_top
  msr sp_el1, x0

  // Configure EL2 to return to EL1h.
  mov x0, #(0b0101)         // EL1h
  msr spsr_el2, x0
  adr x0, el1_start
  msr elr_el2, x0

  // Enable FP/ASIMD access at EL1 and ensure EL2 doesn't trap it.
  mrs x0, cptr_el2
  bic x0, x0, #(1 << 10)    // TFP = 0 (don't trap FP/ASIMD)
  msr cptr_el2, x0
  isb
  eret

Dropping from EL2 to EL1

This is the most intricate part of the boot sequence. Here’s the trick: you can’t just “jump” to a lower exception level. ARM doesn’t have a “go to EL1” instruction. Instead, you use the exception return mechanism backwards.

Normally, when an exception occurs (such as an interrupt), the CPU saves the current state and jumps to a higher exception level. The handler processes the exception, then executes eret (exception return) to return. We’re abusing this: we set up the return state registers to point at EL1, then execute eret as if we were “returning” from an exception that never happened.

Here’s each step:

msr sp_el1, x0 sets the stack pointer that EL1 will use. Each exception level has its own stack pointer. We’re at EL2 right now, so sp refers to sp_el2. We need to pre-configure sp_el1 before we get there.

mov x0, #(0b0101) / msr spsr_el2, x0 configures the Saved Program Status Register. When eret executes, the CPU restores the processor state from spsr_el2. The value 0b0101 (decimal 5) means: return to EL1 using the handler stack pointer variant (called “EL1h”). Bits [3:2] = 01 select EL1. Bit [0] = 1 selects the h variant, meaning EL1 will use sp_el1 as its stack pointer (rather than sp_el0).

adr x0, el1_start / msr elr_el2, x0 sets the Exception Link Register. This is the address the CPU will jump to on eret. We point it at el1_start, the label where our EL1 code begins. adr computes a PC-relative address, which is important because our code might not be running at the address the linker assumed (though, in practice, it is for us).

bic x0, x0, #(1 << 10) / msr cptr_el2, x0 disables the floating-point trap at EL2. By default, EL2 can trap FP/ASIMD (NEON) instructions executed at EL1. If we don’t clear this bit, the first time Rust tries to use a NEON register (which happens sooner than you’d think), the CPU will trap to EL2 and crash. bic stands for “bit clear.” It clears bit 10 (the TFP flag) in the cptr_el2 register.

isb is an Instruction Synchronization Barrier. It forces the CPU to finish processing all pending changes to system registers before executing the next instruction. Without it, the pipeline might still be using stale configuration values.

eret does the actual drop. It atomically loads the program counter from elr_el2 and the processor state from spsr_el2. In one instruction, we go from EL2 to EL1 and start executing at el1_start.

It’s a beautiful hack, honestly. The hardware designers intended eret for returning from exception handlers. We’re using it as a one-way door to a lower privilege level.

8.6 Vector table installation (lines 43-47)

  adr x0, vectors
  msr vbar_el1, x0
  isb

Installing the exception vector table

The CPU needs to know where to jump when an exception occurs (an interrupt, a system call, a data abort, etc.). ARM uses a vector table: a fixed-layout block of code that contains an entry for each exception type. There are 16 entries, every 128 bytes (0x80) apart, for a total of 2048 bytes (0x800).

We load the address of our vectors table and write it to vbar_el1 (Vector Base Address Register for EL1). When an exception happens at EL1, the CPU adds an offset to this base address and starts executing there. We’ll cover the vector table in detail in Part 3 when we implement interrupt handling.

8.7 FP/ASIMD enablement: why Rust needs NEON (lines 49-55)

  mrs x0, cpacr_el1
  orr x0, x0, #(3 << 20)   // FPEN = 0b11
  msr cpacr_el1, x0
  isb

Enabling floating-point and SIMD at EL1

You might wonder: we’re writing a kernel, not a graphics engine. Why do we need floating-point?

Here’s the thing. LLVM (Rust’s code generator) uses NEON registers for more than just floating-point math. It uses them for memory operations like memcpy and struct copies. If you have a struct that’s, say, 32 bytes, LLVM might decide the fastest way to copy it is to load it into a pair of 128-bit NEON registers and store it back. This happens even in integer-only code. Even in early boot.

If FP/ASIMD is disabled when this happens, the CPU generates a synchronous exception with exception class EC=0x07 (FP/ASIMD access trap). Your kernel crashes before it even prints “hello.”

We handled the EL2 side already (clearing TFP in cptr_el2). Now at EL1, we set FPEN (bits [21:20]) in cpacr_el1 to 0b11, which means “don’t trap FP/ASIMD at EL1 or EL0.” The orr instruction sets those bits. Another isb to synchronize.

8.8 Jumping to Rust (lines 57-60)

  bl rust_main
3:
  wfe
  b 3b

Calling into Rust

Finally, bl rust_main is a branch-with-link: it saves the return address in x30 (the link register) and jumps to rust_main. This is where we leave assembly and enter Rust.

The three lines after it are a safety net. rust_main is declared as -> ! (never returns), but if something goes horribly wrong and it does return, we don’t want the CPU executing random memory. So we enter an infinite loop: wfe (wait for event, which puts the CPU in a low-power sleep state) followed by a branch back to wfe.

9. The Rust entry point

Now we’re in Rust. Let’s look at the actual main.rs for the AArch64 virt platform.

#![no_std]
#![no_main]

use core::panic::PanicInfo;
use hal::log::Logger;

core::arch::global_asm!(include_str!("boot.S"));

// QEMU `virt` PL011 UART base.
const UART0_BASE: usize = 0x0900_0000;

struct UartLogger;

impl UartLogger {
    #[inline(always)]
    fn mmio_write(offset: usize, val: u32) {
        unsafe { core::ptr::write_volatile((UART0_BASE + offset) as *mut u32, val) }
    }

    #[inline(always)]
    fn mmio_read(offset: usize) -> u32 {
        unsafe { core::ptr::read_volatile((UART0_BASE + offset) as *const u32) }
    }

    fn putc(c: u8) {
        // FR (0x18) bit5 = TXFF (transmit FIFO full)
        while (Self::mmio_read(0x18) & (1 << 5)) != 0 {}
        Self::mmio_write(0x00, c as u32);
    }

    pub(crate) fn puts(s: &str) {
        for &b in s.as_bytes() {
            if b == b'\n' {
                Self::putc(b'\r');
            }
            Self::putc(b);
        }
    }
}

impl hal::log::Logger for UartLogger {
    fn log(&self, s: &str) {
        UartLogger::puts(s);
    }
}

mod timer;
mod preempt;
mod mem;

#[unsafe(no_mangle)]
pub extern "C" fn rust_main() -> ! {
    let logger = UartLogger;
    logger.log("rustOS: aarch64 QEMU virt boot OK\n");

    #[cfg(feature = "demo-ipc")]
    {
        logger.log("rustOS: IPC + cooperative scheduling demo\n");
        kernel::kmain(&logger)
    }

    #[cfg(feature = "demo-timer")]
    {
        logger.log("rustOS: timer interrupts demo\n");
        timer::init();
        logger.log("rustOS: timer started, entering idle loop\n");
        loop {
            hal::arch::halt();
        }
    }

    #[cfg(feature = "demo-preempt")]
    {
        logger.log("rustOS: preemptive multitasking demo\n");
        preempt::init();
        extern "C" {
            fn start_first(ctx: *const preempt::Context) -> !;
        }
        unsafe { start_first(preempt::first_context()) }
    }

    #[cfg(feature = "demo-memory")]
    {
        logger.log("rustOS: memory management demo (frames + page tables)\n");
        mem::demo();
        loop {
            hal::arch::halt();
        }
    }

    #[cfg(not(any(feature = "demo-ipc", feature = "demo-timer",
                   feature = "demo-preempt", feature = "demo-memory")))]
    {
        logger.log("rustOS: no demo selected, halting\n");
        loop {
            hal::arch::halt();
        }
    }
}

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    UartLogger::puts("rustOS: PANIC\n");
    loop {
        hal::arch::halt();
    }
}

crates/arch_aarch64_virt/src/main.rs

Let’s unpack the important pieces.

9.1 `global_asm!` and including boot.S

core::arch::global_asm!(include_str!("boot.S"));

Including assembly in the Rust build

This line tells the Rust compiler: “take the contents of boot.S and include them as global assembly in this compilation unit.” The assembler processes our boot code, the linker resolves the symbols (_start, rust_main, __stack_top, etc.), and everything ends up in one binary. It’s the bridge between our assembly boot code and Rust.

9.2 `#[unsafe(no_mangle)]` and `extern "C"`

#[unsafe(no_mangle)]
pub extern "C" fn rust_main() -> ! {

The Rust entry point signature

Two things happening here.

#[unsafe(no_mangle)] prevents the Rust compiler from mangling the function name. Normally, Rust encodes type information into symbol names (so rust_main might become something like _ZN17arch_aarch64_virt9rust_main17h3a2b1c4d5e6f7g8hE). Our assembly code calls bl rust_main and needs to find that exact name.

extern "C" says: use the C calling convention. Rust’s own calling convention isn’t stable and can change between compiler versions. The C ABI is well-defined on every platform. On AArch64, the C calling convention puts the first argument in x0, the second in x1, return values in x0, and so on. Since assembly calls this function, we need a stable, predictable calling convention.

-> ! means this function never returns. On bare metal, there’s nothing to return to. The CPU would start executing whatever random bytes follow in memory. The function must either loop forever or halt the CPU.

9.3 The feature gates

#[cfg(feature = "demo-ipc")]
{
    logger.log("rustOS: IPC + cooperative scheduling demo\n");
    kernel::kmain(&logger)
}

Compile-time feature selection

The #[cfg(feature = "...")] attributes are compile-time switches. Depending on which feature you pass to the build script, a different code path gets compiled in. Only one demo runs at a time. This is a common pattern in embedded Rust: use cargo features to select between different configurations without runtime overhead.

9.4 The panic handler

#[panic_handler]
fn panic(_info: &PanicInfo) -> ! {
    UartLogger::puts("rustOS: PANIC\n");
    loop {
        hal::arch::halt();
    }
}

Bare-metal panic handler

On bare metal, you must define what happens when Rust panics (array out of bounds, unwrap() on None, explicit panic!(), etc.). There’s no OS to catch it. Our handler prints “PANIC” to the UART and halts. In a more sophisticated kernel, you’d print the panic message and a stack trace. For now, knowing something panicked is enough.

10. PL011 UART: talking to hardware via MMIO

In a normal program, every memory address maps to a byte of RAM. Address 0x1000 holds some data, you read it, you write it, it’s just storage. On bare metal, some addresses are special: they’re connected to hardware peripherals instead of RAM. Writing to address 0x0900_0000 on our QEMU virt machine doesn’t store anything in memory. It sends a byte out the serial line. This is memory-mapped I/O (MMIO).

The PL011 UART sits at that address. When you write a byte to offset 0x00 from that base, the UART hardware transmits it. When you read from offset 0x18, you get the UART’s flag register, telling you whether the transmit FIFO is full. These addresses come from the hardware manufacturer’s specification (or in QEMU’s case, from QEMU’s source code and the virt machine documentation). There’s no way to discover them at runtime. You look them up.

10.1 The UartLogger implementation

Let’s look at how we actually write bytes to the UART:

const UART0_BASE: usize = 0x0900_0000;

impl UartLogger {
    #[inline(always)]
    fn mmio_write(offset: usize, val: u32) {
        unsafe { core::ptr::write_volatile((UART0_BASE + offset) as *mut u32, val) }
    }

    #[inline(always)]
    fn mmio_read(offset: usize) -> u32 {
        unsafe { core::ptr::read_volatile((UART0_BASE + offset) as *const u32) }
    }
}

MMIO helper functions (from main.rs)

write_volatile and read_volatile are critical. Normally, the Rust compiler (through LLVM) is free to optimize away memory operations. If you write to an address and never read it back, the optimizer might skip the write entirely. “Why bother,” it thinks, “nobody’s going to look at it.”

But for MMIO, the hardware IS looking at it. Writing to the UART data register transmits a byte. The compiler can’t see that side effect. volatile tells the compiler: “this access has effects you can’t reason about. Do it exactly as written, in exactly this order.”

Both functions are unsafe because we’re casting raw integers to pointers and dereferencing them. In safe Rust, pointer arithmetic is forbidden. On bare metal, it’s the only way to talk to hardware.

10.2 Sending a character

fn putc(c: u8) {
    // FR (0x18) bit5 = TXFF (transmit FIFO full)
    while (Self::mmio_read(0x18) & (1 << 5)) != 0 {}
    Self::mmio_write(0x00, c as u32);
}

PL011 character transmission (from main.rs)

The PL011 UART has a transmit FIFO (a small hardware buffer, typically 16 bytes). When you write a byte to the data register (offset 0x00), it goes into the FIFO. The UART hardware drains the FIFO at the configured baud rate, sending bits out the serial line.

But the FIFO can fill up. If you write faster than the UART can transmit, the FIFO overflows and bytes get lost. So before writing, we check the Flag Register (offset 0x18). Bit 5 is TXFF (transmit FIFO full). If it’s set, we spin-wait until there’s room.

This is called polling or busy-waiting. It’s the simplest approach and perfectly fine for debug output. In a production driver, you’d use interrupts instead (the UART can signal when the FIFO has room, so the CPU can do other work instead of spinning).

10.3 Sending a string

pub(crate) fn puts(s: &str) {
    for &b in s.as_bytes() {
        if b == b'\n' {
            Self::putc(b'\r');
        }
        Self::putc(b);
    }
}

PL011 string transmission with newline conversion (from main.rs)

Two things to notice. First, we iterate over the raw bytes of the string (s.as_bytes()), not characters. At this level, we’re dealing with bytes, not Unicode code points.

Second, the \n to \r\n conversion. Serial terminals expect both a carriage return (\r, move cursor to column 0) and a line feed (\n, move cursor down one line). If you send just \n, many terminals will move down without returning to the left edge, giving you a staircase effect. So we insert a \r before every \n.

10.4 The Logger trait implementation

impl hal::log::Logger for UartLogger {
    fn log(&self, s: &str) {
        UartLogger::puts(s);
    }
}

Logger trait implementation (from main.rs)

This bridges the gap between our platform-specific UART driver and the platform-agnostic kernel. The kernel never calls UartLogger::puts directly. It calls logger.log() on a trait object. More on this in the next section.

11. The platform-agnostic kernel

Here’s the design payoff. The kernel crate knows nothing about ARM, nothing about UART, nothing about QEMU. It just knows it has something that can log strings.

11.1 The Logger trait

pub trait Logger {
    fn log(&self, s: &str);
}

crates/hal/src/log.rs

Three lines. That’s the entire hardware abstraction layer for output. Any platform that can print strings can implement this trait.

11.2 The kernel entry

#![no_std]

use hal::log::Logger;

mod ipc;
mod sched;

use core::cell::UnsafeCell;

#[repr(transparent)]
struct RouterCell(UnsafeCell<ipc::Router>);
unsafe impl Sync for RouterCell {}

// Force the router into a writable section. On some bare-metal targets, a `static`
// with interior mutability can otherwise end up in a read-only segment, causing
// a data abort when we first write to it (exactly what we saw on aarch64 QEMU virt).
#[link_section = ".data"]
static ROUTER: RouterCell = RouterCell(UnsafeCell::new(ipc::Router::new()));

pub fn kmain(logger: &dyn Logger) -> ! {
    logger.log("rustOS: kernel online\n");
    logger.log("rustOS: microkernel step 1 (IPC + cooperative scheduling)\n");

    let router: &mut ipc::Router = unsafe { &mut *ROUTER.0.get() };

    let mut ping = sched::PingTask::new();
    let mut pong = sched::PongTask::new();
    let mut tasks: [&mut dyn sched::Task; 2] = [&mut ping, &mut pong];

    sched::run(&mut tasks, logger, router)
}

crates/kernel/src/lib.rs

Look at the kmain signature: fn kmain(logger: &dyn Logger) -> !. It takes a trait object (&dyn Logger), not a concrete type. The kernel doesn’t know whether logger is a PL011 UART, a COM1 serial port, or a carrier pigeon with a Morse code encoder. It just knows it can call .log(). If you wanted to port this kernel to RISC-V or even a microcontroller with an SPI display, you’d implement Logger for that platform’s output device and call kmain with it.

Notice the #[link_section = ".data"] on the ROUTER static. This is a hard-won lesson. On AArch64, the linker sometimes places statics with interior mutability (like UnsafeCell) into read-only sections. The first write causes a data abort. Forcing it into .data guarantees it lands in a writable section. We learned this the painful way.

The IPC router and scheduler are the subject of Part 2 . For now, just know that kmain is where platform-agnostic kernel logic lives.

11.3 The halt function

#[inline(always)]
pub fn halt() {
    #[cfg(target_arch = "x86_64")]
    unsafe {
        core::arch::asm!("hlt", options(nomem, nostack, preserves_flags));
    }

    #[cfg(target_arch = "aarch64")]
    unsafe {
        // Use WFI (wait-for-interrupt) so we reliably sleep until the next IRQ.
        // WFE can return immediately if an event is already latched.
        core::arch::asm!("wfi", options(nomem, nostack, preserves_flags));
    }

    #[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
    {
        loop {}
    }
}

crates/hal/src/arch.rs

This is another piece of the hardware abstraction layer. On AArch64, wfi (wait for interrupt) puts the CPU into a low-power sleep state until an interrupt fires. On x86_64, the equivalent is hlt. The kernel code just calls hal::arch::halt() and doesn’t care which instruction runs underneath.

The comment about wfe vs wfi is interesting. wfe (wait for event) can return immediately if an event flag is already set, which means your “sleep” loop might spin instead of sleeping. wfi is more predictable: it always waits for an actual interrupt.

12. Linker scripts

On bare metal, there’s no OS to decide where your code lives in memory. The linker script is the document that says: put .text.boot first (so the CPU’s entry point is at the right address), put .text after it, then .rodata, .data, .bss, and finally the stack.

Our linker script for the virt platform places the kernel at 0x40080000. This is within the RAM region that starts at 0x40000000 on the QEMU virt machine. The 0x80000 offset is a convention (it matches the Raspberry Pi’s load address offset), though for QEMU it’s somewhat arbitrary as long as it’s within RAM.

The linker script also defines symbols like __bss_start, __bss_end, and __stack_top that the assembly boot code references. Without these, the assembler would have no idea where BSS begins or where to put the stack.

The build.rs file for our crate tells cargo to use this linker script:

fn main() {
    println!("cargo:rerun-if-changed=linker.ld");
    println!("cargo:rerun-if-changed=src/boot.S");
    let manifest_dir = std::env::var("CARGO_MANIFEST_DIR")
        .expect("CARGO_MANIFEST_DIR not set");
    println!("cargo:rustc-link-arg=-T{}/linker.ld", manifest_dir);
}

crates/arch_aarch64_virt/build.rs

The cargo:rustc-link-arg=-T.../linker.ld line passes the linker script to the linker. Without it, the linker would use its default layout, which almost certainly wouldn’t put our code at the right address.

13. Hands-on exercise

Here’s a quick challenge. Open crates/kernel/src/lib.rs and add a second log message to kmain:

pub fn kmain(logger: &dyn Logger) -> ! {
    logger.log("rustOS: kernel online\n");
    logger.log("rustOS: microkernel step 1 (IPC + cooperative scheduling)\n");
    logger.log("rustOS: hello from YOUR_NAME_HERE!\n");  // add this
    // ... rest of the function

Adding a custom log message

Rebuild and run:

./scripts/build-aarch64-virt.sh demo-ipc
./scripts/run-aarch64-virt.sh

Rebuild and verify

You should see your message appear between the boot messages and the scheduler output. One line of Rust. It traveled through a trait object, into a UART driver, through memory-mapped I/O, into a virtual PL011 peripheral, and out QEMU’s emulated serial port to your terminal.

Let’s step back and look at what we’ve got. A 60-line assembly boot sequence that sets up a stack, zeros BSS, drops from EL2 to EL1, enables floating-point, installs an exception vector table, and hands off to Rust. A Rust entry point that initializes a PL011 UART driver and calls into a platform-agnostic kernel. A three-line Logger trait that abstracts away all the hardware details. And a kernel entry function that doesn’t know or care what CPU architecture it’s running on. It boots. It prints. It works. And everything from here builds on this foundation. The IPC system in Part 2 , the timer interrupts in Part 3 , the MMU in Part 4 : they all stand on the boot sequence and UART driver we just walked through. Get this part right, and the rest is (relatively) straightforward.

14. Brief appendix: how would this differ on other platforms?

If you’re curious about x86_64 or Raspberry Pi, here’s a quick sketch.

On x86_64, the boot process is much more complex because of backwards compatibility with the 1978 Intel 8086. The CPU starts in 16-bit real mode, where you can only address 1 MB of memory. You have to transition through protected mode (32-bit) and then long mode (64-bit), setting up GDTs (Global Descriptor Tables), enabling PAE (Physical Address Extension), and configuring initial page tables (paging is required for 64-bit mode). Most Rust OS projects use the bootloader crate by Philipp Oppermann, which handles all these mode transitions and loads your kernel ELF. For serial output, x86 has legacy COM ports at I/O address 0x3F8, accessed with special in and out instructions (port-mapped I/O rather than memory-mapped I/O). The uart_16550 crate wraps this nicely.

On Raspberry Pi, something unexpected happens: the GPU boots first, not the CPU. The VideoCore GPU reads bootcode.bin from the SD card, loads start.elf, reads config.txt for settings, and then loads your kernel (kernel8.img) to address 0x80000. Only then does the ARM CPU start. The boot assembly is simpler than our QEMU virt version (no EL2 drop needed, the firmware handles that), but the UART driver is more complex. The Pi has two UARTs: a full PL011 (which is connected to Bluetooth by default) and a simpler Mini-UART on GPIO pins 14/15. The Mini-UART’s clock is tied to the GPU frequency, which makes baud rate configuration trickier. And of course, the development cycle is slower: build, copy to SD card, plug into Pi, power on, check output.

Both platforms share the same kernel crate and Logger trait. That’s the whole point of the abstraction.

15. References

ARM Architecture Reference Manual (ARMv8-A) (Section D1 for exception levels, Section G for system registers)
OSDev Wiki (community-driven OS development knowledge base)
Writing an OS in Rust by Philipp Oppermann (excellent x86_64 series)
The Embedded Rust Book (no_std patterns and bare-metal Rust)
QEMU AArch64 virt machine documentation
PL011 UART Technical Reference Manual