5-Part Series:
- Part 0: Why build an OS from scratch?
- Part 1: Foundations
- Part 2: Communication
- Part 3 (this): Concurrency
- Part 4: Memory and beyond
GitHub Repository: bahree/rust-microkernel — full source code and build scripts
Docker Image: amitbahree/rust-microkernel — prebuilt dev environment with Rust, QEMU, and source code
Recap from Part 2
: we built message-passing IPC with a mailbox router and a cooperative scheduler. Two tasks — PingTask and PongTask — exchange messages through endpoint-based routing, taking turns via poll(). It works great, as long as every task plays nice.
Right now, our tasks are polite. They take turns, each one calling poll() and returning promptly. But what happens when a task doesn’t feel like returning? The whole system freezes. That’s the problem we’re solving today.
We’re going to teach our kernel to yank the CPU away from a misbehaving task, save everything that task was doing, and hand control to somebody else. By the end of this post, two threads will be running infinite loops, never yielding or returning, yet they’ll alternate perfectly. The OS will be in charge, not the tasks. 😈
TL;DR
We add timer interrupts and preemptive multitasking to rustOS on AArch64:
- Timer interrupts fire every 100ms, giving us a real tick counter
- The GIC routes hardware interrupts to our handler
- Exception vectors catch IRQs and dispatch them to Rust code
- Context switching saves all 31 general-purpose registers (plus SP, ELR, SPSR) from one thread and restores another’s
- Preemptive scheduling means tasks can’t monopolize the CPU
A few terms we’ll define as we go, but here’s the quick version:
- Cooperative scheduling: Tasks voluntarily give up the CPU. If one task loops forever, everyone else starves, and the system locks up.
- Preemptive scheduling: The OS forcibly takes the CPU away on a timer tick. No task can hog the processor because the OS interrupts it and switches to another task.
- Context switch: Saving the complete CPU state of one task and loading another’s, so the second task resumes exactly where it left off. This includes all registers, the stack pointer, and CPU flags.
Time: ~4-6 hours
1. Why interrupts matter
Let’s start with the problem. Why do we need interrupts? Why can’t we just keep polling?
1.1 The polling problem
Remember our scheduler from Part 2?
let mut tick: u64 = 0;
loop {
for task in tasks.iter_mut() {
task.poll(logger, router, tick);
}
tick += 1; // Just counting loop iterations, not real time!
}This has three problems:
- First,
tickdoesn’t mean anything real. One iteration isn’t one millisecond or one second. It’s, however, the length of time the tasks take. - Second, the CPU burns at 100% spinning through that tight loop even when there’s nothing to do.
- And third, if any task decides to loop forever inside
poll(), the entire system locks up.
Think about it this way. Without interrupts, your CPU is like a waiter who walks to each table asking, “Ready to order?” If table 3 is still deciding, the waiter is stuck standing there forever. Everyone else is hungry and just waiting. With interrupts, each table has a bell. The waiter can do something else (or just rest) and respond when a bell rings.
1.2 The solution: Timer interrupts
What if hardware could tap the CPU on the shoulder at regular intervals? That’s exactly what a timer interrupt does. We configure a piece of hardware to fire a signal every 100 milliseconds. The CPU stops whatever it’s doing, runs our handler, and then goes back. Now we have real time. And if we’re clever about what the handler does, we can use it to switch between tasks.
static TICKS: AtomicU64 = AtomicU64::new(0);
// Called by hardware every 100ms, not by our code
fn timer_irq_handler() {
TICKS.fetch_add(1, Ordering::Relaxed);
}
// Main loop can now sleep between interrupts
loop {
let tick = TICKS.load(Ordering::Relaxed);
for task in tasks.iter_mut() {
task.poll(logger, router, tick);
}
hal::arch::halt(); // Sleep until next interrupt
}2. What is an interrupt?
There’s literally a wire on the CPU chip (well, a signal line) dedicated to receiving interrupt requests. Between executing each instruction, the CPU checks this signal. When it sees the line asserted (fancy way of saying the voltage changed by an external device), the CPU finishes its current instruction, saves its state, looks up the address of a handler function from a table in memory, and jumps there.
Think of it like a doorbell ringing while you’re reading a book. You finish your sentence, bookmark your page, answer the door, deal with whatever it is, then come back and pick up exactly where you left off. The bookmark is your “saved state.” Answering the door is the “handler.”
sequenceDiagram
participant CPU
participant Hardware as Timer Hardware
participant Handler as IRQ Handler
participant Task as Task A Code
Task->>CPU: mov x0, #42
Task->>CPU: add x1, x0, #10
Hardware->>CPU: Timer IRQ Signal
activate CPU
Note over CPU: Save CPU State<br/>(x0-x30, SP, PC, flags)
CPU->>Handler: Jump to handler
activate Handler
Handler->>Handler: TICKS += 1
Handler->>Hardware: Acknowledge IRQ
Handler->>CPU: Return
deactivate Handler
Note over CPU: Restore CPU State
CPU->>Task: Resume at add x1, x0, #10
deactivate CPU
Task->>CPU: ... continue execution ...A few key properties of interrupts that make them special:
- Asynchronous: They can happen at any point during execution - literally between any two instructions. Your code doesn’t ask for them.
- Hardware-driven: The CPU doesn’t poll. A physical signal line tells it “something needs attention.”
- Context preservation: The handler must not corrupt the interrupted program’s state. When the handler returns, the original code must resume as if nothing happened.
2.1 IRQs and exception types
IRQ stands for Interrupt Request. It’s a signal from a hardware device (timer, UART, network card, etc.) asking the CPU to pause and handle something. On ARM, there are four types of exceptions:
| Type | Trigger | Example |
|---|---|---|
| Synchronous | Caused by an instruction | Undefined instruction, data abort |
| IRQ | Hardware interrupt request | Timer, UART |
| FIQ | Fast interrupt request | High-priority devices |
| SError | Asynchronous system error | Memory system errors |
For our timer, we care about the IRQ. The other types we’ll handle later (or just hang on, for now).
You’ll see “mask” and “unmask” a lot in this post. Masked means blocked. When an interrupt is masked, the hardware device still generates the signal, but the CPU ignores it. Think of it like “Do Not Disturb” on your phone: calls still come in, your phone just doesn’t ring.
Unmasked means enabled. The CPU will respond to the interrupt signal normally. We mask interrupts during critical setup (like installing the handler table) and unmask them once everything’s ready. If an interrupt fired before the handler table was installed, the CPU would jump to garbage memory and crash.
3. The ARM Generic Timer
ARM processors have a built-in timer called the Generic Timer. Unlike external timer chips that communicate over a bus, this timer is integrated into the CPU itself. It counts at a fixed frequency and generates an interrupt when the countdown reaches zero. It’s perfect for our needs: we can set it to fire every 100ms, and it will reliably interrupt the CPU at that interval. The Generic Timer is a standard feature on all ARMv8-A (AArch64) CPUs, so our code will work on real hardware, not just QEMU. The only catch is that the platform firmware determines the timer’s frequency and can vary, so we need to read it at runtime. This is a common pattern in OS development: you write code that adapts to the hardware it’s running on, rather than hardcoding assumptions.
3.1 Timer registers
In reality, there are actually several timers available (physical, virtual, hypervisor), each with its own set of registers. The reason for this complexity is that ARM’s architecture supports multiple execution levels (EL0 for user code, EL1 for kernel, EL2 for hypervisor, EL3 for secure monitor). Each level has access to different timers. We only care about the physical timer accessible at EL1, since that’s what our kernel runs on. Specifically, we use the CNTP (Counter-timer Physical Timer) registers, which is a subset of the full Generic Timer system and is accessible from EL1 (kernel privilege):
CNTFRQ_EL0: The timer’s frequency in Hz. Read-only, set by firmware. On QEMU virt, it’s typically 62,500,000 (62.5 MHz).CNTP_TVAL_EL0: The countdown value. Write a number here, and the timer counts down to zero, then fires an interrupt.CNTP_CTL_EL0: Control register. Bit 0 enables the timer.
A quick note on those _EL0 suffixes. That doesn’t mean the register “belongs to” userspace. It means the register is accessible from EL0 and above. Our kernel at EL1 can read and write it just fine. Registers named _EL1 are only accessible from EL1 and above.
3.2 Programming the timer
Programming the timer is straightforward. We read the frequency, calculate the countdown value for our desired tick interval (100ms), write it to CNTP_TVAL_EL0, and enable the timer. The timer will then count down at the specified frequency, and when it reaches zero, it will fire an IRQ. We don’t need to worry about the timer counting down while we’re in the handler. The hardware takes care of that. When the timer fires, it automatically resets and starts counting down again. We just need to reprogram it with the same value to ensure it continues firing every 100ms. We do need to worry about the timer firing while we’re in the handler, but that’s actually a good thing. It means we can use the timer to drive our scheduler. Every time the timer fires, we can check if it’s time to switch tasks.
Here’s how we set up a 100ms tick. We read the frequency, divide by 10 to get the countdown value for 100ms, write it to CNTP_TVAL_EL0, and enable the timer:
fn program_timer(freq: u64) {
// 100ms tick: frequency / 10
let tval = (freq / 10) as u64;
unsafe {
core::arch::asm!(
"msr cntp_tval_el0, {tval}",
"mov x0, #1",
"msr cntp_ctl_el0, x0",
tval = in(reg) tval,
out("x0") _,
options(nostack, nomem)
);
}
}If the frequency is 62,500,000 Hz, then freq / 10 = 6,250,000. The timer counts down from 6,250,000 at 62.5 MHz, reaching zero in exactly 100ms. When it hits zero, IRQ 30 fires (that’s the physical timer’s assigned interrupt ID on ARM).
Note that the timer doesn’t automatically reload. It just fires once - it is a one-shot timer. If we want it to fire again after the next 100ms, we have to write CNTP_TVAL_EL0 again. This gives us flexibility: we can change the tick interval on the fly if we want. We do this inside the IRQ handler every time the timer interrupt fires.
4. The GIC (Generic Interrupt Controller)
The Generic Interrupt Controller (GIC) is the hardware component responsible for managing and routing interrupts on ARM systems. It acts as a traffic cop, receiving interrupt signals from various hardware devices (like our timer) and directing them to the appropriate CPU core. The GIC also handles prioritization, masking, and acknowledgment of interrupts.
When the timer fires, it sends an interrupt signal to the GIC. The GIC then checks its configuration to determine which CPU core should handle that interrupt and delivers it accordingly. The GIC also provides registers to enable/disable specific interrupts, set their priorities, and acknowledge them once handled.
The only thing to note is that the timer can generate an interrupt, but something needs to route that interrupt to the CPU core. That’s the GIC’s job.
4.1 Why two components?
The GIC is split into two parts:
GICD (Distributor): One per system, shared by all cores. It receives interrupts from all hardware devices and decides which CPU core should handle each one. Think of it as a switchboard operator receiving all incoming calls. Then it forwards the call to the correct phone (core) based on its configuration.
GICC (CPU Interface): One per core. It delivers interrupts to its specific CPU and handles acknowledgment. Think of it as the phone on each person’s desk. The GICC listens for calls from the GICD and rings when an interrupt is delivered. It also provides registers for the CPU to acknowledge that it has handled the interrupt, allowing the GICD to send the next one.
This split matters for multi-core systems (which core handles the timer? which core handles the UART?), but even on our single-core setup, we need to configure both. This helps us write code that works on real hardware with multiple cores, not just in QEMU, and it also gives us a clearer separation of concerns in our code. The GICD is responsible for global interrupt configuration, while the GICC is responsible for local interrupt delivery and acknowledgment on each core.
4.2 Memory-mapped registers
Memory-mapped I/O means that hardware devices expose their control registers as specific addresses in the system’s memory map. To interact with the GIC, we read and write to these addresses. This is done using special functions that perform volatile reads and writes, ensuring that the compiler doesn’t optimize these accesses away. We treat these addresses as pointers to hardware registers. When we write to a GIC register, we’re actually sending commands to the hardware. When we read from a GIC register, we’re getting status information from the hardware.
On QEMU’s virt machine, the GIC lives at these addresses:
Here are the registers we’ll touch:
GICD registers:
| Offset | Register | What it does |
|---|---|---|
0x000 | GICD_CTLR | Bit 0 enables the distributor |
0x100 | GICD_ISENABLER0 | Write a 1-bit to enable that IRQ number |
0x400 | GICD_IPRIORITYR | Priority for each IRQ (lower = higher priority) |
0x800 | GICD_ITARGETSR | Which CPU core handles each IRQ |
GICC registers:
| Offset | Register | What it does |
|---|---|---|
0x000 | GICC_CTLR | Bit 0 enables this core’s interface |
0x004 | GICC_PMR | Priority mask: only deliver IRQs above this priority |
0x00C | GICC_IAR | Read this to learn which IRQ fired (and tell GIC you’re handling it) |
0x010 | GICC_EOIR | Write the IRQ number here to say “I’m done” |
The IAR/EOIR protocol is important. Reading GICC_IAR does two things at once: it returns the interrupt number and atomically marks that interrupt as “active.” Writing GICC_EOIR marks it “complete.” If you forget the EOIR write, the GIC thinks you’re still handling that interrupt and won’t deliver another one. Classic bug that makes the system appear to freeze after the first interrupt. Always remember: read IAR to start handling, write EOIR to finish.
4.3 The actual init code
Let us bring it all together. Here’s our real timer::init() function. It sets up the GIC, programs the timer, and unmasks IRQs:
static TICKS: AtomicU64 = AtomicU64::new(0);
static CNTFRQ: AtomicU64 = AtomicU64::new(0);
const GICD_BASE: usize = 0x0800_0000;
const GICC_BASE: usize = 0x0801_0000;
const IRQ_CNTPNS: u32 = 30;
pub fn init() {
// Enable GIC distributor
mmio_write32(GICD_BASE, 0x000, 1);
// Enable GIC CPU interface, accept all priorities
mmio_write32(GICC_BASE, 0x004, 0xFF);
mmio_write32(GICC_BASE, 0x000, 1);
// Enable physical timer interrupt (IRQ 30)
enable_irq(IRQ_CNTPNS);
// Read counter frequency and program timer
let freq: u64;
unsafe {
core::arch::asm!(
"mrs {0}, cntfrq_el0",
out(reg) freq,
options(nostack, nomem)
);
}
CNTFRQ.store(freq, Ordering::Relaxed);
program_timer(freq);
// Unmask IRQs (clear DAIF.I)
unsafe {
core::arch::asm!("msr daifclr, #2", options(nostack, nomem));
}
}What is important is that the sequence really matters. We enable the distributor first, then the CPU interface, then the specific IRQ, then program the timer, and finally unmask interrupts at the CPU level. If we unmasked interrupts before the GIC was ready, the CPU might see a spurious interrupt and crash. If we enabled the timer before the GIC, the timer would fire, but the interrupt would never reach the CPU. If we enabled the CPU interface before the distributor, the CPU would be ready to receive interrupts, but none would be sent. The order of operations is crucial for a stable system.
Why AtomicU64 instead of a regular u64? Even on a single-core system, interrupt handlers create concurrency. The main loop and the interrupt handler interleave unpredictably. If the main loop reads TICKS as two 32-bit halves (which happens on architectures without native 64-bit atomic loads) and the timer interrupt fires between the two halves, you get a corrupted value. AtomicU64 guarantees each load and store is indivisible. We use Ordering::Relaxed because on a single core, there’s no other CPU to synchronize with. The atomicity is still needed to prevent tearing, but we don’t need stronger ordering guarantees since there’s no cross-core communication.
5. The exception vector table
The exception vector table is a key piece of the interrupt handling mechanism. It’s a fixed block of memory that the CPU consults when an exception (like an IRQ) occurs. The CPU uses the exception type and the current execution context to calculate an offset into this table, where it expects to find a branch instruction to the appropriate handler.
When an IRQ fires, the CPU needs to know where to jump. On ARM, you set up a vector table: a fixed-layout block of code in memory containing branch instructions for each exception type.
AArch64’s vector table has 16 entries arranged in a 4x4 grid: 4 exception types (Synchronous, IRQ, FIQ, SError) × 4 execution contexts. The contexts are:
- Current EL, SP0 (using the EL0 stack pointer, uncommon)
- Current EL, SPx (using the current EL’s own stack pointer, the normal case for kernel code)
- Lower EL, AArch64 (interrupts from user mode running 64-bit code)
- Lower EL, AArch32 (interrupts from 32-bit code)
For our kernel running at EL1, the “Current EL, SPx” row is the one that matters. That’s at offset 0x200 from the table base.
Each entry gets exactly 0x80 (128) bytes of space. That’s enough room for a branch instruction to a larger handler elsewhere. The .align 11 directive aligns the table to 2^11 = 2048 bytes, which is a hardware requirement: the CPU will fault if VBAR_EL1 points to an unaligned address. That’s why we have the .align 11 at the start of the table. Each entry is a simple branch to a common handler for that exception type. For now, all IRQ entries point to the same exc_irq handler, which will read the interrupt ID and dispatch accordingly.
Here’s the actual vector table from our boot.S:
.align 11
vectors:
// 0x000: Current EL, SP0, Sync
b exc_sync
.space 0x80 - 4
// 0x080: Current EL, SP0, IRQ
b exc_irq
.space 0x80 - 4
// 0x100: Current EL, SP0, FIQ
b exc_fiq
.space 0x80 - 4
// 0x180: Current EL, SP0, SError
b exc_serr
.space 0x80 - 4
// 0x200: Current EL, SPx, Sync
b exc_sync
.space 0x80 - 4
// 0x280: Current EL, SPx, IRQ <--- Timer IRQs land here
b exc_irq
.space 0x80 - 4
// 0x300: Current EL, SPx, FIQ
b exc_fiq
.space 0x80 - 4
// 0x380: Current EL, SPx, SError
b exc_serr
.space 0x80 - 4
// 0x400-0x780: Lower EL entries (same pattern)
// ... (all branches to the same handlers for now)The table is installed during boot with:
el1_start:
adr x0, vectors
msr vbar_el1, x0
isbVBAR_EL1 (Vector Base Address Register) tells the CPU where the table lives. The isb (Instruction Synchronization Barrier) flushes the pipeline so the CPU sees the new table address immediately. If we forgot the isb, the CPU might still use the old (zero) value of VBAR_EL1 for a few instructions, and if an interrupt fired during that window, it would jump to address 0 and crash. Always remember to use isb after changing VBAR_EL1.
6. The Rust IRQ handler
In Rust, we write a safe wrapper around the raw assembly handler. The assembly code in exc_irq will save all the registers, call our Rust function, and then restore registers based on the return value. This way, we can write our interrupt handling logic in Rust without worrying about the low-level details of context saving and restoring. The reason we return a pointer to a Context struct is that the handler can decide to switch to a different thread. If it returns the same pointer it was given, the assembly code will restore that context and resume the same thread. If it returns a different pointer, the assembly code will restore the new context, effectively switching threads. This is how we implement preemptive multitasking: the timer interrupt can decide to switch to a different thread every time it fires.
When a timer interrupt fires, the CPU jumps to exc_irq in the vector table, which (after saving context, more on that later) calls our Rust handler. Here’s the actual code:
#[unsafe(no_mangle)]
pub extern "C" fn rust_irq_handler(current: *mut Context) -> *const Context {
let iar = mmio_read32(GICC_BASE, GICC_IAR);
let id = iar & 0x3FF;
let mut next: *const Context = current;
if id == IRQ_CNTPNS {
let t = TICKS.fetch_add(1, Ordering::Relaxed) + 1;
let freq = CNTFRQ.load(Ordering::Relaxed);
if freq != 0 {
program_timer(freq);
}
// Switch threads every 5 ticks (~500ms with 100ms tick).
if (t % 5) == 0 {
next = super::preempt::switch_next(current);
}
}
// End of interrupt
mmio_write32(GICC_BASE, GICC_EOIR, iar);
next
}If we look closely at the signature, we see that it takes a *mut Context (a pointer to the current thread’s saved state) and returns a *const Context (a pointer to the next thread’s state). If no switch is needed, it returns the same pointer it was given. The assembly code that calls this function uses the return value to decide which context to restore.
Here’s what happens step by step:
- Read IAR: Find out which interrupt fired. We mask with
0x3FFbecause the lower 10 bits hold the interrupt ID. - Check if it’s the timer: ID 30 is our physical timer.
- Increment tick counter:
fetch_addatomically adds 1 and returns the old value, so we add 1 to get the current tick. - Reprogram timer: Set up the next 100ms countdown. Without this, the timer only fires once.
- Maybe switch threads: Every 5 ticks (500ms), call
switch_nextto pick the other thread. - Write EOIR: Tell the GIC we’re done handling this interrupt.
7. DAIF bits and unmasking
ARM has four masking bits in the processor state register (PSTATE), collectively called DAIF. These processor flags control whether certain types of exceptions are masked (blocked) or unmasked (enabled) and are critical for safely handling interrupts - hence the name “DAIF” for Debug, SError, IRQ, FIQ. These bits act as individual on/off switches for different categories of exceptions:
- D (bit 3): Debug exceptions — breakpoints, watchpoints, single-step traps. Used by debuggers.
- A (bit 2): SError (asynchronous aborts) — serious hardware errors like uncorrectable memory faults. You generally want these enabled so the system can respond to hardware failures rather than silently ignoring them.
- I (bit 1): IRQ (normal interrupts) — this is the one we care about most. Timer interrupts, UART receive interrupts, and most peripheral interrupts are IRQs.
- F (bit 0): FIQ (fast interrupts) — a higher-priority interrupt channel. On some systems, FIQ is reserved for secure-world use (TrustZone). We don’t use it in our kernel.
When a bit is set (1), that exception type is masked (blocked). By default, when the CPU enters EL1, all four bits are set, meaning everything is blocked. This is a safety measure: the kernel starts with interrupts disabled so it can set up handlers and data structures before anything fires.
The instruction msr daifclr, #2 means “DAIF Clear with bitmask 0b0010,” which clears only the I bit, unmasking normal IRQs. We deliberately leave D, A, and F alone. We do this at the very end of timer::init(), after the GIC is configured, the timer is programmed, and the vector table is installed. The ordering is crucial: if we unmasked IRQs before the vector table was ready, the first timer tick would jump to an uninitialized address, causing the system to crash.
You might wonder: why not just unmask everything? Because we want precise control. We only enable what we’ve set up handlers for. Unmasking FIQ without a handler would cause the same crash problem. And debug exceptions should only be enabled when we actually have a debugger attached. BTW, leaving them masked by default is a good defensive programming practice.
8. Running the timer demo
OK, this is the moment of truth. Let’s build and run the timer demo. If everything is set up correctly, you should see the system boot, print a message about the timer demo, and then enter an idle loop that wakes on each timer tick. You can run just the timer demo (without preemption) using the demo-timer feature:
./scripts/build-aarch64-virt.sh demo-timer
./scripts/run-aarch64-virt.shExpected output:
The system enters a wfe (Wait For Event) loop and wakes on each timer tick. At this point, we’re not printing ticks to keep things simple. The key proof that it works: the system doesn’t hang. If interrupts weren’t firing, wfe would sleep forever.
Now here’s the key insight: if we can interrupt a running task, we can save its state and run a different task instead. Think about what the interrupt handler already does. When a timer IRQ fires, the CPU automatically saves the program counter and status flags. Our handler saves the remaining registers. After incrementing the tick counter, we restore everything and resume the interrupted code. The interrupted code never knows it was paused. But what if, instead of restoring the same code’s registers, we restored different code’s registers? We’d resume a completely different thread. That’s preemptive multitasking. That’s the whole trick.
Pretty cool, right? With just a timer interrupt and some context saving/restoring, we can switch between threads without them ever asking for it. The timer is the “clock” that drives our scheduler, and the interrupt handler is the “switchboard operator” that decides which thread to run next.
9. What is CPU context?
We talk about saving and restoring “context,” but what does that actually mean? We need to save enough of the CPU’s state so that when we switch back to a thread, it can resume exactly where it left off, with all its registers and flags intact. This includes all the general-purpose registers (x0-x30), the stack pointer (SP), the program counter (PC, stored in ELR_EL1), and the CPU flags (stored in SPSR_EL1).
In other words, Context is everything the CPU needs to resume a task exactly where it left off. Here’s our actual Context struct from preempt.rs:
#[repr(C)]
pub struct Context {
pub x: [u64; 31], // x0..x30
pub sp: u64, // Stack pointer
pub elr: u64, // Exception Link Register (where to resume)
pub spsr: u64, // Saved Program Status Register (CPU flags)
}Size: 31 registers + SP + ELR + SPSR = 34 fields, each 8 bytes = 272 bytes per task.
Why #[repr(C)]? By default, Rust is free to reorder struct fields and add padding however it likes. But our assembly code accesses fields by hardcoded byte offsets ([x9, #0x00] for x0, [x9, #0xF8] for SP, etc.). If the compiler rearranged the fields, the assembly would save and restore the wrong data, silently corrupting the task state. #[repr(C)] forces fields to appear in memory in declaration order. This way, we can reliably calculate offsets for assembly access.
For example, x0 is at offset 0, x1 at 8, …, x30 at 240, sp at 248, elr at 256, and spsr at 264. The assembly code uses these offsets to save and restore the correct registers; if the struct layout changed, the offsets would be wrong, and we’d end up with a very buggy system. This is also a common pattern in OS development: you have a data structure shared between Rust and assembly, and you need to ensure they agree on its layout. #[repr(C)] is the standard way to achieve this.
9.1 Why save ALL 31 registers?
In theory, we could get away with saving only the registers that the interrupted code was actually using. But in practice, we have no way of knowing which registers those are. The compiler can use any register for any purpose at any time. Even if we tried to analyze the code to figure out which registers are live at the interrupt point, we’d have to do that for every possible instruction and every possible interrupt timing. It would be a nightmare.
Imagine Thread A is in the middle of a loop. Register x5 holds a loop counter, and x19 holds a critical pointer. The timer fires, we switch to Thread B, which runs for a while and overwrites x5 and x19 with its own data. When we switch back to Thread A, those registers are corrupted. Thread A crashes or (worse) silently computes the wrong thing.
The problem is we don’t know which registers the interrupted code was using. It could be any of them. So we save all of them.
| Registers | Role | What breaks if not saved |
|---|---|---|
| x0-x7 | Function arguments, return values | Function calls fail |
| x8 | Indirect result location | Large struct returns corrupt memory |
| x9-x15 | Temporaries (caller-saved) | Mid-calculation values corrupted |
| x16-x17 | Intra-procedure scratch | Dynamic linking breaks |
| x18 | Platform register | Reserved for OS use |
| x19-x28 | Callee-saved | Long-lived variables corrupted (hardest bugs) |
| x29 | Frame pointer | Debugger backtraces break |
| x30 | Link register (return address) | Functions return to wrong address |
| SP | Stack pointer | Stack access goes to wrong memory |
| ELR_EL1 | Exception return address | CPU resumes at wrong instruction |
| SPSR_EL1 | Saved CPU flags | Conditional branches break, IRQ state wrong |
What about NEON/FP registers (v0-v31)? AArch64 has 32 additional 128-bit registers for SIMD and floating-point. We don’t save them because it would add 512 bytes per task and ~200 extra cycles per switch. For our demo, we assume tasks don’t use floating-point. A real OS would implement lazy FP saving: track whether a task used FP registers and only save them if it did.
10. Thread setup
For preemption to work, we need at least two threads. Each thread needs its own stack and a context that points to its entry function. When the timer interrupt fires, the handler can switch between these threads by returning different context pointers. With just one thread, the timer would interrupt it, but we’d always return the same context, so it would just resume the same thread. With two threads, we can alternate between them on each timer tick.
Now let’s look at the complete preemption system. We need stacks, entry points, and context initialization.
10.1 Stacks and static contexts
Because we’re writing a kernel, we don’t have dynamic memory allocation yet. We can’t malloc stacks for our threads. Instead, we define static arrays in Rust to serve as stacks. Each thread gets its own stack array. We also define a static array of Context structs to hold the CPU state for each thread. This way, we can set up everything at compile time without needing any heap allocation. The code below is quite straightforward: we define two 16 KB stacks and an array of two Context structs. The CURRENT atomic variable tracks which thread is currently running.
const STACK_SIZE: usize = 16 * 1024; // 16 KB per thread
#[repr(align(16))]
struct Stack([u8; STACK_SIZE]);
static STACK0: Stack = Stack([0; STACK_SIZE]);
static STACK1: Stack = Stack([0; STACK_SIZE]);
static mut CTX: [Context; 2] = [
Context { x: [0; 31], sp: 0, elr: 0, spsr: 0x5 },
Context { x: [0; 31], sp: 0, elr: 0, spsr: 0x5 },
];
static CURRENT: AtomicUsize = AtomicUsize::new(0);The #[repr(align(16))] on Stack guarantees 16-byte alignment. ARM requires the stack pointer to be 16-byte aligned at all times. Violating this causes an alignment fault.
The spsr: 0x5 value means 0b0000_0101, which encodes EL1h: Exception Level 1, using SP_EL1 (the kernel’s own stack pointer). When eret restores this SPSR value, the CPU runs at the kernel privilege level. That’s what we want for our kernel threads.
10.2 Thread entry points
Entry points are the functions where each thread starts executing. When we initialize the context for each thread, we set the elr field to point to its entry function. When the scheduler switches to that thread, it will “return” to that function as if it had been interrupted there. Each thread runs an infinite loop that prints its name every 10 ticks. Thread A prints “A\n” at ticks 0, 10, 20, etc., while Thread B prints “B\n” at ticks 5, 15, 25, etc. This way, we can visually confirm that both threads are running and that the timer interrupt is switching between them.
pub(crate) extern "C" fn thread_a_entry() -> ! {
// Enable the timer after the first thread context is active.
timer::init();
let mut last_tick: u64 = 0;
loop {
let t = timer::ticks();
if t != last_tick {
last_tick = t;
if (t % 10) == 0 { UartLogger::puts("A\n"); }
}
core::hint::spin_loop();
}
}
pub(crate) extern "C" fn thread_b_entry() -> ! {
let mut last_tick: u64 = 0;
loop {
let t = timer::ticks();
if t != last_tick {
last_tick = t;
if (t % 10) == 5 { UartLogger::puts("B\n"); }
}
core::hint::spin_loop();
}
}Notice something subtle: Thread A checks (t % 10) == 0 but Thread B checks (t % 10) == 5. Why not both check (t % 10) == 0?
Here’s what we mean. Context switches happen every 5 ticks. So at tick 10 (a multiple of 10), the scheduler switches from whichever thread is running to the other. But the switch happens during the tick, which means Thread B might never actually see tick 10 while it’s running. It gets switched out right at that moment. By having B check for tick 5 instead, we offset the check from the switch boundary, and both threads reliably print.
Both functions are extern "C" because they’ll be called from assembly via eret. The C ABI ensures the calling convention matches what our assembly code expects.
10.3 Context initialization
Context initialization is the process of setting up the Context structs for each thread so that when we switch to them, they start executing at their entry points with a valid stack. We set the sp field to point to the top of each thread’s stack (remember, stacks grow downward), and we set the elr field to point to the thread’s entry function. The spsr field is set to 0x5, which means EL1h mode (kernel mode using SP_EL1). This way, when we “return” to this context, the CPU will jump to the entry function with the correct privileges.
pub fn init() {
unsafe {
let top0 = STACK0.0.as_ptr().add(STACK_SIZE) as u64;
let top1 = STACK1.0.as_ptr().add(STACK_SIZE) as u64;
CTX[0].sp = top0;
CTX[0].elr = thread_a_entry as *const () as usize as u64;
CTX[0].spsr = 0x5;
CTX[1].sp = top1;
CTX[1].elr = thread_b_entry as *const () as usize as u64;
CTX[1].spsr = 0x5;
}
}Why SP points to the END of the stack: Stacks grow downward on ARM (and x86). Pushing data decreases the stack pointer. So the “bottom” (starting point) is the highest address. We compute STACK0.0.as_ptr().add(STACK_SIZE) to get one-past-the-end. If we set SP to the beginning of the allocation, the very first push would write below it, corrupting other memory.
The eret trick: We set elr to the thread’s entry function address. When eret executes, the CPU loads the PC from the ELR and jumps to it. The CPU doesn’t know (or care) whether it’s “returning” from a real exception or just bootstrapping a new thread. It simply loads and jumps. This is how the first task on every OS starts: fill in the context as if the task had been interrupted, then “return” to it.
11. The context switch in assembly
As we touched on earlier, this is the heart of preemption. Every register must be saved and restored perfectly, or threads silently corrupt each other. The exc_irq handler in boot.S does this. It saves all registers to the current context, calls the Rust handler to decide which context to run next, and then restores registers from the returned context. The assembly code is a bit verbose because we have to save 31 registers, plus SP, ELR, and SPSR, and we have to be careful not to clobber any registers before we save them.
Here’s the actual exc_irq handler from boot.S. We’ll walk through it section by section.
11.1 The scratch save trick
exc_irq:
// We need a working register to read TPIDR_EL1, but we can't
// touch any register without saving it first. Chicken-and-egg.
// Solution: use the stack as scratch space for x9 and x10.
sub sp, sp, #0x20
str x9, [sp, #0x00]
str x10, [sp, #0x08]
// Now x9 is free. Load the current Context pointer.
mrs x9, tpidr_el1Here’s the problem: to save registers to the Context struct, we need a pointer to that struct. But reading TPIDR_EL1 requires a destination register, which would clobber the interrupted thread’s value in that register. Solution: we temporarily push x9 and x10 onto the stack (which doesn’t need a general-purpose register because SP is implicit in str), then use x9 to hold the context pointer. We’ll retrieve the original x9/x10 from the stack later and store them in the proper Context slots.
11.2 Saving all registers
This is needed as we called out earlier, but the implementation is straightforward: we use stp (Store Pair) to save registers in pairs, and str for the odd one out (x8). We also save SP, ELR, and SPSR at the end. The offsets into the Context struct are carefully calculated based on the struct layout.
// Save x0..x8 to Context
stp x0, x1, [x9, #0x00]
stp x2, x3, [x9, #0x10]
stp x4, x5, [x9, #0x20]
stp x6, x7, [x9, #0x30]
str x8, [x9, #0x40]
// Retrieve original x9/x10 from scratch and save to Context
ldr x0, [sp, #0x00]
str x0, [x9, #0x48] // x9's slot in Context
ldr x0, [sp, #0x08]
str x0, [x9, #0x50] // x10's slot in Context
// Save x11..x30
stp x11, x12, [x9, #0x58]
stp x13, x14, [x9, #0x68]
stp x15, x16, [x9, #0x78]
stp x17, x18, [x9, #0x88]
stp x19, x20, [x9, #0x98]
stp x21, x22, [x9, #0xA8]
stp x23, x24, [x9, #0xB8]
stp x25, x26, [x9, #0xC8]
stp x27, x28, [x9, #0xD8]
stp x29, x30, [x9, #0xE8]
// Save SP (original, before our scratch allocation), ELR, SPSR
add x0, sp, #0x20
str x0, [x9, #0xF8] // SP
mrs x0, elr_el1
str x0, [x9, #0x100] // ELR (where to resume)
mrs x0, spsr_el1
str x0, [x9, #0x108] // SPSR (CPU flags)stp (Store Pair) saves two 8-byte registers in one instruction. The offsets (#0x00, #0x10, #0x20, …) are byte offsets into the Context struct. x0 lives at offset 0, x1 at offset 8, x2 at offset 16 (0x10), and so on. After saving x0-x8, we restore x9 and x10 from the stack and save them in their proper slots in the Context. Finally, we save SP, ELR, and SPSR.
Notice how we recover SP: add x0, sp, #0x20 computes the original SP before we subtracted 0x20 for the scratch area. We store that original SP in the Context. This way, when we restore the context later, we can set SP back to where the interrupted thread expects it, not where we temporarily moved it for our scratch space.
11.3 Calling the Rust handler
The rust handler needs the current context pointer to decide which thread to run next. We pass it in x0, and it returns the next context pointer in x0. We then store that pointer back in TPIDR_EL1 for the next interrupt and also keep it in x19 for the restore sequence. This is the point where we can switch threads: by returning a different context pointer, the Rust handler can cause the assembly code to restore a different thread’s state. When the timer fires again, it will save the current thread’s state to its context, call the Rust handler, and then restore the next thread’s state, effectively switching between them on each tick.
// Call rust_irq_handler(current_context) -> next_context
mov x0, x9
bl rust_irq_handler
// x0 now points to the next Context (might be the same, might be different)
msr tpidr_el1, x0
mov x19, x0 // Keep as base register for restoreWe pass the current context pointer in x0 (the first argument in the C calling convention). The Rust handler returns the next context pointer in x0 (the return value). We stash this in both TPIDR_EL1 (so the next interrupt knows where to save) and x19 (so we can use it as a base for the restore sequence).
11.4 Restoring the next context
On the flip side, we restore all the registers from the context pointer returned by the Rust handler. This is essentially the reverse of the save sequence. We load SP, ELR, and SPSR first, then all the general-purpose registers. Finally, we execute eret to jump to the next thread’s entry point with the correct CPU state. This is quite literally the “magic” moment where we switch from one thread to another. The CPU doesn’t know or care that we’re switching threads; it just loads the new PC, flags, and jumps. The interrupted thread is completely unaware that it was paused and resumed later.
// Restore SP, ELR, SPSR for the next thread
ldr x1, [x19, #0xF8]
mov sp, x1
ldr x1, [x19, #0x100]
msr elr_el1, x1
ldr x1, [x19, #0x108]
msr spsr_el1, x1
// Restore x0..x18 (skip x19 for now, we're using it)
ldp x0, x1, [x19, #0x00]
ldp x2, x3, [x19, #0x10]
ldp x4, x5, [x19, #0x20]
ldp x6, x7, [x19, #0x30]
ldp x8, x9, [x19, #0x40]
ldp x10, x11, [x19, #0x50]
ldp x12, x13, [x19, #0x60]
ldp x14, x15, [x19, #0x70]
ldr x16, [x19, #0x80]
ldp x17, x18, [x19, #0x88]
// Restore x20..x30
ldp x20, x21, [x19, #0xA0]
ldp x22, x23, [x19, #0xB0]
ldp x24, x25, [x19, #0xC0]
ldp x26, x27, [x19, #0xD0]
ldp x28, x29, [x19, #0xE0]
ldr x30, [x19, #0xF0]
// Restore x19 LAST (we were using it as the base pointer)
ldr x19, [x19, #0x98]
eretThe critical detail: x19 is restored as the last because we’re using it as the base address for all loads. If we restored it earlier, we’d lose our pointer and couldn’t load the remaining registers. Then we execute eret (Exception Return), which atomically loads the PC from ELR_EL1 and the CPU flags from SPSR_EL1, and jumps to that address. The next thread starts running as if it were never interrupted. The whole save/restore process is completely transparent to the threads. They just run, get interrupted, and resume later without any awareness of the context switching happening behind the scenes.
12. start_first: launching the first thread
We do have a chicken-and-egg problem at boot: we need to start the first thread so that the timer can preempt it, but we don’t have a “previous context” to save because no thread is running yet. The solution is a special start_first function that loads the first thread’s context and jumps to it using eret. This function is called from Rust after we initialize the contexts but before we enable interrupts. It sets up the CPU state for the first thread and then “returns” to it, effectively launching our multitasking system.
.global start_first
start_first:
msr tpidr_el1, x0 // Store context pointer for future IRQs
mov x19, x0 // Use as base register
// Restore SP, ELR, SPSR
ldr x1, [x19, #0xF8]
mov sp, x1
ldr x1, [x19, #0x100]
msr elr_el1, x1
ldr x1, [x19, #0x108]
msr spsr_el1, x1
// Restore all GPRs
ldp x0, x1, [x19, #0x00]
ldp x2, x3, [x19, #0x10]
// ... (same pattern as exc_irq restore)
ldr x19, [x19, #0x98]
eretIt’s nearly identical to the restore half of exc_irq. The eret jumps to whatever address was in elr, which we set to thread_a_entry during init(). The CPU doesn’t know this isn’t a “real” exception return. It just loads and jumps.
preempt::init();
extern "C" {
fn start_first(ctx: *const preempt::Context) -> !;
}
unsafe { start_first(preempt::first_context()) }13. The scheduler
As we mentioned earlier, the scheduler’s job is to decide which thread to run next. In our simple demo, we have only two threads, so the scheduler just toggles between them. After all that assembly and GIC configuration, the scheduler itself is almost comically simple:
pub fn switch_next(_current_ctx: *mut Context) -> *const Context {
let cur = CURRENT.load(Ordering::Relaxed);
let next = cur ^ 1; // XOR: 0 becomes 1, 1 becomes 0
CURRENT.store(next, Ordering::Relaxed);
unsafe { &CTX[next] as *const Context }
}That’s it. XOR toggle for two threads. Compare this with the cooperative scheduler from Part 2, which was a loop that called poll() on each task. That scheduler was the one making decisions about when to run each task. Here, the scheduler doesn’t decide when to switch — the timer interrupt does. The scheduler only decides who to switch to. This separation of concerns is important: the timer provides the preemption mechanism, and the scheduler provides the policy.
We XOR the current index with 1 to toggle between 0 and 1. This is a common trick for two-thread round-robin scheduling. For more threads, you’d use a different data structure (like a queue) and a different algorithm (like priority scheduling). Still, the core mechanism of “save current context, pick next context, restore next context” remains the same.
The rust_irq_handler calls switch_next every 5 ticks (500ms). This function picks the other thread by XOR-toggling the index (0 becomes 1, 1 becomes 0) and returns a pointer to that thread’s saved context. The assembly handler then restores that context instead of the interrupted thread’s context.
For two threads, XOR is perfect. For N threads, you’d replace this with (cur + 1) % N or a proper run queue with priorities. But the underlying mechanism — save the old context, pick the next thread, restore the next context — stays the same regardless of how sophisticated your scheduling policy becomes. Linux’s CFS scheduler and our XOR toggle both ultimately call the same kind of context switch.
14. Running the preemptive demo
OK, let’s see it in action. Build and run the demo-preempt feature. You should see the system boot, print a message about the preemptive demo, and then threads A and B alternating every second without ever yielding. This is the key proof that preemption works: both threads are running and switching without either of them ever calling yield() or returning. The OS is in control.
At face value, this looks boring — just two threads printing letters. But it’s actually a huge deal. We have a timer interrupt that can preempt running code, a scheduler that selects the next thread to run, and a context-switch mechanism that saves and restores CPU state perfectly. This is the core of any preemptive multitasking OS.
./scripts/build-aarch64-virt.sh demo-preempt
./scripts/run-aarch64-virt.sh
The proof: Neither thread ever calls yield() or returns. They both spin in infinite loops. Yet they alternate. The OS is in control.
Timing breakdown: Timer ticks every 100ms. Threads switch every 5 ticks (500ms). Each thread prints every 10 ticks (1 second). So you see A, then about a second later B, then A, and so on.
Press Ctrl-A then X to exit QEMU.
15. Cooperative vs preemptive: a comparison
Now that we’ve built both models, let’s put them side by side:
| Property | Cooperative (Part 2) | Preemptive (this part) |
|---|---|---|
| Who decides when to switch | The task itself | The OS (via timer) |
| Can a task hog the CPU? | Yes | No |
| Overhead per switch | ~10 cycles (function return) | ~300 cycles (full register save/restore) |
| Worst-case latency | Unbounded (task might never yield) | Bounded (at most one tick period) |
| Shared state safety | Simple (no concurrent access) | Complex (need atomics, critical sections) |
| Best for | Trusted code, embedded, coroutines | Untrusted code, general-purpose OS |
The overhead difference is worth understanding. In cooperative scheduling, a “switch” is just a function that returns, with the scheduler calling the next function. That’s a few instructions. In preemptive scheduling, we save 34 fields to memory, call the Rust handler, pick the next thread, restore 34 fields from memory, and execute eret. That’s roughly 30x more work per switch. At 10 switches per second (our current rate), that’s negligible. At 10,000 switches per second (a typical Linux configuration), it becomes a concern.
The shared state trade-off is also significant. With cooperative scheduling, you know exactly when your task might be interrupted: only when you return from poll(). So you can safely modify shared data between poll() calls without locks. With preemption, an interrupt can arrive between any two instructions. That’s why we use AtomicU64 for TICKS and AtomicUsize for CURRENT — even on a single core, the interrupt handler and the interrupted code are logically concurrent.
Most real systems use both approaches. The kernel itself typically uses cooperative scheduling internally (kernel code is trusted, and the performance benefit matters). User processes get preemptive scheduling (user code is untrusted, and fairness matters more than raw throughput). We’ll get there eventually — once we add user/kernel separation.
16. What we built
In this part, we took our cooperative kernel from Part 2 and gave the OS the power to enforce fairness by implementing preemptive scheduling. We went from “tasks voluntarily yield” to “the OS takes control whether tasks like it or not.” Here’s what we added:
- Timer interrupts: The ARM Generic Timer fires every 100ms, giving us a real notion of time instead of meaningless loop iteration counts.
- GIC configuration: The Distributor and CPU Interface route IRQ 30 (the physical timer) to our handler, with proper acknowledge/end-of-interrupt protocol.
- Exception vectors: A 16-entry vector table catches IRQs and dispatches them to Rust code, bridging the hardware-software boundary.
- Context switching: Assembly saves and restores all 34 fields (31 GPRs + SP + ELR + SPSR) per switch, ensuring threads can’t corrupt each other’s state.
- Preemptive scheduling: Timer-driven round-robin means threads can’t monopolize the CPU, even if they spin in infinite loops.
The context switch takes about 300 cycles on our setup. That’s fast, partly because we don’t flush the TLB (we’re in a single address space), don’t save NEON/FP registers, and don’t switch page tables. A real OS doing all of those things might take 3,000-5,000 cycles per switch. But the mechanism is identical: save state, pick next, restore state, eret.
The two threads in our demo never yield, never cooperate, never even know the other exists. Yet they alternate perfectly. That’s the fundamental promise of preemptive multitasking, and it’s why every general-purpose OS uses it.
In Part 4 , we’ll tackle the last major piece of the puzzle: virtual memory. We’ll build a frame allocator, construct page tables, and enable the MMU so that each memory access goes through address translation. This is the foundation for process isolation — giving each task its own view of memory so they can’t stomp on each other.
17. References
- ARM Architecture Reference Manual , Section D1 (Exception Handling), Section D11 (Generic Timer)
- ARM GICv2 Specification
- QEMU virt machine , memory map
- ARM Procedure Call Standard (AAPCS64) , register usage
- Operating Systems: Three Easy Pieces, Chapter 6: Limited Direct Execution
5-Part Series:
- Part 0: Why build an OS from scratch?
- Part 1: Foundations
- Part 2: Communication
- Part 3 (this): Concurrency
- Part 4: Memory and beyond