x86_64 OS

Before you start, this implementation is educational and evolving. I'll document my progress as I go. Some subsystems are simplified. Terminology here is Linux inspired even when semantics are lighter.

Introduction
From Phil-Opp's Blog
My Work
- Disk
- File System

Introduction

I came across this when I surfing for things people made in Rust, when I somewhat finished reading the book (It was a lot of stuff). Initially, it was just me following the popular Writing as OS in Rust series by Philipp Oppermann. Most parts of this project are similiar to the code in the series as of now.

Why do this? It sounds cool.

Of course, I did not just finish his tutorial and call it a day. I wanted to make a file system. I went through a few sources primarily OSTEP.

I mainly talk about the File System in detail here. If you want to read the other topics as well, I would recommend to follow his blog. He explains it better than I ever could. I'll talk about the other things in brief detail to keep it smooth.

From Phil-Opp's Blog

So, an OS runs on bare metal, meaning, no underlying engine. It needs to get itself support for display, accepting keystrokes and all of it. We can't even rely on the standard library since they have functions that depend on the system calls that the kernel allows them to make. We add a #![no_std] to main.rs.

Panic

To set things up in Rust, we need to define our own Panic Handlers. We can't really use stack unwinding to get a stack trace because that is dependent on OS specific libraries like libunwind on Linux. This stack unwinding occurs in Rust when there is a panic. We make the CPU stop on panic. Well, we can't stop execution now, the CPU needs to jump somewhere. Those functions would be Panic Handlers that we define ourselves. These functions are diverging functions, meaning they never return. You can check the code in Cargo.toml that I have a panic = "abort" there.

Entry Point

Now, let's talk about the code. You might think that the entry point for a program would be the main function. That's true, but we need something to call it. Normally, the standard library provides a runtime that calls main. Since we don't have that, we define our own entry point _start. The linker looks for this symbol to know where the execution should begin after the bootloader hands off control.

To form something that runs (executable) we need to combine all of our code along with the libraries used for our specific system. This is the job for the linker. The linker has a target that would build on your machine. If you're on linux, the target would look like x86_64-unknown-linux-gnu. Rust uses a target triple to have other information along with the target. This would be some sort of a json file. We need to define our own target and give it to the linker.

BIOS

When you turn on a computer, it starts by executing some code stored in your ROM. You would call that firmware. There are 2 firmware standard: BIOS and UEFI. UEFI is the more modern one. so, it loads the BIOS from the ROM and tries to perform checks and looks for bootable disks. If it finds one, the control is transferred to a small portion of code stored in the bootable disk at the beginning, called the bootloader. It is a 512 byte portion. I haven't written a bootloader yet since it involves assembly code so the project uses the bootimage tool that automatically prepends a bootloader to our kernel (the code we're going to write).

VGA Buffer

Let's get some visuals. The easiest way to display anything at this stage is to use the VGA text buffer. It consists of 25 lines and 80 character cells. To write into this buffer, we need to know it's location: 0xb8000. Each cell consists of an ASCII byte and a color byte. There are more details to this that you can find in the original series. So, we basically use this VGA text buffer, to make our own print and println macros. We would make a Writer with a lock around it, and for each print invocation, we would hold the lock, write whatever we want and then release the lock for use by others. What if multiple processes try to write something? Having a lock around it helps to deal with situations like that.

Testing

When making something, you would want it to behave as you want it to and then test the code according to it. So, we define some tests for our code. Our code must behave as per our tests. Rust has a default testing framework that depends on external libraries. It allows us to replace this a custom_test_frameworks feature. We use this to define tests for our own kernel. We don't need to run these everytime, we have the option of conditional compilation.

Integration tests: you would put all your tests in a tests directory in project root and the compiler would pick them up when asked to test. These tests run separate from your internal code. They have to import your code from the project like an external user and test how various functions of yours work without knowing the implementation details of each.

CPU Exceptions

Let's say you try to divide by 0. We can't do that, so the CPU has to stop and deal with those situations somehow. What if you're trying to access a memory address that doesn't exist? The CPU needs to be told how to deal with these situations. So, we define functions for each of these situations called as exception handlers. These handlers are defined in a table called the Interrupt Descriptor Table or IDT. When an exception occurs, the CPU interrupts its current work and calls the handler.

When making these interrupt handlers, the CPU needs to be able to return to the state before calling these handlers incase we want to resume our work. Interrutps are not just errors, they are used for other stuff too. Normally, when function calls are involved, there are changes in heap, stack memory and register values. The most crucial ones are the register values. We need to somehow preserve the register values when we call our handler and restore the values once we're done. The x86-interrupt is such a calling convention that guarantees this restoration. The CPU needs to save the values somewhere, that stack is called InterruptStackFrame. Again, better details can be found in the original series.

What if we encounter some error in the panic handlers? Or a panic handler is missing? We would call this a double fault. We need to define the handler for this which is a diverging function with error code 0.

Stack Overflow? We could keep pushing onto a stack until we hit a guard page. The Guard Page is a special memory page at the bottom of the stack that is not mapped into a physical frame, so accessing this would cause a page fault. So a stack overflow causes a page fault.

The x86_64 architecture is able to switch to a predefined, known-good stack when an exception occurs. This switch happens at hardware level, so it can be performed before the CPU pushes the exception stack frame. How does the CPU know where this stack is?

GDT: To fix the stack overflow, we need to switch stacks. The mechanism for this is the Interrupt Stack Table. This IST lives inside Task State Segment (TSS). The only way to load a TSS is via the Global Descriptor Table (GDT). The GDT was primarily used for memory segmentation before paging became the standard. However, it is still used to define different privelege levels and loading the TSS. Check out the gdt.rs file.

Hardware Interrupts

How do external hardware devices interact with the CPU? Does the keyboard wait for the CPU to ask if any key was pressed? Nope. We send our hardware signals to a PIC. The PIC then interrupts the CPU to run a handler function for the specific hardware. This is how a keyboard would send keystrokes. Our system uses 2 PICs in Master Slave configuration to support 15 interrupt lines.

There is a physical hardware oscillator on the motherboard called the Programmable Interval Timer. When we enable interrupts, we're kinda unmuting this clock. This ticking forms the heartbeat of the OS. This allows us to step into multitasking, we can switch between different tasks when the timer ticks.

The PIC blocks any future interrupts if we don't send a End of Interrupt for the current interrupt. The keyboard interrupt works in a similar way to send scancodes.

Memory Allocation and Paging

The actual post about this is really informative and easy to understand. To say it in simpler terms, it's just mapping our memory for easier access. Pages are just virtual blocks of memory that are mapped to some physical frame (range of memory addresses). We use a multi-level tree structure (4-level) to map these pages efficiently.

This concept of pages, helps us to implement heap memory and allocation. We use the alloc crate and define the GlobalAlloc trait. This contains methods for allocation and deallocation, which are directly called by the compiler. We can hence create a heap memory for our kernel. In the blog series, he makes it 100KiB. You can check out the allocator.rs file.

Now, there are different types of allocators depending on how we manage the available pages and allocate them. The simplest one is a Bump Allocator, we just linearly allocate memory. There are others too namely LinkedList Allocator and Fixed-Size Block Allocator.

Multitasking

Pre-emptive Multitasking : The CPU interrupts a task and switches context.
Cooperative Multitasking : The CPU waits for a process to voluntarily give up control.

Cooperative Multitasking is primarily used with asynchronous operations. The Executor gives up control when the task is not ready and the CPU was focus on other tasks. Language-supported implementations of cooperative tasks are often even able to backup the required parts of the call stack before pausing.

For having multitasking here, we try to replicate Rust's default system for this using Futures and Executors and Wakers. See the original series for more details on this.

Well, that was it from the original blog post. After that, I tried my hand at making a file system on this and create, read, write and delete files. The OSTEP was especially helpful for this.

My Work

Disk

The first that would come to mind when trying to write a file system is where would we store it. The entire thing currently runs on QEMU. So I had to attach some kind of disk into it. You can find a file called disk.img in my repo. It is a 64MiB raw data file. It functions exactly like a disk.

You can create the disk.img file with: dd if=/dev/zero of=disk.img bs=1M count=64.

To work with this disk, we need some definitions. A standard kernel detects this as a block device, meaning it divides the memory space provided by this disk into blocks. The blocks are usually 512-bytes and we interact with them. Our code expects a BLOCK_SIZE of 512 bytes. The block size on the disk is specified by the bs=1M parameter. 1MiB is a multiple of 512 bytes. Why a multiple? Because the structures in my code use buffer sizes of 512 bytes. If we don't have a multiple of 512 here, the memory won't map cleanly.

But still, it is just a separation as of now. How do we actually use it? To interact with block devices there are drivers for them. They are commonly called VirtIO drivers and the disk would be called as a VirtIO Block Device. The VirtIO is a standardized interface for communication between hypervisors (QEMU) and a guest (this OS).

This VirtIO disk doesn't appear magically at a fixed address. It sits on the Peripheral Component Interconnect (PCI) bus. There could also be other components on this PCI bus, so we need to scan it and find our block device. The scanner I have in pci.rs is a brute force scanner that goes through every bus from 0 to 255. Yes, it is a combination of 256 buses in a single system. We go through every device, and function reading the Vendor ID. I look for 0x1AF4 which is signature of the VirtIO device. Once we find it, we read its Base Address Register (BAR) to map its internal registers into our memory.

A BAR is a configuration register that tells the CPU where a device's internal resources such as memory of I/O ports are mapped into the system's overall address space. Every PCI function has upto sx BARs (BAR0 - BAR5), each being 32 bits wide.

You can see in my code that I use the virtio_drivers crate for this. The scan() function we defined only discovers the device. The PciTransport is the layer that allows the VirtIO device drivers to the device discovered on the PCI bus. The PciTransport queries the BARs to discover the device's register region. It internally chooses the access method (MMIO or PIO) and exposes a transport object that the virtio device can use to read/write device registers.

Now, we got the address of the device internals. These are physical addresses. Our kernel uses virtual addresses. We need a translation layer between these. It is called a Hardware Abstraction Layer (HAL). You can find the implementation here.

The OsHal struct in my code performs the job of HAL. It has methods like dma_alloc() that allocate memory when the driver asks for memory. "Allocating memory" for a device is radically different from allocating memory for a variable. When the virtio device asks for a buffer, it needs the physical address because it has no idea about our page tables. The dma_alloc() function fetches a frame from the allocator we made earlier and returns the physical and virtual addresses of it. This can be used by the CPU and the virtio driver.

Continuously allocating and deallocating frames is costly, so when we deallocate using dma_dealloc() we store the frames in a DMA_FREE_LIST. When allocating we try to look at this list first. This frame was returned earlier, so we would have the liberty to overwrite it.

In simpler systems, to read 1KiB of data, the CPU would have to loop 1024 times, reading a byte from the disk port and writing it to RAM. This is called PIO (Programmed I/O). You may think of it a 'while' loop. This wastes a lot of CPU cycles, doesn't it?

When driver calls dma_alloc(). You grab physical frame and return its physical and virtual address. This is an empty bucket where the data will eventually land.

Let's say the CPU wants to read some data from the virtio block device. The driver running on the CPU writes a descriptor into a special region of memory (RAM) called a VirtQueue which is a large ring buffer. Simply writing here isn't enough, the CPU wakes up the device by writing to a Queue Notify Register often called the Doorbell. As the name suggests, this is a special tiny register on the device. There is a piece of hardware besides the actual disk called the Disk Controller.

Since we're in QEMU, everything is virtual, our disk controller would be virtio-blk and the disk is disk.img. In actual systems, you would call the disk controller as SATA or NVMe.

The descriptor consists of the source, destination and command. Once the doorbell is hit, this descriptor is read by the disk controller and it moves the data from the disk to the specified destination in RAM by the CPU without any further intervention from the CPU. The CPU can do whatever it wants after writing the descriptor, the rest is the job of the disk controller. If we want to write something into the disk, the CPU would write another descriptor with the source being some address in RAM and destination being some address in the disk. When the controller finishes, it raises an interrupt. This interrupt lets the CPU know that the job is done.

Note that the addresses on the disk side is specified by sectors. They use Logical Block Addressing which is just sector numbers.

So, the CPU has minimal work and the main transfer is handled by the disk controller. This is called Direct Memory Access (DMA).

File System

Now that are we done dealing with the disk, we can look at our file system called SFS. In simpler terms, a file system is an interface over the "dealings" with the block device that lets us create, delete, write into or read from files.

To make things more intuitive, I have a BlockDevice trait. It has methods called read_blocks(), write_blocks() and capacity(). The VirtIOBlk implements this trait. So, we can just take a virtio block device into our file system as input and directly work with it. Our file system, SFS, would handle all the business logic.

The OSTEP book has an amazing section on this.

Let's see some details. So, our disk is viewed as a blocks right? Here, let's say the blocks in my 512 bytes. This makes it easier because it would map 1 block of our file system to exactly 1 Physical Disk Sector (512 bytes).

So, there would be many blocks in the 64MB. We reserve some special blocks.

Superblock (Block 0) : This is the anchor of the file system. It contains a magic number which tells us if it is a valid SFS file system along with the locations of inodes and data bitmap blocks.
Inode Bitmap (Block 1) : You can view an inode as the representation of a file. This holds a bitmap table which tells about the inode table located at Block 3.
Data Bitmap (Block 2) : So the file is represented by an inode, but what about the data inside it? It is represented by data blocks.
Inode Table (Block 3) : This starts at Block 3 and I reserve 10% of the disk for this table. This occupies multiple other contiguous blocks. Each 512 byte block holds exactly 4 inodes, each being 128 bytes. You can refer the Inode struct.
Data Region (remaining blocks) : The remaining blocks after the inode table are all data tables.

We use bitmaps to represent the inode and data regions because bitmaps are efficient at representing if they are occupied or not. When we want to create a file, the file system looks for a free inode and returns that inode. Same goes for the data region.

Now, you would already have guessed that we need to somehow map each inode to the data blocks it uses. This is handled by the Inode struct itself inside layout.rs. You would see two fields called direct_pointers and indirect_pointer. For small files, we store the data block indices in direct_pointers which is 10 * 512 bytes. For larger files, instead of pointing to data, we point to Block of Pointers. The indirect_pointer is a u64 which is 8 bytes. So it can hold 512/8 = 64 new pointers.

Alright, so that was about files. What about directories? Well, we can just treat it as a file... I mean why not? We can just add a flag to a file denoting the file type: FileType::Directory. Instead of storing user data, it would store the filenames and their inode numbers. This is represented by a DiskDirEntry struct. When creating a directory, we add 2 entries: . and .. which represent itself and the directory parent respectively. This allows us to have path traversal.

Now, how would we traverse a path that we have? Let's take /home/me/test.txt for example. We start at the root (Inode 0). We read the data block to find the entry named home. Get its inode (say 5). We read 5's data block to find 'me', say 12. We read 12's data block to find a file named test.txt, say Inode 20. We read 20's data block to get the actual file content. You can see this logic in find_inode_by_path() function.

To interact with this, we can create something like a shell that would take our input and perform operations. It is a simple read-eval-print-loop. You can find the code for the shell commands here.

Thanks for reading till here. The code is not perfect. I have a lot of stale functions, code cleanups pending, features pending, a lot of things under development. The testing framework no longer works at the time of writing this. But this has been a great experience to see how things work. Will update this when I add something more. You can see the demo through the link at the top.

commit at this point: 72cce74

Hope you had a good read.

Table of Contents