A peek into the world of microVMs.

Written on Dec 19, 2024.

MicroVM is kind of like the new kid on the virtualization block. Some call it the love child of containers and virtual machines, while some call it the future of efficient virtualization. The magic is that it carries the ruthless efficiency of containers, has great startup times in milliseconds, but runs a full guest kernel like regular VMs do.

The architecture and terminology of “microVMs” originates from the Firecracker project. Back in 2014 when AWS Lambda was launched, the serverless workload was actually delegated to dedicated, per-customer EC2 instance (a full VM) for isolation [1], which of course was not that efficient. [2] As serverless demand grew, the team at AWS decided to invest in a new virtualization stack, with a focus on efficiency and container compatibility.

Firecracker’s architecture

Firecracker is written in Rust and minimal by design. The lightweight virtual machine monitor (VMM) emulates a minimal device model without any unnecessary legacy devices. The essential x86 interrupt controller (PIC) and interval timer (PIT) are emulated by KVM itself, and the rest of the emulated devices are just the PS/2 keyboard controller, the serial console and the virtio block and network devices. [3]

Basically, while the essential devices are emulated in-kernel, the rest are made to be deliberately simple to emulate in userspace, in order to achieve the best performance.

The boot process is greatly simplified as well. Firecracker doesn’t involve a bootloader like BIOS or UEFI, but instead bootstraps directly into the guest kernel with the initial stack and command line arguments. [4]

That also explains why Firecracker does not support booting from .iso and requires you to provide a guest kernel upfront. We also don’t use a whole disk image like .qcow2 or .vmdk, instead we supply just the rootfs partition in ext4.

json

{
  "boot-source": {
    "kernel_image_path": "vmlinux.bin",                     
    "boot_args": "console=ttyS0 reboot=k panic=1 pci=off",  
    ...
  "drives": [
    {
      "drive_id": "rootfs",
      "path_on_host": "bionic.rootfs.ext4",                 
      "io_engine": "Sync",
    ...

Above is taken from a JSON config from Firecracker. [5] We will get to it in the demo.

You can almost say Firecracker is tailor-made for serverless workloads, especially for running Linux virtual machines on Linux hypervisors. Because (a) it depends on KVM, which means the host can only be Linux; and (b) lack of ACPI and a bootloader means it is hardly suitable for running any other standard operating systems.

But it’s worth mentioning that efforts are made in FreeBSD [6] [7], NetBSD [8] [9] and OSv [10] so they can boot as guest on Firecracker, which really demonstrated the versatility of the FOSS community and that microVM isn’t limited to Linux only.

Demo run with Firecracker

When started, Firecracker runs an HTTP server in background and binds to a Unix socket, defaults to /run/firecracker.socket. You are then supposed to curl the socket with some pretty nerve-racking HTTP PUT requests. [11]

I will spare you the crazy gibberish because Firecracker also gives you the option to initialize declaratively with a JSON config, which in my opinion is much nicer.

--config-file <config-file>
    Path to a file that contains the microVM configuration in JSON format.

So, say you have a machine with write access to KVM, i.e. test -w /dev/kvm, you only need four things to run an Ubuntu 22.04 (Jammy) microVM: the firecracker binary, a kernel image, a rootfs image and a JSON config.

We can prepare all that with a script; this is what I adapted from the Firecracker repo. All this download and preparation can be done separately on another machine.

bash

# Download and extract the Firecracker binary.
arch="$(uname -m)" # either x86_64 or aarch64
release_url='https://github.com/firecracker-microvm/firecracker/releases'
version="$(basename $(curl -sLo /dev/null -w %{url_effective} "${release_url}/latest"))"
curl -L "${release_url}/download/${version}/firecracker-${version}-${arch}.tgz" | tar -xz
cp -a "release-${version}-${arch}/firecracker-${version}-${arch}" firecracker

# Download the kernel image, ext4 image and vm_config.json.
curl -LO "https://s3.amazonaws.com/spec.ccfc.min/firecracker-ci/v1.11/${arch}/vmlinux-6.1.102"
curl -LO "https://s3.amazonaws.com/spec.ccfc.min/firecracker-ci/v1.11/${arch}/ubuntu-24.04.squashfs"
curl -LO "https://raw.githubusercontent.com/firecracker-microvm/firecracker/master/tests/framework/vm_config.json"

# Convert the squashfs image to ext4.
sqfs2tar ubuntu-24.04.squashfs > ubuntu-24.04.tar     # requires squashfs-tools-ng
dd if=/dev/zero of=ubuntu-24.04.ext4 bs=400M count=1
mkfs.ext4 -d ubuntu-24.04.tar ubuntu-24.04.ext4       # requires e2fsprogs with libarchive

# Change the image names in vm_config.json.
sed -i "s|vmlinux.bin|vmlinux-6.1.102|; s|bionic.rootfs.ext4|ubuntu-24.04.ext4|" vm_config.json

Note that the Firecracker repo only provided squashfs images, so we need to manually convert it to ext4. This is my “optimal” solution to that because it neither requires root nor pollutes the permissions in the filesystem.

The networking setup is optional, but requires you to create a tap device, enable IPv4 forwarding and set up a NAT with nft on the host. Then on the guest, assign the network interface in vm_config.json and create eth0 post boot. [12] Alternatively, you can directly configure eth0 from the guest boot_args. [13]

Anyway, after getting everything ready, just do:

bash

# Start the microVM.
./firecracker --api-sock ./firecracker.socket --config-file ./vm_config.json

This is what it looks like when I booted it on my Pixel 8 with stock GKI kernel (with KernelSU), in which KVM is completely functional with root access.

The boot process is so fast that it doesn’t feel like booting a full VM at all.

Besides, from the perspective of an Android modder/enthusiast, this is something that I had never dream of until this [14] came out. I was still working on a custom kernel that barely supported containers using ugly patches back then. [15]

And now with Firecracker, almost every root user on a GKI kernel can do it. With a compact binary and a ~400MiB partition image, with no custom kernels and large, static, impossible-to-compile qemu. A few things to download and you are good to go.

The benefit here is obvious. The design of microVMs is simple enough that the VMM behind it can be made just as lightweight, and that means portability, more than being just fast.

It’s not general-purpose

Still, common criticisms of this minimal device model are that eventually you need things that the model excluded for speed in the first place, and you would end up needing something full-blown like QEMU again. Besides, Firecracker has a very limited feature set, eliminates guest interactions with host kernel, and not designed to be general-purpose. [16]

Firecracker lacks support for other networking stacks and shared host-to-guest directories, common in QEMU, Cloud Hypervisor and other hypervisors. Maintainers more or less shot down suggestions like macvtap, vhost-net and vhost-user-net, also p9fs and virtio-fs, citing reasons like large attack surface and not on the roadmap of Firecracker. [17] [18] [19]

Another thing is that Firecracker doesn’t yet have PCIe and hence GPU support, even when hardware-accelerated inference becoming common in serverless workloads. [20] While the Firecracker team is willing to shepard the changes, it is again not a priority and they “outsource” the implementation to the community. [21]

Inspired by crosvm

Firecracker isn’t the first VMM to implement a microVM-like virtual machine. Coming from ChromiumOS, crosvm is the original implementation that concentrates on paravirtualized devices, being used to run Linux/Android guests on ChromeOS devices to this day. It is the predecessor of Firecracker, and is also written in Rust and uses the same Rust VMM crates. [22]

The Firecracker team still collaborates with the crosvm maintainers and many others to build Rust virtualization components in the rust-vmm community. [23]

QEMU’s adoption of microVM

We see the QEMU Project adopting the microVM architecture as a special machine model, around 2019 and after one year of Firecracker’s initial release. [24] QEMU is a full-blown hypervisor that supports both virtualization and emulation, and likely benefits less from a minimal model since its purpose is kind of “to cover all there is”.

Nevertheless, the patch author said that it “establishes a baseline for benchmarking and optimizing both QEMU and guest operating systems, since it is optimized for both boot time and footprint.” [24:1]

While similar in concept as QEMU’s implementation also emulates less devices, it uses a custom firmware qboot to boot Linux directly [25], only supports i386 and x86_64, and seemingly does not depend on KVM (i.e. uses QEMU’s own TCG emulation) [26]. It also supports shared directory with virtiofsd magic [27].

My failed attempt on Apple Silicon

I was hoping to try it out on my MacBook Air M2, but to no avail. I am not sure if QEMU’s implementation depends on KVM or not, but booting vmlinux results in a kernel panic without it. Obviously, we can’t use Apple Silicon HVF to virtualize a x86_64 target, so I am stuck here.

If you are curious, here is the command I used to boot the kernel:

bash

qemu-system-x86_64 \
  -M microvm -m 512m -smp 2 \
  -bios ./bios-microvm.bin \
  -kernel ./vmlinux-6.1.102 -append "earlyprintk=ttyS0 console=ttyS0 root=/dev/vda" \
  -nodefaults -no-user-config -nographic -serial stdio \
  -drive if=none,file=./debian-sid-nocloud-amd64-daily.raw,format=raw,id=test \
  -device virtio-blk-device,drive=test

And the kernel crash log that follows:

[    0.056725] divide error: 0000 [#1] PREEMPT SMP NOPTI
[    0.056891] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.1.102 #1
[    0.057035] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    0.057218] RIP: 0010:kmem_cache_init_late+0x0/0x2b
[    0.057712] Code: 89 e7 e8 fe 06 e6 fe 4c 8b 25 37 bc 2f 00 4d 85 e4 75 bd 48 c7 c7 c0 b8 2b 82 e8 c6 e0 61 ff 31 c0 5b 41 5c 5d e9 4b dc 88 ff <55> 31 d2 be 08 00 00 00 48 c7 c7 9d 21
[    0.058075] RSP: 0000:ffffffff82203ee0 EFLAGS: 00010246
[    0.058170] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000020
[    0.058268] RDX: 00000000000009a0 RSI: ffff88801ffd5600 RDI: 0000000000000000
[    0.058364] RBP: ffffffff82203f18 R08: 0000000000000002 R09: 0000000000000000
[    0.058460] R10: ffffffff82301798 R11: 0000000000000270 R12: 0000000000000000
[    0.058558] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
[    0.058680] FS:  0000000000000000(0000) GS:ffff88801f400000(0000) knlGS:0000000000000000
[    0.058794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.058867] CR2: ffff888002801000 CR3: 0000000002210000 CR4: 00000000000000b0
[    0.059010] Call Trace:
[    0.059328]  <TASK>
[    0.059507]  ? show_trace_log_lvl+0x1c5/0x28b
[    0.059612]  ? show_trace_log_lvl+0x1c5/0x28b
[    0.059677]  ? x86_64_start_reservations+0x24/0x2a
[    0.059739]  ? show_regs.part.0+0x1e/0x24
[    0.059803]  ? __die+0x59/0x99
[    0.059855]  ? die+0x2b/0x50
[    0.059907]  ? do_trap+0xca/0x120
[    0.059963]  ? do_error_trap+0x6b/0x90
[    0.060021]  ? slab_sysfs_init+0xf7/0xf7
[    0.060070]  ? exc_divide_error+0x3a/0x60
[    0.060130]  ? slab_sysfs_init+0xf7/0xf7
[    0.060189]  ? asm_exc_divide_error+0x1b/0x20
[    0.060268]  ? slab_sysfs_init+0xf7/0xf7
[    0.060328]  ? start_kernel+0x352/0x499
[    0.060388]  x86_64_start_reservations+0x24/0x2a
[    0.060516]  x86_64_start_kernel+0x72/0x7c
[    0.060576]  secondary_startup_64_no_verify+0xcd/0xdb
[    0.060661]  </TASK>
[    0.061000] ---[ end trace 0000000000000000 ]---
[    0.061120] RIP: 0010:kmem_cache_init_late+0x0/0x2b
[    0.061189] Code: 89 e7 e8 fe 06 e6 fe 4c 8b 25 37 bc 2f 00 4d 85 e4 75 bd 48 c7 c7 c0 b8 2b 82 e8 c6 e0 61 ff 31 c0 5b 41 5c 5d e9 4b dc 88 ff <55> 31 d2 be 08 00 00 00 48 c7 c7 9d 21
[    0.061394] RSP: 0000:ffffffff82203ee0 EFLAGS: 00010246
[    0.061463] RAX: 0000000000000046 RBX: 0000000000000000 RCX: 0000000000000020
[    0.061546] RDX: 00000000000009a0 RSI: ffff88801ffd5600 RDI: 0000000000000000
[    0.061628] RBP: ffffffff82203f18 R08: 0000000000000002 R09: 0000000000000000
[    0.061710] R10: ffffffff82301798 R11: 0000000000000270 R12: 0000000000000000
[    0.061793] R13: 00000000000000a0 R14: 0000000000000000 R15: 0000000000000000
[    0.061876] FS:  0000000000000000(0000) GS:ffff88801f400000(0000) knlGS:0000000000000000
[    0.061969] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.062036] CR2: ffff888002801000 CR3: 0000000002210000 CR4: 00000000000000b0
[    0.062230] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.062706] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

I have also tried it inside a QEMU VM without KVM, and this time it just got stuck with no kernel panic. Maybe the microvm is missing the PIT without KVM, but I don’t know.

[    0.000000] tsc: Unable to calibrate against PIT
[    0.000000] tsc: No reference (HPET/PMTIMER) available
[    0.000000] tsc: Marking TSC unstable due to could not calculate TSC khz

Cloud Hypervisor looks great, despite the name

Besides Firecracker and QEMU, there are other notable projects that uses the microVM architecture, like Cloud Hypervisor from Intel. Boring name aside, it is also written in Rust and bases on the same Rust VMM crates that Firecracker uses.

It bears the spirit of Firecracker by providing lightweight, fast virtualization with a minimal device model. The main difference is that it aims to be general purpose for cloud workloads, and not limited to container/serverless or client workloads. [28]

TDOO: write general purpose goodies, walkthrough

The Microsoft Hypervisor (MSHV)

But a new thing I noticed from the Cloud Hypervisor repo, is that other than just running on KVM, it also supports running on the Microsoft Hypervisor. [28:1]

I did find the name “Microsoft Hypervisor” a bit confusing, but turns out it is the core hypervisor of the entire Hyper-V stack [29], said to be a standalone binary that is conceptually untied to Windows. [30] It is the only part of the Hyper-V stack ported to Linux, while rest of the stack like the Windows kernel and userspace stay behind, perhaps replaced by Linux counterparts. You can access the hypervisor via an IOCTL interface at /dev/mshv, similar to /dev/kvm. [29:1]

Public development of Microsoft Hypervisor is messy; on one hand you see related kernel patches being submitted to LKML, but on another hand you see you can’t actually obtain the hypervisor itself. Most of the research and development is happening behind closed Microsoft doors [31]; the required components are not publicly available and are currently unlicensed, but it was said that they would be released in whole eventually. [32]

I guess if anyone will want to work on this, now is probably not the time unless you are sponsored. I will stick with KVM for the time being.

OpenVMM

Conclusion

The advancement of microVM is slowly revolutionizing software deployment, cloud computing, and container ecosystems. It’s a new way of thinking about native virtualization, and it is not just about better performance, but also the simplicity and portability that it brings. Plus, microVM also helps with security by reducing attack surface. As their adoption grows, microVMs are set to shape a more efficient and secure technological landscape.

Thanks for making it through the first public blog I wrote on the internet! Hope you picked up something new about microVMs and the projects around it, like I did writing this blog. Take a look at the references below if you are interested in digging deeper.

A peek into the world of microVMs. ​

Firecracker’s architecture ​

Demo run with Firecracker ​

It’s not general-purpose ​

Inspired by crosvm ​

QEMU’s adoption of microVM ​

My failed attempt on Apple Silicon ​

Cloud Hypervisor looks great, despite the name ​

The Microsoft Hypervisor (MSHV) ​

OpenVMM ​

Conclusion ​

A peek into the world of microVMs.

Firecracker’s architecture

Demo run with Firecracker

It’s not general-purpose

Inspired by crosvm

QEMU’s adoption of microVM

My failed attempt on Apple Silicon

Cloud Hypervisor looks great, despite the name

The Microsoft Hypervisor (MSHV)

OpenVMM

Conclusion