Building a Linux Container Runtime from Scratch

68 comments

·March 26, 2025

pss314

I loved this hands-on presentation Containers From Scratch by Liz Rice from few years ago https://www.youtube.com/watch?v=8fi7uSYlOdc.

Today, Linux containers in (less than) 100 lines of shell by Michael Kerrisk was published https://www.youtube.com/watch?v=4RUiVAlJE2w.

seungwoolee518

Michael Kerrisk wrote a series of article about Linux namespace on lwn.net [0]

[0]: https://lwn.net/Articles/531114/#series_index

Brian_K_White

That bash/busybox demo is awesome. The code is at: https://man7.org/tlpi/code/ (/tlpi-dist/consh/ in the tar)

I still used lxc-utils in my rc script which now seems like positively cheating and may as well use docker.

null

[deleted]

null

[deleted]

Brian_K_White

On my birthday while attending Arisia January 2010 I wrote a single rc script with about 30 non-boilerplate lines of bash (the 3 functions) that does:

  * start all enabled containers on boot
  * stop all running containers at shutdown (ie gracefully wait for them all to shut themselves down before letting the host proceed to shut itself down)
  * start/stop/status any specified container on command
  * list all containers (known/configured, running or not)
  * every container has a gnu screen console
  * simple config file per container to define network & root dir etc.

(these are the latest versions of the wiki page and the referenced rclxc package, but I created the wiki page and the script on Jan 18 2010, despite the wiki history. The weird link for the rpm is because home:aljex no longer exists on the opensuse build service)

https://en.opensuse.org/SDB:LXC

https://anna.lysator.liu.se/pub/opensuse/repositories/home%3...

Whopping 3 files in the package, and one is just a symlink, and the other is just a single rmdir command. No daemon, the script only runs to do something. Not even systemd, just plain old sysv init.

I never developed it beyond essentially proof of concept because my companies owner listened to vmware salespeople, but I did use it in quasi-production for a year or two. (some developer vms, a few internal services, 20 or so customers)

But to me it did prove the concept and I would have liked to just work on that instead of using vmware or anything else. I completely gag when I look at kubernetes or even just podman when I had this so long ago and got so much function out of so little code and complication.

I mean it would obviously get larger and more complicated as it grew to handle more cases and supply more features. I think I just always want to stop at the 90/10 place where you get 90% of the functionality with 10% of the code, and the remaining 10% of the functionality requires 10x the initial code. I feel like once you cross that point you have wandered off the track and are now doing bad engineering in some way and need to go back and figure out where you started driving in your sleep and get back on track solving the problem of getting the necessary job done in some sensible way.

kubafu

> I think I just always want to stop at the 90/10 place where you get 90% of the functionality with 10% of the code, and the remaining 10% of the functionality requires 10x the initial code.

And that should be the right approach 90% of the time. Thanks for your comment!

Joker_vD

> Importantly, we designed Styrolite with full awareness that Linux namespaces were never intended as hard security boundaries—a fact that explains why container escape vulnerabilities continue to emerge. Our approach acknowledges these limitations while providing a more robust foundation.

So what do you do, exactly?

klysm

Say “it’s probably fine” and hope that the people building the foundational systems are protecting us

Joker_vD

No, I mean, what do the Edera developers do differently, in order to provide more robust foundation with this new container runtime called Styrolite? They still use Linux namespaces, as far as I can tell from TFA.

denhamparry

Edera developer here, we use Styrolite to run containers with Edera Protect. Edera Protect creates Zones to isolate processes from other Zones so that if someone were to break out of a container, they'd only see the zone processes. Not the host operating system or the hardware on the machine. The key difference here between us and other isolation implementations is that there is no performance degradation, you don't have to rebuild your container images, and that we don't require specific hardware (e.g. you can run Edera Protect on bare metal or on public cloud instances and everything else in-between).

flkenosad

Anyone know if it's possible to update the Linux kernel so that namespaces are hard security boundaries? I wonder what that would entail.

eyberg

When we speak of 'hard security boundaries' most people, in this space, are comparing to existing hardware backed isolation such as virtual machines. There are many container escapes each year because the chunk of api that they are required to cover is so large but more importantly it doesn't have isolation at the cpu level (eg: intel vt-x such as VMREAD, VMWRITE, VMLAUNCH, VMXOFF, VMXON).

This is what the entire public cloud is built on. You don't really read articles that often where someone is talking about breaking vm isolation on AWS and spying on the other tenants on the server.

null

[deleted]

vaylian

> There are many container escapes each year because the chunk of api that they are required to cover is so large

What API? The kernel syscall API?

If we assume for a moment, that there are no bugs in the Linux namespace implementation, would containers be as safe as virtual machines?

flaminHotSpeedo

> This is what the entire public cloud is built on.

Well... The entire public cloud except Azure. They've been caught multiple times for vulnerabilities stemming from the lack of hardware backed isolation between tenants.

GardenLetter27

A lot of use cases don't want that though. It's nice having lightweight network namespaces for example, just to separate the network stack for tunneling but still have X and Wayland working fine with the applications running there.

fulafel

Have a look at gVisor for one approach.

z3t4

Once you have set up the namespaces you drop all capabilities so if the program gets hacked while it's running it can do very little.

denhamparry

Edera developer here. I agree! But there are instances we need to run with additional capabilities, and we’re also dependent on people knowing how to do the right thing. We’re trying to improve this by setting this by default, but also improving the overall performance and efficiency of running containers

znpy

honest question: how is this any better than running non-root containers?

They can do very little anyway, that way.

sys_call

Non-root containers still operate under a shared kernel. Non-root containers that run under a vulnerable kernel can lead to privilege escalation and container escapes.

Styrolite is a container runtime engine that runs containers in a virtual machine guest environment with no shared kernel state. It uses a type 1 hypervisor to fully isolate a running container from the node and other containers. It's similar to Firecracker or Kata containers, but doesn't require bare metal instances (runs on standard EC2, etc) and utilizes paravirtualization.

null

[deleted]

seungwoolee518

When I was digging into Container (a.k.a it uses linux namespace capabilities) lwn.net's series of article helps me a lot.

https://lwn.net/Articles/531114/#series_index

shortrounddev2

I've seen many examples of people creating containers for Linux; I wish it were comparably easier to create containers for Windows. The fundamental software exists on Windows (AppContainers are how UWP apps work) but the documentation around AppContainers is very sparse/opaque because Microsoft doesn't want you to use AppContainers to make a general purpose sandbox environment like Snap or Flatpak; they want you to write UWP apps. It would be immensely helpful if you could run any arbitrary win32 or higher application in a sandboxed AppContainer where the NT System calls only had access to, say, the application's local folder and its %APPDATA% folder.

Alas, I think that Microsoft has simply given up on Native application support on Windows. Currently the only good way to write native apps for windows is still Win32/MFC and Winforms.

In fact, I think that secretly even Microsoft knows that everyone hates their UI frameworks/runtimes (and the fact that Microsoft deprecates them 2 years into their lifespan) because Microsoft STILL provides modern .Net 8/9 bindings for Winforms in 2025. If only they would just replace the GDI renderer with Direct2D, it would be literally perfect

pjmlp

Windows containers exist, their are based on the jobs, and Microsof took the approach to use the same APIs docker world expects to have as means to integrate with the DevOps container world expectations.

https://learn.microsoft.com/en-us/virtualization/windowscont...

You missed GDI+, Direct2D API is a COM mess that we only put up with because DirectX, and DirectX team doesn't like .NET, thus nothing like XNA or Managed DirectX will ever happen again.

WPF also exists, and since Build 2025 has regained parity with WinUI in official Windows GUI frameworks, that aren't in maintenance mode, aka Forms and MFC.

However, WinUI 3.0 with WinAppSDK has been a mess of project since Project Reunion was announced back in 2021, after almost four years it is still a shadow of UWP tooling, this is where I agree with you, it was so badly managed that nowadays only the Windows development team really cares about it, and most likely because their job depends on having to use WinUI.

But if you so wish to go through the pains of WinUI, there is Win2D.

shortrounddev2

While windows containers exist, the documentation surrounding them at the API level is sparse. Anything from Azure just tells you to use docker.

As far as I can tell GDI+ is still software rendered? DirectX Com objects aren't difficult to work with at all, ive never understood why people hate them so much. The point of using direct2d would be to provide hardware rendering for winforms.

Wpf is OK compared to winui 3 but it still suffers from xaml.

pjmlp

Because the API was designed to be compatible with Docker tooling.

GDI and GDI+ are hardware accelerated for years now,

https://learn.microsoft.com/en-us/windows-hardware/drivers/d...

Maybe because COM tooling sucks, in C++ land, Microsoft re-invents the approach to use COM every couple of years, and it is too much C/C++ style instead of being a proper modern C++ approach to handle COM.

While on .NET land, DirectX team couldn't care less, and leaves the community the work to make the interop work without issues.

The XAML hate comes mostly from outside traditional Windows developer circles.

m00dy

We are an algorithmic trading company [0], and our trading strategies are primarily built as pure Rust libraries. We've been searching for a way to sandbox the strategies we host, as not all of them are signed or open source for verification. Styrolite seems like a promising solution to address this issue, so we’re planning to give it a try.

[0]: https://cycletop.xyz

denhamparry

Edera developer here! Thank you for sharing and any feedback you have would be great! Edera Protect is written in Rust too, and our focus is also performance as well as isolation.

pzmarzly

Why not use any of the existing OCI Runtimes? They take well-defined[0] JSON description as input, and are pretty well-contained (single static binary). And because they are separate binaries, not libraries, you don't need to worry about things like thread safety or FD leaking.

[0] https://github.com/opencontainers/runtime-spec/blob/main/con...

zamalek

"I don't need the full capabilities of OCI." In my (now very much stagnating) Nix-like pet project[1] I merely want a hermetic build environment. Rolling my own container runtime was no more difficult than, what would likely be, a nightmare of emulating a complete OCI container for the simple purpose that I'm after.

Simple problems need simple solutions, and OCI is really complex. I was initially overjoyed by the prospect of deleting my code, but it looks like this project doesn't have rootless/shadowutils support yet (which is solely useful for not having to worry about su or caps during development).

[1]: https://github.com/porkg/porkg/tree/rs

r3trohack3r

I’m currently exploring this for an AI context because I haven’t found a better solution for letting K8S manage AI workloads that need direct GPU access on OSx

denhamparry

Edera developer here. Edera Protect is being developed to manage access to the GPU hardware on a Node with the containers running your workloads. We talk a lot about isolation between containers, but we're also focused on adding this isolation throughout the stack, from containers/processes down to hardware.

pm90

You're running a kubernetes cluster with nodes that are running OSx?

brcmthrowaway

Why are you building AI anything

harha_

The beginning of the article answers to your question.

infogulch

How does this compare to recently discussed Landrun?

https://news.ycombinator.com/item?id=43445662

cedws

Isn’t the gold standard of containerisation gVisor? Can’t get much more restrictive than proxying and filtering syscalls. As far as I remember it’s the default runtime on GKE.

denhamparry

Edera developer here. gVisor is restrictive, but its at a cost of performance. Personally, I'd say Edera Protect is one level deeper. We create Edera Protect Zones to provide isolation, so we create a Zone that is isolated from the OS and hardware of the machine running the container. So we don't proxy or filter syscalls, as the isolation is a layer deeper. We are also focused on ensuring that Edera Protect is as performant (if not better) as running a container today with containerd.

Finally, if you wanted to, you could run gVisor within Edera Protect, but we feel that Edera Protect would already provide the security benefits that gVisor offer.

cedws

Thanks, but what is a “Protect Zone” at a technical level? Why does it provider stronger isolation than syscall filtering?

raesene9

How would you say it compares to Firecracker?

raesene9

If you want better isolation than is provided by Linux namespaces et al, then yep something like gVisor or Firecracker (https://firecracker-microvm.github.io/) provide a likely better level of isolation.

sys_call

gVisor runs a userspace kernel that proxies syscalls to a shared host kernel. Running an "application kernel" in userspace impacts performance because it goes through two schedulers. Virtual machine isolation is more restrictive because it doesn't share any kernel state with other containers. We have a whitepaper that compares the performance of gVisor and Stylorite/Edera if you want to see the differences http://arxiv.org/abs/2501.04580

TechDebtDevin

Cookie consent card wont disappear. Brave mobile.

elboulangero

Same with Firefox on Android...

shellwizard

No problem here. FF Android + uBO hard-mode

asicsp

HN

Building a Linux Container Runtime from Scratch

Building a Linux Container Runtime from Scratch