Mini container runtime in Go

10/02/2024 - Estimated reading time: 4 minutes — Building a tool to build OCI container images

This is a direct continuation of the post on Linux kernel namespaces. Since most of the theory has already been covered, this part would focus more on the code.

We start by getting busybox filesystem.

mkdir busybox
cd busybox
wget https://github.com/jpetazzo/docker-busybox/raw/master/rootfs.tar
tar xvf rootfs.tar

To run the mini runtime execute sudo go run ./cmd/minicr/ /home/user/busybox. Make sure to replace /home/user/busybox with your path. The runtime will execute /bin/bash inside busybox’s filesystem.

Code is located here: https://github.com/pkorzh/container-build-tool/tree/v0

Code stability: it works on my machine.

	`package main import ( "fmt" "os" "os/exec" "path/filepath" "runtime" "syscall" "unsafe" "github.com/spf13/cobra" "golang.org/x/sys/unix" )`
The c code is posted at the end of the page.
	`/* #include <stdlib.h> void exec_ps(); void create_argv(int len); void set_argv(int pos, char arg); / import "C" var rootCmd = &cobra.Command{ Use: "minicr", Long: "Mini containers runtime", SilenceUsage: true, Args: cobra.ExactArgs(1), RunE: func(cmd *cobra.Command, args []string) error {`
We re-execute self, effectively running a bootstrap process. For alternative implementation please see docker’s reexec package.	`reexec := exec.Command("/proc/self/exe", "bootstrap", args[0])`
	`reexec.Stderr = os.Stderr reexec.Stdin = os.Stdin reexec.Stdout = os.Stdout reexec.SysProcAttr = &unix.SysProcAttr{ Cloneflags: unix.CLONE_NEWUTS \| unix.CLONE_NEWNS \| unix.CLONE_NEWPID, } return reexec.Run() }, }`
This is the bootstrap process. The Command is hidden so that end users don’t see it.	`var bootstrapCmd = &cobra.Command{`
	`Use: "bootstrap", Long: "Configure namespaces", Hidden: true, Args: cobra.ExactArgs(1), RunE: func(cmd *cobra.Command, args []string) error { unix.Sethostname([]byte("inside-container"))`
Since parent’s mount list can be `shared` we need to make it private in our namespace.	`if err := unix.Mount("", "/", "", unix.MS_PRIVATE\|unix.MS_REC, ""); err != nil {`
	`return err } pivotRoot(args[0])`
Mount `/proc` so that we can use `ps aux`.	`if err := unix.Mount("proc", "/proc", "proc", 0, ""); err != nil {`
	`return err } args = []string{"/bin/bash"}`
This causes the program that is currently being run by the calling process to be replaced with a new program, with newly initialized stack, heap, and (initialized and uninitialized) data segments.
	`C.create_argv(C.int(len(args))) for i, arg := range args { cArg := C.CString(arg) C.set_argv(C.int(i), cArg) defer C.free(unsafe.Pointer(cArg)) } C.exec_ps() return nil }, } func pivotRoot(root string) error {`
We need this to satisfy restriction of `PivotRoot`: `new_root` and `put_old` must not be on the same filesystem as the current root.	`if err := unix.Mount(root, root, "bind", unix.MS_BIND\|unix.MS_REC, ""); err != nil {`
	`return fmt.Errorf("mount rootfs to itself: %v", err) } pivotDir := filepath.Join(root, ".pivot_root") if err := os.Mkdir(pivotDir, 0777); err != nil { return err }`
`syscall.PivotRoot` call changes the root mount in the mount namespace of the calling process. It moves the root mount to the directory `.pivot_root` and makes `root` the new root mount. Afterwards we can unmount `.pivot_root`, aka the old root.	`if err := syscall.PivotRoot(root, pivotDir); err != nil {`
	`return fmt.Errorf("pivot_root %v", err) } if err := syscall.Chdir("/"); err != nil { return fmt.Errorf("chdir / %v", err) } pivotDir = filepath.Join("/", ".pivot_root") if err := syscall.Unmount(pivotDir, syscall.MNT_DETACH); err != nil { return fmt.Errorf("unmount pivot_root dir %v", err) } return os.Remove(pivotDir) } func main() {`
`LockOSThread` wires the calling goroutine to its current operating system thread. The calling goroutine will always execute in that thread, and no other goroutine will execute in it.	`runtime.LockOSThread()`
	`defer runtime.UnlockOSThread() rootCmd.AddCommand(bootstrapCmd) if err := rootCmd.Execute(); err != nil { fmt.Println(err) os.Exit(1) } }`

If we execute the code we can poke around in a container:

$ sudo go run ./cmd/minicr/ /home/platon/p/busybox
/ # echo $$
1
/ # hostname
inside-container
/ # ps aux
PID   USER     COMMAND
    1 root     /bin/bash
    8 root     ps aux
/ # ls
bin      etc      lib      linuxrc  mnt      proc     run      sys      usr
dev      home     lib64    media    opt      root     sbin     tmp      var
/ #

C code

The code below is a modified version of container_top_linux.c.

//go:build !remote


#define _GNU_SOURCE
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mount.h>
#include <sys/wait.h>
#include <unistd.h>

/* keep special_exit_code in sync with container_top_linux.go */
int special_exit_code = 255;
char **argv = NULL;

void
create_argv (int len)
{
  /* allocate one extra element because we need a final NULL in c */
  argv = malloc (sizeof (char *) * (len + 1));
  if (argv == NULL)
    {
      fprintf (stderr, "failed to allocate ps argv");
      exit (special_exit_code);
    }
  /* add final NULL */
  argv[len] = NULL;
}

void
set_argv (int pos, char *arg)
{
  argv[pos] = arg;
}

void
exec_ps ()
{
  if (argv == NULL)
    {
      fprintf (stderr, "argv not initialized");
      exit (special_exit_code);
    }
  execve (argv[0], argv, NULL);
  fprintf (stderr, "execve: %m");
  exit (special_exit_code);
}

This post is part of a series.

Part 1: Container build tool
Part 2: How-to build OCI Image by hands
Part 3: Building OCI images with Go. No run command yet
Part 4: How to Tar/Untar container layers in Go
Part 5: Linux kernel namespaces
Part 6: Mini container runtime in Go
Part 7: Union mount