shithub: plan9front

Download patch

ref: 83b569cbc7e772ec454f87ccd71104b23faebbc8
parent: bd2ef36d976db0a842e31af309f0ef1439f30a89
author: Jacob Moody <>
date: Fri Apr 28 03:11:37 EDT 2023

/sys/doc: "Namespaces as Security Domains"

--- a/sys/doc/mkfile
+++ b/sys/doc/mkfile
@@ -41,6 +41,7 @@
+	nssec\
 HTML=${ALL:%=%.html} release3.html release4.html
--- /dev/null
+++ b/sys/doc/
@@ -1,0 +1,414 @@
+.HTML "Namespaces as Security Domains"
+Namespaces as Security Domains
+Jacob Moody
+We aim to explore the use of Plan 9 namespaces
+as ways of building isolated processes. We present
+here code for increasing the ability and granularity
+for which a process may isolate itself from others
+on the system.
+First presented in a slightly different form at the 9th International Workshop on Plan 9.
+In a Plan 9 system the kernel exposes hardware and system
+interfaces through a myriad of filesystem trees. These trees, or
+sharp devices, replace the functionality of many would be system calls
+through use of standard file system operations. A standard Plan 9 environment
+is comprised of a composition of these individual devices together, the collection
+of such being the processes namespace.
+With these principles it is quite easy for a process to build a slim namespace using only
+what it may need for operation. This could be done in service to reduce the "blast radius"
+of awry or malicious code to some effect. But to be fully effective a process must also be able
+to remove the ability to bootstrap these capabilities back. We will explore different ways of
+building isolated namespaces, their pitfalls, ways to address those issues, along with new solutions.
+Outside World
+There have been many solutions for sandboxing within the UNIX™
+world. There are more classical approaches such as
+.CW zones ,
+[Price04] and
+.CW jails ,
+that all provide an abstraction of building some number of
+smaller full unix boxes out of a single physical host. However these
+interfaces are presented more as a systems management tool, the mechanisms
+for which an administrator creates and manages these resources is unergonomic
+to use on a per-process basis. Instead it seems more the fashion now to isolate
+specific pieces of the system, and expect it possible that each process on the system
+may choose to manage its environment. The most successful execution of this idea in the
+wild is the OpenBSD project's
+.CW unveil
+.CW pledge
+[Beck18] system calls, allowing a processes to cut off specific parts of the filesystem or
+system call interfaces. Linux namespaces [Biederman06] implement this idea by allowing a process
+to fork off private versions of specific global resources. In both these cases the sandboxing
+of a process is through gradual steps, removing potentially dangerous tools one by one.
+Existing Work
+Let us first define the resources we are restricting access to. The aforementioned gradual solutions
+provide ways in which a process can remove itself from specific kernel interfaces. In plan9 the kernel
+exposes almost all of its functionality through individual filesystems. These devices are accessed
+globally by prefixing a path with a sharp('#'), and have conventional places they are bound within the
+A processes namespace in plan9 is typically constructed using a namespace file. These files
+are a collection of namespace operations formatted as one would expect to see them in a shell script.
+They typically begin by binding in some number of sharp devices in to their expected location.
+bind #d /fd
+bind -c #e /env
+bind #p /proc
+bind -c #s /srv
+Then using the globals provided, in particular /srv, to bring in the rest of the root filesystem.
+A process can at any point choose to construct itself a new namespace, but it must do so when changing
+users. This is done in part to ensure that each filesystem that the program would like to use has
+their chance to authenticate and be notified. Because this information is only exchanged on attach,
+the new user must construct a namespace from scratch.
+Many programs, like network services, wish to drop their current user and become the special user
+.CW none
+user on startup, and in doing so must rebuild their namespace. The conventional default namespace
+files used is /lib/namespace, but most programs allow the user to specify an alternative with a
+flag. It is here that we already can approximate a chroot style environment by changing the root
+filesystem used in a namespace file.
+bind #s /srv
+mount /srv/myboot /root
+bind -a /root /
+By having another filesystem exposed in /srv/myboot and modifying the provided namespace file,
+we've allowed this process to work within an entirely separate root filesystem.
+The issue in using these namespaces as security barriers is that there is nothing preventing
+a process from bootstrapping a resource back. While our example code places a different root filesystem
+in the namespace, nothing is preventing that process or its children from potentially rebootstrapping
+the real root filesystem back. For this issue there is a special rfork flag
+the prevents a process from accessing any almost any sharp device of consequence. This is done by
+preventing a process from walking to a device by its location within '#'. This allows existing
+binds of resources to continue working within the namespace but restricts a process from binding
+in new resources from the kernel.
+While effective we found this to be too large a hammer in practice. Doing as its name implies
+also prevents a process from performing any mounts or binds. This in practice creates a single
+point in time in which a process gives up all of its control, instead of the idealized gradual
+process. This makes it quite hard to make use of in practice, only a singly program in a chain
+may be the one to invoke
+or must hope that no other program further in the chain may want to make use of its namespace.
+The interface itself feels very clunky, there is a nice gradual addition of these kernel devices
+to the namespace why must the removal be all at once?
+We propose a new write interface through /dev/drivers
+that functionally replaces
+/dev/drivers now accepts writes in the form of
+chdev op devmask
+Devmask is a string of sharp device characters. Op specifies how
+devmask is interpreted. Op is one of
+lw(1i) lw(4.5i).
+\f(CW&\fP	T{
+Permit access to just the devices specified in devmask.
+\f(CW&~\fP	T{
+Permit access to all but the devices specified in devmask.
+\f(CW~\fP	T{
+Remove access to all devices.  Devmask is ignored.
+This allows a process to selectively remove access to
+sections of sharp devices with quite a bit of control.
+In order to mimic all of
+features, removing access to
+.CW devmnt ,
+which is not normally accessible directly,
+disables the processes ability to perform mount
+and bind operations.
+For the implementation, we extended the existing
+flag attached to the process namespace group
+in to a bit vector. Each bit representing a index
+.CW devtab .
+The following function illustrates how this vector is set.
+devmask(Pgrp *pgrp, int invert, char *devs)
+	int i, t, w;
+	char *p;
+	Rune r;
+	u64int mask[nelem(pgrp->notallowed)];
+	if(invert)
+		memset(mask, 0xFF, sizeof mask);
+	else		
+		memset(mask, 0, sizeof mask);		
+	w = sizeof mask[0] * 8;
+	for(p = devs; *p != 0;){
+		p += chartorune(&r, p);
+		t = devno(r, 1);
+		if(t == -1)
+			continue;
+		if(invert)
+			mask[t/w] &= ~(1<<t%w);
+		else
+			mask[t/w] |= 1<<t%w;
+	}
+	wlock(&pgrp->ns);
+	for(i=0; i < nelem(pgrp->notallowed); i++)
+		pgrp->notallowed[i] |= mask[i];
+	wunlock(&pgrp->ns);
+Devmask is called from the write handler for /dev/drivers. This
+bitmask is then consulted any time a name is resolved that begins
+with '#'. This is done from within the
+.CW namec ()
+function using the following function to check
+if a particular device
+.CW r
+is permitted.
+devallowed(Pgrp *pgrp, int r)
+	int t, w, b;
+	t = devno(r, 1);
+	if(t == -1)
+		return 0;
+	w = sizeof(u64int) * 8;
+	rlock(&pgrp->ns);
+	b = !(pgrp->notallowed[t/w] & 1<<t%w);
+	runlock(&pgrp->ns);
+	return b;
+We found that once removal is made a core verb of these sharp
+devices it becomes easy to start to view access to them
+as capabilities. This is aided by system functionally already neatly
+organized in to the various devices themselves. For example, one could
+say a process is capable of accessing the broader internet if it has access
+to the
+.CW devip
+device. This access can either be direct via it's path under '#' or through a
+location in the namespace where this device had already been bound. With these
+changes, the entire capability list of a process is on display through just its
+/proc/$pid/ns file. This
+.CW ns
+file would indicate if a particular device is bound and now also includes
+the list of devices a process has access to.
+In practice, this results in a pattern of binding
+in a sharp device, making use of them and removing
+them when no longer needed. A namespace file for
+a web server could now look like
+bind #s /srv
+# /srv/www created by srvfs www /lib/www
+mount /srv/www /lib/
+unmount /srv
+chdev -r s # chdev &~ s
+In this example we have created a new root for the process by
+using exportfs to expose a little piece of the boot namespace.
+We unmount
+.CW devsrv
+and remove access to it with
+.CW chdev
+ensuring there is no way for our process to talk to the real
+.CW /srv/boot .
+This provides a nice succinct lifetime of access to
+.CW devsrv
+and makes the removal of these sharp devices as easy as
+it is to use them in the first place. 
+.CW chdev
+does not restrict access to sharp devices that had already been mounted.
+This allows a process to use a subsection or only one piece of
+sharp devices as well. One example of this may be to restrict a process
+to just a single network stack
+bind '#I1' /net
+chdev -r I
+With this
+.CW chdev
+mechanism, the ability for a device to provide isolation of its
+own became more powerful. Partially illustrated in the previous
+.CW devip
+.CW Devsrv ,
+the sharp device providing named pipes, was an ideal target for
+adding isolation. Devsrv provides a bulletin board of all posted 9p services
+for a given host. We wanted to provide a mechanism for a process, or
+family tree of process to share a private
+.CW devsrv
+between themselves.
+The design for this was borrowed from devip, one in which a process opens a
+.CW clone
+file to read its newly allocated slot number. This new 'board' appears as a sibling directory
+to the
+.CW clone
+it was spawned from. This new board is itself a fully functioning
+.CW devsrv
+with its own clone file, making nesting to full trees of
+.CW srvs
+quite easy, and completely transparent. The following illustrates
+how one could replace their global
+.CW /srv
+with a freshly allocated one.
+</srv/clone {
+	s='/srv/'^`{read}
+	bind -c $s /srv
+	exec p
+Also like devip, once the last reference to the file descriptor returned by opening
+.CW clone
+is closed the board is closed and posters to that board receive an EOF. It is important
+to bake this kind of ownership in to the design, as self referential users of
+.CW /srv
+are quite common in current code.
+This along with chdev can be used to create a sandbox for /srv quite easily,
+the process allocates itself a new /srv then removes access to the global
+root srv. This allows potentially untrusted process to still make use of the interface
+without needing to worry about their access to the global state. The practice of having
+new boards appear as subdirectories allows the entire state to easily be seen by inspecting the
+root of devsrv itself.
+Restricting Within a Mount
+As shown earlier with the use of
+.CW srvfs ,
+an intermediate file server can be used to only service a small subsection of a larger
+namespace. In that example we used this to expose only /lib/www from the host to processes
+running a web server. This can be limited as the invocation of
+.CW exportfs
+can become more complicated if the user wishes to use multiple pieces from completely
+separate places within the file tree. To address this a utility program
+.CW protofs
+was written to easily create convincing mimics of the filesystem it was run from.
+.CW protofs
+accepts a
+.CW proto
+file, a text file containing a description of file tree, and uses it to provide
+dummy files mimicking the structure. These dummies can then be used by a process as targets
+for bind mounts of its current namespace, providing the illusion of trimming all but select
+pieces. This new root can not be simply bound over the real one, that still allows an unmount
+to escape back to the real system but rexporting the namespace still works. To illustrate a
+more involved setup then before.
+# We want to provide our web server
+# with /bin, /lib/www and /lib/git
+; cat >>/tmp/proto <<.
+bin	d775
+lib	d775
+	www	d775
+	git	d775
+; protofs -m /mnt/proto /tmp/prot
+; bind /bin /mnt/proto/bin
+; bind /lib/www /mnt/proto/lib/www
+; bind /lib/git /mnt/proto/lib/git
+# A private srv could be used, omitted for brevity
+; srvfs webbox /mnt/proto
+# Namespace file for using our new mini-root
+; cat >>/tmp/ns <<.
+mount #s/webbox /root
+bind -b /root /
+chdev -r s
+; auth/newns -n /tmp/ns ls /
+Future Work
+While we think these bring us closer to namespaces as security boundaries,
+there is still plenty of work and understanding to be done. One particular
+item of interest is attempting some kind of isolation of
+.CW devproc ,
+possibly in a similar fashion to the
+.CW /srv/clone
+implementation, but attempts have yet to be made. The exact nature of
+.CW namespace
+files and how they relate to sandboxing as a whole has yet to be fully
+worked out. There is clear potential, but it is likely additional abilities may
+be required. It is somewhat difficult to synthesize a namespace entirely
+from nothing, which is something we found ourselves reaching for when building
+alternative roots to run processes within. There is potential for some merger
+.CW proto
+.CW namespace
+files to provide a template of the current namespace to graft on to the next one.
+.CW chdev
+.CW /srv/clone
+are merged into 9front and their implementations are freely available as part of the base system.
+Bob Beck,
+``Pledge, and Unveil, in OpenBSD'',
+.I "BSDCan Slides"
+July, 2018.
+Daniel Price,
+Andrew Tucker,
+``Solaris Zones: Operating System Support for Consolidating Commercial Workloads'',
+.I "Proceedings of the 18th Large Installation System Administration Conference"
+pp. 241-254,
+November, 2004.
+Eric W. Biederman
+``Multiple Instances of the Global Linux Namespaces'',
+.I "Proceedings of the 2006 Linux Symposium Volume One"
+pp. 102-112,
+Ottawa, Ontario
+July, 2006.