git: 9front

ref: 43bfea0df4367b0239b0fe35e1307e460ed86e4f
dir: /sys/doc/names.ms/

View raw version
.HTML "The Use of Name Spaces in Plan 9
.TL
The Use of Name Spaces in Plan 9
.AU
Rob Pike
Dave Presotto
Ken Thompson
Howard Trickey
Phil Winterbottom
.AI
.MH
USA
.AB
.FS
Appeared in
.I
Operating Systems Review,
.R
Vol. 27, #2, April 1993, pp. 72-76
(reprinted from
.I
Proceedings of the 5th ACM SIGOPS European Workshop,
.R
Mont Saint-Michel, 1992, Paper nº 34).
.FE
Plan 9 is a distributed system built at the Computing Sciences Research
Center of AT&T Bell Laboratories (now Lucent Technologies, Bell Labs) over the last few years.
Its goal is to provide a production-quality system for software
development and general computation using heterogeneous hardware
and minimal software.  A Plan 9 system comprises CPU and file
servers in a central location connected together by fast networks.
Slower networks fan out to workstation-class machines that serve as
user terminals.  Plan 9 argues that given a few carefully
implemented abstractions
it is possible to
produce a small operating system that provides support for the largest systems
on a variety of architectures and networks. The foundations of the system are
built on two ideas: a per-process name space and a simple message-oriented 
file system protocol.
.AE
.PP
The operating system for the CPU servers and terminals is
structured as a traditional kernel: a single compiled image
containing code for resource management, process control,
user processes,
virtual memory, and I/O.  Because the file server is a separate
machine, the file system is not compiled in, although the management
of the name space, a per-process attribute, is.
The entire kernel for the multiprocessor SGI Power Series machine
is 25000 lines of C,
the largest part of which is code for four networks including the
Ethernet with the Internet protocol suite.
Fewer than 1500 lines are machine-specific, and a
functional kernel with minimal I/O can be put together from
source files totaling 6000 lines. [Pike90]
.PP
The system is relatively small for several reasons.
First, it is all new: it has not had time to accrete as many fixes
and features as other systems.
Also, other than the network protocol, it adheres to no
external interface; in particular, it is not Unix-compatible.
Economy stems from careful selection of services and interfaces.
Finally, wherever possible the system is built around
two simple ideas:
every resource in the system, either local or remote,
is represented by a hierarchical file system; and
a user or process
assembles a private view of the system by constructing a file
.I
name space
.R
that connects these resources. [Needham]
.SH
File Protocol
.PP
All resources in Plan 9 look like file systems.
That does not mean that they are repositories for
permanent files on disk, but that the interface to them
is file-oriented: finding files (resources) in a hierarchical
name tree, attaching to them by name, and accessing their contents
by read and write calls.
There are dozens of file system types in Plan 9, but only a few
represent traditional files.
At this level of abstraction, files in Plan 9 are similar
to objects, except that files are already provided with naming,
access, and protection methods that must be created afresh for
objects.  Object-oriented readers may approach the rest of this
paper as a study in how to make objects look like files.
.PP
The interface to file systems is defined by a protocol, called 9P,
analogous but not very similar to the NFS protocol.
The protocol talks about files, not blocks; given a connection to the root
directory of a file server,
the 9P messages navigate the file hierarchy, open files for I/O,
and read or write arbitrary bytes in the files.
9P contains 17 message types: three for
initializing and
authenticating a connection and fourteen for manipulating objects.
The messages are generated by the kernel in response to user- or
kernel-level I/O requests.
Here is a quick tour of the major message types.
The
.CW auth
and
.CW attach
messages authenticate a connection, established by means outside 9P,
and validate its user.
The result is an authenticated
.I channel
that points to the root of the
server.
The
.CW clone
message makes a new channel identical to an existing channel,
which may be moved to a file on the server using a
.CW walk
message to descend each level in the hierarchy.
The
.CW stat
and
.CW wstat
messages read and write the attributes of the file pointed to by a channel.
The
.CW open
message prepares a channel for subsequent
.CW read
and
.CW write
messages to access the contents of the file, while
.CW create
and
.CW remove
perform, on the files, the actions implied by their names.
The
.CW clunk
message discards a channel without affecting the file.
None of the 9P messages consider caching; file caches are provided,
when needed, either within the server (centralized caching)
or by implementing the cache as a transparent file system between the
client and the 9P connection to the server (client caching).
.PP
For efficiency, the connection to local
kernel-resident file systems, misleadingly called
.I devices,
is by regular rather than remote procedure calls.
The procedures map one-to-one with 9P message  types.
Locally each channel has an associated data structure
that holds a type field used to index
a table of procedure calls, one set per file system type,
analogous to selecting the method set for an object. 
One kernel-resident file system, the
.I
mount device,
.R
translates the local 9P procedure calls into RPC messages to
remote services over a separately provided transport protocol
such as TCP or IL, a new reliable datagram protocol, or over a pipe to
a user process.
Write and read calls transmit the messages over the transport layer.
The mount device is the sole bridge between the procedural
interface seen by user programs and remote and user-level services.
It does all associated marshaling, buffer
management, and multiplexing and is
the only integral RPC mechanism in Plan 9.
The mount device is in effect a proxy object.
There is no RPC stub compiler; instead the mount driver and
all servers just share a library that packs and unpacks 9P messages.
.SH
Examples
.PP
One file system type serves
permanent files from the main file server,
a stand-alone multiprocessor system with a
350-gigabyte
optical WORM jukebox that holds the data, fronted by a two-level
block cache comprising 7 gigabytes of
magnetic disk and 128 megabytes of RAM.
Clients connect to the file server using any of a variety of
networks and protocols and access files using 9P.
The file server runs a distinct operating system and has no
support for user processes; other than a restricted set of commands
available on the console, all it does is answer 9P messages from clients.
.PP
Once a day, at 5:00 AM,
the file server sweeps through the cache blocks and marks dirty blocks
copy-on-write.
It creates a copy of the root directory
and labels it with the current date, for example
.CW 1995/0314 .
It then starts a background process to copy the dirty blocks to the WORM.
The result is that the server retains an image of the file system as it was
early each morning.
The set of old root directories is accessible using 9P, so a client
may examine backup files using ordinary commands.
Several advantages stem from having the backup service implemented
as a plain file system.
Most obviously, ordinary commands can access them.
For example, to see when a bug was fixed
.P1
grep 'mouse bug fix' 1995/*/sys/src/cmd/8½/file.c
.P2
The owner, access times, permissions, and other properties of the
files are also backed up.
Because it is a file system, the backup
still has protections;
it is not possible to subvert security by looking at the backup.
.PP
The file server is only one type of file system.
A number of unusual services are provided within the kernel as
local file systems.
These services are not limited to I/O devices such
as disks.  They include network devices and their associated protocols,
the bitmap display and mouse,
a representation of processes similar to
.CW /proc
[Killian], the name/value pairs that form the `environment'
passed to a new process, profiling services,
and other resources.
Each of these is represented as a file system \(em
directories containing sets of files \(em
but the constituent files do not represent permanent storage on disk.
Instead, they are closer in properties to UNIX device files.
.PP
For example, the
.I console
device contains the file
.CW /dev/cons ,
similar to the UNIX file
.CW /dev/console :
when written,
.CW /dev/cons
appends to the console typescript; when read,
it returns characters typed on the keyboard.
Other files in the console device include
.CW /dev/time ,
the number of seconds since the epoch,
.CW /dev/cputime ,
the computation time used by the process reading the device,
.CW /dev/pid ,
the process id of the process reading the device, and
.CW /dev/user ,
the login name of the user accessing the device.
All these files contain text, not binary numbers,
so their use is free of byte-order problems.
Their contents are synthesized on demand when read; when written,
they cause modifications to kernel data structures.
.PP
The
.I process
device contains one directory per live local process, named by its numeric
process id:
.CW /proc/1 ,
.CW /proc/2 ,
etc.
Each directory contains a set of files that access the process.
For example, in each directory the file
.CW mem
is an image of the virtual memory of the process that may be read or
written for debugging.
The
.CW text
file is a sort of link to the file from which the process was executed;
it may be opened to read the symbol tables for the process.
The
.CW ctl
file may be written textual messages such as
.CW stop
or
.CW kill
to control the execution of the process.
The
.CW status
file contains a fixed-format line of text containing information about
the process: its name, owner, state, and so on.
Text strings written to the
.CW note
file are delivered to the process as
.I notes,
analogous to UNIX signals.
By providing these services as textual I/O on files rather
than as system calls (such as
.CW kill )
or special-purpose operations (such as
.CW ptrace ),
the Plan 9 process device simplifies the implementation of
debuggers and related programs.
For example, the command
.P1
cat /proc/*/status
.P2
is a crude form of the
.CW ps
command; the actual
.CW ps
merely reformats the data so obtained.
.PP
The
.I bitmap
device contains three files,
.CW /dev/mouse ,
.CW /dev/screen ,
and
.CW /dev/bitblt ,
that provide an interface to the local bitmap display (if any) and pointing device.
The
.CW mouse
file returns a fixed-format record containing
1 byte of button state and 4 bytes each of
.I x
and
.I y
position of the mouse.
If the mouse has not moved since the file was last read, a subsequent read will
block.
The
.CW screen
file contains a memory image of the contents of the display;
the
.CW bitblt
file provides a procedural interface.
Calls to the graphics library are translated into messages that are written
to the
.CW bitblt
file to perform bitmap graphics operations.  (This is essentially a nested
RPC protocol.)
.PP
The various services being used by a process are gathered together into the
process's
.I
name space,
.R
a single rooted hierarchy of file names.
When a process forks, the child process shares the name space with the parent.
Several system calls manipulate name spaces.
Given a file descriptor
.CW fd
that holds an open communications channel to a service,
the call
.P1
mount(int fd, char *old, int flags)
.P2
authenticates the user and attaches the file tree of the service to
the directory named by
.CW old .
The
.CW flags
specify how the tree is to be attached to
.CW old :
replacing the current contents or appearing before or after the
current contents of the directory.
A directory with several services mounted is called a
.I union
directory and is searched in the specified order.
The call
.P1
bind(char *new, char *old, int flags)
.P2
takes the portion of the existing name space visible at
.CW new ,
either a file or a directory, and makes it also visible at
.CW old .
For example,
.P1
bind("1995/0301/sys/include", "/sys/include", REPLACE)
.P2
causes the directory of include files to be overlaid with its
contents from the dump on March first.
.PP
A process is created by the
.CW rfork
system call, which takes as argument a bit vector defining which
attributes of the process are to be shared between parent
and child instead of copied.
One of the attributes is the name space: when shared, changes
made by either process are visible in the other; when copied,
changes are independent.
.PP
Although there is no global name space,
for a process to function sensibly the local name spaces must adhere
to global conventions. 
Nonetheless, the use of local name spaces is critical to the system.
Both these ideas are illustrated by the use of the name space to
handle heterogeneity.
The binaries for a given architecture are contained in a directory
named by the architecture, for example
.CW /mips/bin ;
in use, that directory is bound to the conventional location
.CW /bin .
Programs such as shell scripts need not know the CPU type they are
executing on to find binaries to run.
A directory of private binaries
is usually unioned with
.CW /bin .
(Compare this to the
.I
ad hoc
.R
and special-purpose idea of the
.CW PATH
variable, which is not used in the Plan 9 shell.)
Local bindings are also helpful for debugging, for example by binding
an old library to the standard place and linking a program to see
if recent changes to the library are responsible for a bug in the program.
.PP
The window system,
.CW 8½
[Pike91], is a server for files such as
.CW /dev/cons
and
.CW /dev/bitblt .
Each client sees a distinct copy of these files in its local
name space: there are many instances of
.CW /dev/cons ,
each served by
.CW 8½
to the local name space of a window.
Again,
.CW 8½
implements services using
local name spaces plus the use
of I/O to conventionally named files.
Each client just connects its standard input, output, and error files
to
.CW /dev/cons ,
with analogous operations to access bitmap graphics.
Compare this to the implementation of
.CW /dev/tty
on UNIX, which is done by special code in the kernel
that overloads the file, when opened,
with the standard input or output of the process.
Special arrangement must be made by a UNIX window system for
.CW /dev/tty
to behave as expected;
.CW 8½
instead uses the provision of the corresponding file as its
central idea, which to succeed depends critically on local name spaces.
.PP
The environment
.CW 8½
provides its clients is exactly the environment under which it is implemented:
a conventional set of files in
.CW /dev .
This permits the window system to be run recursively in one of its own
windows, which is handy for debugging.
It also means that if the files are exported to another machine,
as described below, the window system or client applications may be
run transparently on remote machines, even ones without graphics hardware.
This mechanism is used for Plan 9's implementation of the X window
system: X is run as a client of
.CW 8½ ,
often on a remote machine with lots of memory.
In this configuration, using Ethernet to connect
MIPS machines, we measure only a 10% degradation in graphics
performance relative to running X on
a bare Plan 9 machine.
.PP
An unusual application of these ideas is a statistics-gathering
file system implemented by a command called
.CW iostats .
The command encapsulates a process in a local name space, monitoring 9P
requests from the process to the outside world \(em the name space in which
.CW iostats
is itself running.  When the command completes,
.CW iostats
reports usage and performance figures for file activity.
For example
.P1
iostats 8½
.P2
can be used to discover how much I/O the window system
does to the bitmap device, font files, and so on.
.PP
The
.CW import
command connects a piece of name space from a remote system
to the local name space.
Its implementation is to dial the remote machine and start
a process there that serves the remote name space using 9P.
It then calls
.CW mount
to attach the connection to the name space and finally dies;
the remote process continues to serve the files.
One use is to access devices not available
locally.  For example, to write a floppy one may say
.P1
import lab.pc /a: /n/dos
cp foo /n/dos/bar
.P2
The call to
.CW import
connects the file tree from
.CW /a:
on the machine
.CW lab.pc
(which must support 9P) to the local directory
.CW /n/dos .
Then the file
.CW foo
can be written to the floppy just by copying it across.
.PP
Another application is remote debugging:
.P1
import helix /proc
.P2
makes the process file system on machine
.CW helix
available locally; commands such as
.CW ps
then see
.CW helix 's
processes instead of the local ones.
The debugger may then look at a remote process:
.P1
db /proc/27/text /proc/27/mem
.P2
allows breakpoint debugging of the remote process.
Since
.CW db
infers the CPU type of the process from the executable header on
the text file, it supports
cross-architecture debugging, too.
Care is taken within
.CW db
to handle issues of byte order and floating point; it is possible to
breakpoint debug a big-endian MIPS process from a little-endian i386.
.PP
Network interfaces are also implemented as file systems [Presotto].
For example,
.CW /net/tcp
is a directory somewhat like
.CW /proc :
it contains a set of numbered directories, one per connection,
each of which contains files to control and communicate on the connection.
A process allocates a new connection by accessing
.CW /net/tcp/clone ,
which evaluates to the directory of an unused connection.
To make a call, the process writes a textual message such as
.CW 'connect
.CW 135.104.53.2!512'
to the
.CW ctl
file and then reads and writes the
.CW data
file.
An
.CW rlogin
service can be implemented in a few of lines of shell code.
.PP
This structure makes network gatewaying easy to provide.
We have machines with Datakit interfaces but no Internet interface.
On such a machine one may type
.P1
import helix /net
telnet tcp!ai.mit.edu
.P2
The
.CW import
uses Datakit to pull in the TCP interface from
.CW helix ,
which can then be used directly; the
.CW tcp!
notation is necessary because we routinely use multiple networks
and protocols on Plan 9\(emit identifies the network in which
.CW ai.mit.edu
is a valid name.
.PP
In practice we do not use
.CW rlogin
or
.CW telnet
between Plan 9 machines.  Instead a command called
.CW cpu
in effect replaces the CPU in a window with that
on another machine, typically a fast multiprocessor CPU server.
The implementation is to recreate the
name space on the remote machine, using the equivalent of
.CW import
to connect pieces of the terminal's name space to that of
the process (shell) on the CPU server, making the terminal
a file server for the CPU.
CPU-local devices such as fast file system connections
are still local; only terminal-resident devices are
imported.
The result is unlike UNIX
.CW rlogin ,
which moves into a distinct name space on the remote machine,
or file sharing with
.CW NFS ,
which keeps the name space the same but forces processes to execute
locally.
Bindings in
.CW /bin
may change because of a change in CPU architecture, and
the networks involved may be different because of differing hardware,
but the effect feels like simply speeding up the processor in the
current name space.
.SH
Position
.PP
These examples illustrate how the ideas of representing resources
as file systems and per-process name spaces can be used to solve
problems often left to more exotic mechanisms.
Nonetheless there are some operations in Plan 9 that are not
mapped into file I/O.
An example is process creation.
We could imagine a message to a control file in
.CW /proc
that creates a process, but the details of
constructing the environment of the new process \(em its open files,
name space, memory image, etc. \(em are too intricate to
be described easily in a simple I/O operation.
Therefore new processes on Plan 9 are created by fairly conventional
.CW rfork
and
.CW exec
system calls;
.CW /proc
is used only to represent and control existing processes.
.PP
Plan 9 does not attempt to map network name spaces into the file
system name space, for several reasons.
The different addressing rules for various networks and protocols
cannot be mapped uniformly into a hierarchical file name space.
Even if they could be,
the various mechanisms to authenticate,
select a service,
and control the connection would not map consistently into
operations on a file.
.PP
Shared memory is another resource not adequately represented by a
file name space.
Plan 9 takes care to provide mechanisms
to allow groups of local processes to share and map memory.
Memory is controlled
by system calls rather than special files, however,
since a representation in the file system would imply that memory could
be imported from remote machines.
.PP
Despite these limitations, file systems and name spaces offer an effective
model around which to build a distributed system.
Used well, they can provide a uniform, familiar, transparent
interface to a diverse set of distributed resources.
They carry well-understood properties of access, protection,
and naming.
The integration of devices into the hierarchical file system
was the best idea in UNIX.
Plan 9 pushes the concepts much further and shows that
file systems, when used inventively, have plenty of scope
for productive research.
.SH
References
.LP
[Killian] T. Killian, ``Processes as Files'', USENIX Summer Conf. Proc., Salt Lake City, 1984
.br
[Needham] R. Needham, ``Names'', in
.I
Distributed systems,
.R
S. Mullender, ed.,
Addison Wesley, 1989
.br
[Pike90] R. Pike, D. Presotto, K. Thompson, H. Trickey,
``Plan 9 from Bell Labs'',
UKUUG Proc. of the Summer 1990 Conf.,
London, England,
1990
.br
[Presotto] D. Presotto, ``Multiprocessor Streams for Plan 9'',
UKUUG Proc. of the Summer 1990 Conf.,
London, England,
1990
.br
[Pike91] Pike, R., ``8.5, The Plan 9 Window System'', USENIX Summer
Conf. Proc., Nashville, 1991