code: 9ferno

ref: 4142b62431afba148dba3473c32f7063dce861e4
dir: /doc/styx.ms/

View raw version
.ds TM \u\s-2TM\s+2\d
.nr dT 6
.nr XT 6
.TL
The Styx Architecture for Distributed Systems
.AU
Rob Pike
Dennis M. Ritchie
.AI
Computing Science Research Center
Lucent Technologies, Bell Labs
Murray Hill, New Jersey
USA
.FS
.FA
Originally appeared in
.I "Bell Labs Technical Journal" ,
Vol. 4,
No. 2,
April-June 1999,
pp. 146-152.
.br
Copyright © 1999 Lucent Technologies Inc.  All rights reserved.
.FE
.AB
A distributed system is constructed from a set of relatively
independent components that form a unified, but geographically and
functionally diverse entity.  Examples include networked operating
systems, Internet services, the national telephone
switching system, and in general
all the technology using today's diverse digital
networks.  Nevertheless, distributed systems remain difficult
to design, build, and maintain, primarily because of the lack
of a clean, perspicuous interconnection model for the
components.
.LP
Our experience with two distributed operating systems,
Plan 9 and Inferno, encourages us to propose such a model.
These systems depend on, advocate, and generally push to the
limit a fruitful idea: to present their
resources as files in a hierarchical name space.
The objects appearing as files may represent stored data, but may
also be devices, dynamic information sources, interfaces to services,
and control points.  The approach unifies and provides basic naming,
structuring, and access control mechanisms for all system resources.
A simple underlying network protocol, Styx, forms
the core of the architecture by presenting a common
language for communication within the system.
.LP
Even within non-distributed systems, the presentation of services
as files advantageously extends a familiar scheme for naming, classifying,
and connecting to system resources.
More important, the approach provides a natural way to build
distributed systems, by using well-known technology for attaching
remote file systems.
If resources are represented as files,
and there are remote file systems, one has
a distributed system: resources available in one place
are usable from another.
.AE
.SH
Introduction
.LP
The Styx protocol is a variant of a protocol called
.I 9P
that
was developed for the Plan 9 operating system[9man].
For simplicity, we will use the name
Styx throughout this paper; the difference concerns only the initialization of
a connection.
.LP
The original idea behind Styx was to encode file operations between
client programs and the file system,
to be translated into messages for transmission on a computer network.
Using this technology,
Plan 9 separates the file server\(ema central repository for
permanent file storage\(emboth from the CPU server\(ema large
shared-memory multiprocessor\(emand from the user terminals.
This physical separation of function was central to the original
design of the system;
what was unexpected was how well the model could be used to
solve a wide variety of problems not usually thought of as
file system issues.
.LP
The breakthrough was to realize that by representing
a computing resource as a form of file system,
many of the difficulties of making that resource available
across the network would disappear naturally, because
Styx could export the resource transparently.
For example,
the Plan 9 window system,
.CW 8½
[Pike91],
is implemented as a dynamic file server that publishes
files with names like
.CW /dev/mouse
and
.CW /dev/screen
to provide access to the local hardware.
The
.CW /dev/mouse
file, for instance,
may be opened and read like a regular file, in the manner of UNIX\*(TM device
files, but under
.CW 8½
it is multiplexed: each client program has a private
.CW /dev/mouse
file that returns mouse events only when the client's window
is the active one on the display.
This design provides a clean, simple mechanism for controlling
access to the mouse.
Its real strength, though, is that the representation of the window system's
resources as files allows Styx to make those resources available across the
network.
For example, an interactive graphics program may be run on a CPU server
simply by having
.CW 8½
serve the appropriate files to that machine.
.LP
Note that although the resources published by Styx behave like files\(emthey
have file names, file permissions, and file access methods\(emthey do not
need to exist as standard files on disk.
The
.CW /dev/mouse
file is accessed by standard file I/O mechanisms but is nonetheless a
transient object fabricated dynamically by a running program;
it has no permanent existence.
.LP
By following this approach throughout the system, Plan 9 achieves
a remarkable degree of transparency in the distribution of resources[PPTTW93].
Besides interactive graphics, services such as debugging, maintenance,
file backup, and even access to the underlying network hardware
can be made available across the network using Styx, permitting
the construction of distributed applications and services
using nothing more sophisticated than file I/O.
.SH
The Styx protocol
.LP
Styx's place in the world is analogous to
Sun NFS[RFC][NFS] or Microsoft CIFS[CIFS], although it is simpler and easier to implement
[Welc94].
Furthermore, NFS and CIFS are designed for sharing regular disk files; NFS in particular
is intimately tied to the implementation and caching strategy
of the underlying UNIX file system.
Unlike Styx, NFS and CIFS are clumsier at exporting dynamic device-like
files such as
.CW /dev/mouse .
.LP
Styx provides a view of a hierarchical, tree-shaped
file system name space[Nee89], together with access information about
the files (permissions, sizes, dates) and the means to read and write
the files.
Its users (that is, the people who write application programs),
don't see the protocol itself; instead they see files that they
read and write, and that provide information or change information.
.LP
In use, a Styx
.I client
is an entity on one machine that establishes communication with
another entity, the
.I server ,
on the same or another machine.
The client mechanisms may be built into the operating system, as they
are in Plan 9 or Inferno[INF1][INF2], or into application libraries;
a server may be part of the operating system, or just as often
may be application code on a separate server machine.  In any case, the
client and server entities
communicate by exchanging messages, and the effect is that the client
sees a hierarchical file system that exists on the server.
The Styx protocol is the specification of the messages that are exchanged.
.LP
At one level, Styx consists of messages of 13 types for
.RS
.IP \(bu
Starting communication (attaching to a file system);
.IP \(bu
Navigating the file system (that is, specifying and
gaining a handle for a named file);
.IP \(bu
Reading and writing a file; and
.IP \(bu
Performing file status inquiries and changes
.RE
.LP
However, application writers simply code requests to open, read, or write
files; a library or the operating system translates the requests
into the necessary byte sequences transmitted over a communication
channel.  The Styx protocol proper specifies the interpretation of these
byte sequences.  It fits, approximately, at the OSI Session Layer level
of the ISO standard classification.
Its specification is independent of most details of machine architecture
and it has been successfully used among machines of varying instruction
sets and data layout.
The protocol is summarized in Table 1.
.KF
.TS
center box;
l l
--
lfCW l.
Name	Description
attach	Authenticate user of connection; return FID
clone	Duplicate FID
walk	Advance FID one level of name hierarchy
open	Check permissions for file I/O
create	Create new file
read	Read contents of file
write	Write contents of file
close	Discard FID
remove	Remove file
stat	Report file state: permissions, etc.
wstat	Modify file state
error	Return error condition for failed operation
flush	Disregard outstanding I/O requests
.TE
.ce 100
.ps -1
Table 1. Summary of Styx messages.
.ps
.ce 0
.KE
.LP
In use, an operation such as
.P1
open("/usr/rob/.profile", O_READ);
.P2
is translated by the underlying system into a sequence of Styx messages.
After establishing the initial connection to the
file server, an
.CW attach
message authenticates the user (the person or agent accessing the files) and
returns an object called a
.CW FID
(file ID) that represents the root of the hierarchy on the server.
When the
.CW open()
operation is executed, it proceeds as follows.
.RS
.IP \(bu
A
.CW clone
message duplicates the root
.CW FID ,
returning a new
.CW FID
that can navigate the hierarchy without losing the connection to the root.
.IP \(bu
The new
.CW FID
is then moved to the file
.CW /usr/rob/.profile
by a sequence of
.CW walk
messages that step along, one path component at a time
.CW usr , (
.CW rob ,
.CW .profile ).
.IP \(bu
Finally, an
.CW open
message checks that the user has permission to read the file,
permitting subsequent
.CW read
and
.CW write
operations (messages) on the
.CW FID .
.IP \(bu
Once I/O is completed, the
.CW close
message will release the
.CW FID .
.RE
.LP
At a lower level, implementations of Styx depend only on a reliable,
byte-stream Transport communications layer. For example, it runs over either
TCP/IP, the standard transmission control protocol
and Internet protocol,
or Internet link (IL), which is a sequenced, reliable datagram protocol
using IP packets.
It is worth emphasizing, though, that the model does not require the
existence of a network to join the components; Styx runs fine
over a Unix pipe or even using shared memory.
The strength of the approach is not so much how it works over a network
as that its behavior over a network is identical to its behavior locally.
.SH 
Architectural approach
.LP
Styx, as a file system protocol, is merely a component in a
more encompassing approach
to system design: the presentation of resources as files.
This approach will be discussed using a sequence of examples.
.SH
.I "Example: networking
.LP
As an example, access to a TCP/IP network in Inferno and Plan 9 systems
appears as a piece of a file system, with (abbreviated) structure
as follows[PrWi93]:
.P1
/net/
	dns/
	tcp/
		clone
		stats
		0/
			ctl
			status
			data
			listen
		1/
			...		
		...
	ether0/
		0/
			ctl
			status
			...
		1/
			...
	...
.P2
This represents a file system structure in which one can name, read, and write `files' with
names like
.CW /net/dns ,
.CW /net/tcp/clone ,
.CW /net/tcp/0/ctl
and so on;
there are directories of files
.CW /net/tcp
and
.CW /net/ether0 .
On the machine that actually has the network interface, all of these
things that look like files are constructed by the kernel drivers that maintain
the TCP/IP stack; they are not real files on a disk.
Operations on the `files' turn into operations sent to the device drivers.
.LP
Suppose an application wishes to establish a connection over TCP/IP to
.CW www.bell-labs.com .
The first task is to translate the domain name
.CW www.bell-labs.com
to a numerical internet address; this is a complicated process, generally
involving communicating with local and remote Domain Name Servers.
In the Styx model, this is done by opening the file
.CW /dev/dns
and writing the literal string
.CW www.bell-labs.com
on the file; then the same file is read.
It will return the string
.CW 204.178.16.5
as a sequence of 12 characters.
.LP
Once the numerical Internet address is acquired, the connection must be established;
this is done by opening
.CW /net/tcp/clone
and reading from it a string that specifies a directory like
.CW /net/tcp/43 ,
which represents a new, unique TCP/IP channel.
To establish the connection,
write a message like
.CW "connect 204.178.16.5
on the control file for that connection,
.CW /net/tcp/43/ctl .
Subsequently, communication with
.CW www.bell-labs.com
is done by reading and
writing on the file
.CW /net/tcp/43/data .
.LP
There are several things to note about this approach.
.RS
.IP \(bu
All the interface points look like files, and are
accessed by the same I/O mechanisms already available in
programming languages like C, C++, or Java. However, they do not
correspond to ordinary data files on disk, but instead are creations
of a middleware code layer.
.IP \(bu
Communication across the interface, by convention, uses printable character strings where
feasible instead of binary information.  This means that the syntax
of communication does not depend on CPU architecture or language details.
.IP \(bu
Because the interface, as in this example with
.CW /net
as the interface with networking facilities, looks like a piece of a
hierarchical file system, it can easily and nearly automatically
be exported to a remote machine and used from afar.
.RE
.LP
In particular, the Styx implementation encourages a natural way of providing
controlled access to networks.
Lucent, like many organizations, has an internal network not
accessible to the international Internet, and has a few
gateways between the inside and outside networks.
Only the gateway machines are connected to both, and they implement
the administrative controls for safety and security.
The advantage of the Styx model is the ease with which
the outside Internet can be used from inside.
If the
.CW /net
file tree described above is provided on a gateway machine,
it can be used as a remote file system from machines on the
inside.  This is safe, because this connection is one-way:
inside machines can see the external network interfaces,
but outside machines cannot see the inside.
.SH
.I "Example: debugging
.LP
A similar approach, borrowed and generalized from the UNIX
system [Kill], is useful for controlling and discovering the status
of the running processes in the operating system.
Here a directory
.CW /proc
contains a subdirectory for each process running on the
system; the names of the subdirectories correspond to
process IDs:
.P1
/proc/
	1/
		status
		ctl
		fd
		text
		mem
		...
	2/
		status
		ctl
		...
	...
.P2
The file names in the process directories refer to various aspects
of the corresponding process:
.CW status
contains information about the state of the process;
.CW ctl ,
when written, performs operations like pausing, restarting,
or killing the process;
.CW fd
names and describes the files open in the process;
.CW text
and
.CW mem
represent the program code and the data respectively.
.LP
Where possible, the information and control are again
represented as text strings.  For example, one line
from the
.CW status
file of a typical process might be
.DS
.CW "samterm dmr Read 0 20 2478910 0 0 ...
.DE
which shows the name of the program, the owner, its state, and several numbers
representing CPU time in various categories.
.LP
Once again, the approach provides several payoffs.
Because process information is represented in file form,
remote debugging (debugging programs on another machine)
is possible immediately by remote-mounting the
.CW /proc
tree on another machine.
The machine-independent representation of information means
that most operations work properly even if the remote machine
uses a different CPU architecture from the one doing the
debugging.
Most of the programs that deal
with status and control contain no machine-dependent parts
and are completely portable.
(A few are not, however: no attempt is made to render the
memory data or instructions in machine-independent form.)
.SH
.I "Example: PathStar\*(TM Access Server
.LP
The data shelf of Lucent's PathStar Access Server[PATH] uses Styx to connect
the line cards and other devices on the shelf to the control computer.
In fact, Styx is the protocol for high-level communication on the backplane.
.LP
The file system hierarchy served by the control computer includes a structure
like this:
.P1
/trip/
	config
	admin/
		ospfctl
		...
	boot/
		0/
			ctl
			eeprom
			memory
			msg
			pack
			alarm
			...
		1/
			...
/net/
	...
.P2
The directories under
.CW /net
are similar to those in Plan 9 or Inferno; they form the interface to the
external IP network.
The
.CW /trip
hierarchy represents the control structure of the shelf.
.LP
The subdirectories under
.CW /trip/boot
each provide access to one of the line cards or other devices in the shelf.
For example, to initialize a card one writes the text string
.CW reset
to the
.CW ctl
file of the card, while bootstrapping is done by copying the control
software for the card into the
.CW memory
file and writing a
.CW reset
message to
.CW ctl .
Once the line card is running,
the other files present an interface to the higher-level structure of the device:
.CW pack
is the port through which IP packets are transferred to and from the card,
.CW alarm
may be read to discover outstanding conditions on the card, and so on.
.LP
All this structure is exported from the shelf using Styx.
The external element management software (EMS) controls and monitors the
shelf using Styx operations.
For example, the EMS may read
.CW /trip/boot/7/alarm
and discover a diagnostic condition.
By reading and writing the other files under
.CW /trip/boot/7/ ,
the card may be taken off line, diagnosed, and perhaps reset or substituted,
all from the system running the EMS, which may be elsewhere in the network.
.LP
Another example is the implementation of SNMP in the PathStar Access Server.
The functionality of SNMP is usually distributed through the various components
of a network, but here it is a straightforward adaption process,
running anywhere in the network, that translates SNMP requests to Styx
operations in the network element.
Besides dramatically simplifying the implementation, the natural
ability for aggregation permits
a single process to provide SNMP access to an arbitrarily complex network subsystem.
Yet the structure is secure: the file-oriented nature of the operations make it
easy to establish standard authentication and security controls to guarantee
that only trusted parties have access to the SNMP operations.
.LP
There are local benefits to this architecture, as well.
Styx provides a single point in the design where control can be separated
from the details of the underlying fabric, isolating both from changes in the
other.  Components become more adaptable: software can be upgraded
without worrying about hidden dependencies on the hardware,
and new hardware may be installed without updating the control
software above.
.SH
Security issues
.LP
Styx provides several security mechanisms for
discouraging hostile or accidental actions that injure the integrity
of a system.
.LP
The underlying file-communication protocol includes
user and group identifiers that a server may check against
other authentication.
For example, a server may check, on a request to open a file,
that the user ID associated with the request is permitted to
perform the operation.
This mechanism is familiar from general-purpose operating
systems, and its use is well-known.
It depends on passwords or stronger mechanisms for authenticating
the identity of clients.
.LP
The Styx approach of providing remote resources
as file systems over a network encourages genuinely secure access
to the resources in a way transparent to applications, so that
authentication transactions need not be provided as part of each.
For example, in Inferno, the negotiation of an initial connection
between client and server may include installation of any of
several encrypting or message-digesting protocols that
supervise the channel.
All application use of the resources provided by the server
is then protected against interference, and the server
has strong assurance that its facilities are being used in
an authorized way.
This is relevant both for general-purpose file servers,
and, in the telephony field, is especially useful for safe
remote administration.
.SH
Summary
.LP
Presentation of resources as a piece of a possibly remote file system
is an attractive way of creating distributed systems that treads a
path between two extremes:
.IP 1
All communication with other parts of the system is by
explicit messages sent between components.
This communication differs in style from applications' use
of local resources.
.IP 2
All communication is by means of
closely shared resources: the CPU-addressable memory in
various parts is made directly available across a big network;
applications can read and write far-away objects exactly as
they do those on the same motherboard as their own CPU.
.LP
Something like the first of these extremes is usually more evident
in today's systems, although either the operating system or software
layered upon it usually paper over some of the rough spots.
The second remains more difficult to approach, because
networks (especially big ones like the Internet) are not very
reliable, and because
the machines on them are diverse in processor architecture
and in installed software.
.LP
The design plan described and advocated in this paper
lies between the two extremes.
It has these advantages:
.IP \(bu
.I "A simple, familiar programming model for reading and writing named files" .
File systems have well-defined naming, access, and permissions structures.
.IP \(bu
.I "Platform and language independence" .
Underlying access to resources is
at the file level, which is provided nearly everywhere, instead
of depending on facilities available only with particular languages
or operating systems.
C++ or Java classes, and C libraries can be constructed
to access the facilities.
.IP \(bu
.I "A hierarchical naming and access control structure" .
This encourages clean
and well-structured design of resource naming and access.
.IP \(bu
.I "Easy testing and debugging" .
By using well-specified, narrow interfaces
at the file level, it is straightforward to observe the communication
between distributed entities.
.IP \(bu
.I "Low cost" .
Support software, at both client and server,
can be written in a few thousand lines
of code, and will occupy only small space in products.
.LP
This approach to building systems is successful in the general-purpose
systems Plan 9 and Inferno;
it has also been used to construct systems specialized for telephony, such
as Mantra[MAN] and the PathStar Access Server.
It supplies a coherent, extensible structure both to the internal communications
within a single system and external communication between heterogeneous
components of a large digital network.
.LP
.SH
References
.nr PS -1
.nr VS -1
.IP [NFS] 11
R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and
B. Lyon,
``Design and Implementation of the Sun Network File System'',
.I "Proc. Summer 1985 USENIX Conf." ,
Portland, Oregon, June 1985,
pp. 119-130.
.IP [RFC] 11
Internet RFC 1094.
.IP [9man] 11
.I "Plan 9 Programmer's Manual" ,
Second Edition,
Vol. 1 and 2,
Bell Laboratories,
Murray Hill, N.J.,
1995.
.IP [Kill84] 11
T. J. Killian,
``Processes as Files'',
.I "Proc. Summer 1984 USENIX Conf." ,
June 1984, Salt Lake City, Utah, June 1984, pp. 203-207.
.IP [Pike91] 11
R. Pike,
``8½, the Plan 9 Window System'',
.I "Proc. Summer 1991 USENIX Conf." ,
Nashville TN, June 1991, pp. 257-265.
.IP "[PPTTW93] " 11
R. Pike, D.L. Presotto, K. Thompson, H. Trickey, and P. Winterbottom, ``The Use of Name Spaces in Plan 9'',
.I "Op. Sys. Rev." ,
Vol. 27, No. 2, April 1993, pp. 72-76.
.IP [PrWi93] 11
D. L. Presotto and P. Winterbottom,
``The Organization of Networks in Plan 9'',
.I "Proc. Winter 1993 USENIX Conf." ,
San Diego, Calif., Jan. 1993, pp. 43-50.
.IP [Nee89] 11
R. Needham, ``Names'', in
.I "Distributed systems" ,
edited by S. Mullender,
Addison-Wesley,
Reading, Mass., 1989, pp. 89-101.
.IP [CIFS]
Paul Leach and Dan Perry, ``CIFS: A Common Internet File System'', Nov. 1996,
.I "http://www.microsoft.com/mind/1196/cifs.htm" .
.IP [INF1]
.I "Inferno Programmer's Manual",
Third Edition,
Vol. 1 and 2, Vita Nuova Holdings Limited, York, England, 2000.
.IP [INF2]
S.M. Dorward, R. Pike, D. L. Presotto, D. M. Ritchie, H. Trickey,
and P. Winterbottom, ``The Inferno Operating System'',
.I "Bell Labs Technical Journal"
Vol. 2,
No. 1,
Winter 1997.
.IP [MAN]
R. A. Lakshmi-Ratan,
``The Lucent Technologies Softswitch\-Realizing the Promise of Convergence'',
.I "Bell Labs Technical Journal" ,
Vol. 4,
No. 2,
April-June 1999,
pp. 174-196.
.IP [PATH]
J. M. Fossaceca, J. D. Sandoz, and P. Winterbottom,
``The PathStar Access Server: Facilitating Carrier-Scale Packet Telephony'',
.I "Bell Labs Technical Journal" ,
Vol. 3,
No. 4,
October-December 1998,
pp. 86-102.
.IP [Welc94]
B. Welch,
``A Comparison of Three Distributed File System Architectures: Vnode, Sprite, and Plan 9'',
.I "Computing Systems" ,
Vol. 7, No. 2, pp. 175-199 (1994).
.nr PS +1
.nr VS +1