code: pyhg

ref: 5513dae7d9eee46f7e1843c8eeeaf9066d0ec632
dir: /sys/src/cmd/python/Doc/lib/compiler.tex/

View raw version
\chapter{Python compiler package \label{compiler}}

\sectionauthor{Jeremy Hylton}{jeremy@zope.com}


The Python compiler package is a tool for analyzing Python source code
and generating Python bytecode.  The compiler contains libraries to
generate an abstract syntax tree from Python source code and to
generate Python bytecode from the tree.

The \refmodule{compiler} package is a Python source to bytecode
translator written in Python.  It uses the built-in parser and
standard \refmodule{parser} module to generated a concrete syntax
tree.  This tree is used to generate an abstract syntax tree (AST) and
then Python bytecode.

The full functionality of the package duplicates the builtin compiler
provided with the Python interpreter.  It is intended to match its
behavior almost exactly.  Why implement another compiler that does the
same thing?  The package is useful for a variety of purposes.  It can
be modified more easily than the builtin compiler.  The AST it
generates is useful for analyzing Python source code.

This chapter explains how the various components of the
\refmodule{compiler} package work.  It blends reference material with
a tutorial.

The following modules are part of the \refmodule{compiler} package:

\localmoduletable


\section{The basic interface}

\declaremodule{}{compiler}

The top-level of the package defines four functions.  If you import
\module{compiler}, you will get these functions and a collection of
modules contained in the package.

\begin{funcdesc}{parse}{buf}
Returns an abstract syntax tree for the Python source code in \var{buf}.
The function raises \exception{SyntaxError} if there is an error in the
source code.  The return value is a \class{compiler.ast.Module} instance
that contains the tree.  
\end{funcdesc}

\begin{funcdesc}{parseFile}{path}
Return an abstract syntax tree for the Python source code in the file
specified by \var{path}.  It is equivalent to
\code{parse(open(\var{path}).read())}.
\end{funcdesc}

\begin{funcdesc}{walk}{ast, visitor\optional{, verbose}}
Do a pre-order walk over the abstract syntax tree \var{ast}.  Call the
appropriate method on the \var{visitor} instance for each node
encountered.
\end{funcdesc}

\begin{funcdesc}{compile}{source, filename, mode, flags=None, 
			dont_inherit=None}
Compile the string \var{source}, a Python module, statement or
expression, into a code object that can be executed by the exec
statement or \function{eval()}. This function is a replacement for the
built-in \function{compile()} function.

The \var{filename} will be used for run-time error messages.

The \var{mode} must be 'exec' to compile a module, 'single' to compile a
single (interactive) statement, or 'eval' to compile an expression.

The \var{flags} and \var{dont_inherit} arguments affect future-related
statements, but are not supported yet.
\end{funcdesc}

\begin{funcdesc}{compileFile}{source}
Compiles the file \var{source} and generates a .pyc file.
\end{funcdesc}

The \module{compiler} package contains the following modules:
\refmodule[compiler.ast]{ast}, \module{consts}, \module{future},
\module{misc}, \module{pyassem}, \module{pycodegen}, \module{symbols},
\module{transformer}, and \refmodule[compiler.visitor]{visitor}.

\section{Limitations}

There are some problems with the error checking of the compiler
package.  The interpreter detects syntax errors in two distinct
phases.  One set of errors is detected by the interpreter's parser,
the other set by the compiler.  The compiler package relies on the
interpreter's parser, so it get the first phases of error checking for
free.  It implements the second phase itself, and that implementation is
incomplete.  For example, the compiler package does not raise an error
if a name appears more than once in an argument list: 
\code{def f(x, x): ...}

A future version of the compiler should fix these problems.

\section{Python Abstract Syntax}

The \module{compiler.ast} module defines an abstract syntax for
Python.  In the abstract syntax tree, each node represents a syntactic
construct.  The root of the tree is \class{Module} object.

The abstract syntax offers a higher level interface to parsed Python
source code.  The \ulink{\module{parser}}
{http://www.python.org/doc/current/lib/module-parser.html}
module and the compiler written in C for the Python interpreter use a
concrete syntax tree.  The concrete syntax is tied closely to the
grammar description used for the Python parser.  Instead of a single
node for a construct, there are often several levels of nested nodes
that are introduced by Python's precedence rules.

The abstract syntax tree is created by the
\module{compiler.transformer} module.  The transformer relies on the
builtin Python parser to generate a concrete syntax tree.  It
generates an abstract syntax tree from the concrete tree.  

The \module{transformer} module was created by Greg
Stein\index{Stein, Greg} and Bill Tutt\index{Tutt, Bill} for an
experimental Python-to-C compiler.  The current version contains a
number of modifications and improvements, but the basic form of the
abstract syntax and of the transformer are due to Stein and Tutt.

\subsection{AST Nodes}

\declaremodule{}{compiler.ast}

The \module{compiler.ast} module is generated from a text file that
describes each node type and its elements.  Each node type is
represented as a class that inherits from the abstract base class
\class{compiler.ast.Node} and defines a set of named attributes for
child nodes.

\begin{classdesc}{Node}{}
  
  The \class{Node} instances are created automatically by the parser
  generator.  The recommended interface for specific \class{Node}
  instances is to use the public attributes to access child nodes.  A
  public attribute may be bound to a single node or to a sequence of
  nodes, depending on the \class{Node} type.  For example, the
  \member{bases} attribute of the \class{Class} node, is bound to a
  list of base class nodes, and the \member{doc} attribute is bound to
  a single node.
  
  Each \class{Node} instance has a \member{lineno} attribute which may
  be \code{None}.  XXX Not sure what the rules are for which nodes
  will have a useful lineno.
\end{classdesc}

All \class{Node} objects offer the following methods:

\begin{methoddesc}{getChildren}{}
  Returns a flattened list of the child nodes and objects in the
  order they occur.  Specifically, the order of the nodes is the
  order in which they appear in the Python grammar.  Not all of the
  children are \class{Node} instances.  The names of functions and
  classes, for example, are plain strings.
\end{methoddesc}

\begin{methoddesc}{getChildNodes}{}
  Returns a flattened list of the child nodes in the order they
  occur.  This method is like \method{getChildren()}, except that it
  only returns those children that are \class{Node} instances.
\end{methoddesc}

Two examples illustrate the general structure of \class{Node}
classes.  The \keyword{while} statement is defined by the following
grammar production: 

\begin{verbatim}
while_stmt:     "while" expression ":" suite
               ["else" ":" suite]
\end{verbatim}

The \class{While} node has three attributes: \member{test},
\member{body}, and \member{else_}.  (If the natural name for an
attribute is also a Python reserved word, it can't be used as an
attribute name.  An underscore is appended to the word to make it a
legal identifier, hence \member{else_} instead of \keyword{else}.)

The \keyword{if} statement is more complicated because it can include
several tests.  

\begin{verbatim}
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
\end{verbatim}

The \class{If} node only defines two attributes: \member{tests} and
\member{else_}.  The \member{tests} attribute is a sequence of test
expression, consequent body pairs.  There is one pair for each
\keyword{if}/\keyword{elif} clause.  The first element of the pair is
the test expression.  The second elements is a \class{Stmt} node that
contains the code to execute if the test is true.

The \method{getChildren()} method of \class{If} returns a flat list of
child nodes.  If there are three \keyword{if}/\keyword{elif} clauses
and no \keyword{else} clause, then \method{getChildren()} will return
a list of six elements: the first test expression, the first
\class{Stmt}, the second text expression, etc.

The following table lists each of the \class{Node} subclasses defined
in \module{compiler.ast} and each of the public attributes available
on their instances.  The values of most of the attributes are
themselves \class{Node} instances or sequences of instances.  When the
value is something other than an instance, the type is noted in the
comment.  The attributes are listed in the order in which they are
returned by \method{getChildren()} and \method{getChildNodes()}.

\input{asttable}


\subsection{Assignment nodes}

There is a collection of nodes used to represent assignments.  Each
assignment statement in the source code becomes a single
\class{Assign} node in the AST.  The \member{nodes} attribute is a
list that contains a node for each assignment target.  This is
necessary because assignment can be chained, e.g. \code{a = b = 2}.
Each \class{Node} in the list will be one of the following classes: 
\class{AssAttr}, \class{AssList}, \class{AssName}, or
\class{AssTuple}. 

Each target assignment node will describe the kind of object being
assigned to:  \class{AssName} for a simple name, e.g. \code{a = 1}.
\class{AssAttr} for an attribute assigned, e.g. \code{a.x = 1}.
\class{AssList} and \class{AssTuple} for list and tuple expansion
respectively, e.g. \code{a, b, c = a_tuple}.

The target assignment nodes also have a \member{flags} attribute that
indicates whether the node is being used for assignment or in a delete
statement.  The \class{AssName} is also used to represent a delete
statement, e.g. \class{del x}.

When an expression contains several attribute references, an
assignment or delete statement will contain only one \class{AssAttr}
node -- for the final attribute reference.  The other attribute
references will be represented as \class{Getattr} nodes in the
\member{expr} attribute of the \class{AssAttr} instance.

\subsection{Examples}

This section shows several simple examples of ASTs for Python source
code.  The examples demonstrate how to use the \function{parse()}
function, what the repr of an AST looks like, and how to access
attributes of an AST node.

The first module defines a single function.  Assume it is stored in
\file{/tmp/doublelib.py}. 

\begin{verbatim}
"""This is an example module.

This is the docstring.
"""

def double(x):
    "Return twice the argument"
    return x * 2
\end{verbatim}

In the interactive interpreter session below, I have reformatted the
long AST reprs for readability.  The AST reprs use unqualified class
names.  If you want to create an instance from a repr, you must import
the class names from the \module{compiler.ast} module.

\begin{verbatim}
>>> import compiler
>>> mod = compiler.parseFile("/tmp/doublelib.py")
>>> mod
Module('This is an example module.\n\nThis is the docstring.\n', 
       Stmt([Function(None, 'double', ['x'], [], 0,
                      'Return twice the argument', 
                      Stmt([Return(Mul((Name('x'), Const(2))))]))]))
>>> from compiler.ast import *
>>> Module('This is an example module.\n\nThis is the docstring.\n', 
...    Stmt([Function(None, 'double', ['x'], [], 0,
...                   'Return twice the argument', 
...                   Stmt([Return(Mul((Name('x'), Const(2))))]))]))
Module('This is an example module.\n\nThis is the docstring.\n', 
       Stmt([Function(None, 'double', ['x'], [], 0,
                      'Return twice the argument', 
                      Stmt([Return(Mul((Name('x'), Const(2))))]))]))
>>> mod.doc
'This is an example module.\n\nThis is the docstring.\n'
>>> for node in mod.node.nodes:
...     print node
... 
Function(None, 'double', ['x'], [], 0, 'Return twice the argument',
         Stmt([Return(Mul((Name('x'), Const(2))))]))
>>> func = mod.node.nodes[0]
>>> func.code
Stmt([Return(Mul((Name('x'), Const(2))))])
\end{verbatim}

\section{Using Visitors to Walk ASTs}

\declaremodule{}{compiler.visitor}

The visitor pattern is ...  The \refmodule{compiler} package uses a
variant on the visitor pattern that takes advantage of Python's
introspection features to eliminate the need for much of the visitor's
infrastructure.

The classes being visited do not need to be programmed to accept
visitors.  The visitor need only define visit methods for classes it
is specifically interested in; a default visit method can handle the
rest. 

XXX The magic \method{visit()} method for visitors.

\begin{funcdesc}{walk}{tree, visitor\optional{, verbose}}
\end{funcdesc}

\begin{classdesc}{ASTVisitor}{}

The \class{ASTVisitor} is responsible for walking over the tree in the
correct order.  A walk begins with a call to \method{preorder()}.  For
each node, it checks the \var{visitor} argument to \method{preorder()}
for a method named `visitNodeType,' where NodeType is the name of the
node's class, e.g. for a \class{While} node a \method{visitWhile()}
would be called.  If the method exists, it is called with the node as
its first argument.

The visitor method for a particular node type can control how child
nodes are visited during the walk.  The \class{ASTVisitor} modifies
the visitor argument by adding a visit method to the visitor; this
method can be used to visit a particular child node.  If no visitor is
found for a particular node type, the \method{default()} method is
called. 
\end{classdesc}

\class{ASTVisitor} objects have the following methods:

XXX describe extra arguments

\begin{methoddesc}{default}{node\optional{, \moreargs}}
\end{methoddesc}

\begin{methoddesc}{dispatch}{node\optional{, \moreargs}}
\end{methoddesc}

\begin{methoddesc}{preorder}{tree, visitor}
\end{methoddesc}


\section{Bytecode Generation}

The code generator is a visitor that emits bytecodes.  Each visit method
can call the \method{emit()} method to emit a new bytecode.  The basic
code generator is specialized for modules, classes, and functions.  An
assembler converts that emitted instructions to the low-level bytecode
format.  It handles things like generator of constant lists of code
objects and calculation of jump offsets.