DESIGN OF THE RSCHEME COMPILER Paul R. Wilson and Donovan Kolbly The RScheme compiler is designed to be relatively straightforward but relatively complete. It is intended to be easy enough to understand that it can be modified for a variety of purposes, but complete enough to generate reasonably good code and serve as an illustration of basic issues in compiling dynamic languages like Scheme and Dylan. These features, combined with reasonable documentation, should make it a good system for teaching and experimenting with techniques in programming language design and implementation. Parts of the design are intended to allow experimentation with particular implementation strategies we find interesting, but mostly the design has been guided by an attempt to make things simple, "right" and "clean", as long as that didn't sacrifice portability and performance. (Obviously, this is a juggling act that involves the exercise of great taste and exquisite judgement, and it's faintly possible that someone might disagree with some of our design decisions.) One of our design principles has been to attempt to do the 20% of the work that gets 80% of the performance benefit, and to try not to slide down the slippery slope into a very complex compiler design. One of the eventual goals of RScheme is to be a vehicle for implementing multiple languages. Its syntactic extension mechanism is integrated into the compiler in a principled way, and we intend to extend this mechanism to turn the RScheme compiler into a highly programmable "open" compiler. The front-end of the compiler will be reprogrammable according to a relatively simple and clear protocol, exploiting a simple model of the front end of the compiler. (If you're familiar with the notion of metaobject protocols, you can view this as a similar notion, but allowing the programmer to control things like scope and typing, rather than just things like inheritance and dispatching. The goal is to let programmers extend or change the language in a principled way, or to reimplement parts of the language and perform a variety of high-level optimizations without changing the guts of the compiler.) Given this mechanism, it is also trivial to implement a facility for defining inline procedures, because most of the work is just a simple special case of the work done for syntax extension. (Our lexically-scoped syntactic extension mechanism can achieve the functionality of Scheme's "hygeinic macros", but it does it by a very different mechanism, integrated with the normal scoping mechanisms of a conventional Scheme compiler. We believe that this is the "right" way to do it, since most of the work done by conventional hygeinic macroexpansion is replication of the work done by the normal scoping mechanisms in the compiler. In principle, our system should require less redundant code than a conventional macroexpansion prepass, and provide a more powerful framework for compiler extensibility. It can be viewed as an extension of the "syntactic closures" concept of Bawden and Rees, which we believe was a fundamentally superior approach to the earlier "hygeinic macroexpansion" work by Kohlbecker et al. and later work by Clinger and Rees.) The compiler has two back ends: one that generates C code which can be compiled by a C compiler to get machine language, and one that generates CLOSURE-THREADED code. Closure-threading is just a means of constructing executable procedures by creating closures of existing code; it avoids the need to actually generate machine code, because it just ``glues together'' existing code by creating closures in properly structured environments. Compiling to C is attractive for two main reasons: * it is extremely portable. We generate "nearly" machine-independent C, using typedefs and #define's to allow us to hide differences between C compilers. (For example, we define our own type "int32" that represents a 32-bit integer, which may be mapped onto an int or a long int, whichever is 32 bits for that C compiler.) * it allows us to exploit the C compiler's optimizations to some degree, especially register allocation and instruction scheduling. We can benefit from the C register allocator by compiling nested Scheme expressions to nested C expressions in many common circumstances. The temporary values will then be handled by the C compiler, and allocated to machine registers in most cases. There are also two disadvantages of compiling to C, however: * it's not the ideal intermediate language for compling languages like Scheme, because of the mismatch between the semantics of the two languages, and * there is no portable, lighweight mechanism for compiling C code and linking the result into a running system. This is a problem in an interactive system, especially one that is intended to be extremely portable. The first problem is the price we pay for having a relatively simple, highly portable compiler. It would be significantly more work to use a conventional back end and retarget it for different platforms, even given modern semi-automatic back end generators. We settle for getting decent performance, rather than striving to maximum performance. The second problem can be solved reasonably well by having two back ends, one that generates machine code via C, and one that compiles to some interpreted or threaded representation. Threaded code is easy to generate without dynamic linking of machine code, so it's suitable for quick interactive response. This is valuable in a system like ours, where expressions typed by the user at the interaction prompt (in the "read-eval-print" loop) are actually compiled and then executed rather than interpreted. (RScheme appears to the casual user to be interpreted, but it's not---the "eval" of the read eval print loop compiles expressions and executes the resulting code, rather than interpreting it. This is common in modern interactive systems. In RScheme, interactively-typed code is compiled to threaded code, and it takes some extra doing to compile to C and link that in.) We chose closure-threaded code as a representation just because it's very easy to implement the back end. In the future, we intend to replace the closure-threaded back end with one that generates something like bytecodes, which will be interpreted by a more conventional virtual machine interpreter. This representation is intended to be very compact and not terribly slow. In normal use, programmers will compile the performance-critical parts of their applications to machine code via C for maximum speed, and the other parts of their code to this interpreted representation to save space. STARTING RSCHEME From the point of view of the operating system, RScheme is just an executable program, linked together by combining object files written in C and C++ and compiled with an appropriate compiler. Some of these object files (containing things like the garbage collector and some low-level I/O code) are handwritten C or C++. Others are C files generated by using the RScheme compiler to generate C and a C compiler to generate object code. Scheme procedures contain more than executable code, however, and a running Scheme system must have environments and other data structures in order to function. (From the point of view of the executable machine code, these are just data.) The RScheme binary therefore takes as input a file which contains the data structures the running system will need. This is called a HEAP IMAGE, that is, a "snapshot" of the heap of a running Scheme system that has been saved to a file. The RScheme binary reads this information from the file and reconstructs the data structures on the heap. Once this is done, it can execute one of those procedures (specified when the image was saved to a file) as the "startup" procedure. It takes the startup closure's environment and template values and puts them in appropriate registers, then extracts the appropriate code pointer from the template, and jumps to that code. From that moment on, it is executing Scheme. For a normal image used for interactive programming, the startup procedure will do some initializations, and then call the read-eval-print loop procedure. You can also save an image that executes some other procedure at startup time. For example, if you've finished a whole application and are happy with it, and want RScheme to come up running that application, you can save an image where the startup procedure does the usual initializations, and then calls the "main" procedure of your application. ------------------------------------------------------------------------- THE RSCHEME EXECUTION MODEL RScheme is based on an abstract machine defined by a set of registers, data structuring conventions, and PRIMITIVE OPERATIONS that are implemented as small fragments of C code. This is not a virtual machine in the sense of many Lisp and Smalltalk systems, which interpret their primitive instructions at run time. It is an abstract machine much like a high-level assembly language. In one important sense, this abstract machine is lower level than most virtual machines for dynamically-typed languages; the operations are statically typed using a very simple type system, and the compiler in many cases can glue them together without type-checking code. The compiler inserts type checking primitives as necessary when values are truly dynamically typed, but it can often omit type checking when it can be statically determined that the values returned by one expression are of the type expected by the enclosing expression. (We will discuss type declarations and simple bottom-up type inference later.) In another important sense, our abstract machine is higher-level than a conventional virtual machine. Our abstract machine primitives are expressions that can be nested, rather than just statements that are glued together. The RScheme REGISTERS are implemented as C global variables. Depending on what C compiler is used, some of these variables may be mapped to hardware registers in the underlying machine (GNU C supports this). For most of the following, we will assume that we are executing RScheme code that has been compiled to machine code via C. The threaded code observes the same conventions, because threaded code is really just a bunch of closures of simple RScheme procedures. When and if we switch to something like a bytecode interpreter, the calling convention will not change---a bytecoded procedure will be represented as a closure of a normal RScheme procedure, which simply interprets the bytecodes. The calling convention is optimized for machine code, because we believe that most programs will spend most of their time in machine code rather than interpreting bytecodes. (If not, they'll be slow anyway, so it won't matter much.) The basic register set of RScheme is pretty much the same as the one for the simple compiler described in Paul Wilson's notes on Scheme compilers (available via ftp from cs.utexas.edu in pub/garbage/schnotes). The main difference is in that rather than using an eval stack to hold intermediate values computed by nested expressions, extra registers are used to hold those values. This additional set of registers is also used for argument passing and for register-allocating local variables that the compiler determines do not need to be heap-allocated. --- The basic registers The basic registers are the ENVIRONMENT register, the CONTINUATION register, and REG0, which serves the function of the VALUE register in the notes on basic compiler design. REG0 is actually part of the auxiliary register file, to be discussed later, but its role is essentially the equivalent of a VALUE register.) Conceptually, the program counter is another basic register, but we use the underlying machine's program counter for this; our strategy for compiling to C makes this simple. The environment, continuation, and value registers all hold tagged Scheme values. (They never hold raw integers or raw pointers.) The ENVIRONMENT register holds a pointer to the current heap-allocated lexical environment. (If the compiler does its job well, many environments won't be allocated on the heap, but this register will point to the chain of environments representing variable bindings that couldn't be register allocated.) The CONTINUATION register holds a pointer to the chain of partial continuations (of suspended callers) of the currently executing procedure. The REG0 (the value register) is primarily used for returning values from procedure calls, but it also has a special role in procedure calling. There are a few other registers, notably the ARG_COUNT register. At a procedure call, the arg_count register is set by a caller to signify the number of arguments being passed. On entry to the procedure, the callee checks to see if it received the right number of arguments. (If the procedure can take extra arguments, it pushes those into an argument list.) The arg-count register always holds a raw integer, so there is no need for the garbage collector to know about it. --- The auxiliary register file The auxiliary register file is the set of other registers, which can be used for argument passing, holding temporary values of intermediate expressions, and for register-allocating local variables that do not need to be heap-allocated. The auxiliary register file is conceptually an unbounded array of registers, and registers are used starting at one end and moving toward the other. At any given point in the execution of a program, the registers in use are a contiguous range starting at register 0. Typically, only the first few registers are used, and the ones nearest 0 are most heavily used. This allows us to allocate the first few registers in hardware registers (if the underlying C compiler supports that) and get a large benefit from a few registers. The fact that we use a contiguous, zero-based range of registers makes the interface between compiled code and the garbage collector simple. When garbage collection occurs, it is only necessary for the GC to know how many registers are in use, and it can trivially determine which ones those are. This lets it find the root pointers for garbage collection easily. (This constraint is also exploited for thread switching---it is easy to find the registers that are in use so that they can be saved.) Auxiliary Register usage Register usage in RScheme is currently fairly simple, and we want to keep it that way if possible. We only intend to make it more complex if there are major performance benefits, and we don't intend for it ever to be very complex. All auxiliary registers hold tagged Scheme values. This simplifies garbage-collection and thread-switching code, and we think that it can provide reasonably good performance given a careful selection of basic object representations. Argument Passing Arguments are passed in registers, and any reasonable number (up to several hundred) of arguments can be passed there. At the moment of a procedure call, the only values in the auxiliary register file are the arguments to the call. Any other register-allocated values will be saved, as necessary, in a partial continuation before the call. (This is very much analogous to the calling convention described in Paul Wilson's notes on Scheme compilation, with the main difference being that instead of saving values from an eval stack into a partial continuation, they are saved from registers.) For a tail call, no values need to be saved. For a non-tail call, a continuation is saved before computing the argument values for the call. (Actually, at this writing (11/94) the register allocator is even more conservative---it heap-allocates any variable in whose scope any out-of-line procedure call occurs. This should be changed to only heap-allocation of bindings in whose scope a NON-TAIL call occurs.) In the general case, the arguments to a procedure must be copied from the argument passing registers into a heap-allocated environment when the procedure is entered, so that closures can capture that environment and keep it live indefinitely. For procedures that create no closures, this is not necessary, and the argument variables can be left in the registers where they were passed. We expect the latter to be a very common case, as roughly half of all calls are to small procedures at the leaves of the call graph. After entering the procedure, then, the arguments will be in the first few registers of the auxiliary register file, or they will have been copied to the heap and no auxiliary registers will be in use. Whichever is true, the remaining registers can be allocated consecutively to hold temporary values or local variable bindings. The auxiliary register file behaves in much the same way as the eval stack in the simple example compiler (in Paul Wilson's compiler notes). As mentioned above, the auxiliary file contains ONLY values for the currently-executing procedure, and any values needed by the caller will already have been saved in the continuation chain. Despite this stack-like pattern of usage (within a procedure), the compiler can statically resolve which register is used for which temporary or local variable, and there is no need to use a stack pointer at run time. If the registers are mapped to hardware registers by the C compiler, accesses to them are extremely fast. Local Variables In the general case, local variables bindings (including argument bindings) are allocated on the heap, in conventional environment frames. An environment frame contains a scope link, pointing to the lexically-enclosing environment frame. In favorable cases, the compiler can determine that variable bindings don't really need to be allocated on the heap; it can allocate them in registers, much like temporary values. We currently use a simple rule for this. For each binding contour, we construct a compile-time binding environment frame, and if we allocate any of those variables on the heap, we allocate all of them on the heap. ------------------------------------------------------------------------ SUPPORTING GC, THREADS, and INTERRUPTS To support garbage collection, it must be possible for the collector to find the ROOT SET of pointers from which it traces to find all reachable data. (If this is not clear, see Paul Wilson's survey on garbage collection, available via anonymous ftp from cs.utexas.edu as pub/garbage/gcsurvey.ps.) If variables were allocated in a stack, we would need to be able to find all of the pointer variables in the stack so that we could trace outward from them. Our use of registers and a heap-allocated continuation chain means that we only need to start at the registers, and trace outward from there. The tracer will be able to find the first partial continuation by starting at the continuation register, and trace the continuation chain like any normal data structure. From there it will find all of the saved temporary values, and pointers to saved binding environments. Then it can continue tracing, to find all of the values of variables and follow those outward until all live data are traced. (For the moment, when we talk about finding the registers that are in use, we are talking about the abstract machine registers. The C compiler may also compile our code so that it use other registers, and we assume for the moment that we can ignore those. Later we will explain how certain restrictions on our code generator do in fact allow us to ignore the registers allocated by the C compiler.) The collector must also be able to scan from the other registers, including the auxiliary registers holding temporaries. At any point where GC occurs, it must be able to determine how many registers are in use, and decode their representations. This is why we ensure that the values in these registers are always in a tagged format, so that the collector can tell whether they are immediate values or pointers that must be traced. There are two basic strategies for supporting GC, which we call "GC anytime" and "safe points". We use safe points, but we will describe the GC-anytime strategy and explain why we do NOT use it. GC-anytime The GC-anytime strategy ensures that at any arbitrary machine instruction, the running program can be stopped and the GC can recognize all registers that are in use. This requires that the compiler keep enough information around that at any instruction boundary, it is known which registers have been allocated; this complicates the interface between the GC and compiled code, especially when compiling to C. (More on that later.) An alternative is to use so-called "conservative" garbage collection (more properly called "conservative pointer-finding"; in such a system, the GC looks at all registers that MIGHT be in use, and treats anything that MIGHT be a pointer as a pointer. This has two drawbacks: * It's not safe because the collector might mistake an integer or other nonpointer value for a pointer, and retain data that are in fact garbage. In practice this is not usually a big problem, but in some cases very large amounts of data may be mistakenly retained and memory will be prematurely exhausted. * It's not safe in the presence of copying garbage collection or pointer swizzling (a technique used for transparently manipulating addresses when faulting data in from a large persistent store). These techniques may change addresses used as pointers when the objects they point to are moved. Here conservative pointer-finding is likely to fail, because an integer or other nonpointer value may be updated to reflect that its (apparent) referent has moved. Naturally, this makes no sense if the value is not a pointer, and programs will fail disastrously. Safe Points We have chosen the "safe points" strategy, which is simple and reliable. Rather than ensuring that it's safe to GC at any point in a program's execution, we ensure that GC can only happen at certain safe points. For this to work, there must be a bound on the amount of allocation that can occur between safe points, so that a safe point can always be reached before memory is completely exhausted. We keep some memory in reserve, so that if memory is "exhausted" (excluding this reserve memory), we can set a flag and continue to operate until the next safe point, cutting into this reserved space. At each safe point, we check the flag to see if memory is exhausted; if so, we garbage collect, restore our reserve space, and reset the flag before continuing execution. To a first approximation, our strategy for ensuring that safe points occur frequently enough is to make every procedure call and backward branch a safe point. A program cannot run indefinitely without calling another procedure or branching backward---the longest path of execution between safe points is the longest forward path through a single procedure. If a program sits in a loop, for example, at the end of the loop it will reach a safe point before branching backward to begin the next iteration. (If a loop is coded as tail recursion, the tail calls that iterate the loop will also be safe points because they are procedure calls.) (In the current runtime system (as of 11/94), we actually have a safe point at every monotone boundary, which has certain advantages and disadvantages. This may change in the future.) Supporting Threads Supporting preemptive threads requires two things: * making sure that that every thread yields control frequently enough that it can't "lock out" other threads for too long, and * making sure that the GC still functions in the presence of suspended threads. In the RScheme system, support for GC safe points and support for preemption are combined into a single mechanism. Safe points are also potential thread preeemption points. In effect, the compiler turns preemptive multitasking into cooperative multitasking, by emitting extra instructions at safe points so that a thread will voluntarily yield control if its time slice has expired. Actually suspending a thread is easy, since we've already implemented support for first-class continuations. The code for thread switching begins by saving a continuation, which encapsulates the state of the thread. This can be saved in a thread-control block used by the scheduler. After the thread has been suspened, another thread is chosen, and it is resumed by calling the escape procedure that represents its saved state. In order for it to be possible to garbage collect in the presence of threads, it must be possible for the collector to decode the thread control blocks of all suspended threads, and trace outward from their saved state. We therefore ensure that threads are only suspended at safe points, so that if GC occurs, all threads are in a safely traceable state. We need only trace the scheduler's list of suspended threads (itself a normal, traceable Scheme data structure), and it follows that we will reach all live data. Interrupts Interrupts are much like thread switches, in that the currently executing thread must be suspended and control dispatched elsewhere---by calling the interrupt handler. As with thread switches, we must ensure that this only happens at safe points. For actual hardware interrupts, this isn't really possible, because the hardware dispatches to the interrupt handler without the Scheme system having a say in the matter. To get around this problem, we write low-level interrupt handlers in C that are dispatched in this way, and use that to implement a higher-level notion of Scheme interrupts. Generally, when a hardware interrupt occurs, a C routine is invoked to deal with the most urgent issue---noting what's going on. This C routine then sets the flag for the corresponding Scheme-level interrupt, and the combined flag checked at each safe point. At the next safe point, the fact that this flag is set, and a C routine is invoked to see why. It will find out that the interrupt-pending flag is set, and dispatch to the appropriate Scheme interrupt handler. (For example, at a timer interrupt, all that happens is the C interrupt handler records the fact that a timer expired, and sets flags so that a Scheme interrupt will occur. The Scheme-level interrupt handler can then do all of the interesting processing. The cost of safe points In principle, safe points probably slow a system down significantly. For a very fast system that compiles to machine code using a custom back-end, we would expect safe points to cost 5-10% in execution time (at a guess), due to frequent checking overhead: at each procedure call or backward branch, a flag must be checked and a conditional branch (not taken, in the usual case) executed. (This overhead can be kept fairly low, even for small inner loops, by unrolling the loop a couple of times to eliminate most of the backward branches. For a high-performance system, such loop unrolling is desirable anyway, to enhance conventional optimizations.) For a system like RScheme, which is not likely to ever be that fast, the overhead is much smaller. If RScheme is a factor of two or so slower than an all-out high-performance system, the overhead will only be a few percent. We consider this well worth it, because it preserves the ability to use copying collection and to layer RScheme atop persistent object stores. ------------------------------------------------------------------------ STRATEGIES FOR COMPILING TO C Compiling Scheme to C poses several special problems, because C compilers do not support garbage collection, and because C's stack discipline does not correspond to Scheme's continuations. When compiling a conventional language (such as Pascal or C++) to C, it is fairly straghtforward to map the source language's procedures onto C procedures, and the source language's variables onto C variables. In the general case, Scheme variables cannot be mapped to C variables, because they must have indefinite extent to support closures. We therefore must implement our own binding environments in terms of low-level C datatypes. It would also be awkward to use C's stack, for two reasons: * continuation saving would require copying C's stack, which may be expensive, and * C does not support stack-scanning to find local variables. These problems might be surmountable if we do not expect continuations to be heavily used (so that it would be acceptable for them to be slow) and if we used conservative garbage collection. Instead, we have chosen to implement our own register set and a conventional continuation chain. In the general case, we use C pretty much as a portable assembler. (In certain important special cases, however, we will take advantage of features of C to let us generate good code.) We implement our own procedure-calling convention in terms of our registers and continuation chain; this convention is very similar to those of high-performance compilers for Scheme and ML. Representing Code Addresses At a procedure call, we will take a Scheme closure pointer, and extract the parts of the closure, and put them in our basic registers. The environment pointer is just a pointer to an environment frame, which we put in the environment register so that the procedure will see the right variables when we start executing its code. Similarly, the template pointer is a pointer to a vector of information saved by the compiler for the procedure's use; we extract that from the closure and put it into the template register. We then extract the pointer to the actual code from the template, and start executing that. In RScheme, the usual representation of the code pointer is a pointer to a C procedure. To "branch" to that code and start executing the procedure, we simply execute a C procedure call. The C code generated for a Scheme procedure call is thus a few dereferences to accomplish the "setup" of the Scheme environment, and a call to the C procedure that implements the code. It's not really quite that simple, though, because a Scheme procedure can't generally be represented as a single C procedure. For example, if the called procedure itself does a call, it must have a code pointer to use as a return address, i.e., the address of the point in the procedure to return to. This must be saved in a continuation before a (non-tail) procedure call. There is no way in (portable) C to take the address of that return point within the corresponding C procedure. We therefore break Scheme procedures up into multiple parts, each represented by a C procedure. Each return point in the Scheme procedure is represented as C function pointer, and "returning" into a Scheme procedure is implemented as a call to the procedure that represents the next part of the Scheme procedure. We refer to the resulting sections of a Scheme procedure as "monotones", because they represent the forward-only execution of code between safe points. They only execute in the forward direction. (Given the description so far, this is not strictly necessary, because backward branches don't require segmenting the Scheme procedure in this way. As we will explain later, it is convenient to do so.) This has two drawbacks: * What should be a simple branch to executable code is implemented as a C function call, which is more expensive both in instructions executed and in code size. (A simple C procedure call that passes no arguments is only a few instructions, but that's more than one instruction, and we do this frequently.) * C procedures do TOO MUCH. In particular, calling a C procedure saves a return address on the C stack, because C doesn't let you specify that these are tail calls and don't need to save their callers' state. This latter point is particularly annoying, because if we implement our Scheme naively, we will keep performing C calls without performing C returns, and the return addresses will pile up until the stack segment becomes huge and we exhaust memory (or swap space) and crash. We therefore need an extra little trick to pop a return address off of the stack at each call. (This basic trick was invented by Guy Steele at MIT back in the 70's, and recently refined and used in a C back end for Standard ML of New Jersey by Tarditi et al. at CMU.) The basic trick to accomplish this is to have a single C procedure which acts sort of like an interpretive loop, and calls one monotone at a time. The monotone executes AND RETURNS, and it is this single looping procedure that actually calls the next monotone. In this way, we ensure that the C stack just oscillates up and down between two levels. When a C procedure representing a Scheme monotone ends with a (Scheme) procedure call, it performs the setup part of procedure call sequence (extracting the environment and template pointers and putting them in the environment and template registers). It then extracts the code pointer from the template and returns that value to its caller, the quasi-interpretive loop. The loop procedure accepts this return value, and does the (C) call the pointed-to C procedure. Avoiding the Overhead of the Quasi-Interpretive Loop There are ways of avoiding the quasi-interpretive loop, with different degrees of unportability. One technique (used by the Glasgow Haskell compiler) is to use the C compiler to generate assembly code, and then postprocess the assembly output to strip out the unneeded (C) call and return instruction sequences, and replace them with labels. The result can then be assembled, using the labels to denote the addresses of the actual code sequences for the monotones. This works very well and avoids both runtime overhead and code bloat. (For SPARCs, it also has the advantage that you can strip out the pesky register-window stuff and just use one register window as though it were a normal set of general-purpose register set.) Unfortunately, this requires customizing the postprocessor for each platform, to recognize the calling and return sequences. Another solution, which we have an experimental implementation of for RScheme, is to take advantage of the GNU C compiler's support for taking the addresses of labels. GNU C is an extended version of C, where the address of a label can be taken as a kind of pointer value, and jumped to directly without performing a normal calling sequence. We simply insert labels and expressions that take the addresses of labels, and use those to implement the saving of return addresses in continuations, etc. This avoids the stacking of return addresses, and the need for the quasi-interpretive loop. Unfortunately, it does not eliminate all of the code bloat, because the call and return sequences are still there, even if they're bypassed at run time. (We expect to be able to get rid of most of this bloat by compiling many monotones as a single procedure, but there's still some bloat because we need some executable code that extracts the labels from those large procedures. We may be able to eliminate most of the remaining bloat by being careful how we link C code, but we're yet sure.) Also, it doesn't seem to work on some platforms (notably AIX), and we're not sure yet exactly why not. (Note: the above discussion is just a little bit out of date. See the RScheme README file for a slightly different discussion. As it turns out, the GNU C compiler's support for label addresses is weaker than we thought. The README file proposes a workable way around the problem, and also alternatives that are possible if the GNU project enhances their compiler in certain likely ways.) --- Abstract Machine Instructions and C expressions The abstract machine that we compile for is a language of nested expressions. These expressions include things like fetching the value of a variable, performing an integer multiply, performing a floating-point multiply, pushing the current state onto the continuation chain as a partial continuation, and so on. We believe that it is very advantageous to have an abstract instruction set of TYPED, nested expressions. The compiler can recognize that the return value type of an expression is of the preferred type for the expression that it's nested in, and omit type checking. In many cases, nested expressions that do not involve procedure calls can be compiled directly to nested expressions in C, with no runtime overhead for decoding or dispatching. The type system used by the back end of the compiler is concerned with low-level concrete types, such as raw integers, tagged integers, raw floats, tagged floats, pointers to objects that consist of tagged slots, pointers to objects that consist of untagged bits, etc. It is not the same as the type system that is seen by the casual programmer, and it is supposed to be fairly independent of what language is actually being implemented. (E.g., you should be able to put a fairly efficient implementation of Pascal on top of our compiler backend, with only minor changes.) It is not complicated---just complicated enough that simple operations can be mapped onto simple C expressions that map onto simple (sequences of) machine instructions. (Actually, it's also independent of the fact that the primitives are implemented in C---you should be able to plug in a more conventional low-level code generator.) In attempting to compile to nested C expressions, the compiler uses a simple rule. Any value that clearly only exists during the execution of a monotone can be represented in the most efficient format. Values that must be saved across monotones must be represented as full-fledged, tagged Scheme values. Naturally, this simple, bottom-up analysis and representation selection is very sensitive to whether types are declared and whether procedures are inlined. Expressions can only be compiled to nested C expressions within a single monotone, so it pays to inline small procedures and to declare the types of variables referenced within those expressions. For example, relatively complex floating-point expressions in the body of a loop may be compiled into code very similar to floating-point code written directly in C, but calls to small, non-inlined procedures will incur both procedure calling and type-checking overhead. Literals Our compiler is smart enough to recognize that many simple literals can be represented as C literals that will be encoded in the instruction stream, rather than fetched from the procedure template at run time. These include immediate values such as small integers, booleans, the empty list, and ASCII characters. The binary representations of these values are encoded so that the upper bits are generally zeroes, increasing the chances that a C compiler will be able to represent them as an operand to a single instruction, which requires no extra instructions or memory fetches. (The C #defines which yield them are in terms of small integers, rather than full 32-bit values, and it is expected that the C compiler will sign-extend these small values rather than representing them as full 32-bit constants.) Type Inference Our compiler performs very simple bottom-up type inference, and the back end uses a simple static type system for concrete types. Each expression has a type; for literals, the type is obvious to the compiler, and for local variables, type declarations can make it trivial. (Currently, if you don't declare the type of a variable, our compiler won't infer it; once variables' types are inferred, however, it can often infer the types of expressions operating on those variables. It therefore pays to declare the types of procedure arguments and let variables.) Compound abstract machine expressions also may have several variants for different types of arguments, so for example there is an integer + primitive and a floating-point + primitive. If the argument types match one of these definitions, the compiler chooses the appropriate opcode for that variant, and the return type is known. A simple framework for coercions of primitive types is also defined. If the compiler can see that the arguments to + are an integer and a float, for example it can compile that as a float + operation with a coercion of the integer argument to a float. For integers and floats, the compiler is biased towards efficient representations---in its bottom-up processing of the expressions, it attempts to keep the representations close to the native machine as possible. Floating point values are preferentially kept in a raw IEEE double representation, not a boxed value on the heap referred to via a pointer. Integer values are preferentially kept in a tagged format, because the low-tagging representation does not cost much for common integer operations. To do this analysis, the compiler makes a distinction between simple nested expressions within a monotone, and values that can "escape" from a monotone due to a procedure call or backward branch. ------------------------------------------------------------------------ OBJECT REPRESENTATIONS RScheme data values are represented as two-bit, low-tagged 32-bit quantities. This means that RScheme values are generally 32 bits, where the low two bits is a primary tag that says what basic kind of value it is, and the upper 30 bits holds either the actual value, or a pointer to a heap-allocated object that is the actual value. Every heap-allocated object, whether of a built-in type or a user-defined type has a header that says exactly what kind of object it is. This header includes a CLASS POINTER, which is a pointer to a class object that represents that kind of object. Plain, standard Scheme objects such as vectors and strings have class pointers---unlike some systems, there is NO DISTINCTION between objects that are plain Scheme values and objects that are instances of classes. In fact EVERY value in RScheme has a class, and there's a primitive in the abstract instruction set that will give you back a pointer to the class of an object, even if the object is a built-in immediate value. For immediate values, this primitive looks in a special table to find the class pointer, preserving the illusion that every object is allocated on the heap and has a header with a class pointer. Notice that this small feature is important, in that it will allow us to implement a good object system with a metaobject protocol efficiently with little or no compiler cooperation. --- IMMEDIATE VALUES are the values that are represented directly within the upper 30 bits of a word. Two different primary tags are used for immediate values---one just for short integers, and another for all other kinds of immediate values. Integers have a primary tag of 00, which is convenient because it turns out not to get in the way of most arithmetic operations. (You don't actually have to strip out the tag to do integer addition for example, because the 00 tags of the operands will add together and form a 00 tag in the low bits of the result.) Other (non-integer) immediate values have a secondary tag saying what kind of noninteger immediate value they are---e.g., boolean, empty list, ASCII character. This secondary tag is in the next-to-lowest range of a few bits, just above the primary tag. The remaining bits are available to store the actual value, e.g., the ASCII code representing a character. ------------------------------------------------------------------------ BASIC STRUCTURE OF THE COMPILER The RScheme compiler is structured as two passes. This does not include reading (which we view as prior to actual compilation) or generating machine code from C code (which we view as an assembly-like postpass); it refers only to the operation of the compiler proper. Each pass does some work on the way down the expression graph, and different work on the way up, so there are effectively four distinct phases. (The phases are conceptually distinct, but are interleaved at run time because the compiler may return back up one part of the expression graph before calling itself recursively to go down the next part.) The first phase of the compiler strongly resembles the operation of the example compiler in Paul Wilson's notes on Scheme compilation, and generates intermediate code called icode as output. icode is structured essentially as nested expressions, decorated with compile-time environments that make scopes explicit, and the results of some simple analyses. Icode is essentially a more horizontal (explicit) representation of the source program, and does not make much commitment to low-level implementation. For example, local variables are represented as fields of compile-time environment frames that represent the binding contours of the source program, without committing to whether the runtime bindings of those variables will be allocated in registers or in heap allocated binding environment frames. (Similarly, the representations of a reference to a variable within an expression does not commit to whether the reference will be implemented as a reference to a register or as an indirection through the environment register and environment frame chain.) The second phase of the compiler (called code-gen in the implementation) traverses the icode graph, making decisions about register- and heap- allocation of variables, etc. on the way down, and generating abstract machine expression code on the way up. ------------------------------------------------------------------------ A NEW INTERPRETED BACK END? While our current threaded-code back-end is simple, and handy for interactive use, it would probably be better to provide a more compact interpreted representation; then programmers could compile performance- critical code to C, and the rest of their code would be compact. Our system is structured so that the primitives are organized into a simple framework, i.e., encoded in a Scheme data structure that indexes the operations by their names and types, and holds the C fragments that implement the primitives as Scheme strings. This is a convenient framework for semi-automatically generating the C code for an interpreter. Several alternatives exist for interpreted representations. Many Scheme systems (and most Smalltalk systems) compile to "bytecodes", which is a virtual machine instruction format where every opcode is a byte, and some opcodes take additional bytes as arguments. Another alternative is threaded code, where each "opcode" is an address of a piece of machine code to executed. "Direct" threaded code uses actual machine addresses for this, so that the code for a complex operation consists of a sequence of addresses of the sub-operations that implement it. The suboperations may be sequences of machine code instructions, or other "threaded" code sequences. A variant of threaded code which we find very appealing is "indirect" threaded code. In indirect-threaded code, a "pointer" to the code for an operation is really an index into a table of pointers to the actual code sequences. This is slightly slower than direct-threaded code, but can be much more compact. If there are no more than 64 thousand pieces of code, then these code "pointers" can be 16-bit offsets rather than 32-bit pointers. It might seem that these 16-bit pointers would be less compact than bytecodes, but that's not necessarily so. As Chuck Duff (implementor of the Actor object-oriented programming system, not to be confused with Carl Hewitt's Actor languages) has pointed out, the ability to have more than 256 instructions makes it possible to encode more in each instruction, so that most instructions do not need to take operands. The elimination of operands can save in space and especially in execution time. This is especially true if most of the operands are generated automatically from definitions of abstract operations. We could write a simple program to generate individual opcodes and C code for all 64 combinations of lexical addresses with frame numbers from 0 to 7 and slot numbers from 1 to 8. (This isn't really the best example... but that's the general idea.) On the other hand, fast decoding may not matter as much in a system like ours, where we normally expect the most time-critical code to be compiled to machine code (via C). In our initial implementation of an interpreter, we'll probably go with something like a conventional bytecoded instruction set, where many of the bytecodes (1-byte opcodes) take operands that are one or two bytes. We can then CISCify this instruction set by constructing new opcodes with fewer or smaller arguments, which encode common cases of the more general opcodes. As we'll discuss later, we can also construct opcodes that are equivalent to combinations of opcodes. --- ADDING (AN) INTERPRETIVE VIRTUAL MACHINE(S) TO THE CURRENT SYSTEM In adding interpreted code to our compiler-oriented system, we want to avoid adding any overhead at procedure call time for native-code procedures. We don't want to have to check to see if something is native-coded or compiled before calling it. (We're probably assuming that we spend most of our time in native-coded procedures, so we'd rather keep native-code calling fast even if it costs a little bit for interpreted code.) It's straigtforward to add interpreted representations to the basic system, allowing interpreted procedures to call native-code procedures and vice versa. All that's necessary is that we make sure that interpreted procedures look like compiled ones, with a template and a code pointer that points to native code (e.g., a C function pointer in the usual setup). For interpreted procedures, we simply make sure that the code pointer always points to a special routine that invokes the interpreter, and that the pointer to the actual interpreted code is stored in the template. Thus calling an interpreted procedure IS calling a native code procedure, whose action is to interpret some bytecodes (or whatever interpreted representation we're using) stashed in the template as a literal. Calling out of (and returning into) interpreted code can work the same way. Whenever an interpreted routine saves a continuation, the return address is a similar little routine that will resume interpreting the interpreted representation. To save the interpretive return address (i.e., the point within the interpreted code to resume at), an integer is saved in the continuation as well, as though it were just some normal intermediate value that needed to be saved. The normal calling and returning conventions work fine for this. From the "normal" point of view, every interpreted procedure looks pretty similar, and the differences between them are just in the literal data. Notice that this works even if different interpreted routines use different interpreters. There's no problem combining routines that use different interpreters---they're just procedures whose code pointers point to different interpretive routines. --- A FANCIER VIRTUAL MACHINE STRATEGY: Tree Code, Automatic macro-opcode construction, and automatic generation of peephole optimizers to exploit macro opcodes. (NOTE: the following is medium- to long-term stuff, not expected to be acted on soon. We need a simple interpreted backend that's fairly compact before we worry too much about this stuff. On the other hand, the design of the short-term backend might be inspired by the long-term ideas.) A more interesting possibility is the generation of opcodes (semi-) automatically based on profiling data, where a macro-opcode would represent a combination of several basic opcodes, based on observed usage patterns over a bunch of runs of a bunch of different programs. For a simple bytecode-like instruction set, this might just be based on observed patterns of arguments to bytecodes, and observed sequences of bytecodes. In essence, a process of profiling, analysis, and synthesis will generate a CISCy opcode set from observed patterns of usage of a RISCy opcode set. If the instruction set is properly designed, this could work very well, not just by reducing interpretive overhead, but also by allowing the resulting macro-opcodes to be optimized. If the operations are nestable expressions (as in our current system) rather than sequential statements, they can be aggregated into macro-operations by nesting the C fragments and compiling the result. This may give the C compiler enough to chew on that it can do useful optimizations, eliminating common subexpressions and the like. Better yet, the nesting can be done by our basic compiler, which already knows how to do some simple bottom-up analyses based on its type system. --- If we want to support runtime profiling and/or dynamic compilation, It seems likely that a good representation for this would be some kind of packed-tree representation. Rather than being sequential statements, the opcodes would be nestable expressions much like the abstract machine's primitive instructions, but the tree representing the code for a procedure would be packed into the code object for that procedure. (I think m-code for Halstead's MultiLisp was something like this, only dynamically typed rather than typed like our abstract machine language.) The idea here is that the runtime representation would preserve most of the properties of our abstract machine language, supporting later optimization. This only really matters, though, if we want to do dynamic compilation to machine code, or do RUNTIME profiling of interpretive execution, and use the results to guide the construction of macro-opcodes that increase speed. (That may be important for situations where we really need a portable virtual machine across platforms, and can't rely on generating machine code for safety or footprint reasons.) For more conventional situations, where the interpreted representation is to save space more than time, we can do static analysis of generated abstract machine code, and extend the opcode set based on what will save code size, rather than execution time. (Typically it will save both.) This makes the actual runtime interpreted format less critical to the analysis stuff; the analysis would happen on the abstract machine code, not the interpretive virtual machine code. 16-bit instructions (tree nodes) might be right for a runtime-analysis system--pretty easy and fast to decode, pretty compact, and able to represent offsets within relatively large procedures without resorting to fetching another operand. A variable-length instruction format might be better for a static analysis system, if the overall strategy works so well that you seldom execute long instructions, or when you do they do enough work that the decoding overhead isn't a big deal. (Or if you assume that the time-critical code will be compiled to native code, and the stuff that's getting compiled to interpreted code is worth making a little slower to make it smaller.) A conventional bytecode instruction set would work fine, with the most important byte codes often being one byte, and others taking two-byte operands. Initially, the system would be built with a roughly one-to-one correspondence between the abstract machine primitives and the primitives of the interpretive virtual machine. (This prototype would be slow.) Then, on the basis of profiling and pattern recognition, we'd recognize common patterns of nested expressions, and automatically construct equivalent macro-opcodes by glomming together the equivalent primitives using our compiler and optimizing them. This would be a mostly bottom-up iterative process, first constructing macro-opcodes that glom together an expression and its common argument expressions, then glomming those together, and so on. (Notice that it's no problem to construct macro opcodes where one argument is of a commonly-occurring type, but the other is free, like currying a function.) Linking Macro-Opcodes Dynamically The above scheme, taken straightforwardly, makes sense for developing a single version of the system---after sufficient profiling, a set of macro-opcodes is settled on and "hardened" into the system. Then people use the system and nobody can add new opcodes without becoming incompatible with everybody else. This is okay, in that it provides a clean way to develop a good compiler and virtual machine, but it's not as good as if we could modularize the instruction set, and allow different compilers to be used and have different macro-opcode sets. Let's call the opcode set that's built into the standard virtual machine the "base opcode set" and some application- or language-specific opcode set an "auxiliary opcode set". One way of allowing different macro-opcode sets is to provide a simple segmented space of opcodes, and to use a level of indirection for the decoding of opcodes that aren't common across all systems. Suppose that an auxiliary opcode set is compiled into a vector of code fragments, so that to execute an opcode you only need to index into this object and jump to that address. To allow a procedure to use that opcode set, you just make sure that it has a pointer to that code vector in its template. Calling an auxiliary opcode just means indirecting through the template and code vector pointer, indexing into the code vector, and jumping. These steps must be accomplished by some opcode(s) in the base set, and we want it to be fairly fast. Suppose we say that any procedure can use at most a few (say, two) auxiliary opcode sets in addition to the base set, and that if it uses them the pointers to the code vectors will be at particular slots in the template. Suppose also that we restrict the number of opcodes in an auxiliary set to some reasonable number like 100. Then we just need a bunch of base set opcodes that indirectly execute opcodes in the auxiliary sets---in this case, we need 200 opcodes, so that we have 100 that will execute the 100 opcodes in the first auxiliary set, and 100 that will execute the 100 opcodes in the second auxiliary set. For most applications, this is overkill---they only need a handful of extra opcodes, and the functions within a particular application may only need to use one auxiliary set. (QUESTION: what's the space cost of a pointer in each template? If procedures are small, that could negate some of the space benefit of the enlarged instruction set. I suspect that's okay, because most of the space benefit will come from the automatically-enlarged base opcode set, and the dynamic linking of specialized opcode sets will be mostly for speed. If you don't need the speed, don't do it.) -------------------------------------------------------------------------