CS386L (Advanced Programming Languages) NOTES, CONT'D. (by Paul Wilson, U. of Texas at Austin wilson@cs.utexas.edu) --- Copyright 1994 Paul R. Wilson Until 1996, you may use these notes freely for noncommercial purposes, as long as this notice is preserved in all copies. After January 1, 1996, you must obtain permission from the author to copy or redistribute. ------------------------------------------------------------------- INTERPRETATION AND COMPILATION STRATEGIES Most interpreters do not interpret program text directly; they actually do some amount of preprocessing of the text, and interpret some higher-level representation. Crummy old BASICs may interpret text fairly directly---actually reading the text character by character most of the time. Better BASICs TOKENIZE the text and interpret data structures consisting of sequences of tokens. This is like interpreting the output of a lexical analyzer. More advanced systems, such as Logo or LISP interpreters, often process the tokenized input into syntax trees (e.g., LISP S-expressions), and the interpreter interprets those. Still more advanced systems COMPILE syntax trees into VIRTUAL MACHINE instructions. A virtual machine is much like an emulator for a particular architecture---but the architecture is never implemented in hardware. Instead, a software program INTERPRETS the virtual machine instructions, decoding them and dispatching to sequences of actual hardware machine instructions which implement each virtual machine instruction. Such VIRTUAL MACHINES may be fairly high-level. Smalltalk systems and many Lisp and Scheme Systems (such as Scheme-48) use virtual machines in which stack operations, procedure calls, method dispatch and garbage collection are primitive instructions. That is, things like the GC are part of the simulated target machine, rather than being implemented on top of it. Virtual machines are typically written in assembly language, or a procedural language like C, which is then compiled to machine code. Compiling to a virtual machine has advantages of simplicity and portability. If the virtual machine is written in C, for example, it can be compiled with little change to run on various hardware architectures. Code that has been compiled to virtual machine instructions can be moved from one implementation to another---on a different hardware platform---with no change. One problem with virtual machines is that they are typically slower than hardware machines. Instruction dispatching takes time, and complex operations compile into sequences of these fairly slow virtual machine operations. One way of speeding up virtual machines is to INCREASE THE GRANULARITY of work done by the VM. Fairly complex operations may be implemented directly by the virtual machine, and programs need only execute a single virtual machine instruction to accomplish a fairly large amount of work. Examples of such "big" primitives are matrix operations in an APL virtual machine, or BitBlt (Bit Block Transfer) operations in a Smalltalk virtual machine. If there's a good "fit" between the operations provided by the virtual machine and the actual operations required by a particular program, the cost of interpretation may be negligible. For example, in an APL program, most of the time may be spent in operations on large arrays, which the virtual machine implements very efficiently. Another example of this is shell programming. A UNIX shell can be viewed as a virtual machine whose basic datatypes are character streams (files, sockets, etc.) and whose primitives are entire UNIX programs, which are typically written in C. Interpreting a few lines of shell script may not cost much, compared to executing the primitives---e.g., awk, grep, cc---which may execute millions of instructions and operate on thousands or even millions of bytes of data. In the limiting case, the big primitives approach consists of putting entire programs into the virtual machine, and calling them using a single virtual machine instruction. Of course, the virtual machine doesn't get you much in that case. The problem of the "big primitives" approach is that there are too many possible "big" primitives, so it only works for reasonably narrow ranges of applications. Beyond that, you spend too much time coding too many "big" primitives, and it's easier to simply write a native-code compiler. In effect, compiling to native code lets application programmers write their own ``primitives''---specialized routines that operate at full hardware speed. (Of course, you need a good compiler to get maximum speed, but a decent compiler will usually get you nice speedups over a variety of programs. You may still want to hand-hack some important "big" primitives to tweak them for maximum performance, like Smalltalk BitBlt. On the other hand, it may be more effective to hack your compiler to perform more optimizations and generate better code, because the improvements will apply to other routines as well.) ----------------------------------------------------------------- INTERPRETERS, COMPILERS, AND DEBUGGERS In traditional Lisp systems, the user typically interacts with an interpreter, which has more support for debugging than a compiler. Once part of a program has been "fully" debugged [ha!], it may be compiled so that it will run faster. Compiled code is usually harder to debug, and in fact in some systems it may not perform all of the runtime checks that are required by the language; in other systems, these checks may be disabled by the programmer to get maximum speed. Lisp systems support this mixed approach to debugging and compiling by letting compiled and interpreted routines call each other freely. (One way of doing this is to have all procedures represented as "compiled" code, except that some code is only compiled in the most trivial sense---as a stub routine that calls the interpreter to interpret the actual expression in question. This allows all procedure calling to use the same (fast) procedure calling convention, so that compiled parts of the program don't suffer any checking overhead deciding whether a call is to interpreted code or compiled code.) The problem with this approach is that programs are never really fully debugged, and inevitably bugs crop up in compiled code. Ideally, you should be able to run programs in a compiled form, and still debug them easily. Writing a debugger for compiled code is harder than writing one for interpreted code. Consider an evaluator that interprets S-expressions---the S-expressions usually correspond fairly directly to the source code (modulo macroexpansion, etc.), so displaying the current expression is easy and understandable. Allowing the user to change values in the debugger is also easy, because the code for the whole routine is there in a fully general form. With compiled code, on the other hand, a good debugger must map backward from the machine instructions (possibly virtual machine instructions) to the source expressions. If the compiler does lots of transformations and optimizations before generating code, this may be difficult, or even impossible in the general case. (Consider a compiler that does constant propagation and dead code elimination---entire branches of routines may get optimized away because the compiler can tell at compile time which way a branch will go. If you change a value at debug time, you may take a branch that doesn't *exist* in the compiled code.) Optimizations make it more difficult for the debugger to support debug-time changes---if a user changes a value in the debugger, that may violate the normal assumptions that the compiler uses to perform optimizations. In effect, any place in the program where the user is allowed to affect the computation using via the debugger becomes an I/O point---the execution becomes dependent on the data the user gives the program at debug time. Supporting full debugging of compiled code therefore must inhibit certain optimizations. Very advanced systems may still allow most common optimizations, however, by ensuring that the compiler records what optimizations were done so that they can be undone at debug time. ----------------------------------------------------------------- SCHEME INTERPRETATION AND COMPILATION The interpreter in eval.scm and the compiler in compiler.scm are intended to clarify how Scheme evaluation and compilation work---in particular, that it's pretty easy because the syntax and semantics of Scheme are simple. NOTE THAT THE PROGRAMMING STYLE I USE IN EVAL.SCM IS A BIT WEIRD. For example, I use "let" a lot to give names to intermediate values, where a real Scheme programmer would probably just nest expressions, e.g., where I say (let ((intermediate-value (some-intermediate-computation)) (compute-a-value intermediate-value)) It would be more Schemish to say (compute-a-value (some-intermediate-computation)) I've used extra named variables so that people who aren't used to Lisp or Scheme can follow what's going on more easily. I also overuse set! a bit---if I was writing in real Scheme style, most of the set!s would go away because I'd use letrec (instead of let), or something like that. I'd also use let* instead of some of the nested lets. I've omitted a bunch of low-level data-structured manipulating routines, partly just because they're tedious, and partly because it doesn't matter much how they're implemented. For example, binding environments could be implemented lots of ways, and compiled code could be byte codes, or native code or whatever you choose---the overall structure of the interpreter or compiler shouldn't change. ----------------------------------------------------------------- A SKELETAL SCHEME INTERPRETER The code in eval.scm is an interpreter for an interesting subset of Scheme, written in Scheme, showing how basic recursive expression evaluation works. It assumes that there are already low-level routines to perform things like binding environment frame construction and variable lookup. (NOTE: the code in eval.scm is intended to be looked at... it's got comments, etc., and you can see very concretely how things work. Reading the following explation is most useful in conjunction with looking at the code.) In the interpreter, binding environment frames are assumed to be an associative data structure, structured as a list of environment frames; each frame is itself a vector or list (it doesn't really matter which---it's just a mapping) of pairs that map symbols (variable names) to cells that can hold values (variable bindings). From the point of view of the interpreter, top-level environments are handled the same as lexical (local) environments. The low-level function lexical-lookup will return a object that may represent a top-level variable or a lexically scoped variable. In particular, it understands variable-ref and variable-set! either way. Also note that special forms are handled analogously to top-level variables: They are stored in variable-like objects called objects, which are returned as the result of a lexical-lookup just like variables. However, instead of holding a binding (a place to store a value), the object stores a underlying procedure which knows how to evaluate that kind of special form. This highly regular structure makes some part of the evaluator simpler, and quite a bit more extensible. You can assume that a closure is just a structure that has fields for the actual S-expression representation of a procedure, for the binding environment it's defined ("closed") in, and the list of (formal) arguments it expects. In this interpreter, we represent environments explicitly, but several other things about the interpreter state are represented implicitly. For example, we don't implement an expression evaluation stack or a continuation chain explicitly---we "snarf" that from the language we're writing the interpreter IN (in this case, it happens to be Scheme, but it could be C or whatever). The "stack" of the language we're writing in implements our stack for us: the stack of expressions being evaluated is simply all of the "expr" arguments of the chain of recursive calls to eval. NOTE THAT THIS IS JUST AN IMPLEMENTATON CHOICE---we could have chosen to represent our activation stack explicitly, and implemented the interpreter as a simple iterative loop. Note that this interpreter even snarfs tail-call optimization from whatever Scheme it executes in---calls to EVAL reduce to calls to routines that evaluate particular kinds of expressions, and those may in turn reduce to calls to EVAL.