--- Copyright 1994 Paul R. Wilson Until 1996, ou may use these notes freely for noncommercial purposes, as long as this notice is preserved in all copies. After Jan 1, 1996, you must ask for permission to copy or redistribute. --- INTRODUCTORY NOTES ON SCHEME --- ABSTRACT (Don't expect to understand all of this part yet... come back to it later.) Scheme is a descendant of Lisp and Algol. From Lisp, it gets its basic syntax (fully parenthesized prefix expressions), dynamic typing, garbage collection and built-in list processing features. From Algol, it gets lexical scope and block structure. These features combine to make a very elegant and powerful language. Scheme has first-class procedures, which are an extremely elegant and powerful mechanism when combined with block structure and garbage collection, allowing a wide variety of programming language paradigms to be implemented easily within the language. Scheme also has an extremely powerful feature for controlling flow of control, called call-with-current-continuation, or call/cc for short. Call/cc allows the implementation of a variety of advanced control structures, such as backtracking and (when combined with timer interrupts) multithreading. Scheme is very much a language designer's language. It is intended to be as simple and elegant as possible while still being extremely powerful. This contrasts sharply with some other languages (notably Common Lisp and C++), which are heavily influenced by a desire for backward compatibilitly with other dialects, and by a desire to standardize many features. By and large, the Scheme designers' goal is to provide a small language that can easily be extended within the language, rather than a large one with a plethora of features built in. Scheme makes an excellent base for experimenting with language features. Scheme is also very much a teacher's language---it embodies many important concepts in elegant ways, so that they're easy to teach. Once you understand Scheme, it's easy to understand things like Common Lisp. --------------------------------- Like LISP, Scheme is written as simple prefix expressions, fully parenthesized. In C or Pascal, a call to procedure foo with arguments bar and baz is written foo(bar,baz); but in Scheme it's written (foo bar baz) Note that the procedure name goes INSIDE the parentheses, along with the arguments. Get used to it. Just as in C, expressions can be nested: (foo (bar x) (baz y)) is pretty much equivalent to C's foo(bar(x),baz(y)); --- EXPRESSIONS RETURN VALUES, BUT MAY HAVE SIDE EFFECTS Scheme expressions ombine the features of exprssions and statements. They return values, but they can also have SIDE EFFECTS---i.e., change the state of a variable or an object. The assignment operation in Scheme is called SET!. If we want to assign the value 3 to the variable foo, we write (set! foo 3) which is pretty much equivalent to C's foo = 3; Note that this looks like a function call, because everything uses prefix notation. --- MOST EXPRESSIONS ARE PROCEDURE CALLS In conventional programming languages like C and Pascal, there's an awkward distinction between procedures and other kinds of expressions. So, for example, (a + b) is an expression, but foo(a,b) is a procedure call. In Scheme, things are much more uniform. Basic operations such as addition are procedures, and there is a unified syntax for writing expressions---parenthesized prefix notation. So rather than saying (a + b) in Scheme, you say (+ a b). And rather than saying foo(a,b), you say (foo a b). Either way, it's just an operation followed by its operands, all inside parentheses. While most things in Scheme are procedure calls, there are a few other kinds of expressions you need to know, which behave differently. They are called SPECIAL FORMS. --- CONTROL STRUCTURES ARE EXPRESSIONS Scheme control structures are expressions, and return values. An IF expression is a lot like a C IF-THEN statement, but the THEN branch and the ELSE branch are also expressions that return values, and whichever one is executed has its value returned as the value of the IF statement as a whole. For example, (if (< a b) a b) returns a, or b, whichever is less. (This is like the ternary expression "(a < b) ? a : b" in C.) Note that even though every expression returns a value, not all values are used. The enclosing expression may not use the returned value. The uniformity of value returning means that we never have to explicitly use a return statement. Suppose we wanted to write a MIN function. In C, we might do it this way: int min(a,b) int a, b; { if (a < b) return a; else return b; } In Scheme, we can just do this: (define (min a b) (if (< a b) a b)) Whichever branch is taken, the value of the variable will be returned as the value of that branch of the if, which is returned as the value of the whole if expression, and that is returned as the return value of the procedure. --- The BOOLEAN VALUES #t (TRUE) and #f (FALSE) There are two special objects in Scheme, written #t and #f. These are just the written representations of two special objects that happen to have a special role in the evaluation of conditional expressions. Conceptually, at least, #f (the false value) and #t (the canonical true value) are just objects like any others in Scheme---there is an object #f that is conceptually stored on the heap and referred to via a pointer, and another object #f which is stored on the heap and referred to via a pointer. The only special thing about #f (but it's a very special thing) is that if the first subexpression in an IF (i.e., the condition expression) evaluates to that value, the third subexpression (i.e., the else branch) will be evaluated and its value will be returned as the value of the whole IF expression. If ANY OTHER value is returned by the first expression, then the second subexpression (the then branch) will be evaluated and its value will be returned. In Scheme, the only value that counts as being a true result in a condition is the unique object #f. ALL OTHER VALUES ARE CONSIDERED TO BE TRUE, but there's one predefined true value that is often used for clarity, and that's the unique object #t. (Note to Lisp programmers: in this respect, Scheme is different from LISP. In Lisp, the empty list value NIL is the same thing as the false value. In Scheme, the empty list object (written '()) is a different object, and it counts as a TRUE value. Be careful about that.) --- EVERYTHING'S A POINTER Conceptually, all Scheme objects are allocated on the heap, and referred to via pointers. This actually makes life simple, because you don't have to worry about whether you should dereference a pointer when you want to use a value---you always do. Since pointer dereferencing is uniform, procedures ALWAYS dereference a pointer to a value when they really use the value, and you never have to explicitly force the dereferencing. For example, the predefined Scheme procedure + adds two (pointers to) numbers, and automatically dereferences both pointers before doing the addition. It returns a pointer to the number that's the result of the addition. So when we evaluate the expression (+ 2 3) to add two to three, we are taking a pointer to the integer two and a pointer to integer three, and passing those as arguments to the procedure +. + returns a pointer to the integer five. We can nest expressions, e.g., (* (+ 2 3) 6), so that the pointer to five is passed, in turn, to the procedure *. Since these functions all accept pointers as arguments and return pointers as values, they nest just fine. After a while, this starts seeming perfectly natural, and you stop thinking about the pointers. As we'll see later, an implementation is free to optimize away these pointers if it doesn't affect the programmer's view of things---but when you're trying to understand a program, you should always think of values as pointers to objects. The uniform use of pointers makes lots of things simpler. In C or Pascal, you have to be careful whether you're dealing with a raw value or a pointer. If you have a pointer and you need the actual value, you have to explictly dereference the pointer (e.g., with C's * operator or Pascal's postfix ^). If you have a value and you need a pointer, you have to take its address (e.g., with C's & operator, or Pascal's prefix ^). In Scheme, none of that mess is necessary. User-defined routines pass pointers around, consistently, and when they bottom out into predefined routines (like the builtin + or SET! operations) those low-level builtin operations to any dereferencing that's necessary. (Of course, when traversing lists and the like, the programmer has to ask for pointers to be dereferenced, but from the programmer's point of view, that just means grabbing another pointer value out of a field of an object he or she already has a pointer to.) --- ALL OBJECTS LIVE ON THE GARBAGE-COLLECTED HEAP In languages like C or Pascal, data objects (NOTE: by "objects" I just mean records or structs or whatever) may be allocated statically, or on an activation stack as part of a procedure activation record, or dynamically allocated on the heap at run time. Scheme is simpler---all objects are allocated on the heap, and referred to via pointers. The Scheme heap is GARBAGE COLLECTED, meaning that the Scheme system automatically cleans up after you. Every now and then, the system figures out which objects aren't in use anymore, and reclaims their storage. (This determination is very conservative and safe---the collector will never take back any object that your program holds a pointer to, or might reach via any path of pointer traversals. Don't be afraid that the collector will eat objects you still care about while you're not looking!) --- INDEFINITE EXTENT The use of garbage collection supports the abstraction of INDEFINITE EXTENT. That means that all objects conceptually live forever---there's no concept (at the langauge level) of reusing memory. From the point of view of a running program, memory is infinite---it can keep allocating objects indefinitely, without ever reusing their space. Of course, this abstraction breaks down if the garbage collector fails to reclaim enough space---if you really try to create data structures that are bigger than the available memory, you'll run out. --- ALL VARIABLES ARE POINTER VARIABLES, DYNAMICALLY TYPED In Scheme, all variables have the same type: "pointer to anything." They're therefore the same size, i.e., one pointer-sized machine word. (We'll generally assume that pointers are 32 bits, as is true in most modern Scheme systems.) Scheme is DYNAMICALLY TYPED, meaning that variables don't have fixed types, BUT OBJECTS DO. An object carries its type around with it---an integer is an integer forever, but a variable may refer to an integer at some times, and a string (or something else) at other times. The language provides type-checking at run time to ensure that you don't perform the wrong operations on objects---if you attempt to add two strings, for example, the system will detect the error and notify you. --- POINTERS VS. IMMEDIATE VALUES Most Scheme implementations optimize away a lot of pointers. For example, it's inefficient to actually represent integers values as pointers to integer objects on the heap. Scheme implementations therefore use tricks to represent integers without really using pointers. (Again, keep in mind that this is just an implementation trick that's hidden from the programmer. Integer values have the semantics of pointers, even if they're represented differently from other things.) Rather than putting integer values on the heap, and then passing around pointers to them, most implementations put the actual integer bit pattern directly into variables---after all, a reasonable-sized integer will fit in a machine word. A short value (like an integer) stored directly into a variable is called an IMMEDIATE VALUE, in contrast to pointers which are used to refer to objects indirectly. The problem with putting integers or other short values into variables is that you have to tell them apart from each other, and from pointers which might have the same bit patterns. The solution to this is TAGGING. The value in each variable actually has a few bits devoted to a TYPE TAG, which says what kind of thing it is---e.g., whether it's a pointer or not. The use of a few bits for a tag reduces the amount of storage available for the rest of the value, slightly, but as we'll see next, that usually isn't a problem. --- WHY IMMEDIATE VALUES DON'T BREAK THE POINTER ILLUSION It might seem that storing integer bit patterns directly in variables would break the abstraction that Scheme is supposed to present---the illusion that all values are pointers to objects on the heap. That's not so, though, because the language enforces restrictions that keep programmers from seeing the difference. In the case of integers, you can't change the value of an integer. If integers were actually allocated on the heap and referred to via pointers, and if you could change the integer's value, then that change would be visible through other pointers to the integer. In Scheme, you CAN'T CHANGE the value of an integer---integers are IMMUTABLE (not changeable). (That doesn't mean that a variable's value can't be one integer at one time, and another integer at another---the variable's value is really a pointer to an integer, not the integer itself, and you're really just replacing a pointer to one integer with a pointer to another integer.) Another restriction imposed by the language is that there is only one integer object with a given integer value---you can't make copies of an integer, with the same integer value, you can only make copies of a pointer to the integer---and there's only ever one integer object with that value. These restrictions may seem peculiar, but once you grasp their significance, they're deeply cool, and loaded with truth, beauty, and righteousness. When you think about it, it doesn't make any sense to change the value of an integer, in a mathematical sense. For example, what would it mean to change the integer 6's value to be 7? It wouldn't mean anything sensible, for sure. 6 is a unique, abstract mathematical object that doesn't have any state that can be changed---6 is 6, and behaves like 6, forever. What's going on in conventional programming languages is NOT really changing the value of an integer---it's replacing one (copy of an) integer value with (a copy of) another. That's because most programming languages have both pointer semantics (for pointer variables) and VALUE semantics (for nonpointer variables, like integers). You make multiple copies of values, and then clobber the copies when you perform an assignment. In Scheme, we don't need to clobber the value of an integer, because we get the effect we want by replacing pointers with other pointers. An integer in Scheme is a unique entity, just as it is in mathematics. We don't have multiple copies of a particular number, just multiple references to it. Representing integers as tagged immediate values makes arithmetic much faster than it would be if integers were unique objects on the heap---when we add two integers to get another integer, we don't have to go and find the unique object with that integer value, and use a pointer to it as a value. We simply use an equivalent unique value (the actual integer value) in lieu of the unique pointer value. The right way to think about this is probably to view a tagged integer value as a funny kind of pointer, where the object at the end of the pointer has been optimized entirely away, because the information about the object can be derived from the pointer itself. (You can make this more concrete by visualizing a huge array holding all of the integers in increasing order, with the integer 1 in the first array element, 2 in the second, and so on. In this case, the offset of an integer in the array contains exactly the same information as the value of the integer---we can use the array indices in place of the integer values they "point to", and that means we can throw the array itself away.) --- LOW TAGGING The most popular tagging scheme is called LOW TAGGING, which means stealing the least significant few bits out of a word to hold type information. (The other popular alternative is high tagging, which uses the most significant bits.) TWO-BIT LOW TAGGING Our Scheme system uses the low-order two bits of a word to hold the type tag. ("Low-order" just means that if you view the word as an binary integer, they're the least significant bits.) If the low bits are 00, that signifies that the stored value is an integer; the actual bit pattern for the integer is stored in the remaining high-order 30 bits. (The two-bit tag therefore decreases the possible range of integer values by a factor of four, but few programs care.) If the low bits are 11, that signifies that the value is a pointer, and the upper 30 bits hold the machine address. Addresses are treated somewhat differently than integers however---instead of being shifted left two bits, addresses simply have their low two bits removed and replaced with a tag value. This is possible because all popular architectures use BYTE ADDRESSING, but our Scheme system aligns all objects on 4-byte boundaries. The low two bits of an object address are therefore always 00. To dereference a tagged pointer (whose low bits are 11 because of the tag) it is only necessary to mask or subtract out the tag value, restoring the original address. --- WHY LOW TAGGING IS CHEAP Given a two-low-order-bit tag of 00, integer arithmetic on tagged values turns out to be almost as fast as raw 32-bit integer arithmetic. Addition and subtraction are very fast because the same instructions can be used---the low-order bits participate in the arithmetic in the usual way, but adding a 00 tag to a 00 tag results in a 00 tag. Similarly, subtracting a 00 tag from a 00 tag results in a 00 tag. In effect, a 32-bit addition or subtraction acts as a 30-bit addition or subtraction, AND computes a new tag value, in parallel, for free. (Notice that with 30-bit values in the high 30 bits of a 32-bit word, the sign bit is in the right place, so everything works right, including overflow detection hardware.) Multiplication and division of low-tagged integers is slightly slower--- it is necessary to shift one of the operands two bits before doing the multiply or divide. (In effect, a 30-bit value in the high-order bits looks to the hardware like a 32-bit integer shifted left two bits, i.e., multiplied by four. Multiplying two such numbers results in a number that's shifted left by FOUR bits---so we shift one of them down to the 30 low-order bit positions before the multiplication, and the result is only shifted left two bits. That gives us exactly what we want---a 30 bit value in the high-order bits, with zeroes in the low order bits.) Pointer operations are also very fast given the low-tagged representation. It is necessary to remove the tag before using an address, to restore the two low-order bits to 00 and address a word-aligned object. On some machines, this may require an extra instruction, but on many it does not, or usually doesn't. On some machines, the two low-order bits of an address are ignored, because all loads and stores must be on word boundaries. On some other machines, there is an addressing mode that allows the addition or subtraction of a small constant while computing an address, without a time penalty. On these machines, the subtraction of the tag from the address can be done by the address calculation hardware, for free, using the small constant encoded in an instruction. Most Scheme objects only have fields that are general-purpose VALUE CELLS, which are just like variable bindings---they can hold any Scheme value, whether it's a tagged immediate value or a tagged pointer to another heap-allocated object. So, for example a PAIR (also known in Lisp terminology as a CONS CELL) is a heap-allocated object with two fields. Either field can hold any kind of value, such as a number, an ASCII character, a boolean, or a pointer to another heap object. A heap-allocated object usually has a hidden HEADER field that you're not supposed to know about. This extra field holds type information, saying exactly what kind of heap allocated object it is. So, laid out in memory, a PAIR looks something like this: +-----------+ header| | +===========+ car| 22| +-----------+ cdr| 15| +-----------+ The first field of a pair is called the CAR field, and the second field is called the CDR field. These are among the dumbest names for anything in all of computer science, and are just a historical artifact. In this case, the car field of the cons cell holds the integer 22, and the cdr field holds the integer 15. (Of course, those are really 30-bit binary numbers with a two-bit low tag of 00, but you usually draw things in the obvious way shown here.) Scheme provides a built-in procedure CAR to get the value of the car field of a cons cell, and SET-CAR! to set that field's value. Likewise there are functions CDR and SET-CDR! to get and set the CDR field's values. Suppose we have a global variable binding for the variable FOO, and its value is a pointer to the above cons cell. We would draw that situation something like this: +--------+ +---------+ header| | FOO | +----+------------->+========+ +---------+ car| 22| +--------+ cdr| 15| +--------+ By convention, we draw the pointer as though it points to the first data field, not the header. That's because the header is not supposed to be visible at the language level---it's only there to help the implementation (compiler or interpreter) manage dynamic typing. When we're drawing data structures, and we're not worried about grubby implementation-level details, we often leave the headers out. --- DEFINING VARIABLES AND PROCEDURES If you want to define a variable in Scheme, you always have to give it an initial value. There is no way to create a variable without giving it a value, as there is in C. To define a variable named foo with the initial value 10, just write (define foo 10) To define a procedure, you use a similar syntax that has a more complicated meaning. Suppose I want to define a procedure that will return the value 10. I can just write (define (foo) 10) The parentheses around the name of the variable tell the interpreter or compiler you're talking about a procedure. The expressions following the parentheses are taken to be the body of the procedure, i.e., expressions that will be evaluated when the procedure is called. In this case, there is only one expression, which is just the number 10. This procedure will just return the number 10 when it is called. Procedures can take arguments, and the arguments go inside the parentheses after the name of the procedure. If we want to define a procedure IDENTITY that just takes one argument and returns it, we can write (define (identity x) x) Notice that the parentheses around the name of the procedure and the argument(s) resembles the parentheses around the name of the procedure when it's called, e.g., calling identity with the argument 20: (identity 20) The procedure definition syntax was designed so that you can look at the definition of a procedure and see how to call it. If we want a procedure that doubles its argument, we can write (define (double x) (+ x x)) As in most reasonable languages, different procedures can take arguments and call them by the same name (define (quadruple x) (+ (double x) (double x))) Notice that double has an argument variable x, and quadruple also has an argument variable x. This is okay, because ARGUMENT variables are LOCAL variables---they can't be seen outside the procedures where they're defined. There is no conflict because there's a SCOPE RULE that says which variables are visible to which expressions. (In Scheme, the rule is LEXICAL SCOPE, which we'll discuss in detail later.) --- VARIABLES Consider the picture above, where we have a variable FOO that refers to a pair. So far, we've casually talked about variables holding values, but that's not quite right. Variables are bound to storage, and storage holds values. When speaking precisely, we say that the variable name FOO is BOUND to the memory location represented by the box on the left. BINDING just means making an association between a name and a piece of storage. For brevity, we refer to the location as the variable's BINDING. (Even this is not precisely correct, because literally the BINDING is the ASSOCIATION between the name FOO and the word of storage, not the actual storage itself.) What is this distinction? Why not just say that the variable holds a value, i.e., why not call the unit of storage a variable? Because that's not right. Consider the following short program. (define x 10) ; create a binding of x holding the value 10 (define (double x) ; define a procedure that doubles its argument (+ x x)) (define (quadruple x) ; define a procedure that quadruples (+ (double x) ; its argument. (double x))) (define (foo x) ; define a recursive procedure that calls (if (> x 0) ; itself if its argument is more than 0, (foo (- x 1))) ; handing the recursive call an argument ; that's one less. (foo (quadruple x)) ; look up the toplevel value of x, hand ; it to a call to quadruple, and hand the ; result to a call to foo. Notice that we've defined a top-level (non-local) variable x, and we've defined three procedures, double, quadruple, and foo, each of which has a local variable x. There are therefore FOUR DIFFERENT VARIBLES NAMED X in this code. At the top level, x means one thing, but in each of the procedures, it means something else. Each procedure defines a different meaning for the name x, and each separate MEANING is a different VARIABLE. This becomes obvious when we talk about "the variable x defined at the top level" versus "the variable x in the procedure double" versus "the variable x in the procedure quadruple" versus "the variable x in the procedure foo". Notice also that when we define the procedures, there is no storage allocated for their local variables; the variables named x in the procedures are just declarations---no space will be allocated for them until the procedures are called. --- BINDINGS When we call something a VARIABLE, that's NOT because we can assign to it and change its value. None of the above variables has a value that's variable in that sense; none of these procedures happens to modify the values they're given as arguments. The term "variable" means here pretty much what it means in mathematics---at different times we invoke the same procedure and the variable will refer to something different. For example, I may call double with the argument 10, and for the duration of that call the value of x will be 10. Later, I may call double with the value 500, and for the duration of that call the value of x will be 500. Consider how similar this is to variables in logic. I may have a logical statement that for all x, if x is a person then x is mortal. (Forall x, person(x) -> mortal(x)). I can use that same logical statement and apply it to lots of people, e.g., if Socrates is a person then Socrates is mortal, and if Bill Clinton is a person then Bill Clinton is mortal, and so on. Each time I use it, x may refer to a different thing, and that's why it's called a variable. Just because it's a variable doesn't mean that when I use it I change the state of the thing I use it on---for example, Bill Clinton is probably not modified much by the fact that I'm inferring something about him, and I'm pretty sure Socrates isn't changed much at all by the experience. The point here is that the same variable can refer to different things at different times. These different things are called BINDINGS, because the variable is associated with (BOUND TO) something different each time. Consider the recursive procedure foo, above. In a recursive procedure, the same variable may be bound to DIFFERENT things at the SAME TIME. Suppose I call foo with the argument 15, and it binds its argument x with the value 15. Then it examines that value, and calls itself with the argument 14. The recursive call binds its argument x with the value 14, then examines that and calls itself with the value 13, and so on. At each recursive call, a new BINDING of x is created, even if the old bindings still exist because the earlier calls haven't finished yet---they're suspended for the duration of the recursion. But what is a variable BOUND TO, i.e., to what does a variable refer? In Scheme, it refers to a piece of STORAGE. When you call a procedure, for example, the argument variable is bound to a piece of storage that can hold a value. Inside that call to that procedure, that variable name will refer to that piece of memory. A single binding of a variable may hold different values over time, because of assignments, as in most procedural languages. So not only may the same variable be bound to different pieces of storage, but each piece of storage may hold different values over time. --- TOP-LEVEL VARIABLES AND BINDINGS Top-level variables are a little bit confusing, because there is a one-to-one mapping from variables to bindings. Local variables clearly show you the difference between a variable and a binding, because you can see that the same expression can bind a variable many times if it's executed many times. Top-level variables tend to hide this distinction, because when you define the variable, you also create exactly one binding of it. (When you create that binding, you also initialize it to a particular value. It's important to notice that there are really three separate and important concepts here: declaring a variable, binding that variable to a particular piece of storage, and putting a value in that storage. Consider a top level variable named x. (define x 22) There is the variable NAME x, which may be the name of many different variables, including local variables of procedures. Thus the NAME x can be used for many variables. Then there's the top-level VARIABLE x, which is a particular program variable; it's different from any other variables named x. Then there's the BINDING of the top-level variable x, i.e., the storage to which the top-level variable is bound. And then there's the value stored there, in this case 22. These are all different things. --- Be clear on the following: * a variable name may be used as the name of many different variables (e.g., the various variables named x in different procedures in the earlier example) * a particular variable may have many different bindings (i.e., refer to many different pieces of storage), as in the case of multiple calls to the same procedure, each of which binds the same variable to different piece of storage within that call to the procedure * a particular variable binding (e.g., the storage allocated for a particular variable within a particular call to a procedure) may hold different values, because assignments may change the value stored in that piece of storage. These distinctions will become important later, when we take advantage of some of Scheme's powerful abstraction features. --- --- Going back to the picture with the binding of FOO holding a pointer to a pair... We say that the name foo is bound to a word of storage, represented by the box in the picture. We also say that the word of storage HOLDS A VALUE, in this case, the pointer to the pair. WE DO *NOT* SAY THAT THE VARIABLE IS BOUND TO THE CONS CELL, unless we are being unforgivably sloppy. Variables can be bound to storage, and storage can hold values, but variables cannot be bound to values! This distinction is very important in Scheme, because of the possibility of SIDE EFFECTS, i.e., assignments. When we store a new value into the variable binding, we are NOT REBINDING THE VARIABLE. The variable name FOO still names the same piece of storage. (If Scheme were a purely functional language, we might be able to slur over the difference between storage and values, because the value in a piece of storage would never change. But Scheme has side effects, so we must always be aware that variables are bound to storage, not to values.) --- LISTS Scheme, like Lisp, has built-in procedures for dealing with a particularly flexible kind of list---a list of pairs, whose CDR fields string them together, and whose CAR fields hold the values. We'll call these Lisp-style lists, or Lisp lists for short. Notice that in Lisp you don't typically string objects together into a list by giving each one a next field that points right to the next object. Instead, you create a list of pairs that hold the objects (or pointers to them)---only the pairs have next fields. A major advantage of this is that you don't have to modify an object to put it on a list---an object can easily be in many lists at once, because a list is really just a spine of cons cells that holds pointers to the items in the list. This is much cleaner than the way people are typically taught to create simple lists in a beginning programming class. Dynamic typing also helps make Lisp lists useful. A list of cons cells can hold ANY KIND OF OBJECTS, or even a bunch of different kinds of objects. So, for example, a cons cell list can be a list of integers, a list of lists, a list of ASCII characters, or a list of any of the kinds of objects we haven't gotten to yet. It can also be a mixed list of integers, other lists, and whatnot. A few list routines can therefore be useful in a variety of situations---a single list search routine can search any kind of list for a particular target object, for example. This picture shows two variable bindings, for the variables BAR and FOO. BAR's binding holds a list +--------+ +---------+ | | BAR | +----+--->+========+ +---------+ | 10| +--------+ | +----+-+ +--------+ \ \ +--------+ \ +--------+ +--------+ +---------+ | | \ | | | | FOO | +----+--->+========+ +-->+========+ +-->+========+ +---------+ | 22| / | 15| / | 6| +--------+ / +--------+ / +--------+ | +----+---+ | +----+-+ | * | +--------+ +--------+ +--------+ A more common way of drawing this data structure is +---+ +---+---+ BAR | +-+--->| + | +-+------+ +---+ +-+-+---+ | | | | | 10 | \/ +---+ +---+---+ +---+---+ +---+---+ FOO | +-+--->| + | +-+----->| + | +-+----->| + | * | +---+ +-+-+---+ +-+-+---+ +-+-+---+ | | | | | | 22 15 6 This emphasizes the idea that everything's a pointer---conceptually, the cons cells hold pointers to the integers. Pairs are drawn as pairs of little boxes, and they're typically drawn with the boxes side by side---that's just handy because cons cells are often used for linear lists, which you want to display horizontally---it's easy to draw the spine of the list horizontally if the CDR field (used as a "next" pointer) is on the right. We leave off the headers because they're a low-level detail anyway, and because everybody recognizes this kind of two-box drawing of a pair. TERMINOLOGY In Lisp and Scheme parlance, we often talk about storing an object in a variable binding, rather than storing a pointer to the object into the variable binding. This is just sloppy talk, in two important ways. You should always remember that variables don't hold values, bindings do. And bindings generally don't hold objects, they hold pointers, so the fact that an object is "stored in" one variable doesn't prevent it from being stored in another at the same time---all objects are conceptually stored uniquely on the heap, and variable bindings hold pointers to them. --- Scheme has built-in TYPE PREDICATES that allow you to dynamically check the types of values. So, for example, if you want to know if a value is a (pointer to a) pair, you use the procedure PAIR?. (PREDICATES are just functions that return a boolean value (true or false, written #t or #f). By convention, the names of Scheme predicates end with a question mark. (This is equivalent to the Lisp convention of ending predicates with the letter p. NUMBER? in Scheme corresponds to NUMBERP in Lisp.) --- EQUALITY TESTS AND IDENTITY Lisp and Scheme have POINTER SEMANTICS, meaning that you can tell the difference between two objects that are of the same type and have the same values of their fields. The predicate EQ? tests to see if objects are the actual same ("identical") object. The predicate EQUAL? test to see if two objects are structurally identical, but maybe not the same object. So if I have two cons cells whose CAR values are 5 and whose CDR values are 6, they will be EQUAL?, but not EQ?. In some functional programming languages, there is no EQ? primitive---you're only supposed to care if objects have the same VALUES, not whether they have the same OBJECT IDENTITY. Lisp and Scheme aren't pure functional languages, although you can do some pretty nifty things with functions. (More on that later.) -------------------------------------------------------------------- Scheme is a PROCEDURAL language, in that you write procedures which explictly control the flow of program execution. Scheme is also an EXPRESSIONAL language. Every procedure returns a value, as does every control construct. For example IF controls the flow of execution, just like in C. But IF is also an expression, and returns the value of its "true" branch, if the condition evaluates to true, or the false branch if it's false. (This is like the ternary expression in C.) The syntax of an IF in Scheme is (IF ) (Note that there's no syntactic sugar like "then" or "else" keywords.) So, for example, you can compute the MIN of two variables as an if expression: (if (< a b) a b) If you want the IF to put the MIN value in (the currently visible binding of) the variable FOO, you can do this: (if (< a b) (set! foo a) (set! foo b)) If you want both side effects and return values, you can use a begin expression for each branch, which will perform the assignment and return the value as well: (if (< a b) (begin (set! foo a) a) (begin (set! foo b) b)) BEGIN expressions just group several expressions together and force them to be evaluated sequentially; the value of the last one is returned as the value of the begin expression. We'll come back to that in a while. (In some implementations of Scheme, the last two versions are equivalent, because set! also returns the value being assigned. That's not standardized, however, and in some implementations, set! returns the OLD value.) --- TEXTUAL REPRESENTATIONS OF LISTS Lisp and Scheme have a standard way of writing a textual representation of a list structure, and of reading such textual representations to construct lists. These can be used for writing list data to files, and reading them back in to construct new lists. In the example used earlier, the list that's the value of the binding of FOO is written (22 15 6). The value in BAR's binding is (10 15 6). The procedure WRITE takes a value and writes its textual representation to a file. This can be a complicated list, and multiple levels of structure will be represented by nested parentheses. The procedure READ reads such a representation and reconstructs the list structure---sort of. Notice that the textual representation doesn't explicitly represent the spine of the list---that's just implied by the parentheses enclosing the list elements. This way of writing lists loses information about whether lists share structure; there's no way to tell from the printed representation that the two lists above actually share the same structure, and only differ by one pair. So if we say (write foo) and (write bar), this will be written to the file: (22 15 16) (10 15 6) If we then call READ twice, we'll get back two NEW lists, which DON'T share structure, but which hold the same sequences of integers. --- THE SCHEME READER We just said that when we read in two textual representations of lists using READ, we get two distinct list structures. This is true, up to a point. The PAIRs that make up the lists will be different, but if their contents are integers, they'll be the same object. Suppose, for example, if we have a file containing the characters "(6 6 6) (6 6 6)" and we call READ twice to read the data in as two lists, like so: Scheme> (define foo (read)) FOO Scheme> (define bar (read)) BAR Scheme>foo (6 6 6) Scheme>bar (6 6 6) If we draw the lists using box-and-arrow notation, and representing integers as separate objects in accordance with the language abstractions, we'll get this: +---+ +---+---+ +---+---+ +---+---+ FOO | +-+------>| + | +-+--->| + | +-+--->| + | * | +---+ +-+-+---+ +-+-+---+ +-+-+---+ | | | | | | | | | | \/ | | +---------+ | +-------->||<----+ +---------+ +-------->| 6 |<----+ | +---------+ | | ^ | | | | | | | | | | +---+ +---+---+ +---+---+ +---+---+ BAR | +-+------>| + | +-+--->| + | +-+--->| + | * | +---+ +-+-+---+ +-+-+---+ +-+-+---+ Remember that integers are represented specially, though, and multiple copies of an (immediate) integer bit pattern have the same semantics. If we go down to a lower level of abstraction, we'll see this: +---+ +---+---+ +---+---+ +---+---+ FOO | +-+------>| 6| +-+--->| 6| +-+--->| 6| * | +---+ +---+---+ +-+-+---+ +-+-+---+ +---+ +---+---+ +---+---+ +---+---+ BAR | +-+------>| 6| +-+--->| 6| +-+--->| 6| * | +---+ +-+-+---+ +-+-+---+ +-+-+---+ This picture reflects the low-level fact about the implementation that multiple references to a unique integer are represented by multiple copies of a unique bit pattern. The reader therefore preserves the uniqueness of integers across reads and writes, but not the uniqueness of cons cells---it creates new cons cells whenever it reads a list. (Besides integers, there is only one other kind of object, symbols, which the READing preserves the uniqueness of; these will be discussed later.) --- COMMON MISCONCEPTIONS ABOUT CODE AND DATA You may have noticed that the printed representation of lists is very similar to the printed representations of programs. You may have also heard loose talk that in Lisp, "Programs are data" and that you can create "self-modifying code", or that "expressions are just lists". DON'T YOU BELIEVE IT!!!!!!! Lisp and Scheme COULD have a different syntax for code than for data---and some versions DO. For now, you should consider programs and data to be entirely distinct kinds of things, and in general they actually are. In high-performance implementations, programs get compiled into machine code before they're run, and that machine code is pretty much like machine code for Pascal or C. It's not lists. It's not interpreted. As we'll see later, the list-like expression syntax of Scheme DOES make it easy to write interpreters, and we'll look at one. But the actual Scheme language does not REQUIRE an interpreter, and there's no standardized way to create self-modifying code. There are powerful features that can give you most of the desirable effects of self-modifying code (in a much cleaner way), but they're NOT based on treating code as data. --- AN INTERACTIVE PROGRAMMING ENVIRONMENT The normal way of interacting with a Scheme system is via an interactive loop, often called a READ-EVAL-PRINT loop. The Scheme system is just a program that sits and waits for you to type in an expression. When you do, it evaluates the expression to get a value, prints the value so you can see it, and loops to wait for the next input. That's all it does, over and over and over. Of course, evaluating an expression can have arbitrarily hairy effects, depending on what expression you type in. It can create variable bindings, create new procedures, call procedures, and whatnot---all the stuff you'd expect from a programming language. The nice thing about such an INTERACTIVE PROGRAMMING ENVIRONMENT is that you're INSIDE the system. With typical "batch" implementations of typical programming langauges, a program is a bunch of stuff in a file which you run through a compiler, and then RUN all at once. You just sit there watching it, waiting for it to complete a run, or maybe blow up. When the program dies, its whole heap and all of its variables disappear. In an interactive programming environment, the variables just hang around, and you can define new variables and procedures one at a time, change procedure definitions or variable values, run procedures, etc. There doesn't have to be a "main" program that orchestrates a single coherent execution---the "main" program is just the read-eval-print-loop, and it does whatever you want at each iteration---you're right there, able to query the values of variables after any procedure you tell it to execute, etc. Scheme systems are typically robust---if your program does something stupid because of a bug, you generally won't get a segmentation fault or other fatal error. Most basic errors (e.g., exceeding array bounds, dereferencing a null pointer) are checked, rather than allowing your program to go off into the ozone. When an error is detected, the Scheme system will try to recover to some reasonable state, and throw you into a debugger to fix the problem, or just give you another iteration of the read-eval-print loop. --- INTERACTIVE VS. BATCH COMPILATION In an interactive programming environment, there's a different notion of "compile time" than in a conventional batch environment. In a batch environment, you submit a whole program to the compiler and compile the whole thing before running it. In an interactive environment, you can compile some functions, run some functions, interleaved any old way. Scheme is usually used in this interactive way, but most systems also let you compile a whole program in a batch-style way. This may generate better code than compiling little bits at a time. It is common to develop a program by interactively building and testing little bits of it, then recompile the whole thing with a different compiler mode once it's finished. --- TRANSPARENT COMPILATION and "COMPILE-TIME" Scheme may be interpreted or compiled, and interactive systems may use either an interpreter or a compiler to execute things typed at the scheme prompt. When we want to explain the meaning of expressions, we'll usually talk as though they were being interpreted. This is just for simplicity in explaining the semantics of expressions. When we talk about defining procedures, however, we'll generally assume that they are compiled as soon as the Scheme system reads them in. This is in fact how many Scheme systems work, including the one we're developing. This changes the meaning of the word "compile-time" from what most people are used to. "Compile-time" is simply the time at which an expression is compiled, i.e., when it is first read in by the Scheme system. Consider this: Scheme> (define (double x) (+ x x)) DOUBLE Scheme> (double 10) 10 When the first expression is read and evaluated, that's COMPILE-TIME for the procedure double. When the second expression is evaluated, by calling double, that's RUN-TIME for that procedure. It's important to remember that COMPILE TIME just means the time when a procedure is DEFINED, and run time is when the procedure is called. --- EXPRESSION EVALUATION When you type an expression to the read-eval-print loop and hit return, the characters you typed are converted into a structured representation, and evaluated. This evaluation can take two forms: interpretation, or compilation and execution. In the interpretation approach, the data structure that represents the expression is traversed in a simple recursive way, with actions taken at each step. In the compilation approach, the whole expression is analyzed, and a piece of code is generated that, when executed, will have the same effect as interpreting the expression---then this compiled code is executed so that those effects will actually happen. Many modern Scheme systems (including the one we're building) use the compile-and-execute approach. It's easier to understand an interpreter, though, so we'll talk about interpreters first. --- The read-eval-print loop prints a prompt when it's ready to accept input. The prompt usually looks something like ">" or "Scheme>" or just "?". We'll use "Scheme>" for clarity. Suppose we get the scheme prompt and we type in the expression "(+ (- 3 2) (- 5 4))". The read-eval-print loop will READ that expression (from the buffer your typing goes into), translating it from a sequence of characters to a hierarchical data structure representing the nesting of expressions. It will then call the interpreter with that representation as an argument. The interpreter will evaluate it and return a result. The read-eval-print loop will print that result, and then display another prompt signifying that it's ready for another expression. We'll show what the scheme system displays, interleaved with what the user types in, pretty much as it would appear on the user's screen. Scheme> (+ (- 3 2) (- 5 4)) 2 Scheme> When the read-eval-print loop reads the expression and converts it to a data structure, it typically converts it to a Lisp-style list in the obvious way, by using the normal Scheme procedure READ or something pretty much like it. Keep in mind that this is ONLY A CONVENIENT WAY OF DOING THINGS---other representations could be used. --- STRINGS When we type things to the read-eval-print loop, and it uses read to read them, new Scheme objects are created on the heap to represent what we type. Suppose we just type a character string, in double quotes. Scheme>"This is a string" "This is a string" When the reader reads the string (including the quotes), it recognizes that as signifying a string object---the expression typed in was not a list, but just a single object of type string, storing the character sequence T-h-i-s- -i-s- -a- -s-t-r-i-n-g. So the reader simply creates a string object, which is a sort of array of characters. But now what? The expression has been read, and now it's supposed to be evaluated and printed. What's the value of a string? In Scheme, strings aren't really interesting to treat as expressions and evaluate---they don't do anything when treated as expressions. So when given a string as an expression to evaluate, Scheme just returns the same string. This turns out to be convenient, and a lot of built-in types (including integers) are treated this way. (You'll see later why this is so convenient, but it's nothing deep.) Objects (like strings and integers) that evaluate to themselves are called SELF-EVALUATING objects, for obvious reasons. Once the expression has been read and (trivially) evaluated, the read-eval-print loop just prints it and gives the next prompt. --- BOOLEANS and TRUTH Scheme provides provides a special unique object, whose written representation is #f, called FALSE. This object counts as false if it's the result of a condition expression in an IF (or COND) expression. This is the ONLY value that counts as false, and all others count as true. For convenience and clarity, Scheme also provides another boolean value, written #t, which can be used as a true value. Note that any value other than false is true, but the special boolean object #t is a good one to use when all you want to say is that something is true---returning the true boolean makes it clear that all you're returning is a true value, not some other value that conveys more useful information. (Warning to Lisp programmers: the false object #f is NOT the same thing as the empty list object '(). The empty list counts as a TRUE value in Scheme.) Like strings, booleans are self-evaluating---if you type #f to the read-eval-print loop, it will be evaluated to #f, and that's what will get printed. --- SYMBOLS There's another, more interesting data type in Scheme that is related to strings. These are SYMBOLS. Symbols are like strings, in that they have a sequence of characters associated with them. But symbols are special objects in that they are UNIQUE. There may be any number of strings with the same sequence of characters, but there is only one symbol with any particular sequence of characters. If we type "foo" (with the double quotes) at the Scheme prompt, over and over, the reader will keep creating new strings, all with the character sequence f-o-o. If we just type foo, though, with no quotes, something quite different happens. The Scheme system maintains a special table called the SYMBOL TABLE, which holds all of the symbols that have been encountered by the reader. The reader checks this table every time it reads a symbol. If a symbol by that name is already in the table, a pointer to that symbol is returned. If no symbol exists for that character sequence, one is created. In this way, no duplicates are ever created, and all pointers to a symbol named foo will always point to the SAME symbol, named foo. (The symbol table is typically structured as a hash table, keyed by the name of the symbol.) The symbol table is not visible at the langauge level---it's just a hack that the system uses to ensure the uniqueness of symbols (which IS visible at the langauge level). Other implementation strategies could be used, but the use of a symbol table is almost universal. --- Symbols and strings are typically used in quite different ways. When you compare strings, you typically use the EQUAL? test (or a more specific test, STRING-EQUAL?) to see if two characters have the same character sequence. When you compare symbols, however, you know that if two symbols are EQUAL? they are also EQ?, because the symbol table trick enforces that. You can therefore compare symbols very, very quickly by simply comparing pointers---if the bit patterns are the same, they're the same symbol. If not, they're not, and must have different character sequences as well. --- SYMBOLS are NOT VARIABLES One thing to notice is that a SYMBOL is NOT a variable. Just because there's a symbol x, for example, doesn't mean there's a variable named x. A SYMBOL IS JUST A SPECIAL KIND OF DATA OBJECT, WHOSE UNIQUE PROPERTY IS ITS UNIQUENESS. DO NOT CONFUSE SYMBOLS WITH VARIABLES! --- IMPLEMENTING EQ? In principle, we don't need the symbol table trick to enforce the language-level uniqueness of symbols. We could simply implement EQ? to do a much more expensive (character-wise) comparison on symbols. EQ? is a very important primitive, however, and most implementations are careful to make sure that it is just a very few instructions which compare bit patterns. Any two Scheme values can be compared as simple bit patterns, whether they're pointers, or immediates, or whatever. If they're the same bit pattern, they're the same Scheme object. Notice how this is true for immediates---conceptually, each copy of an immediate (such as the integer 22) is conceptually a pointer---but comparing the bits of the copies gives the same result as if both had been pointers to a unique heap-allocated object holding the value 22. Neat, huh? --- But now back to evaluating expressions. Suppose we type 133 at the Scheme prompt: Scheme>133 133 The character sequence is recognized as an integer by the reader, and the expression is converted to a Scheme integer, which the evaluator notices is a self-evaluating object (all integers are, like strings), and that value is returned and printed. Likewise, if we type "foo" at the prompt, the reader will create a string object, which will be evaluated to itself and printed. But now suppose we type foo, without quotes. The evaluator treats an expression consisting of a symbol very differently from one consisting of a string. Symbols are NOT self-evaluating---they're generally treated as variable references. Here's what happens: Scheme>foo ERROR---UNBOUND SYMBOL FOO Why does this happen? When we type foo at the prompt, we are in effect asking the Scheme system to look up the value of the variable foo and print the result. Unless we have defined a variable named foo, however, that's meaningless. Foo is just a name, but it is not yet the name OF anything. To make the name foo be a global variable name, we use a DEFINE form, like so: Scheme> (define foo 1) FOO What we've done here is really THREE separate things: 1) we've declared the symbol foo to be a global variable name, 2) we've BOUND that name to a piece of storage, and 3) we have put the value 1 into that piece of storage. The distinction between 1 and 2 (declaring and binding) is not obvious in the case of a "top-level" variable---that's Scheme's equivalent of a global variable, and it's what we just created---but it will be clearer in the case of local variables, which we'll get to later. Note that the read-eval-print loop printed FOO, the name of the variable we defined. That's because a definition is an expression, and must return a value. Definitions aren't usually done for value, though---they're done for side effects, namely the creation of the binding and the initialization of that binding's value. Most implementations of Scheme simply return the name of the symbol being bound as the value of the define expression. That's handy, because it gives the user some feedback---seeing FOO lets you know that the variable got defined. (Some implementations return some other value from a definition expression, though, so you shouldn't count on what a define returns if you want your code to be portable.) (By convention, when we type a symbol name, we will use lower-case. Scheme systems typically convert all symbol names to upper-case, or to lower case, when they're read. This makes them case insensitive---it doesn't matter if you type a symbol name in upper or lower case, or a mix, because the reader makes them all the same. It is not standardized whether symbols are all uppercase, or all lowercase, however. We'll assume they're all uppercase. That's handy for our examples, because what we type will use lower-case, and what Scheme prints out will use upper-case, making it easier to follow what's going on. Strings are NOT converted to all one case, however---only symbols are. Having defined the toplevel variable foo, we can now ask its value. Scheme> foo 1 Now that there's a piece of storage named foo, we can put any kind of thing we want in it, replacing the value 1. To to change the value stored in a variable binding, we use a SET! form: (set! foo "Some string or other.") "Some string or other" A SET! expression looks up the binding (storage) associated with a variable name---NOT THE VALUE---and then stores the given value into it. Note that this means that SET! does NOT evaluate its first argument before operating on it---if it did, the value of the first argument (in this example) would be the integer 1, and we would be trying to store a string into an integer, which makes no sense. SET! DOES evaluate its second argument---in this case, the string "Some string or other." Since the argument is a string, though, it evaluates to itself, so it doesn't matter in this case that the second argument is evaluated. --- TOP-LEVEL BINDING ENVIRONMENTS In Scheme, there's no such thing as a global variable, but for the time being we'll consider TOP-LEVEL variables to be global. They're variables that exist outside of any particular procedure. In understanding top-level binding, you should keep in mind that the SYMBOL TABLE holds only symbols, and it's only a hack to ensure uniqueness. Top-level variable bindings are kept in a different table, called the TOP-LEVEL environment. This table associates symbols with actual pieces of storage. These tables may be very similar, but their functions are crucially different. When you interact with the Scheme system, asking it to evaluate expressions and print the results, these requests are interpreted relative to a VARIABLE BINDING ENVIRONMENT, or binding environment (or just environment) for short. All that means is that when a variable is evaluated, the system has to know how to find the piece of storage the variable name refers to. Typically, when you're just typing stuff at the read-eval-print loop, the system just looks in the top-level environment to look up the bindings of variables. The read-eval-print loop keeps a pointer to the binding environment it uses to interpret variable names. For the moment, assume that this just points to the top-level environment, and searches for variables just search that table. (As we'll see later, declaring and using procedure arguments and local variables changes this a bit. We'll also eventually see that you can have multiple top-level environments, and change the meanings of any number of variable names by simply telling the system to use a different table to look things up.) --- A NOTE ON IDENTIFIERS In Scheme, symbols are used as identifiers, and symbols can have fairly weird names. The main requirement is that they start with a letter and don't have spaces in them. They CAN have a variety of other symbols in them, like hyphens, <, >, =, etc. So don't be surprised if you see a symbol named something like big-long-name. It's common practice to use hyphens to separate words in long names consisting of a phrase jammed together. This is possible because of the extremely simple syntax of Scheme, where you put parentheses and/or spaces around subexpressions rather than relying on a fancy tokenizer and parser. [In C, you might write a*b-c/d, and expect that to be parsed as (a * b) - (c / d). In Scheme, you have to use spaces and parentheses: (- (* a b) (/ c d)).] It's something that Lisp folks love and some other people despise. --- EXPRESSION EVALUATION So far, we've seen how a few simple kinds of expressions are evaluated---strings, symbols, integers, definitions, and set! expressions. Now we can look at how more complex expressions are constructed from these, and how they are evaluated. A very common kind of expression is a COMBINATION, also known as a procedure call expression. Usually, when you see an expression of the form ( ... ), it's a combination (procedure call). So, for example, (foo bar baz) means to call the procedure named foo with the arguments bar and baz. Bar and baz are just symbols, which are evaluated in the usual way, namely by treating them as variable names and fetching the values stored in their bindings. Expression evaluation is thus a recursive process---to compute the values of the arguments to a procedure call, we just evaluate those subexpressions. If they, in turn, are complicated expressions, we'll have to evaluate their arguments, and so on. So if we evaluate an expression Scheme> (+ (- b 3) (- a 6)) the first argument expression, (- b 3) is evaluated, which requires evaluating ITS arguments, the variable b and the integer 3. Suppose the value fetched from variable b is 7---then the procedure - is called with those 7 and 3 as its arguments. It returns the value 4, which will be used as the first argument to +. We next evaluate the expresson (- a 6), which likewise looks up the value of a, and evaluates 6. Supposing a's value is 8, we'll pass 8 and 6 as arguments to the procedure -, and get back 2. Now we pass the values 4 and 2 to the procedure +, which returns the 6 as the value of the original expression. ORDER OF EVALUATION The above description of evaluating a combination is not quite right, in that there's no guarantee that the first argument expression is evaluated first. A Scheme implementation is allowed to evaluate arguments left to right, right to left, or in any old order it feels like at the time. Don't write code that depends on the order in which procedure arguments are evaluated. (Note that some special forms like IF and AND do guarantee to evaluate their subexpressons left to right; these are not combinations, however.) PROCEDURE NAMES, VARIABLES, and FIRST-CLASS PROCEDURES One thing we've glossed over here is where the procedures come from, and how the FIRST subexpression of a combination is evaluated. In most programming languages you're probably used to, procedure names are just procedure names, and they're very different from variable names. Not so in Scheme. In Scheme, procedures are FIRST-CLASS. That means that a procedure is just an object on the heap, like a pair or a symbol, but with the very special property that it can be APPLIED, i.e., called with argument values to operate on. Procedure variables also share a UNIFIED NAMESPACE, meaning there's no difference at all between a procedure name and a variable name---procedure names just happen to name variables whose values are procedures. (This is different from most dialects of LISP, and it's one of the things that makes Scheme cleaner than most LISPS.) When we evaluate an expression like (+ foo bar), + is really just a variable name, like foo and bar. When we evaluate a combination, the procedure name is evaluated just like arguments, presumably returning a procedure object as its value. This procedure is applied to the argument values computed by evaluating the other subexpressions. Note that it's possible to screw up if you try to apply a variable value to some arguments, and that value happens NOT to be a procedure. This seldom happens in practice, but it's easy to do if you want to: Scheme> (set! + "Some string value.") "Some string value." Scheme> (+ 2 3) ERROR: Attempt to apply non-procedure: "Some string value" 2 3 What this illustrates is that + is just a variable name, and the value of its binding can be changed with SET!. An attempt to apply a non-procedure object (in this case a string) is a run-time error. Notice that (the usual value of) + is a built-in Scheme procedure, but you can still redefine + just by changing the value of the variable +. The only thing that's special about the variable (and procedure) is that it exists when the system starts up. You can clobber it just like you can a variable that you create with a define expression. COMPLEX EXPRESSIONS AS PROCEDURE "NAMES" Notice that we said that the procedure name position is treated as a subexpression, just like the arguments, and it just has to return a procedure value when evaluated. This is really true, and one of the consequences is that a procedure "name" needn't be a name at all---it can be an arbitrary expression that finds or creates a procedure and returns it to be applied. This is common in implementing novel control structures, where you may want a fancy lookup procedure to find the procedure that's most appropriate for a particular situation, depending on run-time information: ((lookup-appropriate-procedure) some-argument some-other-argument) Here, lookup-appropriate-procedure is presumably a procedure that returns the procedure we really want, and applies THAT to the arguments. --- SPECIAL FORMS AND QUOTE Earlier we said that USUALLY when you see an expression of the form ( ... ), it's a combination, and that combinations are evaluated in a very regular way---by evaluating each subexpression, and applying the result of the first to the results of the rest. There are certain kinds of expressions that look syntactically like combinations, but aren't, and a major part of learning Scheme is getting the hang of which things are NOT procedure calls. These expressions are called SPECIAL FORMS. Recall that SET! does not evaluate its first argument, which has to be treated specially. SET! is not a procedure, like + or CONS---it is a special form. Special forms are just expressions with certain keywords in the "procedure name" position, which do something different than the normal argument evaluation and calling. Other special forms we've already seen are BEGIN, DEFINE, and IF. Notice that a BEGIN forces its subexpressions to be evaluated in order (unlike the arguments to a procedure call), and just returns the last value. DEFINE does not evaluate its first argument (the variable name), because the variable doesn't exist yet when the define is executed. IF always evaluates its first subexpression, and only evaluates one of the other two. If IF was a procedure, and therefore its arguments were always evaluated, it would not implement the proper semantics---both the "then" branch and the "else" branch would be executed, perhaps with unintended side effects as well as increasing run times. Another important special form is QUOTE. Quote is a very simple special form, which just returns the value of its one subexpression without evaluating it. For example, suppose we have a symbol foo, but no variable named foo---we're just using the symbol foo as a data object, not a name for a piece of storage. If we want to create a variable bar whose value is the symbol foo, we can't do this: Scheme> (define bar foo) ERROR: unbound symbol foo Instead, we do this: Scheme> (define bar (quote foo)) BAR This way, when the define expression is evaluated, and its second argument is evaluated, the (quote foo) form is evaluated. Since quote just returns its argument, the evaluation stops there. Now if we query the value of the symbol variable bar, we get back the symbol foo. Scheme> bar FOO QUOTE is very important, in that it allows you to determine when evaluation stops, and when values are to simply be interpreted as LITERAL data, i.e., not standing for something else. Suppose we want to create a list of the integers 1, 2, and 3, and store it in the variable foo. The Lisp-style list containing those numbers can be written (1 2 3), but that also looks like a combination. (It looks like an attempt to treat 1 as a procedure and apply it to 2 and 3---not a great idea.) So rather than doing this: Scheme> (set! foo (1 2 3)) ERROR: attempt to apply non-procedure: 1 2 3 we do this: Scheme> (set! foo (quote (1 2 3))) (1 2 3) This idiom is extremely common, so there's a little bit of assistance from the reader to let you avoid writing (quote ...) over and over again. The single quote character ' can be used as a shorthand, just before a quoted expression, i.e., we can write '(1 2 3) instead of (quote (1 2 3)). SELF-EVALUATING objects are also a nice hack to avoid needing to write QUOTE all the time. There's no fundamental semantic reason for things like integers and strings to evaluate to themselves, or to anything at all. It's just convenient that we don't have to put quote in front of every integer or string we use as a literal, We could write: (+ (quote 1) (quote 2)), or equivalently, (+ '1 '2) but self-evaluation of integers lets us write things the obvious way: (+ 1 2) --- SPECIAL FORMS ARE NOT PROCEDURES In Scheme, procedures are first-class objects, and the arguments to procedure are always evaluated before the procedure is entered. This makes life easier for the compiler---it always knows how to compile a call to a procedure, because they all work the same way. (If you're familiar with fexprs from older dialects of Lisp, forget that stuff for now.) This makes it possible for Scheme compilers to generate good, fast code---they don't have to compile more general and slower code that can handle different patterns of argument evaluation. Special forms are NOT FIRST-CLASS. You can't take a pointer to IF, for example---there's no variable named IF, so you can't evaluate it like a variable. Every special form name is a special keyword understood in a special way by the compiler. (In some implementations, you can shadow special form names by defining a variable with the same name, but that's not portable, and it's usually not a good idea.) Note that this doesn't change the fact that there's a symbol IF that can be used as data---just don't try to use it as a name of a variable. By the same token, you can't redefine special forms by assigning a new value to its name's binding---none exists. In Scheme, there are only a few special forms, so compilers only have to understand a few constructs. --- MACROS This is just for people who know Lisp macros (if you don't know about macros, don't worry about it yet): If you've used Lisp before, you're probably familiar with macros. Forget that stuff for now. We'll see later how a macro facility allows you to effectively define new special forms in terms of built-in special forms, but in a way that doesn't make life difficult for the compiler. This is not something to be done casually, because you're changing the syntax of the language---a more profound (and error-prone) thing to do than defining a new procedure. We'll also see why Lisp-style macros are dangerous and error-prone. The Scheme standard defines an optional macro facility that avoids the dangers of Lisp-style macros. --- THINGS TO BE VERY CLEAR ABOUT When trying to understand expression evaluation in Scheme, it's important to remember that the evaluation of combinations is very simple and uniform---all subexpressions are evaluated and then the procedure is applied to the arguments (called). You need to keep in mind, however, that this regularity is made somewhat less obvious by the presence of self-evaluating objects. (It sometimes looks like some arguments aren't being evaluated, but they are---they're just evaluated trivially to themselves.) It's also important to know a few special forms, and how they treat their arguments. Besides BEGIN, DEFINE, IF, QUOTE, and SET!, you only need to know a couple more to really understand how Scheme works: LET and LAMBDA. (There are another few that turn out to be handy, but they can easily be understood in terms of this small set.) Other than a handful of special forms, just about everything else in Scheme is a procedure, and built-in procedures are called in the same way as user-defined procedures. Special forms introduce irregularity into the evaluation of expressions---some subexpressions are evaluated, some aren't. In the case of IF, this is necessary to implement the intended functionality---it would not work otherwise. --- LAMBDA Procedures in Scheme are created with the LAMBDA special form. (define double (lambda (x) (+ x x))) Here we're DEFINEing a variable binding named double, and we're executing a lambda special form to create a procedure which will be the binding's initial value. The expression (lambda (x) (+ x x)) means "create a procedure which will take an argument, call it x, and return the value of the expression (+ x x). The first subform of a lambda expression (in this case, (x)) is a list of argument names. The rest of the expressions are the BODY of the lambda---a sequence of expressions to be evaluated. The value of the last expression in the sequence is returned as the return value of the procedure. (In this case, that's the value of (+ x x), which is the only expression in the sequence.) --- PROCEDURE DEFINITION SYNTAX There's another syntax for procedure definitions, which we used before, and which is handy for creating bindings and procedures. This is a variation of the DEFINE syntax that lets you leave out the "lambda" form: (define (double x) (+ x x)) Note that unlike a simple variable declaration, the first subform is a LIST (in this case (double x)), not just a symbol. This list has the procedure name first, and then the argument names, in order. The idea behind this syntax is that it's similar to the way you call a procedure, e.g., (double 22). Even if there are no arguments to the procedure, there must be parentheses around the procedure name, to indicate that it's a procedure. That distinguishes it from a plain variable definition. (We said all that earlier.) Note that THESE TWO SYNTAXES ARE EQUIVALENT. Either way, you're creating a binding for double, and a first-class procedure with an argument x whose body is (+ x x). When you're just writing simple procedures, it's convenient to use the DEFINE variant syntax. You should always be aware, however, that you're defining a regular variable AND evaluating a lambda to get an initial value for it. You can still clobber the binding's value (just like any other variable's) by assigning a new value into it, etc. --- BEGIN allows you to write SEQUENCE expressions. BEGIN is a simple but handy special form, which allows you to write one expression that evaluates several subexpressions. For example, (begin (foo) (bar) (baz)) is an expression that sequentially calls foo, then bar, then baz. After evaluating the subexpressions, the value of the last subexpression is returned. In the above example, the value of the call to baz is returned as the value of the whole BEGIN expression. (In terms of sequencing, (begin ...) is a lot like Pascal's begin ... end, or C's more terse {...}. We don't need to say END because Scheme's syntax is simple prefix expressions. The closing parenthesis (the one that matches the opening parenthesis before the word BEGIN) ends the block.) Note that sequences show up in other special forms. The expressions in the body of a procedure are executed sequentially, too. And just as a BEGIN expression returns the value of its last subexpression, a procedure returns the value of its last subexpression. --- COND is another way of writing conditional expressions COND is another special form, which you can view as a fancy version of IF. Recall that an if expression has three subexpressions, so that (if a b c) returns the value of b if a is true, and c if a is false. (As with IF, only one value counts as false, and that's the unique boolean object #f; all other values including the empty list count as true values.) Sometimes you want to have a sequence tests, where you do one test and take an action if it's true, but if it's false you want to do another test, and so on. In many languages that's done by saying something like If test1 then action1 else if test2 then action2 else if test3 then action3 else action4 endif In Scheme, we do that with COND, where we pair each test and its action by enclosing them in parentheses: (cond (test1 action1) (test2 action2) (test3 action3) (#t action4)) Notice the last test is the true value #t, which is always true. That ensures that action4 will always be evaluated if all of the other tests fail, which is like an ELSE clause. One of the nice things about COND is that instead of grouping one test with one expression, we can group a test with a SEQUENCE of expressions, sort of like a begin, e.g., (cond (test1 action1a action1b) (test2 action2) (test3 action3a action3b action3c)) In this case, if test1 returns any true value (i.e., not the false object #f), then action1a and action1b will be evaluated in sequence, and the value of action1b will be returned as the value of the COND expression. If test1 fails (returns anything but #f), then test2 will be evaluated and action2 will be evaluated and its result returned as the value of the COND expression. Likewise, if test2 fails, then action3a, action3b, and action3c will all be evaluated, and the last value will be returned as the value of the COND expression. (In this case, we didn't put in a final else clause whose test is just #t. If all of the tests evaluate to #f, then NO action will be taken, and the return value of the COND is UNDEFINED. That is, it'll be some value, but you should not count on what value it is---don't use the result in any further computation.) Notice that a COND is exactly equivalent to a set of NESTED IF expressions. The last example could be written this way: (if test1 (begin action1a action1b) (if test2 action2 (if test3 (begin action3a action3b action3c)))) COND therefore doesn't let you do anything you can't do with nested IFs (and BEGINs to make sequences). The advantage of COND is that it gives you a nice syntax where you can express a sequence of tests in a clear way, and don't have to write a lot of BEGINs to group sequential actions together. There's a name for this kind of construction that could be written in terms of simpler constructions: SYNTACTIC SUGAR. COND doesn't let you do anything new, but it provides a nicer (sweeter) syntax for the same thing, so that your code is easier to read. So for simple two-way conditionals where each action is a single expression, use IF. For multiway conditionals, or conditionals where the actions are sequences of expressions, COND is a much clearer way to say the same thing. --- A NOTE ABOUT PARENTHESES When you're writing an expression like a COND expression, and each of the subexpressions is itself an expression that's wrapped in parentheses, you end up with lots of parentheses. That's the price we pay for Scheme's very simple syntax. Consider a COND where there are two tests, each of which is a call to a procedure: (cond ((foo) ...) ((bar) ...)) and consider what happens if each of the actions is a sequence of procedure calls: (cond ((foo) (baz) (quux)) ((bar) (bleen) (fubar))) that's a lot of parentheses AND EVERY ONE OF THEM IS REQUIRED. But don't despair of getting them all in there in the right places. Learn how to use a text editor that can match parentheses for you. All decent editors have some feature for recognizing which parentheses match. LEARN TO USE THAT FEATURE, or you will not like Scheme (or Lisp). (In the vi editor that comes with UNIX, the command to find matching parentheses is %, also known as -5 on most keyboards. If you position the cursor over a left parenthesis, it will search forward to find the matching right parenthesis and highlight it. If you position the cursor over a right parenthesis, it will likewise search backward to find the matching left one. This is very handy for checking to see if you've actually nested expressions the way you think you have. Other reasonable editors designed for programming have similar features.) Most good editors designed for programming have more advanced features to support writing Lisp or Scheme programs, but parenthesis matching is absolutely essential, or you'll spend half your life trying to figure out where you left out a parenthesis. --- A NOTE ABOUT INDENTING When you write your code, it's important that you put the parentheses in the right place so that the Scheme interpreter or compiler can understand what you wrote. It's equally important that YOU and other people looking at your code can understand what you wrote. There are some well-established rules for indenting your code so that the structure is obvious to anyone familiar with the language. Look at the file indent.txt for an explanation of Scheme indenting rules. Right now, go look and see how you're supposed to indent COND and IF and DEFINE expressions. Later, as you learn about LET and LAMBDA and LETREC, go back to indent.txt and see how to indent those kinds of expressions, too. --- NOW IT'S TIME TO DO SOME SCHEME PROGRAMMING This would be a good point to work through the first few Scheme programming examples in prgex1.txt, up through #8. --- LET Let is a special form whose function is to create LOCAL VARIABLE BINDINGS, and evaluate expressions in the resulting LOCAL BINDING ENVIRONMENTS. You've seen local binding environments in other languages before. When you enter a procedure, for example, the arguments' local names SHADOW global variable names---the mapping from names to chunks of storage changes. Similarly, in C or Pascal you've probably seen blocks with local variables of their own, e.g., in C: ... { int x = 10; int y = 20; foo(x,y) } ... Here we've got a block (inside curly braces) where a local variables named x and y are created. (The same thing can be done with begin...end blocks in Pascal. Inside the block, all references to the variables x and y refer to these new local variables. When execution reaches the end of the block, these variable bindings cease to exist and references to x or y will again refer to whatever they did outside the block (perhaps global variables, or block variables of some intermediate-level block). In this case, all that happens inside the block is a call to the procedure foo, using the values of the block variables, i.e., 10 and 20. In C or Pascal, these temporary variables might be allocated by growing the stack when the block is entered, and shrinking it again when the block is exited. In Scheme, things are pretty similar. Blocks can be created with LET expressions, like so: ... (let ((x 10) (y 20)) (foo x y)) ... The first subexpression of the LET is the variable list, which in this case only has one element, (x 10). This says that the LET will create a variable named x whose initial value is 10. A let variable list can contain any number of clauses, creating any number of LET variables. These are very much like the name and value parts of a define form. The rest of the LET is a sequence of expressions, called the LET BODY. The body of the let is very much like a BEGIN expression---the expressions are simply evaluated in order, and the value of the last expression is returned as the value of the whole LET expression. (The fact that this value is returned is very handy, and will be important in examples we use later.) One difference between C or Pascal blocks and Scheme LETs is that the variable bindings don't necessarily cease to exist when the LET is exited, and the bindings therefore can't be allocated on a stack in the general case. (The reason for this will become clear when we talk about LAMBDA and CLOSURES.) One way to visualize the creation of block variables is to see it as the creation of a new table mapping names to values. Except for the new variables, the table is the same as the one that was in use when the block was entered. The LET expression EXTENDS this environment with mappings for the let variables. Suppose we typed the above expression at the Scheme prompt, when we were just doing the usual expression evaluation in the usual top-level environment. The interpreter maintains a pointer to the "current environment" when evaluating an expression. This pointer always points to the environment the currently-executing code must execute in, i.e., the variable bindings it must see for the variables it uses. +-----+ +------+-----+ [curr-envt] | +--+------->| CAR | +--+----> ## +-----+ +------+-----+ | CONS | +--+----> ## +------+-----+ | + | +--+----> ## +------+-----+ * * * +------+-----+ | FOO | +--+----> ## +------+-----+ After entering the let and creating the bindings for x and y, the interpreter changes the current-environment pointer to point to the resulting new environment. This is typically implemented by representing the environment as a chain of tables, rather than a simple table. The newest table is searched first, and so on down the chain, to find the appropriate bindings. This environment chain is used as a pointer-linked stack, for the most part, with new environments being pushed onto the stack when a LET is entered, and popped off the stack when a LET is exited. +-----+ +-------+-----+ [curr-envt] | +--+------->|[scope]| +--+----+ +-----+ +-------+-----+ | | x | 10 | | +-------+-----+ | | y | 20 | | +-------+-----+ | | +-------------------------+ | +----->+-------+-----+ | CAR | +--+----> ## +-------+-----+ | CONS | +--+----> ## +-------+-----+ | + | +--+----> ## +-------+-----+ * * * +-------+-----+ | FOO | +--+----> ## +-------+-----+ The link that connects the two tables is called a SCOPE link. It reflects the nesting of naming scopes in the program. In this case, when a variable is referenced inside the let, the search for a binding begins at the new (small) table. If it is not found, the search follows the scope link to the next table and looks there. This can continue for as many levels as there are nested scopes in the program. --- LAMBDA AGAIN In Scheme, procedures are often created by using the LAMBDA special form. Lambda creates an ANONYMOUS FIRST-CLASS procedure, which is called a CLOSURE. The procedure is ANONYMOUS because lambda simply returns a pointer to the procedure (closure) object on the heap. Scheme procedures are called CLOSURES because a procedure includes an environment pointer along with the code---a Scheme procedure remembers the environment where it was defined, and when it executes, that environment is used to resolve variable names. The procedure is CLOSED in the environment, meaning that the variable names are fixed once and for all when the procedure object is created. (Scheme procedures values are therefore more than just function pointers.) Recall that binding environments are created on the heap---when we create a closure, and it keeps a pointer to the binding environment where it's created, that binding environment is guaranteed to continue to exist as long as the procedure does. Suppose we type the following expression at the Scheme prompt, to be interpreted in a top-level environment: Scheme> (let ((count 0)) (lambda () (begin (set! count (+ count 1)) count))) ## Evaluating this LET expression first creates a binding environment with a binding for count. The initial value of this binding is 0. In this environment, the lambda expression creates a closure. When executed, this procedure will increment the count, and then return its value. (Note that the procedure is not executed yet, however---it's just created.) This procedure, returned by the lambda expression, is also returned as the value of the let expression, because a LET returns the value of its last body expression. The read-eval-print loop therefore prints a representation of the (anonymous) procedure. Unfortunately, we didn't DO anything with the value, like give it a name, so we can't refer to it anymore, and the garbage collector will just reclaim it. Now suppose we want to do something with it. We'll create a new variable, and use the above let expression to create a new environment and procedure, just like before. Scheme> (define my-counter (let ((count 0)) (lambda () (set! count (+ count 1)) count)))) MY-COUNTER Now we have a top-level binding of my-counter, whose value is the procedure we created. The procedure keeps a pointer to the environment created by the LET, which in turn has a pointer to the top-level environment, thus: closure +--->+----------+ +--------------------------------|----+----+ | | | +----------+ | | | + + | | +----+-----+ \/ environment | | +-------+-----+ | | |[scope]| +--+----+ | \/ code +-------+-----+ | | +-------------------+ | count | 10 | | | | (set! count | +-------+-----+ | | | (+ count 1) | | | | count | +------------------------+ | +-------------------+ | | +---->+------------+-----+ | | CAR | +--+--> ... | +------------+-----+ | | CONS | +--+--> ... | +------------+-----+ | * | * | * | +------------+-----+ | | my-counter | +--+-------------+ +------------+-----+ Now if we call the procedure my-counter, it will execute in its own "captured" environment (created by the let). It will increment the binding of count in that environment, and return the result. The environment will continue to exist as long as the procedure does, and will store the latest value until next time my-counter is called: Scheme>(my-counter) 1 Scheme>(my-counter) 2 Scheme>(my-counter) 3 Notice that if we evaluate the let form again, we will get a NEW environment, and a new procedure that will increment and return its count value---in effect, each procedure has its own little piece of state which only it can see (because only it was created in that particular environment). If we want, we can define a procedure that will create new environments, and new procedures that capture those environments---we can generate new counter procedures just by calling that "higher-order" procedure. (A HIGHER-ORDER PROCEDURE is just a procedure that manipulates other procedures.) Each time make-counter is called, it will execute a let, creating an environment, and inside that it will use lambda to create a counter procedure. Scheme> (define (make-counter) (let ((count 0)) (lambda () (set! count (+ count 1)) count))) MAKE-COUNTER Each of the resulting procedures will have its own captured count variable, and keep it independently of the other procedures. Make sure you understand that the above procedure definition could have used an explicit lambda to create the procedure make-counter, rather than the special procedure definition syntax: Scheme> (define make-counter (lambda () (let ((count 0)) (lambda () (set! count (+ count 1)) count))) MAKE-COUNTER You may actually find this easier to understand, because it shows you exactly what's going on: binding make counter and creating a procedure (with the outer lambda) that WHEN EXECUTED, will evaluate a let to create an environment, and a lambda (the inner one) to create a procedure that captures it.) Now we'll call the procedure created by the above definition, three times, and each time it will create a new procedure: Scheme> (define c1 (make-counter)) C1 Scheme> (define c2 (make-counter)) C2 Scheme> (define c3 (make-counter)) C3 Now we'll call those procedures and look at their return values, to illustrate that they're independent counters: Scheme> (c1) 1 Scheme> (c1) 2 Scheme> (c2) 1 Scheme> (c2) 2 Scheme> (c1) 3 Scheme> (c1) 4 Scheme> (c3) 1 Neat, huh? The combination of block structure (local environments) with first-class procedures (closures), allows us to associate state with procedures. Garbage collection makes this very convenient, because we know that the environments will hang around as long as the procedures do. If you're familiar with object-oriented programming, you may notice a resemblance between closures and OO "objects". A closure associates data with a procedure, where an object associates data with multiple procedures. After we get to object-oriented programming, we'll explain how OO facilities can be implemented in Scheme using closures. --- LAMBDA IS CHEAP, AND CLOSURES ARE FAST It may seem that lambda is an expensive operation---after all, it creates procedure objects on the fly. At first glance, you might think that executing lambda would require a call to the compiler each time. This is not the case, though, and lambda is actually a fairly cheap constant-time operation. Notice that the procedure part of a lambda expression is known at compile time---each time the lambda is executed at run time, it will create a new closure, and may capture a new environment, but the expression closed in that environment is determined solely by the body of the lambda expression. A compiler for scheme will therefore compile the lambda like any other procedure, when it compiles the enclosing procedure. So, for example, when our example procedure make-counter is compiled, the compiler will also compile the code for the lambda body. This code will be kept around for use by make-counter. The actual run-time code for lambda just consists of fetching the address of the code, and the current environment pointer, and putting them in a closure object on the heap. Lambda is therefore about as fast as CONS---all that's really happening is the creation of the closure object itself, not anything expensive like calling the compiler at run-time. --- SCHEME HAS NO "EVAL" If you've used Lisp before, you may be familiar with a function called EVAL. EVAL takes an arbitrary expression, represented as a list, and evaluates it. The interesting thing about this capability is that the expression does not have to exist at compile time---it can be computed at run time and handed to eval, which will run it. This is much like the evaluator used by the top-level loop in an incremental programming environment. While almost all Scheme implementations have an EVAL function, it is not standardized, and programs aren't supposed to use it. The evaluator used by the read-eval-print loop is only supposed to be used for incremental program development (i.e., letting you type in expressions, etc.)---it's NOT supposed to be used by programs themselves. A program that contains explicit calls to EVAL is NOT A LEGAL SCHEME PROGRAM. There are very good reasons for this restriction, which we'll get to in a while. Suppose for the moment, however, that the Scheme system exposes eval to the user (rather than just calling it itself in the read-eval-print loop). Scheme> (eval '(+ 2 3)) 5 Scheme> (+ 2 3) 5 Scheme> (define (return2+3) (list '+ 2 3)) RETURN2+3 Scheme> (eval (return2+3)) 5 Scheme> (eval (cons '* (cdr (return2+3)))) 6 If the above is not clear, here's what we did: First, we evaluated the expression (+ 2 3), represented as a list, by explicitly calling eval. Then we evaluated the same expression by just typing it at the read-eval-print loop, which implicitly calls eval. Then we defined a procedure return2+3 that creates a list, each time it's called, of the form (+ 2 3). Then we called eval, with a call to return2+3 to compute its argument. Then we called eval, with a more complicated expression to compute its argument---return2+3 is used to create a list, but then the head of the list is thrown away (using CDR to get the rest of the list) and the symbol * is prepended, creating the list (* 2 3), which is handed to eval for evaluation. --- WHY EVAL IS THE WRONG THING (when used in normal programs) Many people believe that EVAL is one of the greatest and most wonderful features of Lisp---it lets you do "anything you want" by constructing expressions at run time. Why did the Scheme designers leave it out? EVAL is exceedingly dangerous, and it makes it hard for compilers to generate good code. It's also usually slow. The presence of EVAL implies that every program using it has to have an interpreter or compiler available at RUN-TIME. It also means that the compiler can't predict what can happen at run-time and what can't---ANYTHING can happen. For example, if there's no EVAL, a compiler may be able to look at the source code for an entire program and notice that the variable + is never assigned to, therefore the meaning of + never changes. In that case, it may generate much better code for +, compiling it "inline" like a C expression, rather than as a real procedure call. If the program uses EVAL, however, the compiler probably won't be able to do this---it can't be sure at compile time that the program wont' construct a list of the form (set! + -) and feed it to the evaluator---redefining + and making the optimized code incorrect. Scheme programs are supposed to be usable in a batch-style environment. After you develop and debug your program, you should be able to feed it to a compiler that compiles the whole thing doing aggressive optimizations, and generates an efficient standalone executable file. EVAL undermines this by requiring part of the development system in application programs and by inhibiting global optimizations. These arguments aren't airtight---it's certainly possible to build a Common Lisp compiler that notices whether eval is EVER used in a program---if not, it can do global optimizations and leave the evaluator out of the executable file. (Uglier strategies are also possible.) There are other reasons for avoiding eval, however. Another reason for avoiding eval is that it makes it impossible to examine the source code of a program and know how all of the parts interact---any function can be called, indirectly, by calling the evaluator with an arbitrary list as its argument. This goes against all the principles of modularity that make it possible to comprehend large programs. Yet another reason for avoiding eval is that it's SLOW and UNPREDICTABLE. Evaluating an arbitrary expression at run-time may take much longer than evaluating the same expression by including it textually in a program. If the system uses an interpreter for eval, eval may evaluate expressions an order of magnitude (or more) slower than executing the equivalent compiled code. If it calls a compiler to compile the expression, the call to the compiler may be much more expensive than the actual execution of the expression. One of the things that people have learned over the last twenty or so years is that MOST of the times when people use EVAL, they'd be better off doing something else, like creating closures. Closures have some of the nice characteristics of EVAL---you can create procedures at run-time, customized by particular situations. Since closures capture environments, they can be customized by having them depend on the data in those environments. This is enough flexibility for most situations where people used to use EVAL. If that's sufficient, it's preferable to eval because you can still look at the code for the closures, and have some idea what kinds of things they're likely to do. The compiler can look at the code, too, and generate optimized machine code, at compile time, once and for all. --- IMPLEMENTING EVAL Despite the fact that normal programs shouldn't use eval, you should understand how a simple interpretive evaluator can be implemented. This will ensure that you really understand what Scheme programs mean, and is a first step toward understanding how Scheme is compiled. We will be implementing a subset of Scheme in Scheme. Scheme is a handy language for implementing Scheme, because we can implement the parts we want to implement ourselves, and steal the other features from the underlying Scheme system. This is also called SNARFING---we can take a feature from the underlying Scheme system and just use it, if it's not one of the features we care about implementing ourselves. It lets us illustrate the parts we want to illustrate, by doing things in our interpreter, and ignore the parts we don't care about at the moment. --- A METACIRCULAR INTERPRETER We'll explain a simple interpreter, which is just a procedure which accepts an expression as an argument and returns the value of the expression. Our interpreter will accept expressions in the usual Lisp-like list representation, so we can dispense with writing our own reader. For the time being, we'll steal a lot of features from the underlying Scheme system we're writing our interpreter in---for example, when our interpreter evaluates an addition expression, it will call the + procedure of the underlying Scheme system. This is known as a METACIRCULAR INTERPRETER, which just means using an implementation of a language as a host environment to run another implementation of the same language, or a similar language. This is an important technique. It's often used as the first step in bootstrapping a new language, and subsequent steps remove dependencies from the underlying system until the new implementation is free-standing. (For example, the system might eventually compile basic procedures like + to machine code, and cease calling the underlying system's + procedure to implement addition.) Metacircular interpreters are also useful for prototyping new languages, because they make it very easy to change various aspects of the language. Since they can be written in portable languages, it's easy to send somebody the source for your language implementation, and let them experiment with it. Lisp is an excellent language for writing metacircular interpreters, and most Lisp interpreters start out being written in other Lisps. (The first implementation of Scheme was written in Lisp, and the first compiler for Scheme compiled to Lisp.) --- EVALUATING SIMPLE ARITHMETIC EXPRESSIONS The first version of our interpreter will just handle simple nested arithmetic expressions---the functions +, -, /, and * applied to integers. ; First we create a BINDING ENVIRONMENT---we'll use an association list ; keyed by name symbol. The values will be the equivalent procedures ; snarfed from the underlying language. In this simple interpreter, ; the binding environment just maps a few variable names to places ; we can store a few pointers to procedures. We SNARF the procedures ; from the underlying Scheme implementation. (define our-binding-envt (list (list '+ +) (list '- -) (list '* *) (list '/ /))) ; to look up a value, we use ASSQ, which searches an association list ; for the pair with the given key. (define (our-lookup-value symbol) (assq symbol our-binding-envt)) ; APPLYing a function consists of calling the equivalent procedure snarfed ; from the underlying Scheme (define (our-apply function arg1 arg2) (function arg1 arg2)) ; To evaluate a combination, we'll call OUR-EVAL to evaluate the arguments, ; and we'll look up the function and apply that. (Note that we assume ; the argument to a combination is a combination, i.e., a pair.) (define (our-eval-combo expr) ; make sure it's got exactly two arguments (if (not (eq? (length expr) 3)) (error "ARG COUNT ERROR IN EXPR" expr) ; evaluate subexpressions (let ((binding (our-eval (car expr))) (arg1-value (our-eval (car (cdr expr)))) (arg2-value (our-eval (car (cdr (cdr expr)))))) (our-apply binding arg1-value arg2-value)))) ; OUR-EVAL is the main dispatch for the evaluator. Given an expression, ; it checks to see if it's self-evaluating. (In this case, that just ; means checking to see if it's an integer.) If so, it just returns that ; same value. If not, it checks to see if it's a symbol; if so, it makes ; sure it's bound, and looks up the value. Otherwise, it should be a ; pair representing a combination---if so, eval-combo is called to ; evaluate it recursively. (define (our-eval expr) (if (integer? expr) expr (if (symbol? expr) (let ((binding (our-lookup-value expr))) (if binding ; if we found a binding for the var (cadr binding) ; extract the value from the binding (error "UNBOUND VARIABLE" expr))) ; else signal an error (if (pair? expr) (our-eval-combo expr) (error "ILLEGAL EXPRESSION" expr))))) --- Notice how many features we've stolen from the underlying Scheme. Our evaluator expects a Scheme data structure as its input---rather than writing our own reader in Scheme, which would read a sequence of characters from the user's terminal (or a file) and turn it into a Scheme list, we just rely on the fact that the user is typing expressions into a Scheme system, which automatically constructs that data structure before handing it to OUR-EVAL. We rely on the fact that the underlying Scheme has its own reader and its own read-eval-print loop, so that we can just type an expression in that language and call our own EVAL procedure with an argument expression that's already been read. Now we can focus on what happens when the data structure is handed to OUR-EVAL. Later we can implement our own reader if we want to. Similarly, we didn't bother to write our own procedures +, -, *, and /. We just arrange it so that when we want to perform those operations, we can just call the underlying Scheme system's procedures. What we HAVE done is explicitly write a recursive procedure to evaluate nested expressions. OUR-EVAL looks at the expression it's given (in the form of a Scheme data structure) and analyzes it to decide what to do. If it's a simple expression (like a self-evaluating object), we do the simple thing to return the right value (like returning the value of a self-evaluating object as its value). In this very simple subset of Scheme, the only kind of self-evaluating object is integers. If it's a more complicated expression, we take it apart and evaluate its subexpressions as necessary, and do whatever we're supposed to do for that kind of expression. In this very simple subset of Scheme, the only kind of complex expression is a COMBINATION, i.e., a procedure call expression, so we evaluate the argument subexpressions recursively and then apply the procedure to the argument values. We've also put in error checking---if we find out that we're given an expression that isn't one of the kinds of things this interpreter understands, we call an ERROR routine to tell the user they entered an illegal expression. --- The above interpreter is very simple, but it's a real interpreter. It does the main things any interpreter does---analyzing expressions and taking the appropriate actions.This is a recursive process that eventually bottoms out into simple actions like returning the value of a self-evaluating object or variable binding, or calling a procedure. An interpreter for the whole Scheme language can have the same basic structure---it just needs to be extended to recognize more different kinds of expressions and take more kinds of actions. This simple little interpreter could be extended to handle more and more of the Scheme language, just by extending the main dispatch procedure OUR-EVAL to recognize more kinds of expressions, and do the appropriate actions for each one. The file eval.scm contains just such an interpreter. Like this one, it's a metacircular interpreter that steals some functionality from the underlying scheme. Unlike this one, it handles all of the major interesting constructs in Scheme, including LET and LAMBDA. Have a look at eval.scm and see how it does more complicated things. The interpreter in eval.scm is also a metacircular interpreter, i.e., it's written in Scheme and snarfs features like the reading of data structures and little operations like addition. The basic operations snarfed from the underlying system are called PRIMITIVES. That is, they are the basic starting point that we build up from to define more complicated things. If we wanted to, though, we could write an interpreter with very much the same structure in C or assembly language. In that case, we'd have to do a lot more tedious coding of the little things; we'd have to write our PRIMITIVES ourselves rather than just stealing them wholesale. On the other hand, that's not really hard, either; for example, the code to add two tagged integers is just a few machine instructions. If we wrote our interpreter in machine language, we'd also have to implement our own stack to hold onto environments and things during recursive calls to the evaluator. Notice that our recursive evaluator here is, in effect, SNARFING the stack of the underlying Scheme system---we don't have to explicitly use a stack to implement the traversal of the call graph, because we can implicitly use the underlying Scheme's stack by writing our interpreter in the obvious recursive way. Like implementing our own primitives, implementing our own stack would be a little tedious, but not difficult. The important thing to notice is that we can divide the language into complex expressions, which we interpret with our recursive evaluator, and simple PRIMITIVES, which we just execute (either by calling procedures in the underlying scheme or by executing little hand-coded operations we write in assembler or C). Given a small handful of primitives, we can glue them together by writing complicated expressions that are interpretable, and we can implement a whole programming language. Later, we'll talk about how a similar division of labor can be used in a COMPILER. A compiler understands certain small, primitive operations, and knows how to glue them together to make bigger expressions. --- INTERPRETERS AND PROGRAMS It is important to understand interpreters, because many programs are really interpreters of some kind. For example, consider a database program that people can type queries requests for information) into and get answers back. This is an interpreter that reads expressions (in the query language), analyzes them, and takes the appropriate action. Now consider a database program that's not interactive, and just reads through a file of transactions to be performed, (e.g., "move $1500 from account 216 to account 512". Most people would probably think of the file (of transactions to be performed) as data, but you can also view the database program as an interpreter, and the transactions as expressions in its language---a language of database transactions. Viewing programs in this way is often helpful---it helps you think about the important basic operations, and the ways that they can be composed to make larger, more interesting operations. By thinking of the input data format as a language, and trying to design it to be expressive and flexible, you often end up with a simpler, more effective framework for building your application. As we just showed above with the arithmetic evaluator, building an interpreter doesn't have to be hard---especially if the interpreter doesn't have to be fast.