An Introduction to Scheme and its Implementation

Go to the first, previous, next, last section, table of contents.

The Reader

We won't write a whole reader for our interpreter, but I'll sketch how the reader works, and show a simplified reader.

(Our interpreter will just "cheat" the reader from the underlying Scheme system we're implementing it in, but it's good to know how we could write a reader, and it's a nice example of recursive programming.)

The reader is just the procedure read, which is written in terms of a few lower-level procedures that read individual characters and construct tokens, which read puts together into nested data structures. A token is just a fairly simple item that doesn't have a nested structure. For example, lists nest, but symbol names don't, strings don't, and numbers don't.

The low-level routines that read uses just read individual tokens from the input (a stream of characters). These tokens include symbols, strings, numbers, and parentheses. Parentheses are special, because they tell the reader when recursion is needed to read nested data structures.

(I haven't explained about character I/O, but don't worry--there are Scheme procedures for reading a character of input at a time, testing characters for equality, etc. For now, we'll ignore those details and I'll just sketch the overall structure of the reader.)

Lets assume we have a simple reader that only reads symbols, integers, and strings, and (possibly nested) lists made up of those things. It'll be pretty clear how to extend it to read other kinds of things.

Implementing `read`

read uses recursion to construct nested data structures while reading through the character input from left to right.

For example, the input character sequence

(foo 20 (baz))

will be read as a three-element list, whose first two elements are symbols foo and the number 20; its third element is another list, whose single element is the symbol bar.

read can also read simple things, like symbols and numbers, by themselves.

The data structures that read constructs are called s-expressions. An s-expression may be something simple like a string or a number, or a list of s-expressions. (Notice that this recursive definition covers arbitrarily deeply nested lists.)

The traditional term s-expression is very unfortunate. Technically an expression is a piece of a program, independent of how it's represented. An s-expression isn't really an expression at all--it's just a data structure, which we can choose to use as a representation of an expression in a program, or not.(6) Remember that the reader's job is only to convert textual expressions into handy data structures, not to interpret those data structures as programs. It's the evaluator that actually interprets data structures as programs, not the reader. That's why the read-eval-print loop hands the s-expressions returned from read to eval for evaluation.

I'll show a slightly oversimplified version of read, called micro-read. The main simplifications are that micro-read only handles a few basic types--symbols, positive integers, and lists--and we've left out most error-checking code. We assume that what we're reading is a legal textual representation of a Scheme data structure. We also haven't dealt with reading from files, instead of the standard input, or what to do when reaching the end of a file.

To make it easier to implement read, we'll use a helper procedure that reads a single token at a time, called read-token. Intuitively, calling read-token repeatedly will chop the input into "words." Then read can group these "words" together to form "phrases," which may describe complex data structures.

For example the following input character sequence

  (foo 1 (a "bar"))

will be chopped into the following tokens, one at a time, in a left-to-right scan of the input by repeated calls to read-token

  (
  foo
  1
  (
  a
  "bar"
  )
  )

Notice that left and right parentheses are tokens, even though they're written as single characters. You can think of them as special words that tell read where a new list starts and where it ends.

Given read-token, read must recognize nested structures---intuitively, where read-token recognizes individual words, read must recognize phrases, which may be nested. Each phrase corresponds to an s-expression that read must construct, and nested phrases correspond to nested s-expressions.

Most of the work of reading is actually done by read-token, which reads a single input token--e.g., a symbol, a literal string, a number, or a left or right parenthesis. That is, read-token performs lexical analysis (also known as scanning). That is, read-token reads a sequence of characters from the input until it recognizes a "word."

(Our little scanner will use the standard Scheme procedure read-char to read one character of input at a time, and also the predicate procedures char-alphabetic? and char-numeric?; these tell whether a character represents a letter or a number. We'll also use the Scheme character literal objects #\", #\(, and #\), which represent the double quote character, left parenthesis character, and right parenthesis character, respectively.)

;;; a scanner for a simple subset of the lexical syntax of Scheme
(define (read-token)
   (let ((first-char (read-char)))
      (cond ;; if first-char is a space or line break, just skip it
            ;; by calling self recursively
            ((whitespace? first-char)
             (micro-read))
            ;; else if it's a left paren, return the special
            ;; object that represents left parenthesis tokens
            ((eq? first-char #\( )
             '*left-paren-token*)
            ;; likewise for right parens
            ((eq? first-char #\) )
             '*right-paren-token*)
            ;; else if it's a letter, we assume it's the first char
            ;; of a symbol and call read-symbol to read the rest of
            ;; the chars in the symbol and return a symbol object
            ((char-alphabetic? first-char)
             (read-symbol first-char))
            ;; else if it's a digit, we assume it's the first digit
            ;; of a number and call read-number to read the rest of
            ;; the number and return a number object
            ((char-numeric? first-char)
             (read-number first-char))
            ;; else it's something this little reader can't handle,
            ;; so signal an error
            (else
             (error "illegal lexical syntax")))))

[ see handout with discussion of lexical analysis, state machines, etc. ]

The basic operation of read-token is to read a character from the input, and use that to determine what kind of token is being read. Then a specialized routine for that kind of token is called to read the rest of the characters that make up the token, and return a Scheme object to represent it. We represent identifiers tokens like foo as Scheme symbols, and digit sequences like 122 as the obvious Scheme number objects.

We represent left and right parenthesis tokens specially, because there's not an obvious Scheme object to represent them. (We could use the Scheme left and right parenthesis character objects, but that could cause trouble if we add the ability to read character literals. For now, we'll just use the Scheme symbol objects *left-parenthesis-token* and *right-parenthesis-token*, which is tacky but good enough for this example code.

[ see handout with complete code for the little lexer and reader ] So that you can use any number of whitespace characters between tokens, read-token skips any whitespace that occurs at the beginning of the input.

The helper routines for reading particular kinds of tokens are [see handout for actual code]:

read-symbol. If the character we read is a letter, we're reading a symbol, so we call read-symbol to finish reading it. (We pass it the character we read, since it's the first character of the symbol's print name.) read-symbol (not shown) just reads through more characters, saving them until it hits a character that can't be part of an identifier, e.g., whitespace or a parenthesis. Once it has read the characters that make up the symbol printname, read-symbol must obtain a pointer to the unique symbol object with that name; if there isn't one, it must be created. [put code here] When it finishes reading the whole print name of the symbol, read-symbol passes the list of characters to the built-in Scheme procedure list->string to create a Scheme string object with that sequence of characters. Then it passes that string object to the built-in Scheme procedure string->symbol. string->symbol checks the table of existing symbols to see if there's already a symbol with that printname. If so, it just returns a pointer to it. (This ensures that it never creates two symbol objects with the same name, and always returns the same symbol for a string with the same sequence of characters.) If a symbol with that printname doesn't exist, it constructs a symbol by that name, adds it to the table, and returns a pointer to that. (string->symbol ensures that there is only ever one symbol with a given printname.) Either way, the pointer to the unique symbol with that name is returned as the value from read-symbol.
read-number. If the character we read is a digit, we're reading a number, so we call read-number. (We pass it the first character we read, since that's the first digit of the number.) read-number just reads through successive characters, accumulating a list of character objects that represent digits. It stops when it encounters a character that can't be part of a number. (For our simple little subset, that's anything that's not a digit.) Then it passes this list to the standard Scheme procedure list->string, which returns a Scheme string object with that sequence of characters. That's passed in turn to string->number, which returns a Scheme number object that represents the corresponding number.

[put actual code from micrread.scm here?]

Implementing the `read` procedure

Given read-token, it's easy to implement read. read uses recursion to recognize nested data structures. It calls read-token to read the next token of input. If this is a normal token, e.g., a symbol or string, read just returns that. If it's a left parenthesis token, however, read constructs a list, reading all of the elements of the list up to the matching right parenthesis. This is done by another helper procedure, read-list.

To avoid confusion with the standard Scheme read procedure, we'll call our simplified version micro-read.

;;; Simplified version of read for subset of Scheme s-expression syntax
(define (micro-read)
   (let ((next-token (read-token))
      (cond ((eq? next-token '*left-parenthesis-token*)
             (read-list '()))
            (else
             next-token))))

Here's a slightly oversimplified version of read-list, (The main oversimplification is that we don't check for illegal syntax, like extra closing parentheses.)

The following code is off the top of my head. Zhu Qing has a nice working version... get it.

(define (read-list list-so-far)
   (let ((next-item (read-token)))
      (cond ;; if we hit a right parenthesis, return the list we've read,
            ;; reversing it into proper order
            ((eq? next-token '*right-parenthesis-token*)                 
             (reverse list-so-far))
            ;; otherwise, read next item and call self recursively
            ;; to read rest of list
            (else                                                  
             (read-list (cons next-token list-so-far))))))

Here I've coded read-list recursively in two ways.

The iteration that reads successive items in the list is implemented as tail recursion, passing the list so far as an argument to the recursive call. Intuitively, this iterates "rightward" in the list structure we're creating. Each list element is consed onto the list so far, and the new list is passed to a the tail-recursive call that performs iteration. (At the first call to read-list, we pass the empty list, because we've read zero elements so far.)

This constructs a list that's backwards, because we push later elements onto the front of the list. When we hit a right parenthesis and end a recursive call, we reverse the backwards list we've accumulated, to put it in the proper order, and return that.

Each list element is read by simply calling micro-read, which is what allows a list to contain arbitrary s-expressions, including other lists. Intuitively, this recurses downward through the nested data structures we're creating. The mutual recursion between micro-read and read-list is the key to the structure of the reader.

Comments on the Reader

The reader is a simple kind of recursive descent parser for normal Scheme data structures. (A parser converts a sequence of tokens into a syntax tree that describes the nesting of expressions or statements.) It is a "top-down" parser, because it recognizes high-level structures before lower-level ones--e.g., it recognizes the beginning of a list before reading and recognizing the items in the list. (That is, on seeing a left parenthesis, it "predicts" that it will see sequence of list elements followed by a matching right parenthesis.)(7) (8)

The reader converts a linear sequence of characters into a simple parse tree.

(If you're familiar with standard compiler terminology, you should recognize that read-token performs lexical analysis (a.k.a. scanning or tokenization) using read-string, read-symbol, and read-number. read performs simple predictive recursive-descent ("top down") parsing via the mutual recursion of read and read-list.)

Unlike most parsers, the data structure read generates is a data structure in the Scheme language, rather than a data structure internal to a compiler or interpreter. This is one of the nice things about Scheme--there's a simple but flexible parser you can use in your own programs. You can use it for parsing normal data as well as to help parse programs.

When implementing the Scheme language, that's not all there is to doing parsing of Scheme programs. The reader does the first part of parsing, translating input into s-expressions. The rest of parsing is done during interpretation or compilation, in a very straightforward way. The reader only recognizes nesting of expressions, and basic syntactic distinctions between tokens, e.g., whether they are parentheses, identifiers, or numeric literals. Later parsing must detect what kind of Scheme expressions the s-expressions represent, e.g., that a particular list represents a procedure call or a special form, and if it's a special form, which kind of special form.

The rest of the parsing isn't much more complicated than reading, and is also done recursively.(9)

Go to the first, previous, next, last section, table of contents.

The Reader

Implementing read

Implementing the read procedure

Comments on the Reader

Implementing `read`

Implementing the `read` procedure