Proposed Tagging and Generic Dispatching Scheme
            for RScheme and/or the GNU Extension Language
Goals

  A tagging scheme that allows both fast type checking the conventional
  Scheme sense and fast generic function dispatching for OO code.

  Minimal space cost, in particular, not requiring that pairs (cons cells)
  have headers.

Design

  Tagging:

     1. All heap allocated objects are allocated on 8-byte boundaries.
        This means that raw addresses always end in 000, i.e., you
        you can use 3-bit low tagging.

     2. Pairs have no headers, hence must have their own 3-bit tag.

     3. All other heap-allocated objects have a 2-word header which
        includes a pointer to the appropriate class object.

     4. Integers are 30 bits (NOT 29) with a two-bit low tag of 00.
        This keeps array/struct indexing fast: you can add a tagged
        int to the address of the array with no masks or shifts.  (The
        two-bit 00 tag in effect means that integers are pre-shifted 
        to be word offsets...).  A 3-bit tag of 000 is really a
        two bit-0 tag plus the low bit of an (even) integer.  A 3-bit
        tag of 100 is likewise a two-bit integer tag plus the low ibt
        of an odd integer.  Integer tag checks can thus be implemented
        as though two-bit tagging was used, and we also maintain the
        usual benefit of 00 tags for ints---we can add or subtract ints
        without stripping the tags out, because the 00 tags will add
        or subtract to yield 00 tags.

     5. Noninteger immediate values (e.g., ASCII characters, booleans)
        have a secondary tag field immediately above the primary tag.
        This will probably be three bits.
     
  Dispatching:

    There are two basic kinds of dispatching:

        * simple type checks for most built-in classes, and

        * general dispatching, which requires fetching a class pointer
          of the kind of object in question.

    To a first approximation, we can do type checks for immediate
    (nonpointer) values just by checking the tag bits, but we'll use
    a simple trick to allow the same fast code to work for pairs.

    For non-pair pointer types, either built-in or user-defined, we
    can do simple type checks (e.g., vector?, port?, some-user-defined-class?)
    by checking that the tag is a nonpair pointer tag, extracting the
    class pointer from the header of the object, and comparing that to
    the class pointer of the type we're checking for.

    General dispatching requires a little bit of cleverness to make
    fast.  To a first approximation here's what we do:

       We extract the primary tag bits and see if we've got an
       immediate value.  If so, we use primary and secondary tag bits
       as an index into a table of class pointers for immediate values.
       If not, we grab the class pointer from the header of the object.

    We have to enhance this a little bit to make it fast and to enable
    it to deal with pair pointers:

       We generalize the table lookup to use the primary and secondary
       tag bits UNCONDITIONALLY, even if the value in question may not
       really be an immediate value with a secondary tag.  No matter
       what kind of value it is, we take the low six bits and use them
       as an index into a 64-element array of class pointers.

       For immediate values with secondary tags, there will be a unique
       entry in the 64-element table.  For other kinds of values, with
       shorter tags, we will put multiple copies of the class pointer
       in the relevant slots of the table.  E.g., for integers, we'll
       have a copy of the pointer to the integer class object in every
       slot of the table whose offset ends in 00.  For pairs, we'll
       have a copy of the pair class pointer in every slot of the
       table whose offset ends in the 3-bit pair tag. 

       For "other pointer" types, we provide an escape mechanism.
       In every slot of the table whose offset ends in the 3-bit
       "other pointer" (non-pair pointer) type, we put a not-in-table
       flag of some kind. (Probably the integer 0, which has a bit
       pattern of all zeroes, which is easy to test against in one
       instruction on most machines)

    Now all we have to do to get a class pointer for any kind of
    value is this:

       Mask in the primary and secondary tag bits, irrespective of
       whether the value really has both primary and secondary tags.
       (Masking those 6 low bits in is faster than masking the upper
       bits out, because the mask will probably be an immediate bit
       pattern in the operand field of some instruction.)

       Use the resulting bits as an index into the table of class
       pointers, and load the class pointer.

       Check to see that we didn't get the not-in-table flag.  If
       we did, load the class pointer from the head of the object
       (an indexed load).

    Assuming that most checks are for immediate values or pairs,
    the usual instruction sequence will be something like this:

       mask-imm 111111    ; get primary+secondary tag bits
       shift 2            ; shift it 2 bit positions to make it an index
       indexed load       ; get table entry
       cmp 0              ; compare to 0 to see if we missed table
       branch (taken)     ;

    If we miss the table, we won't take the branch, and we'll do an
    indexed load too.  (To make that load fast, we have to ensure that
    the table of class pointers we're checking against is in some
    place we can load from quickly.)

    Depending on the instruction set, we may be able to branch-on-zero
    instead of doing a compare and a branch.

    We'll probably lose a cycle for the branch, but maybe not if
    branch prediction works.

    Bottom line: about 5 instructions if we're lucky with branch
    prediction and have single-cycle loads from a small first-level
    cache.  Up to about 8 if we're unlucky.  We can write three or
    four minor variations on this in C, e.g. reversing the branches
    of the IF statement, and see what works best, maybe on a
    platform-by-platform basis with an #ifdef or equivalent.

    Another thing to notice is that we've got some spare 3-bit
    primary tag values.  So far we've used these:

        EvenInt
        OddInt
        OtherImmediate
        PairPtr
        OtherPtr

    That leaves 3 more values we can use for specialized pointer types
    that we want types for in our 64-element table.  Based on
    profiling and thought, those might be allocated to closures
    (to speed procedure call checks) or vectors or strings.  I'd be
    very cautious in using up the remaining tag values, though.

    Another possibility (maybe relevant to GNU ext. lang.) is to make
    sure that #f and '() have different primary tags, but tags which
    differ in exactly one bit.  Then we can check for #f-or-'() by
    extracting those tag bits they have in common and checking to see
    that they're what we expect.

    We might use one of the tag values for headless double floats, but
    it's not clear that's worthwhile.  If we're heap-allocating a lot
    of floats, we're probably losing already, and need to work on
    type declarations.  (E.g., homogenous float arrays, type-declared
    float expressions.)  Cutting the size of a float down to 2 words
    from 4 is probably not going to help, because the main costs will be
    the time cost of boxing and unboxing floats, and maybe the generational
    GC cost of tracking pointers to short-lived float objects from long-lived
    objects (e.g., general arrays).

---

  Supporting advanced (e.g., MOP-based) object systems

    Advanced language systems based on metalevel programming (e.g. metaobject
    protocols) typically divide compilation into high-level and low-level
    components.  The goal is to divide the language into a BASE LANGUAGE
    which can be compiled to fast code using relatively conventional
    compiler techniques, and a META-LANGUAGE that allows higher-level
    constructs to be mapped onto the base language.  This modularizes
    the language and its implementation, allowing the separation of 
    high-level concerns from low-level concerns.

    The art of designing such a language system is to identify the critical
    issues that must be dealt with in the (relatively conventional) base
    language compiler, and to allow higher-level issues to be addressed
    cleanly in higher-level modules.   (See Kiczales et al.,
    esp. the book "The Art of the Metaobject Protocol".)  The key step
    is to design the interface (e.g., a metaobject protocol) so that
    things that really matter to performance are exposed and made clear,
    while abstracting away from unnecessary detail.

    We think that for a fairly conventional OO-Lisp or Smalltalk-like
    system, the only real support we need from the basic language
    compiler is the ability to quickly get a class pointer for any
    object (see above) and support for FUNCALLABLE INSTANCES (see
    below).  (Thanks to Gregor Kiczales for pointing out the latter.)

------

Supporting user-defined procedure types ("Funcallable instances" in
CLOS MOP terminology)

It seems to be a very good idea to leave two similar 3-bit tags
free to represent different kinds of procedures.

The basic idea is that along with being able to find a class pointer
(quickly) for any arbitrary value, there is one more little bit of
support that's useful to high-level object systems like metaobject
protocols:  being able to use a uniform calling discipline for either
plain procedures or odd user-defined kinds of procedures.

In CLOS terminology, the latter are known as "funcallable instances",
i.e., things that are instances of some class, but also happen to
be structured so that you can call them as procedures.  An example
of this is a CLOS generic function (or an instance of any subclass of
generic-fucnction), which has environment and code pointers like any
other function, but which may have other slots defined for its class.

As far as the basic language goes, all that's necessary is that there's
some layout common to plain closures and funcallable instances, and
that it be able to check very quickly that something is one or the
other.  (Any other stuff will be handled by the high-level object
system, so the basic compiler doesn't need to know about it.)

In terms of defining a tagging scheme, it's desirable to reserve a
primary tag for pointers to callable objects.  Then the header of
the object can be used to discriminate between different kinds of
callable objects (e.g., plain closures vs. funcallable instances).
The check for callability can just extract the primary tag and
check it against a constant.

---

Other uses of the basic dispatch mechanism (probably not worth it)

Another possibility is to use the basic dispatch mechanism to do
some other handy dispatching.  If we use quad-word alignment, and have
a few tag values to burn, we could refine the representation of
procedure pointers to encode the most common argument patterns as
well as the fact that a value is a procedure. E.g., we could have a
different tag value for procedure-that-takes-exactly-two-integer-arguments
or for procedure-that-takes-a-list-argaument.

It's not clear that this is worth the hair in the interface between the
high-level compiler and the low-level compiler, however, especially since
the biggest wins will come from static determination (and inlining) of
methods.