[The twelfth in a series of posts on the evolution of TransForth]

 

It’s been quite fun playing with this Universal Machine from the Cult of the Bound Variable. In this post we’re going to continue the journey toward building a full Forth for this machine by assembling a Forth inner interpreter; turning this register machine into a stack machine.

 

Last time we finished up a single-pass macro assembler for the UM-32 (in 15 lines of F# and 40 lines of Forth!) but did nothing more than a simple “Hello, world” with it. If you remember when we built an inner interpreter for TransForth, you know that this is the heart of Forth. If you’re unclear on the mechanics of a direct threaded inner interpreter, you’d probably enjoy the video from that earlier post:

 

 

Register Machines (what a pain)

 

Being so used to dealing with a stack machine and essentially zero-operand instructions, it can be a bit annoying to deal with a register machine. But that’s what we have with the UM-32. Let’s assign purposes to the remaining registers (remember, z is already the zero constant and registers 1 and 2 are temp registers. Let’s allocate one more temp along with the standard inner interpreter and stack pointers.

 

: x 3 ;  \  Temp register

: w 4 ;  \  Working register

: i 5 ;  \  Interpreter register

: s 6 ;  \  Stack (data) register

: r 7 ;  \  Return stack register

 

In a way you can think of what we’re doing as implementing stack machine mechanics on top of a register machine. There are, of course, many real stack machines implemented in hardware. Koopman’s book is an excellent resource (free online) by the way. For the UM-32, we’ll need to handle the stacks ourselves along with the mechanics of walking nested sequences of stack operations.

 

Given the data or return stack pointer and a value register, these words will handle the stacks in a few instructions:

 

: push,  ( ab-m ) \  b.push(a)

    DUP

    dec,   \  b--

  z store, \  M[b] = a

;

 

: pop,  ( ab-m ) \  b = a.pop()

    OVER SWAP

  z SWAP      \  aazb

    fetch,    \  b = M[a]

    inc,      \  a++

;

 

These don’t emit any UM-32 code at this point. Instead they are assembler “macros” used to inline these instructions as needed later.

Inner Interpreter

 

Now we’re going to emit the inner interpreter and dictionary structure. Following that, we’ll initialize and kick things off. We’ll start by jumping to the initialization code (yet to be defined):

 

forward branch,  \  over dictionary

 

It happens to be patched to address 60 later and the following ends up in the image:

 

00000     60 y literal

00001      z y loadjump

 

The triumvirate of an inner interpreter is enter,next and exit (in the past I’ve referred to them by their more classic names docol, next and dosemi).

 

The job of enter is to go into word definitions. It pushes the interpreter (i) register onto the return stack so we can get back out upon exit. It then points i at the body of the word and does a next.

 

VARIABLE &enter  address &enter !

i r push,                 \ r.push(i)

2 t literal, t w i add,   \ i = w + 8 (skip over enter,)

\ falls through to next,

 

It emits the following UM-32 code to the image (falling through to next):

 

00002      0 t literal

00003    t t t nand

00004    r t r add

00005    z r i store

00006      2 t literal

00007    i w t add

 

To make it easy to jump to this, here’s a convenience word:

 

: enter,  &enter @ x literal,  x jump, ;

 

Next comes next, whose job is to jump to the next word in a sequence and advance the interpreter to the one following.

 

VARIABLE &next  address &next !

i z w fetch,  \  w = M[i]

    i inc,    \  i++

    w jump,

 

Emitting:

 

00008    w z i fetch

00009      1 t literal

00010    i t i add

00011      z w loadjump

 

And the same kind of convenience macro:

 

: next,  &next @ x literal,  x jump, ;

 

Remember that all primitive words are plain machine code followed by a next. All secondary words begin with enter (that is, with machine code to jump to &enter) and are terminated with exit. The job of exit is to leave and rejoin a parent sequence; simply recovering the return address from the return stack and doing next from there.

 

VARIABLE &exit  address &exit !

r i pop,  \  i = r.pop()

    next,

 

Emitting an inlined pop and a jump to next:

 

00012    i z r fetch

00013      1 t literal

00014    r t r add

00015      8 x literal

00016      z x loadjump

 

That’s it for the inner interpreter! Just fifteen instructions (plus two for the jump as the start).

 

Hand-packed Primitives

 

Now we want to pack a dictionary with primitive and secondary words to exercise things a bit. Using the same example as in the direct threading demo video, we’ll need literals (lit), pick and add primitives then we can define secondary dup (in terms of lit and pick) and double (in terms of dup and add).

 

Starting with lit, we push the literal which is packed alongside word addresses and advance the interpreter to skip ahead:

 

VARIABLE &lit  address &lit !

i z y fetch,  \  y = M[i]

  y s push,   \  s.push(y)

    i inc,    \  i++

      next,

 

Emitting:

 

00017    y z i fetch

00018      0 t literal

00019    t t t nand

00020    s t s add

00021    z s y store

00022      1 t literal

00023    i t i add

00024      8 x literal

00025      z x loadjump

 

The pick word is used to pull a copy of the nth element on the stack to the top:

 

VARIABLE &pick  address &pick !

    s y pop,    \  y = s.pop()

    s x move,   \  x = s

  y x x add,    \  x = x + y

  x z x fetch,  \  x = M[x]

    x s push,   \  s.push(x)

        next,

 

Emitting:

 

00026    y z s fetch

00027      1 t literal

00028    s t s add

00029      1 t literal

00030    x s t cmove

00031    x x y add

00032    x z x fetch

00033      0 t literal

00034    t t t nand

00035    s t s add

00036    z s x store

00037      8 x literal

00038      z x loadjump

 

The add word essentially pops two values and pushes back their sum. It could be implemented as:

 

  s y pop,   \  y = s.pop()

  s x pop,   \  x = s.pop()

y x x add,   \  x = x + y

  x s push,  \  s.push(x)

      next,

 

But a more efficient implementation just fetches the top of the stack for the second operand (without popping) and pokes back the result (without pushing):

 

VARIABLE &add  address &add !  \ TODO: More efficient

    s y pop,    \  y = s.pop()

  s z x fetch,  \  x = M[s]

  y x x add,    \  x = x + y

  x s z store,  \  M[s] = x

        next,

 

Emitting:

 

00039    y z s fetch

00040      1 t literal

00041    s t s add

00042    x z s fetch

00043    x x y add

00044    z s x store

00045      8 x literal

00046      z x loadjump

 

For our final primitive in this exercise, we just need a halt. It’s just a single instruction terminating execution (so it doesn’t need to be followed by a next). Later we’ll seed the process by pointing the interpreter here so that once all the nested sequences are complete the machine halts:

 

VARIABLE &halt  address &halt !

halt,

 

Emitting:

 

00047          halt

 

Hand-packed Secondary Words

 

Now on to the secondary words, which are quite different. They’re a branch to enter followed by a sequence of addresses rather than of machine code. It’s starting to gain a Forth-like feel; a little more comfortable:

 

VARIABLE &dup  address &dup !

enter,

&lit  @ m,

0       m,

&pick @ m,

&exit @ m,

Emitting:

 

00048      2 x literal

00049      z x loadjump

00050      17

00051      0

00052      26

00053      12

 

Notice how the literal zero is packed in there but otherwise it’s just a list of addresses. This is the essence of threaded code. If you think of the word addresses as op codes for a zero-operand stack machine, you’re not far off!

 

The double word is another secondary, this time defined in terms of another secondary (the dup from above) and primitives:

 

VARIABLE &double  address &double !

enter,

&dup  @ m,

&add  @ m,

&exit @ m,

 

Emitting:

 

00054      2 x literal

00055      z x loadjump

00056      48

00057      39

00058      12

 

That’s it for the dictionary.

 

Sample Program

 

Now we want to prime the mechanics to run a sample program. We’ll push a 42 to the stack and execute double. This will dup the 42 by doing a 0 pick and add them together. With success we’ll end up with an 84 on the stack and halt.

 

To kick off the interpreter, we’ll point it at a location in memory containing the address of the halt word in the dictionary. This will be pushed to the return stack upon initially entering double and will be popped and executed upon completion just as we want:

 

VARIABLE terminate  address terminate !

&halt @ m,

 

Emitting:

 

00059      47

 

Remember the jump at the very start of the image? That is patched to the current address and so here finally begins execution of our sample:

 

tohere

 

We’ll partition memory much as we did before with return and data stacks in high memory and the dictionary (already packed) in low memory:

 

16383 r literal,  \  top of return stack, 3FFF

12287 s literal,  \  top of data stack, 2FFF

 

And like we said, seed the interpreter pointer with a cell containing the halt address:

 

terminate @ i literal,

 

Emitted so far:

 

00060  16383 r literal

00061  12287 s literal

00062     59 i literal

 

To set up our little scenario, we manually push a 42 to the stack, point the interpreter (word register) at doubleand branch into it. The inner interpreter takes it from there; walking the nested primitive and secondary words to completion and halting.

 

42 x literal,  x s push,

&double @ w literal,

&double @ branch,

 

Emitting:

 

00063     42 x literal

00064      0 t literal

00065    t t t nand

00066    s t s add

00067    z s x store

00068     54 w literal

00069     54 y literal

00070      z y loadjump

 

We’re done!

 

We can’t quite just load up this 71-cell image in the UM-32 and go however because we’re using memory at much higher addresses for the stack. We could add code to allocate and copy over the image but I think I’ll just take the easy route and pad the rest of the image with zeros; making a nice little 64KB (16K 32-bit cells) image file:

 

: pad,  16384 address DO 0 m, LOOP ;

pad, msave

 

Tracing

 

I would never have been able to get this all working without some visibility into the UM-32 in action. To get some tracing I added the following to Luke’s UM-32 implementation (inside his cycle function):

//(* Debugging
let name = function 0 -> "z" | 1 -> "t" | 2 -> "y" | 3 -> "x" | 4 -> "w" | 5 -> "i" | 6 -> "s" | 7 -> "r"
let an, bn, cn, a2n = (name a), (name b), (name c), (name a2)
printf "%05i  " finger
match code with
| 0u  -> printfn "  %s %s %s cmove       %s = %s:%i if %s:%i" an bn cn an bn reg.[b] cn reg.[c]
| 1u  -> printfn "  %s %s %s fetch       %s = M[%s:%i][%s:%i]" an bn cn an bn reg.[b] cn reg.[c]
| 2u  -> printfn "  %s %s %s store       M[%s:%i][%s:%i] = %s:%i" an bn cn an reg.[a] bn reg.[b] cn reg.[c]
| 3u  -> printfn "  %s %s %s add         %s = %s:%i + %s:%i" an bn cn an bn reg.[b] cn reg.[c]
| 4u  -> printfn "  %s %s %s mult        %s = %s:%i * %s:%i" an bn cn an bn reg.[b] cn reg.[c]
| 5u  -> printfn "  %s %s %s div         %s = %s:%i / %s:%i" an bn cn an bn reg.[b] cn reg.[c]
| 6u  -> printfn "  %s %s %s nand        %s = %s:%i ~& %s:%i" an bn cn an bn reg.[b] cn reg.[c]
| 7u  -> printfn "        halt"
| 8u  -> printfn "    %s %s alloc       new(%s:%i) -> %s" bn cn cn reg.[c] bn
| 9u  -> printfn "      %s free        %s:%i" cn cn reg.[c]
| 10u -> printfn "      %s echo        %s:%i" cn cn reg.[c]
| 11u -> printfn "      %s key         %s:%i" cn cn reg.[c]
| 12u -> printfn "    %s %s loadjump    load(%s:%i), jump(%s:%i)" bn cn bn reg.[b] cn reg.[c]
| 13u -> printfn "%5i %s literal     %s = %i" value a2n a2n value
//Console.ReadLine() |> ignore
//*)

Now we can run our image and see all the gloriously gory details!

 

00000     60 y literal     y = 60

00001      z y loadjump    load(z:0), jump(y:60)

00060  16383 r literal     r = 16383

00061  12287 s literal     s = 12287

00062     59 i literal     i = 59

00063     42 x literal     x = 42

00064      0 t literal     t = 0

00065    t t t nand        t = t:0 ~& t:0

00066    s t s add         s = t:4294967295 + s:12287

00067    z s x store       M[z:0][s:12286] = x:42

00068     54 w literal     w = 54

00069     54 y literal     y = 54

00070      z y loadjump    load(z:0), jump(y:54)

00054      2 x literal     x = 2

00055      z x loadjump    load(z:0), jump(x:2)

00002      0 t literal     t = 0

00003    t t t nand        t = t:0 ~& t:0

00004    r t r add         r = t:4294967295 + r:16383

00005    z r i store       M[z:0][r:16382] = i:59

00006      2 t literal     t = 2

00007    i w t add        i = w:54 + t:2

00008    w z i fetch       w = M[z:0][i:56]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:56

00011      z w loadjump    load(z:0), jump(w:48)

00048      2 x literal     x = 2

00049      z x loadjump    load(z:0), jump(x:2)

00002      0 t literal     t = 0

00003    t t t nand        t = t:0 ~& t:0

00004    r t r add         r = t:4294967295 + r:16382

00005    z r i store       M[z:0][r:16381] = i:57

00006      2 t literal     t = 2

00007    i w t add         i = w:48 + t:2

00008    w z i fetch       w = M[z:0][i:50]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:50

00011      z w loadjump    load(z:0), jump(w:17)

00017    y z i fetch       y = M[z:0][i:51]

00018      0 t literal     t = 0

00019    t t t nand        t = t:0 ~& t:0

00020    s t s add         s = t:4294967295 + s:12286

00021    z s y store       M[z:0][s:12285] = y:0

00022      1 t literal     t = 1

00023    i t i add         i = t:1 + i:51

00024      8 x literal     x = 8

00025      z x loadjump    load(z:0), jump(x:8)

00008    w z i fetch       w = M[z:0][i:52]

00009      1 t literal     t = 1

00010    i t i add        i = t:1 + i:52

00011      z w loadjump    load(z:0), jump(w:26)

00026    y z s fetch       y = M[z:0][s:12285]

00027      1 t literal     t = 1

00028    s t s add         s = t:1 + s:12285

00029      1 t literal     t = 1

00030    x s t cmove       x = s:12286 if t:1

00031    x x y add         x = x:12286 + y:0

00032    x z x fetch       x = M[z:0][x:12286]

00033      0 t literal     t = 0

00034    t t t nand        t = t:0 ~& t:0

00035    s t s add         s = t:4294967295 + s:12286

00036    z s x store       M[z:0][s:12285] = x:42

00037      8 x literal     x = 8

00038      z x loadjump    load(z:0), jump(x:8)

00008    w z i fetch       w = M[z:0][i:53]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:53

00011      z w loadjump    load(z:0), jump(w:12)

00012    i z r fetch       i = M[z:0][r:16381]

00013      1 t literal     t = 1

00014    r t r add         r = t:1 + r:16381

00015      8 x literal     x = 8

00016      z x loadjump    load(z:0), jump(x:8)

00008    w z i fetch       w = M[z:0][i:57]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:57

00011      z w loadjump    load(z:0), jump(w:39)

00039    y z s fetch       y = M[z:0][s:12285]

00040      1 t literal     t = 1

00041    s t s add         s = t:1 + s:12285

00042    x z s fetch       x = M[z:0][s:12286]

00043    x x y add         x = x:42 + y:42

00044    z s x store       M[z:0][s:12286] = x:84

00045      8 x literal     x = 8

00046      z x loadjump    load(z:0), jump(x:8)

00008    w z i fetch       w = M[z:0][i:58]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:58

00011      z w loadjump    load(z:0), jump(w:12)

00012    i z r fetch       i = M[z:0][r:16382]

00013      1 t literal     t = 1

00014    r t r add         r = t:1 + r:16382

00015      8 x literal     x = 8

00016      z x loadjump    load(z:0), jump(x:8)

00008    w z i fetch       w = M[z:0][i:59]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:59

00011      z w loadjump    load(z:0), jump(w:47)

00047          halt

 

And viola! We indeed end up with 84 on the stack. Not a particularly impressive feat, but it shows that all seems to be working as expected.

 

To be continued…

 

Next, we’ll work on building an outer interpreter to process Forth source code and get out of this assembly language business. However, we’ll always keep the assembler handy and we’ll always be free to dip down to defining Forth words at this lowest level; for performance and to take advantage of machine-level functionality as needed. This spanning from bare metal all the way up to very high levels of abstraction is a unique beauty of Forth.