[The twelfth in a series of posts on the evolution of TransForth]

It’s been quite fun playing with this Universal Machine from the Cult of the Bound Variable. In this post we’re going to continue the journey toward building a full Forth for this machine by assembling a Forth inner interpreter; turning this register machine into a stack machine.

Last time we finished up a single-pass macro assembler for the UM-32 (in 15 lines of F# and 40 lines of Forth!) but did nothing more than a simple “Hello, world” with it. If you remember when we built an inner interpreter for TransForth, you know that this is the heart of Forth. If you’re unclear on the mechanics of a direct threaded inner interpreter, you’d probably enjoy the video from that earlier post:

# Register Machines (what a pain)

Being so used to dealing with a stack machine and essentially zero-operand instructions, it can be a bit annoying to deal with a register machine. But that’s what we have with the UM-32. Let’s assign purposes to the remaining registers (remember, z is already the zero constant and registers 1 and 2 are temp registers. Let’s allocate one more temp along with the standard inner interpreter and stack pointers.

: x 3 ;  \  Temp register

: w 4 ;  \  Working register

: i 5 ;  \  Interpreter register

: s 6 ;  \  Stack (data) register

: r 7 ;  \  Return stack register

In a way you can think of what we’re doing as implementing stack machine mechanics on top of a register machine. There are, of course, many real stack machines implemented in hardware. Koopman’s book is an excellent resource (free online) by the way. For the UM-32, we’ll need to handle the stacks ourselves along with the mechanics of walking nested sequences of stack operations.

Given the data or return stack pointer and a value register, these words will handle the stacks in a few instructions:

: push,  ( ab-m ) \  b.push(a)

DUP

dec,   \  b--

z store, \  M[b] = a

;

: pop,  ( ab-m ) \  b = a.pop()

OVER SWAP

z SWAP      \  aazb

fetch,    \  b = M[a]

inc,      \  a++

;

These don’t emit any UM-32 code at this point. Instead they are assembler “macros” used to inline these instructions as needed later.

# Inner Interpreter

Now we’re going to emit the inner interpreter and dictionary structure. Following that, we’ll initialize and kick things off. We’ll start by jumping to the initialization code (yet to be defined):

forward branch,  \  over dictionary

It happens to be patched to address 60 later and the following ends up in the image:

00000     60 y literal

The triumvirate of an inner interpreter is enter,next and exit (in the past I’ve referred to them by their more classic names docol, next and dosemi).

The job of enter is to go into word definitions. It pushes the interpreter (i) register onto the return stack so we can get back out upon exit. It then points i at the body of the word and does a next.

i r push,                 \ r.push(i)

2 t literal, t w i add,   \ i = w + 8 (skip over enter,)

\ falls through to next,

It emits the following UM-32 code to the image (falling through to next):

00002      0 t literal

00003    t t t nand

00005    z r i store

00006      2 t literal

To make it easy to jump to this, here’s a convenience word:

: enter,  &enter @ x literal,  x jump, ;

Next comes next, whose job is to jump to the next word in a sequence and advance the interpreter to the one following.

i z w fetch,  \  w = M[i]

i inc,    \  i++

w jump,

Emitting:

00008    w z i fetch

00009      1 t literal

And the same kind of convenience macro:

: next,  &next @ x literal,  x jump, ;

Remember that all primitive words are plain machine code followed by a next. All secondary words begin with enter (that is, with machine code to jump to &enter) and are terminated with exit. The job of exit is to leave and rejoin a parent sequence; simply recovering the return address from the return stack and doing next from there.

r i pop,  \  i = r.pop()

next,

00012    i z r fetch

00013      1 t literal

00015      8 x literal

That’s it for the inner interpreter! Just fifteen instructions (plus two for the jump as the start).

# Hand-packed Primitives

Now we want to pack a dictionary with primitive and secondary words to exercise things a bit. Using the same example as in the direct threading demo video, we’ll need literals (lit), pick and add primitives then we can define secondary dup (in terms of lit and pick) and double (in terms of dup and add).

Starting with lit, we push the literal which is packed alongside word addresses and advance the interpreter to skip ahead:

i z y fetch,  \  y = M[i]

y s push,   \  s.push(y)

i inc,    \  i++

next,

Emitting:

00017    y z i fetch

00018      0 t literal

00019    t t t nand

00021    z s y store

00022      1 t literal

00024      8 x literal

The pick word is used to pull a copy of the nth element on the stack to the top:

s y pop,    \  y = s.pop()

s x move,   \  x = s

y x x add,    \  x = x + y

x z x fetch,  \  x = M[x]

x s push,   \  s.push(x)

next,

Emitting:

00026    y z s fetch

00027      1 t literal

00029      1 t literal

00030    x s t cmove

00032    x z x fetch

00033      0 t literal

00034    t t t nand

00036    z s x store

00037      8 x literal

The add word essentially pops two values and pushes back their sum. It could be implemented as:

s y pop,   \  y = s.pop()

s x pop,   \  x = s.pop()

y x x add,   \  x = x + y

x s push,  \  s.push(x)

next,

But a more efficient implementation just fetches the top of the stack for the second operand (without popping) and pokes back the result (without pushing):

s y pop,    \  y = s.pop()

s z x fetch,  \  x = M[s]

y x x add,    \  x = x + y

x s z store,  \  M[s] = x

next,

Emitting:

00039    y z s fetch

00040      1 t literal

00042    x z s fetch

00044    z s x store

00045      8 x literal

For our final primitive in this exercise, we just need a halt. It’s just a single instruction terminating execution (so it doesn’t need to be followed by a next). Later we’ll seed the process by pointing the interpreter here so that once all the nested sequences are complete the machine halts:

halt,

Emitting:

00047          halt

# Hand-packed Secondary Words

Now on to the secondary words, which are quite different. They’re a branch to enter followed by a sequence of addresses rather than of machine code. It’s starting to gain a Forth-like feel; a little more comfortable:

enter,

&lit  @ m,

0       m,

&pick @ m,

&exit @ m,

Emitting:

00048      2 x literal

00050      17

00051      0

00052      26

00053      12

Notice how the literal zero is packed in there but otherwise it’s just a list of addresses. This is the essence of threaded code. If you think of the word addresses as op codes for a zero-operand stack machine, you’re not far off!

The double word is another secondary, this time defined in terms of another secondary (the dup from above) and primitives:

enter,

&dup  @ m,

&exit @ m,

Emitting:

00054      2 x literal

00056      48

00057      39

00058      12

That’s it for the dictionary.

# Sample Program

Now we want to prime the mechanics to run a sample program. We’ll push a 42 to the stack and execute double. This will dup the 42 by doing a 0 pick and add them together. With success we’ll end up with an 84 on the stack and halt.

To kick off the interpreter, we’ll point it at a location in memory containing the address of the halt word in the dictionary. This will be pushed to the return stack upon initially entering double and will be popped and executed upon completion just as we want:

&halt @ m,

Emitting:

00059      47

Remember the jump at the very start of the image? That is patched to the current address and so here finally begins execution of our sample:

tohere

We’ll partition memory much as we did before with return and data stacks in high memory and the dictionary (already packed) in low memory:

16383 r literal,  \  top of return stack, 3FFF

12287 s literal,  \  top of data stack, 2FFF

And like we said, seed the interpreter pointer with a cell containing the halt address:

terminate @ i literal,

Emitted so far:

00060  16383 r literal

00061  12287 s literal

00062     59 i literal

To set up our little scenario, we manually push a 42 to the stack, point the interpreter (word register) at doubleand branch into it. The inner interpreter takes it from there; walking the nested primitive and secondary words to completion and halting.

42 x literal,  x s push,

&double @ w literal,

&double @ branch,

Emitting:

00063     42 x literal

00064      0 t literal

00065    t t t nand

00067    z s x store

00068     54 w literal

00069     54 y literal

We’re done!

We can’t quite just load up this 71-cell image in the UM-32 and go however because we’re using memory at much higher addresses for the stack. We could add code to allocate and copy over the image but I think I’ll just take the easy route and pad the rest of the image with zeros; making a nice little 64KB (16K 32-bit cells) image file:

# Tracing

I would never have been able to get this all working without some visibility into the UM-32 in action. To get some tracing I added the following to Luke’s UM-32 implementation (inside his cycle function):

```//(* Debugging
let name = function 0 -> "z" | 1 -> "t" | 2 -> "y" | 3 -> "x" | 4 -> "w" | 5 -> "i" | 6 -> "s" | 7 -> "r"
let an, bn, cn, a2n = (name a), (name b), (name c), (name a2)
printf "%05i  " finger
match code with
| 0u  -> printfn "  %s %s %s cmove       %s = %s:%i if %s:%i" an bn cn an bn reg.[b] cn reg.[c]
| 1u  -> printfn "  %s %s %s fetch       %s = M[%s:%i][%s:%i]" an bn cn an bn reg.[b] cn reg.[c]
| 2u  -> printfn "  %s %s %s store       M[%s:%i][%s:%i] = %s:%i" an bn cn an reg.[a] bn reg.[b] cn reg.[c]
| 3u  -> printfn "  %s %s %s add         %s = %s:%i + %s:%i" an bn cn an bn reg.[b] cn reg.[c]
| 4u  -> printfn "  %s %s %s mult        %s = %s:%i * %s:%i" an bn cn an bn reg.[b] cn reg.[c]
| 5u  -> printfn "  %s %s %s div         %s = %s:%i / %s:%i" an bn cn an bn reg.[b] cn reg.[c]
| 6u  -> printfn "  %s %s %s nand        %s = %s:%i ~& %s:%i" an bn cn an bn reg.[b] cn reg.[c]
| 7u  -> printfn "        halt"
| 8u  -> printfn "    %s %s alloc       new(%s:%i) -> %s" bn cn cn reg.[c] bn
| 9u  -> printfn "      %s free        %s:%i" cn cn reg.[c]
| 10u -> printfn "      %s echo        %s:%i" cn cn reg.[c]
| 11u -> printfn "      %s key         %s:%i" cn cn reg.[c]
| 12u -> printfn "    %s %s loadjump    load(%s:%i), jump(%s:%i)" bn cn bn reg.[b] cn reg.[c]
| 13u -> printfn "%5i %s literal     %s = %i" value a2n a2n value
//*)
```

Now we can run our image and see all the gloriously gory details!

00000     60 y literal     y = 60

00060  16383 r literal     r = 16383

00061  12287 s literal     s = 12287

00062     59 i literal     i = 59

00063     42 x literal     x = 42

00064      0 t literal     t = 0

00065    t t t nand        t = t:0 ~& t:0

00066    s t s add         s = t:4294967295 + s:12287

00067    z s x store       M[z:0][s:12286] = x:42

00068     54 w literal     w = 54

00069     54 y literal     y = 54

00054      2 x literal     x = 2

00002      0 t literal     t = 0

00003    t t t nand        t = t:0 ~& t:0

00004    r t r add         r = t:4294967295 + r:16383

00005    z r i store       M[z:0][r:16382] = i:59

00006      2 t literal     t = 2

00007    i w t add        i = w:54 + t:2

00008    w z i fetch       w = M[z:0][i:56]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:56

00048      2 x literal     x = 2

00002      0 t literal     t = 0

00003    t t t nand        t = t:0 ~& t:0

00004    r t r add         r = t:4294967295 + r:16382

00005    z r i store       M[z:0][r:16381] = i:57

00006      2 t literal     t = 2

00007    i w t add         i = w:48 + t:2

00008    w z i fetch       w = M[z:0][i:50]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:50

00017    y z i fetch       y = M[z:0][i:51]

00018      0 t literal     t = 0

00019    t t t nand        t = t:0 ~& t:0

00020    s t s add         s = t:4294967295 + s:12286

00021    z s y store       M[z:0][s:12285] = y:0

00022      1 t literal     t = 1

00023    i t i add         i = t:1 + i:51

00024      8 x literal     x = 8

00008    w z i fetch       w = M[z:0][i:52]

00009      1 t literal     t = 1

00010    i t i add        i = t:1 + i:52

00026    y z s fetch       y = M[z:0][s:12285]

00027      1 t literal     t = 1

00028    s t s add         s = t:1 + s:12285

00029      1 t literal     t = 1

00030    x s t cmove       x = s:12286 if t:1

00031    x x y add         x = x:12286 + y:0

00032    x z x fetch       x = M[z:0][x:12286]

00033      0 t literal     t = 0

00034    t t t nand        t = t:0 ~& t:0

00035    s t s add         s = t:4294967295 + s:12286

00036    z s x store       M[z:0][s:12285] = x:42

00037      8 x literal     x = 8

00008    w z i fetch       w = M[z:0][i:53]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:53

00012    i z r fetch       i = M[z:0][r:16381]

00013      1 t literal     t = 1

00014    r t r add         r = t:1 + r:16381

00015      8 x literal     x = 8

00008    w z i fetch       w = M[z:0][i:57]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:57

00039    y z s fetch       y = M[z:0][s:12285]

00040      1 t literal     t = 1

00041    s t s add         s = t:1 + s:12285

00042    x z s fetch       x = M[z:0][s:12286]

00043    x x y add         x = x:42 + y:42

00044    z s x store       M[z:0][s:12286] = x:84

00045      8 x literal     x = 8

00008    w z i fetch       w = M[z:0][i:58]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:58

00012    i z r fetch       i = M[z:0][r:16382]

00013      1 t literal     t = 1

00014    r t r add         r = t:1 + r:16382

00015      8 x literal     x = 8

00008    w z i fetch       w = M[z:0][i:59]

00009      1 t literal     t = 1

00010    i t i add         i = t:1 + i:59

00047          halt

And viola! We indeed end up with 84 on the stack. Not a particularly impressive feat, but it shows that all seems to be working as expected.

# To be continued…

Next, we’ll work on building an outer interpreter to process Forth source code and get out of this assembly language business. However, we’ll always keep the assembler handy and we’ll always be free to dip down to defining Forth words at this lowest level; for performance and to take advantage of machine-level functionality as needed. This spanning from bare metal all the way up to very high levels of abstraction is a unique beauty of Forth.