the forth memory model


many articles, tutorials, blogposts, etc about forth focus exclusively on its nature as a stack based language. in fact, this is all derivative or 'inspired' languages learn from it - the likes of Factor, Cat and Play only bother with the postfix. many other aspects of forth are sorely underappreciated, such as its metaprogramming, its 2nd stack, and the topic of this article, the memory model. this article aims to explain and explore forth's memory model, by implementing a counted array datastructure.

this article assumes an understanding of how postfix works, and a somewhat basic understanding of forth, and of the von neumann architecture.

forth is a language with 'raw memory access'. unlike python, go, javascript, or any similar languages, forth allows you to directly access and control the ram your program is given, via a set of core words:

! - write cell to memory location
@ - read cell from memory location
, - compile cell
here - pointer to pointer to latest free cell
allot - allocate bytes

these words commonly have the following variants:

-! - decrement contents of memory location by
+! - increment contents of memory location by
c! - write byte to memory location
c@ - read byte from memory location
c, - compile byte

most, if not all, of your forth-memory interactions will be via these words, the vast majority using ! and @. so, what is a cell?

cells

chances are, you've heard the phrase 64 bit, 32 bit, etc. what this refers to is the 'wordsize' of the cpu. 32 bits have 'words' 32 bits wide, meaning their registers can store 32 bits(4 bytes) at once. in forth, the word 'word' is taken to refer to the subroutines, so we use cell instead. this means on a 16 bit cpu, your forth has cells that store 16 bits. put shortly, a cell is a block of memory.

forth memory is entirely composed of cells: your data and return stacks are made of cells, and your dictionary space is also made of cells.

dictionary space

in forth, all the words and variables and constants that you define go somewhere in memory. this specific somewhere in memory is called the 'dictionary space'. when you run the code

variable a

you are reserving (allocting) a piece of that dictionary space for use later, which you can access by writing a. you can thus think of the forth dictionary space as your c 'heap', however the mechanics of it are far simpler to understand.

dictionary space can be modelled as a sequence of cells right next to each other, like a tape.

+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+

just like a tape, the dictionary space has a head, called here. here is a variable that can be accessed and controlled by the user. here points to the latest 'free' (unused) cell of memory.

+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
  ^ here

if you want to 'allocate' (use) some memory, then all you need to do is move here forward. you can put a value in that memory that you just used.

+---+---+---+---+---+---+---+---+---+
| 1 |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
      ^ here

if you want to make some memory usable again, you can simply move here backwards.

+---+---+---+---+---+---+---+---+---+
| 1 |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
  ^ here

moving where here points to is analogus to the malloc and free functions in C. luckily, you don't ever have to set here manually in forth. you have two words that do this for you: , and allot.

comma and allot

comma and allot are the main two ways you'll allocate memory. both of them work by moving here forward. comma was referred to as 'compile' earlier, because it is often used in the context of creating new definitions. forth dictionary definitions are sequences of pointers in memory. the process of compilation is turning words into pointers, then calling comma on them. comma works by copying a cell of memory from the stack, and placing it into the latest free cell of memory(where here points to), then incrementing here.

+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
  ^ here
becomes
+---+---+---+---+---+---+---+---+---+
| 1 |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
      ^ here

allot is our second word for dealing with the dictionary space. it allocates by bytes, and returns a pointer to the region you allocated. it is often used with the word cells, which multiplies by the amount of bytes in a cell. the word cells allows you to write something like 4 cells allot, which is understandable, even if you don't know the underlying meaning of each word. allot works like this:

+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
  ^ here
becomes
+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
  ^ pointer   ^ here

now that we know what the words do, we can implement and use them.

the words that we kinda just forgot about

the implementation of comma is pretty simple.

: , here @ ! 1 cells here +! ;

you'll notice that this really isn't that much code - less than a line, in fact! but what do they mean?

the first part of the definition is the colon (:) followed by a comma. this starts a definition for the word comma. the next 'segment' of code is the here @. as explained previously, here is a pointer that contains a pointer to the latest free cell of memory. forth variables are pointers to values, and ! @ are the operations for them. ! takes a value and a pointer on the top of the stack, and writes the value to the pointer. as forth variables are simply words that place pointers on the stack, you can try this for yourself using the following code:

variable a
1 a ! \ Sets  a to 1.
a @ \ Fetches the value of a (1)

so, here @ fetches a pointer to the latest free cell of memory. our stack has a pointer on top, and a value beneath it. we then call ! to put the value into the pointer. this leaves our memory looking like this:

+---+---+---+---+---+---+---+---+---+
| 1 |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
  ^ here

you'll notice that here is pointing at a piece of memory that is being used already! this isn't something we want, as if we call comma again, it'll just overwrite our value. we need to increment here. luckily for us, we have a word for that - +!. +! acts identically to !, but rather than setting a value, it increments the value stored in the pointer by the amount. if we go back to our code example with a, the following will set a to 3, as we've incremented a by 2, like the operator +=.

2 a +!

with this knowledge in hand, we can understand the rest of the definition of comma. we get the byte value of a single cell, and increment the value stored in here by that, so that here now points to the next free cell of memory, and all is good in the world.

+---+---+---+---+---+---+---+---+---+
| 1 |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
      ^ here

the next tool at our disposal in the memory allocation toolkit is allot. allot works somewhat similarly to C's malloc - it takes a size, and returns a pointer to a region of that size. however, unlike malloc, our implementation can also fit in a single line.

: allot here @ swap here +! ;

we already know what each of these words do individually, but not as a whole! lets break it apart. allot takes a single argument - a size, and returns one element - a pointer. to allocate some memory, we increment here by that size argument. so, we save the value of here

+---+---+---+---+---+---+---+---+---+
|   |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
  ^ here, our pointer

then place the saved value beneath our size, via swapping. our stack looks like so:

size
pointer

to complete the allocation, we need to move here forward, so we place here on top of the stack, and increment it via +!. this leaves only a pointer to a region of 'free' memory on the stack! with this toolkit, we can start to write an array data structure.

the array

for us, an array is a sequence of cells in memory. this is, as you can expect, somewhat simple to implement. in fact, creating an array in any forth is as simple as writing the code

n cells allot

which just leaves a pointer to an array of n cells. want to index into the array? you just add that many cells to the pointer! remeber, this is forth. we don't have data types here. however, manual pointer arithmetic and manually allocating quickly gets boring. lets write some words to help us out. the first word we will write is a 'defining word' - one that creates arrays for us, so we don't have to do it ourselves.

: array cells allot constant ;

pretty simple so far! the word constant is one that takes a value from the stack, and reads the next space seperated token in from the source file (the next word). it then defines a word that pushes the value to the stack. the reason it isn't reading ; is because we are compiling it, rather than executing it. when we call array, the word constant gets called, which reads in the name for our array, and defines a word of that name that uses the pointer we just pushed. you use the word like this:

n array name

even with this word, we're still doing pointer arithmetic manually. lets address this by making words to read and write from the array:

: index ( ptr index -- value )  cells + @ ;
: write ( value ptr index  -- ) cells + ! ;

you'll notice that i've added stack comments (denoted by the brackets) - these tell you what arguments the word takes and returns. i've violated the typical standard of stack comments, so that you can understand what mine mean more easily.

the index word simply fetches an element from the array. it can be pretty trivially visualised:

+---+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |   |   |
+---+---+---+---+---+---+---+---+---+
  ^ pointer

the cells + bit simply does the following to the pointer, in both cases.

+---+---+---+---+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |   |   |
+---+---+---+---+---+---+---+---+---+
      ^ pointer

and that's basically all we need, at least for our very limited array model. there's a large set of limitations here - it doesn't have a length, so we can't iterate over it. without knowing it's length, we can't bounds check it either! you can just overflow into all the memory. this is of course, solveable.

arrays part two: tokyo arrays

it's time for a refactor of our code. the first issue we're going to tackle is storing the length of the array. if we were so inclined, we could create a variable to store it ourselves, and use that for our operations. we however, are not so inclined.

: array dup , cells allot constant ;

only two words added, and we're storing metadata - we've implemented a proper 'data structure' now. but what does this actually do? if you remember, the word comma compiles a value into a cell. the array word takes a single argument from the stack: the length. we store the length in the memory before the pointer! calling array with 7 will thus leave our memory looking something like this:

+---+---+---+---+---+---+---+---+---+
| 7 |   |   |   |   |   |   |   |   |
+---+---+---+---+---+---+---+---+---+
  ^   ^ pointer               ^ here
  length

the length of the array is one cell away from our array's pointer, meaning we can pretty easily write a word to get the length of the array.

: length cell- @ ;

now that all the work is done on the side of the creation of the array, lets turn to words that access the array. the only difference between the two words is the final operation they use - @ or !. this is an oppourtunity for factorisation, a key tenet of writing forth. we can hide our bounds check in a word that does cells + for us.

: (index) ( ptr index -- ) over length over < if abort" array index exceeded!" then cells + ;

the first thing you'll notice about this word is the brackets around the name. this is yet another forth convention, wherein 'internal' words are surrounded in brackets. this works because in forth, brackets have no special meaning. they're a word like anything else. the next thing you'll notice is probably the cells +. the code hasn't really changed, it's just been moved about. the core operation of our (index) method is still doing the arithmetic on the pointer.

of course, the bounds checking happens in the rest of the code. the word 'over' takes the 2nd stack element, and duplicates it on top. so, we 'over' the pointer to the array, and calculate it's length, using the word we wrote earlier. this prevents repetition, and generally makes code less a soup of symbols. we then 'over' the index, as our length is now on top of it. we compare the index against the length, leaving a boolean value on the stack.

at this point, if you aren't familiar with forth, the if statement will definitely confuse you. in forth, the if statement takes its boolean from the stack, and uses then to conclude the body, rather than something like end or fi. our if's body contains another new word. 'abort"'. abort" is simply a word that takes a string from the input stream, presents it as an error message, then exits the program. the rest of the word can be summarised as presenting the user an error if the index is greater than the length. after the if condition, we see the familiar code: cells +. this word just checks before it does the addition. we can now rewrite our index and write words in terms of (index), as it acts the same as the previous cells + code.

: index (index) @ ;
: write (index) ! ;

with this, we have implemented a bounds checked array in forth, in 5 lines of code!

errata

there's still a few things to note about our array implementation: it doesn't handle negative indexes, it doesn't handle data types other than the cell, it's error handling approach could be improved, etc. however, i'm confident that the readers of the article should be able to implement these things, and possibly more! with all main article concluded with, we can have a tiny bit of fun with this array implementation. for instance, we can make it one-indexed:

: th 1- index ;

we can have our cake, and eat it too! all this code does is make the index 0 based, then invoke the existing 0-indexed method. unlike other languages, we don't have to pick between the two! the syntax for both is identical, and easily changed - the power goes to the user here.

our array implementation sadly doesn't do a good job of setting default values. the observant among you may've realised that our 'cells allot' call does the exact same thing as writing , and a number that many times. this presents another approach to making arrays where the syntax looks somewhat like this:

array name 1 , 2 , 3 , 4 , 5 , 6 ,

the implementation of array in this context is left to the reader, as it is outside the scope of this article, as it'd require dealing with the various idiosyncracies in forth implementations to do correctly, nor is it bounds-checked.

if there are any mistakes you notice in the article, or any questions, please feel free to contact me.

back