The Blinking Computer will be a reasonably easy-to-build educational computer. As the name suggests, the idea is that you can write code and see it execute at every level, that means discrete components and lots of LEDs. The design has evolved many times since the 1980's. If you like this sort of thing you'll probably also like the Homebrew Computers Website and howcpuworks.
The objectives are:
The Blinking CPU uses a very simple design. There are eight registers, one of which is the program counter. On every instruction cycle the program counter is incremented, two registers are passed through the ALU and stored in a third register. Each register access has an indirect mode, meaning that the value at the address held in the register can be used to load a value for the ALU and the ALU result can be stored at a value held in the destination register. Everything is 16 bits wide: the instructions; the registers; the address space and the words in memory.
The ALU only implements addition, subtraction and a few basic logic operations, there is no multiplication or division. There is a carry flag, a zero and a negative flag. The addition and subtraction operations can be conditional, which implements control flow by conditionally writing to the processor register.
The CPU is predominantly built using Resistor Transistor Resistor Logic in a straightforward manner. However, there is the need to build time and parts cost and so two areas tend on the side of minimalism over understandability, even though the primary aim is education. The memory is a two transistor DRAM cell using a MOSFET as a storage capacitor, and the ALU uses a novel adder to mimimise the ripple carry delay. The intention is to simulate all parts with Circuit Lab and so be able to explain without the need for an oscilloscope.
The RAM uses a Raspberry Pi Pico. In some senses it would be ideal to build RAM as well, however the implementation of the registers already provide the educational aspect of this, and the goals on assembly time and cost override the ideal of using discrete transistors and LEDs everywhere.
As the assembler is only one instruction, it's easy to learn and maps directly to what happens on the hardware. A minimal FORTH is provided so that complex systems may be thrown together quickly, as well as providing 32 bit operations and missing functionality such as multiply and divide. A very minimal language is provided that compiles to FORTH which is compatible with Python syntax. This is again kept very simple so that the process of compiling may be understood. All software, including the emulator, is open source.
The rest of this document starts at the transistor level and works through to high level programming. It's equally possible for students to start by writing some simple code in the emulator, inspect it at every level down to machine code, then see that machine code execute as Blinking LEDs.
At present the only writeup is on hackaday.io
The ALU implements the standard AND, OR, XOR bitwise operations, plus Logical Shift Right (LSR) by one.
Only addition and subtraction are supported as arithmetic operations, a carry input may be used.
Subtraction is just inversion and then addition with the appropriate carry flag. Addition/Subtraction not only take up the largest number of gates per bit, but are also the only instructions where the bits are not processed independently. Ripple Carry is implemented for simplicity (faster algorithms exist), this means that addition/subtraction is the slowest ALU operation. The worst case performance is where the carry has to propagate the whole length of the word. In an instruction cycle the program counter must be incremented, thus the ripple carry adder is the main constraint on the system speed.
Pass Transistor Logic can be used to reduce the complexity. For example, XNOR can be be implemented with only two transistors. A full adder is only two cascaded XNOR plus the carry mechanism. For inputs A and B with carry C we have:
|Input A||Input B||CarryIn C||D = A XNOR B||Out = C XNOR D||CarryOut|
CarryOut = A IF D ELSE C, that is if D is zero then the carry is propagated, else it's set to the value of A (or B). This is the idea behind ''diode carry'' where the carry signal propagates only through fast diodes and so improves the worst case performance of the ALU. The diode carry idea is from 2017 but has yet to be implemented.
All instructions are of the form
cnd [*]DST = [*]SRC0 OP [*]SRC1 where:
DSTare registers or constants. There are 8 in total consisting of:
[*]Ameans either direct addressing,
Aor indirect addressing,
X = *P. Jump absolute is P = *P which jumps to the next word after the instruction being executed
|A = B AND C||bitwise AND|
|A = B OR C||bitwise OR|
|A = B XOR C||bitwise XOR|
|A = B ADC C||addition without carry|
|A = B SBC C||subtraction without carry|
|A = B LSR C||Logical Shift Right - currently only implemented for C = 1|
|A = HCF||WHATEVER HAPPENS, DO NOT CALL HCF!|
|A = B ADD C||addition without carry - conditionally executed|
|A = B SUB C||subtraction without carry - conditionally executed|
cndis conditional execution which may be applied to the
-OPs, the four cases are given below. In practice, eq0 and ne0 are equally used, although one could be eliminated with the XOR instruction. Of gt0, ge0, lt0 and le0 only lt0 is implemented as it's the most common. The others tend to occur after a subtraction, reversing the subtraction order gives ge0, and setting the carry bit and using SBC gives gt0 and le0.
|always execute - this is the default case|
|lt0||execute if the previous ALU result was negative|
|eq0||execute if the previous ALU result was zero|
|ne0||execute if the previous ALU result was not zero|
OPare 4 bits total,
[*]DSTare 4 bits each giving a total of 16 bits
|[*]DST = [*]SRC0||[*]DST = [*]SRC0 + 0||this runs the ALU and effectively ignores the second argument|
|[*]DST = word||[*]DST = *P, word||load immediate. The next word is the data to be loaded|
|JMP word||P = *P, word||loads P with the next word|
|CLC||0 = 0 ADC 0||clears the carry flag|
There are two levels to the state logic in the CPU. At the top level, the CPU executes an infinite loop of:
program counter increment and instruction fetch
execute from instruction register
Most processors ensure that instruction fetch and program counter increment happen concurrently with the instruction execution. The Blinking CPU favours simplicity over speed, making this loop explicit allows reuse increment hardware. The special instruction register is hard wired into any write in the first phase of this loop.
For each instruction the low level pipeline is:
cnd), if not then skip the rest of this list (MCRESET)
SRC0on the register address bus (RADDR = SRC0)
*SRC0then the Pico overrides the values with
SRC1on the register address bus (RADDR = SRC1)
*SRC1then the Pico overrides the values with
The instruction pipeline is long. This is because the conventional READ, WRITE and ALU instructions have been merged into one single instruction. Speed is not a objective here, getting the design simple enough to understand and get built is what counts. In contrast, here is the 6502 architecture and this is what it looks like when running.
Here the Pico is only acting as memory. Conventionally memory has an address bus and a data bus, however the Pico has 24 GPIO pins available and so does not have enough pins for both address and data bus. Hence, the address is read first followed by a read/write to the data bus using the same GPIO pins. The Blinking CPU is set up for the Pico to read at every opportunity, whether the information is needed or not. This allows for the Pico to be running an emulation of the CPU for debugging and fault finding.
The next word in the instruction sequence can be loaded with
*P and used as a general 16 bit value. This needs a bit of hardware support to ensure that this 'load immediate' word is not also taken as an instruction in the next cycle. The processor contains a NOP flag which is set if the instruction contains *P except when the destination is P and the instruction is executed. The NOP flag clears on the load of the next instruction. This doesn't skip the word that has been read, it just means that the processor reads it and treats it as a conditional instruction that doesn't execute which is an acceptable overhead.
The CPU has a 16 bit address space and 16 bit data space. A Raspberry Pi Pico is used to store the 65,536 words of memory as it is low cost and very well supported for educational use. The Pico doesn't have enough GPIO pins to separate the address bus and data bus, so this is shared.
In future the Pico may:
FORTH is a very lightweight language popular as a first language after assembler for very low resource hardware - so perfect.
This project has a partial and fairly mutilated implementation of FORTH. No apology is made for this, FORTH is not a good language for learning to code. At this stage it's use is to stress test the instruction set and the hardware, to find out what doesn't work and to find out what features are missing or aren't used.
FORTH is normally self-hosted, there's an 'outer interpreter' written in FORTH that reads characters, splits into tokens, looks up tokens in a dictionary and writes FORTH code. This project doesn't do that, it uses a far more conventional parse and compile approach. The compiler tools generate a memory image which is loaded into the memory and the processor is reset to run the code.
Also FORTH is normally implemented as threaded code with an inner interpreter, the details of which is often quite hard to understand. Hence first we present an interpreted implementation which sets out the data structures and the extra complexity of threading is discussed later. In both cases:
*C = Aand the execution pointer is set to the first item in the called function,
A = *A
FORTH uses the term 'word' for a function/subroutine. We reserve the term 'word' for the more conventional usage of the natural data size, whatever it contains, and use the term 'function' for function/subroutine. In standard FORTH all functions take their arguments from the stack (B) and return results on the stack. We divide functions into primative functions that are written in assembler and compound functions which are a list of primitive and compound functions.
This is the easiest case to understand and so is presented first. Here is the interpreter, all the primitive functions (written in assembler) live below
asmForthBoundary, all the compound functions live above it:
*3 = asmForthBoundary interpreter: A = A ADD 1 # point to the next function to be executed 0 = *A SUB *3 # set flags to say if this is assembler or a compound function lt0 P = *A # if assembler then call it directly *C = A # if compound then store the current function on the call stack C = C SUB 1 # make some space A = *A SUB 1 # and start on the new list of functions P = interpreter
Primitive functions must preserve A, B and C registers (as they are neeed as above) but may freely use *0, *1, *2 and *3. Primitive functions simply return by jumping to the
interpreter label as A and C keep track of what to do next. For example, in this scheme we can define
dup: # duplicate the top of stack *0 = *B # copy top of data stack B = B ADD 1 # increase data stack *B = *0 # write duplicate P = interpreter
A compound function looks like:
: dup2 dup dup ;
which compiles very simply to a list of address of functions:
dup2: :dup :dup :return
return function just pops the call stack and set A:
return: # return from a list of functions C = C ADD 1 A = *C P = interpreter
Most FORTHs use some form of threaded code where the tail of one function looks up where to go next and jumps directly to that. This avoids jumping back into the interpreter just to increment the execution pointer and then jumping out again. This is easiest to understand in a primitive function as all that is needed is to increment the execution pointer to point to the next function and jump to it.
dup: # duplicate the top of stack *0 = *B # copy top of data stack B = B + 1 # increase data stack *B = *0 # write duplicate A = A + 1 # NEXT: increment the instruction pointer P = *A # NEXT: jump directly to the next function
Here the code has boilerplate of two instructions at the end whereas the interpreter code only had one, however there were also three instructions to call the primitive function. Consequently the threaded version is faster as the 2 instruction overhead is less overhead than 4 for the interpreter. In practice, some of the
A = A + 1 can be eliminated so threading primitives has very low overhead for this instruction set.
For compound functions we must start up the 'inner interpreter' on every function:
dup2: *C = A + 1 # docolon: push the incremented execution pointer onto the call stack C = C - 1 # docolon: increase the (descending) call stack A = P + 1 # docolon: Execution pointer to start of function list P = *A # docolon: start executing the first function :dup :dup :return
return function is almost the same and executes in the same number of instructions:
return: # return from a compound function C = C + 1 A = *C P = *A
Comparing the two implementations for compound functions, the interpreter has 7 instructions in the loop whereas the threaded implementation only takes 4 instructions to start, therefore the threaded implementation is faster for both primitive and compound functions. As there is more boilerplate code for the threaded implementation the memory usage will be greater, but hey, there's a whole 64k words to play with and that's massive.
*C is empty, so it's possible to keep the instruction pointer in *C and not use A. However, an indirect through the instruction pointer is needed in the NEXT code, so not using A increases this code from two lines to three lines (*C = *C ADD 1; A = *C; P = *A). If A is needed in a primitive function it can always be stored in *C and *C used in NEXT without penalty, so that's what's done as it results in shorter or same length code.
Most functions operate on a word and are independent of what that word contains. For integers, S1, S2, U1, and U2 are supported, where S/U donates signed and unsigned and the digit indicates the number of words used. Therefore U4 is a 64 bit unsigned integer. The floating point types are F2 and F3. F2 is IEEE single precision and F3 has a faster, unpacked format with extended range. Operations are implemented in F3 with F2 only supporting conversion to/from F3.
SIMPL is a SIMPL IMperative Procedural Language. It's deliberately minimal and very very simple - the aim is to show how languages and compilation work in an easily digestible way. It looks like a tiny subset of Python.
The language implements the absolute minimum and so can be mastered very quickly. Variables can be global or local to functions, only fixed sized arrays are supported. Types are as FORTH with the default being a 32 bit signed integer (S2). There are ten infix operators (+, -, *, /, ==, !=, <, >, <= and >=) for familiarity and readability, but no operator precedence and so order must be explicitly stated using parenthesis.
while/break/continue exist for control flow using indent for scope and that's it. The code below was the first real use of the compiler (04Aug23):
# algorithm from https://bisqwit.iki.fi/story/howto/bitmath/#DivAndModDivisionAndModulo def divmodU2(dividend, divisor): multiple = 1 while divisor < dividend: divisor = divisor + divisor multiple = multiple + multiple result = 0 while multiple != 0: if dividend >= divisor: dividend = dividend - divisor result = result + multiple divisor = lsr2(divisor) multiple = lsr2(multiple) return result # , dividend
A variable is declared and type assigned on first usage. Further usage should not specify the type. Global variables are visible in functions and must not conflict with local variable names or function argument names. Function arguments are assumed to be S2 unless a type is provided.
The target audience for the Blinking Computer is learners in Electronics and Computer Science and Python is regarded as being the best first language to learn. Therefore SIMPL was designed to be as compatible with Python as possible in order to minimise the conflicts in learning both of them. However, SIMPL also has the goal of being a very small self-contained language where it is possible to understand all the steps in compulation to bytecode (FORTH), machine code, instruction execution down to individual states and blinking lights. However, there are inevitable differences:
Therefore it is worth listing the pros and cons of keeping the two aligned:
It will be open for review whether this 'almost a subset of Python' is the right way to go.
Q: Why a general purpose register based architecture not an accumulator based architecture?
A: The accumulator based architecture was tempting, it keeps the complexity down which keeps the build time and cost down. However, this only happens if the number of registers is reduced to the accumulator, the program counter, status register and pretty much nothing else. It was important that the assembly level should feel like a modern processor with general purpose registers. The current design only has four, which should be roughly in balance with the complexity of the ALU, plus general purpose registers allows for writing to the program counter and thus a unified instruction set.
The Blinking Computer is designed to be minimal with lots of holes to fill. It's a starting point for other work, some of which could be:
[*]DST = [*]([*]SRC0 OP [*]SRC1)