🡄 Previous

Next 🡆

Contents > General-purpose Computer

Simulator

The assembler for the general-purpose computer converts an assembly language program into a binary file. It consists of a lexer and a code generator. As the lexer translates the program into a sequence of tokens, it builds a table that maps constants to values, and labels to addresses. The table enables the code generator to substitute symbolic operands with their numerical equivalents as it transforms the tokens into machine code bytes. The code generator appends two final bytes to the binary file containing the program’s entry point address.

A separate process generates CYCLE_LEFT and CYCLE_RIGHT MC programs based on the binary file’s size.

The simulator for the general-purpose computer loads the binary file (the starting memory state), the generated and hand coded MC programs (the microcode), and the TS programs (the logic circuits). It writes the machine code bytes onto the floor of an empty playfield, followed by two padding zeros, and the 21-byte state register. It initializes the state register to all zeros, except for a and P, to which it assigns the maximum address and the entry point address, respectively.

For the example program, the assembler produces a binary file containing 1,021 bytes. The simulator loads the file, and it allocates a playfield with:

(1,021 bytes − 2 bytes + 2 bytes + 21 bytes) × 8 bits/byte × 10 columns/bit = 83,360 columns.

The simulator pops open a window to display the program’s graphics and to listen for keyboard events. Then it enters an infinite loop, where it perpetually alternates between actualizations of CYCLE_LEFT and CYCLE_RIGHT. At the end of each cycle, the simulator examines the surface of the pile in the columns corresponding to memory address 00FD, the location of the drawFrame flag. When the flag indicates a frame is ready for display, the simulator copies the visible playfield from the pile to the window, and it copies the key statuses from the window to the pile.

Instead of executing every IL instruction, the simulator employs a memoized algorithm that computes the long-term evolution of the pile. On startup, the algorithm builds the structures described by the TS programs on a private playfield with all possible inputs, and it records the outputs into a set of lookup tables. The tables allow the algorithm to emulate the effects of building structures without actually building them. For example, here is the swap circuit for all possible inputs:

swap

Rather than manipulating all those pieces, the algorithm inspects the input cells, it performs a table lookup, and it assigns the output cells, omitting everything in between:

swap memoized

Despite their seeming levitation, the output cells support structures built on top of them as if the swap circuit was really there.

To efficiently compute the lookup tables, the algorithm orders the TS programs into a dependency tree. It starts from those that have no dependencies, and it works outward through the tree. That order enables the algorithm to response to a dependency with a lookup in a table it just computed. And it uses the looked up value to skip the build via the technique above. To finish even faster, the algorithm parallelizes the process.

The algorithm does not compute lookup tables for intermediate gates because the side effects of leaking tetrominoes cannot be determined in advance. However, the algorithm does compute lookup tables for circuits that employ intermediate gates because those circuits internally catch the leaked tetrominoes.

Rather than accommodating an ever-taller rectangular pile, the algorithm only keeps track of the pile's surface. For each MC instruction, the algorithm reads input from the surface, it performs a table lookup, and it stores output back to surface, overwriting what it just read. The effect is same as if the algorithm grew the pile, and it truncated everything below the surface.

The algorithm implements the surface as a one-dimensional byte array, s, where each byte contains the data normally spaced across eighty columns of the top-two rows of the pile. For each MC instruction, f x, where f is the name of an n-byte function, the algorithm applies f's lookup table, LUTf, to n consecutive bytes of the surface, starting at index x:

sx‥sx+n−1 ← LUTf(sx‥sx+n−1)

That read-lookup-write operation is performant, but an n-byte function requires a lookup table of size (28)n bytes; i.e., 1-, 2-, and 3-byte functions require 256 B, 64 KiB, and 16 MiB tables, respectively. However, as the algorithm computes a table for a 3-byte function, it monitors the output for a byte constrained to 0 and 1. If it finds such a byte, it assumes the corresponding input parameter is a Boolean, effectively reducing a 24-bit function to a 17-bit function. In the general case, that is a risky assumption. But in this particular situation, the first or third parameter of every 3-byte function is a Boolean, enabling the algorithm to safely cache them in 128 KiB tables.

The simulator’s speed—the number of machines cycles it simulates per unit time—is proportional to the total operations the machine carries out per cycle, or more specifically, to the lengths of CYCLE_LEFT and CYCLE_RIGHT. Since a process generates those MC programs based on the machine code size, the program will run faster if it can be made smaller.

The program’s playfield data region consumes nearly a quarter of the program's size at 253 bytes. If the program adopts boundary checks, it eliminates the need for solid row 22 and solid column 10 (the floor and walls, respectively), as well as rows 0–1 (the vanish zone). That brings the size down to 200 bytes. If the program packs each playfield cell into a nibble rather than a full byte, then it halves the region size. If the program reduces the palette to two colors, then each cell fits in a single bit, and the region drops to a mere 25 bytes, less than a tenth of its current size.

However, due to the computer's minimal instruction set, the additional code required to support an alternate playfield representation may increase the program size to the point that it negates the benefit of a smaller data region. Worse, such a change could decrease the frame rate because it requires extra code in the drawOrTestTetromino loop. At five subroutine calls per frame and four blocks per tetromino, the program amplifies the computational cost of anything in the loop twenty-fold.

On the other hand, if the program employed a smaller playfield data region, it would likely speed up line clears. There is currently a noticeable delay when clearing lines due to the amount of data copied. But if the program packed multiple cells into each byte, then it would copy cells in parallel, significantly hastening the process.

At 112 bytes, the tetrominoes table is another possible target for program size reduction. Nearly a third of its rows are duplicates. Deleting them would save 36 bytes. However, the indirection table required to compensate for the missing rows demands 28 bytes. And indirection increases access time.

The program prioritizes simplicity over space. This text leaves the optimal balance between data, instruction path lengths, and program size that maximizes the frame rate to future researchers.

🡄 Previous

Next 🡆