Monkeys With Assemblers – Part 7 : Bilingual Monkeys

So far the monkeys have been able to produce simple code to solve simple problems using a very simple machine language of just 24 instructions. I want to see what happens if we give them a more complex, real world processor to play with.

Processor Choice

When deciding which processor to use I considered the following:

It must have a relatively simple instruction set
There must be existing code available for future ‘training’ experiments
It should be well documented
Bonus points if I have existing knowledge of it

Some of the options considered were:

Arm – This would have been a strong contender due to the widespread adoption of Arm processors but the instruction set, even on the much older chips, just seemed a little too complex for this first attempt

x86 – See above!

Atmel AVR 8 bit – This ticked most of the boxes. It’s relatively simple and was used in earlier Arduino boards which means it probably has a diverse community associated with it, but I’m completely unfamiliar with the processor itself.

PIC 8 bit – Another simple microcontroller that’s apparently used in a lot of embedded devices. As with the AVR, though, I’m just not at all familiar with the processor.

Z80 – Like the PIC and the AVR, this easily ticks the first 3 boxes. The reason this classic makes an appearance is because of its use in the good ol’ Sinclair Spectrum. I have *some* knowledge of the processor itself and of Z80 assembler, which will help, but ultimately it wins because of the nostalgia trip that I’ll get to enjoy along the way!

Something tells me this is going to be a lot of work though….

Virtual Machine

A good place to start is to dive right in to our new Z80 virtual machine and try to get some code running. I considered writing an emulator from scratch because it would’ve been a fun challenge, but it’s a LOT of work and thankfully someone introduced me to the ‘chips‘ emulation project instead. This includes a full Z80 emulator, all contained in a single header file.

I wanted to keep the existing Arm-like support and potentially support more processors in future, so I created a base VM class with derived ARM-ish and Z80 versions. This is pretty trivial because the interface is so simple (Reset the machine, load a program, execute from a location in memory).

The Z80 version then just wraps the ‘chips’ emulator. The most complex part is how we handle memory. In the emulator, read and write requests are made using pins and the address/data buses and there’s no separation between code and data memory, but until now we’ve been handling program and data memory as two completely separate sets of memory (Harvard architecture, I believe).

To work around this we’ll hack the VM to ‘pretend’ that the program resides at the end of our data ROM+RAM. We’re basically using our Harvard architecture internally but hiding that implementation from the emulator.

The whole implementation was actually pretty trivial and I was able to hack together a simple 2 instruction program to confirm that things are working. Hurrah!

Disassembler

The next priority is the disassembler. If we’re going to use generators to produce machine code we need an easy way to verify that the instructions are encoded correctly and to be able to read the solutions that our monkeys produce. Thankfully chips came to the rescue again : z80dasm.h

I don’t have support for labels yet, but this is a quick and easy starting point. The project also included an ImGui panel for displaying the emulator registers so I went with the same format for now.

The majority of work at this point though was refactoring parts of the VM system to move from a fixed 32 bit instruction length to the variable length required by the Z80. With those changes we can now run our simple, hand-coded example for the Write One Byte exercise and display the final state of the Z80 cpu:

Very cool. Thanks, Chips!

Code Generation

Now we need the monkeys to start generating Z80 code. This is the most complex part. Our previous Arm-ish instruction set used a very simple 32-bit encoding of an opcode and two operands. A real-world processor has more complex instruction encoding which is nowhere near as consistent.

We need to be able to generate random instructions with random opcodes and operands, but then also modify those operands later using the genetic algorithm

For the Z80, it wasn’t clear to me what exactly is considered an “instruction”. For example, DEC_A, DEC_B, DEC_C etc. encode the register (A,B,C…) into the opcode in a consistent manner, but not in the same way that CP_A, CP_B, CP_C etc. do. They use different bits in the opcode. Then in some sources I’m seeing CP_r being treated a a single instruction with an encoded register operand but the DEC_r treated as one instruction per-register. That seems inconsistent but I don’t know the language well enough yet to understand the reasoning behind it.

After going round in circles for days I finally just settled on this list of operations, which gives us around 320 instructions in total, compared to the 24 from the original Arm-ish instruction set, and seems to cover every possible opcode available on the CPU.

We also need to efficiently handle variable length instructions. The genetic algorithm generator relies on them being a fixed 32 bit length at the moment which is much more efficient when, for example, inserting or modifying an instruction in the middle of the code (ie. knowing at which address in the code that instruction resides and how much space it uses).

For this reason I chose to retain the existing generic, 32 bit system for defining and storing instructions but then encode those into the target platform code before execution. This requires less changes to the genetic algorithm generator and the cost of encoding will be lower than decoding+encoding in the GA operators themselves.

The next step is to figure out a way to define each of the 320 instructions, including the information required to encode it:

For a quick test I added just 4 instructions. It’s wonderful to see it work, even if all it’s doing is shuffling data between the accumulator and memory:

Then I had to manually add every other instruction definition and test the encoding for each one to make sure that it’s working properly. This was ridiculously time consuming and painfully tedious and I feel that there must’ve been an easier way!

One discovery I made along the way is that branch destinations are more complex with variable length instructions. When we’re inserting/removing instructions during code generation we don’t want an existing branch to jump to an invalid location and execute garbage code. Instead, we store the destination operand as a higher level instruction index and then resolve that to the correct memory address at encoding time. If we discover during encoding that the branch jumps to an invalid address then we just report it as an encoding failure:

Virtual Machine Console

For the Arm-ish VM we faked a system call in order to emulate logging to a console. For the Z80 I figured we’d use a port. Basically, if anything is output to port 1 we’ll add it to the console.

I also made a small improvement to the Print String Exercise scoring to reduce the score given for each character that is printed beyond the desired string length. This penalises solutions that print unnecessary characters, including white space, and helps us to avoid incorrect/hacky solutions.

Stack

All of the Z80 stack operations are implemented but I haven’t added any explicit stack support to the VM. In theory, there’s nothing to be done except perhaps adding a bit more memory to the VM configuration for any Exercise that might want to make use of a stack.

In future it would be interesting and relatively simple to pre-populate the stack with parameters, such as the desired sequence length, before executing the program. That would get us closer to supporting subroutines, but that might be a whole future post in itself.

Metrics

Exercises used to score solutions based on the number of instructions executed. Since the Z80 emulator operates in cycles and each instruction requires a different number of cycles I figured it would be more accurate to measure those instead. The Arm-ish VM just treats one instruction as one cycle though.

Similarly, program size was previously scored based on the number of instructions in the solution. Now, because of the variable length instructions, we score based on the actual amount of bytes the program occupies.

Parlez-vous Z80?

All the pieces are in place, let’s see how the monkeys perform.

Write Numeric Byte Sequence

This works pretty well! It takes longer to find a correct solution than our Arm-ish version (~10 seconds compared to the previous sub-second timing) but we’re dealing with 13 times the number of instructions, the additional encoding step and some cycle-based emulator overhead. Also, the Z80 version doesn’t have any of the ‘hints’ in the random code generator which were helping to reduce the number of garbage solutions produced by the Arm-ish version. It’s a promising start!

Write Char Byte Sequence

This one caused a lot of problems on the Arm-ish implementation so I’m not surprised that it took a whopping 31 minutes for the Z80. They *did* manage to find a solution though!

Copy Char Byte Sequence

Like the Arm-ish version, the solution ended up being a sloppy sequence of copies but the Z80 version took much longer (2+ mins compared with a few seconds).

Copy Char String

In this case we failed to find a solution, facing the same evolutionary dead-end as the Arm-ish version. It might have appeared to succeed in the screenshot, but it’s actually failing because the code doesn’t succeed when running Test 2 (the Bonjour…. string).

Print Char String
Unlike the Arm-ish version, which suffered because of the very specific setup for the fake OS print call, the Z80 “Port 1” version was successful and the monkeys found a correct solution in under 2 seconds.

Given more time it will find a more optimised solution. Here’s a case with the “Code Size Bias” setting set to produce smaller code, which results in an 8 byte program that takes 512 cycles to execute:

Changing the setting to produce faster code results in a larger program of 28 bytes, but one which only takes 234 cycles to execute:

I don’t fully understand this solution though. It’s basically unrolled some of the loop in the previous example and then it’s relying on the parity flag to determine whether to loop further, but that’s an undocumented feature of the OUTI and OTIR instructions. I think there’s some combination of the B loop counter value and the character stored at a particular location in the string that’s causing the flag to be set and it just happens to work with the two sets of test data used in this Exercise.

Write Star Pattern

I had to give this one a try. As you might expect, we have a solution but it was just a matter of how long it would take….over 6 hours in this case!

Conclusion

Overall I’m pretty happy with the way this turned out. The implementation is pretty tidy and the performance isn’t nearly as bad as I thought it would be given the added complexity. I find it mesmerising to watch the different approaches the monkeys take and how the code evolves as they work towards an optimised, correct solution. It also makes me confident that it won’t be too difficult to add support for additional processors in future.

The cool part now is that we’re working with a real, complete processor and that opens up new possibilities. Next, I want to look into replacing the random code generator with something more intelligent to see if we can improve the monkeys’ overall performance. It might involve playing some Spectrum games…yay!