2026/03/30

16-bit CPU from Scratch in Kotlin (Part 2): Building the CPU

Translating CPU theory into a 16-bit Kotlin emulator: ISA decisions, execution loop, memory model, and testing.

This is Part 2 of the series.

In Part 1 we built the mental model of a CPU. In this post we turn that model into real Kotlin code from kotlin-cpu — If you enjoy the project, a ⭐ star on GitHub goes a long way and helps others find it.

It is highly recommended that you take a look at the repository code before you start reading, and keep it open while you go through this post. Seeing the real files next to the explanations makes the whole build process much easier to follow.

This is a full step-by-step build story: from assembly language design, to instruction encoding, to CPU state, to execution loop, to tests.

Step 1: Define The CPU Model

We start with a tiny and explicit machine model:

  • 64KB byte-addressed RAM
  • Little-endian word layout in memory (low byte first, high byte second)
  • 8 registers (R0..R7), each 16-bit
  • PC and SP special registers
  • Flags: zero, carry, negative
  • A running flag to stop on HALT

Core structure:

interface Memory<T> {
    operator fun get(address: Int): T
}
 
interface MutableMemory<T> : Memory<T> {
    operator fun set(address: Int, value: T)
}
 
@OptIn(ExperimentalUnsignedTypes::class)
class RAM(size: Int) : MutableMemory<UByte> {
    private val data = UByteArray(size)
 
    override operator fun get(address: Int): UByte {
        return data[address]
    }
 
    override operator fun set(address: Int, value: UByte) {
        data[address] = value
    }
 
    fun toUByteArray(): UByteArray {
        return data
    }
}
 
@OptIn(ExperimentalUnsignedTypes::class)
class RegisterFile(size: Int) : MutableMemory<UShort> {
    private val data = UShortArray(size)
 
    override operator fun get(address: Int): UShort {
        return data[address]
    }
 
    override operator fun set(address: Int, value: UShort) {
        data[address] = value
    }
 
    fun toShortArray(): UShortArray {
        return data
    }
}
 
class Cpu {
	val memory: Memory<UByte> field = RAM(65536)
	val regs: Memory<UShort> field = RegisterFile(8)
 
	private var PC: UShort = 0u
	private var SP: UShort = 0xFFFEu
 
	private var zeroFlag: Boolean = false
	private var carryFlag: Boolean = false
	private var negativeFlag: Boolean = false
 
	private var running: Boolean = true
 
	...
}

Two details matter a lot:

  • R0 is hard-wired to zero (reads always return 0, writes are ignored).
  • Stack starts at 0xFFFE and grows downward in 2-byte steps.

Why make R0 always zero?

  • It gives you a free constant 0 in every instruction, without loading it first.
  • It simplifies instruction patterns. For example, MOV R3, R0 clears R3, and ADD R5, R2, R0 behaves like a copy from R2 to R5.
  • Ignoring writes keeps behavior deterministic: even if code does LI R0, 123, later reads from R0 are still 0.

This is a common CPU design trick: one special register makes both assembly code and hardware logic simpler.

If you want to see the complete code for this step, check Cpu.kt.

Step 2: Decide The .kasm Language

We want a tiny assembly language that is easy to read and easy to parse.

Rules:

  • One instruction per line.
  • Labels end with :.
  • Comments start with ;.
  • Register names are R0 to R7.

Example:

; sum two numbers and branch
LI R1, 5
LI R2, 7
ADD R3, R1, R2
BEQ R3, R2, skip
HALT
skip:
HALT

Reader stage strips comments and empty lines first:

private fun readKasm(kasm: String): String {
	return kasm
		.lineSequence()
		.map { line -> line.substringBefore(';').trim() }
		.filter { it.isNotEmpty() }
		.joinToString("\n")
}

If you want to see more .kasm examples, check src/main/resources.

Step 3: Define ISA Encoding (16-bit)

The CPU uses fixed 16-bit instructions.

Base idea:

  • opcode = bits[15..12]
  • Remaining bits depend on instruction format.

Formats:

R-type: opcode[15..12] rd[11..9] rs1[8..6] rs2[5..3] aluOp[2..0]
I-type: opcode[15..12] rd[11..9] rs1[8..6] imm6[5..0]
J-type: opcode[15..12] imm12[11..0]
Stack : opcode[15..12] reg[11..9]

Instruction Encoding Explorer

Real `.kasm` lines and the exact 16-bit instruction words they encode to.

R-Type Example

.kasm Instruction

ADD R3, R1, R2

bits: 0000 011 001 010 000

hex: 0x0650

opcode

bits 15..12 = 0000

ALU family

rd

bits 11..9 = 011

R3 (destination)

rs1

bits 8..6 = 001

R1

rs2

bits 5..3 = 010

R2

aluOp

bits 2..0 = 000

ADD

Immediate behavior:

  • imm6 is sign-extended (-32..31)
  • imm12 is sign-extended (-2048..2047)
  • Branch/jump offsets are word-based (offset * 2)

Step 4: Implement The ISA Instruction Set

I kept the ISA small but complete enough for programs with arithmetic, memory, branching, and calls.

MnemonicOpcodeFormatSemantics
ADD rd, rs1, rs20x0 + aluOp=000Rrd = rs1 + rs2
SUB rd, rs1, rs20x0 + aluOp=001Rrd = rs1 - rs2
AND rd, rs1, rs20x0 + aluOp=010Rrd = rs1 & rs2
OR rd, rs1, rs20x0 + aluOp=011Rrd = rs1 | rs2
XOR rd, rs1, rs20x0 + aluOp=100Rrd = rs1 ^ rs2
MOV rd, rs10x0 + aluOp=101Rrd = rs1
SHL rd, rs1, rs20x0 + aluOp=110Rrd = rs1 << (rs2 & 0xF)
SHR rd, rs1, rs20x0 + aluOp=111Rrd = rs1 >> (rs2 & 0xF)
ADDI rd, rs1, imm60x1Ird = rs1 + imm6
LI rd, imm60x2Ird = imm6
LUI rd, imm60x3Ird = imm6 << 8
LOAD rd, base, off60x4Ird = mem16[(base + off6) & 0xFFFE]
STORE rs, base, off60x5Imem16[(base + off6) & 0xFFFE] = rs
BEQ rs1, rs2, off60x6Iif equal, PC += off6 * 2
BNE rs1, rs2, off60x7Iif not equal, PC += off6 * 2
JMP off120x8JPC += off12 * 2
RET0x9JPC = pop()
PUSH rs0xCStackpush(rs)
POP rd0xDStackrd = pop()
CALL off120xEJpush(returnPC); PC += off12 * 2
HALT0xFJstop execution

Real Isa Code Layers (from kotlin-cpu)

The instruction table above is the contract. The Isa implementation is the engine that enforces that contract.

Layer 1 - wiring dependencies (Cpu, IsaDecoder, Alu, IsaDebugger):

class Isa private constructor(
	val cpu: Cpu,
	private val decoder: IsaDecoder,
	private val alu: Alu,
	private val debugger: IsaDebugger
) : IsaExecution {
 
	companion object {
		fun create(
			cpu: Cpu,
			debugger: IsaDebugger = IsaDebugger.None
		): Isa {
			return Isa(
				cpu = cpu,
				decoder = IsaDecoder,
				alu = Alu,
				debugger = debugger
			)
		}
	}
}

This layer is composition: it assembles the CPU state with decode logic, ALU logic, and optional debugging.

Layer 2 - fetch/decode/execute loop:

fun run() {
	context(decoder, alu) {
		while (cpu.isRunning()) {
			val inst = cpu.fetch()
			val opcode = decoder.decodeOpcode(inst)
			with(debugger) {
				cpu.debug(opCode = opcode, inst = inst, step = { op, inst ->
					executeInstruction(opCode = op, inst = inst)
				})
			}
		}
	}
}

This layer is control flow: fetch one word, decode opcode, execute, repeat until HALT flips running to false.

Layer 3 - opcode dispatch table:

context(decoder: IsaDecoder, alu: Alu)
private fun executeInstruction(opCode: OpCode, inst: UShort) {
	when (opCode) {
		OpCode.ALU -> cpu.alu(inst = inst)
		OpCode.ADDI -> cpu.addi(inst = inst)
		OpCode.LI -> cpu.li(inst = inst)
		OpCode.LUI -> cpu.lui(inst = inst)
		OpCode.LOAD -> cpu.load(inst = inst)
		OpCode.STORE -> cpu.store(inst = inst)
		OpCode.BEQ -> cpu.beq(inst = inst)
		OpCode.BNE -> cpu.bne(inst = inst)
		OpCode.JMP -> cpu.jmp(inst = inst)
		OpCode.RET -> cpu.ret()
		OpCode.PUSH -> cpu.push(inst = inst)
		OpCode.POP -> cpu.pop(inst = inst)
		OpCode.CALL -> cpu.call(inst = inst)
		OpCode.HALT -> cpu.halt()
	}
}

This layer is routing: one opcode maps to one CPU behavior.

Layer 4 - per-instruction semantics (example handlers):

context(decoder: IsaDecoder, alu: Alu)
fun Cpu.alu(inst: UShort) {
	val rd = decoder.decodeRd(inst)
	val rs1 = decoder.decodeRs1(inst)
	val rs2 = decoder.decodeRs2(inst)
	val a = readReg(rs1)
	val b = readReg(rs2)
	val result = alu.execute(
		operation = decoder.decodeAluInstruction(inst),
		a = a,
		b = b,
		updateCarryFlag = { carry -> updateCarry(carry = carry) }
	)
 
	updateZeroNegative(value = result)
	writeReg(index = rd, value = result)
}
 
context(decoder: IsaDecoder)
fun Cpu.call(inst: UShort) {
	val offset = decoder.decodeImm12(inst)
	push(pc { it })
	pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}
 
fun Cpu.halt() {
	stopRunning()
}

This layer is the actual behavior of each instruction: register reads/writes, flags, stack, memory, and PC updates.

Pseudo instruction:

NOP ; expands to ADD R0, R0, R0

If you want to see the complete code for this step, check Isa.kt.

Step 5: Build The Encoder And Decoder

Encoder writes bit-fields into a UShort:

private fun encodeRegister(opcode: OpCode, rd: Int, rs1: Int, rs2: Int, aluOperation: AluInstruction): UShort {
	return (
		(opcode.code shl 12) or
		(rd shl 9) or
		(rs1 shl 6) or
		(rs2 shl 3) or
		aluOperation.operation
	).toUShort()
}

Decoder extracts fields back out:

fun decodeOpcode(inst: UShort): OpCode {
	val code = (inst.toInt() shr 12) and 0xF
	return OpCode.entries.first { it.code == code }
}
 
fun decodeRd(inst: UShort) = (inst.toInt() shr 9) and 0x7
fun decodeRs1(inst: UShort) = (inst.toInt() shr 6) and 0x7
fun decodeRs2(inst: UShort) = (inst.toInt() shr 3) and 0x7

Sign extension for immediates:

fun decodeImm6(inst: UShort): Int {
	val imm = inst.toInt() and 0x3F
	return if (imm and 0x20 != 0) imm or -0x40 else imm
}
 
fun decodeImm12(inst: UShort): Int {
	val imm = inst.toInt() and 0xFFF
	return if (imm and 0x800 != 0) imm or -0x1000 else imm
}

Bitwise Encode / Decode

ADD R3, R1, R20x0650
0000011001010000
opcode [15..12]rd [11..9]rs1 [8..6]rs2 [5..3]aluOp [2..0]
1

Place opcode in bits 15..12

Operation

0x0 shl 12

Instruction built so far

0000000000000000

Placed

0000

opcode = 0x0 (ALU family)

shl 12 shifts the opcode value all the way to the top 4 bits. ADD belongs to the ALU opcode family, so opcode = 0x0 and all four bits are 0.

2

Place rd in bits 11..9

Operation

3 shl 9

Instruction built so far

0000011000000000

Placed

011

rd = R3 (011)

shl 9 moves register index 3 (R3) into bit positions 11..9. OR-ing it with the running word fills those three bits without touching others.

3

Place rs1 in bits 8..6

Operation

1 shl 6

Instruction built so far

0000011001000000

Placed

001

rs1 = R1 (001)

shl 6 moves register index 1 into bit positions 8..6.

4

Place rs2 in bits 5..3

Operation

2 shl 3

Instruction built so far

0000011001010000

Placed

010

rs2 = R2 (010)

shl 3 moves register index 2 into bit positions 5..3.

5

Place aluOp in bits 2..0

Operation

0x0 (no shift needed)

Instruction built so far

0000011001010000

Placed

000

aluOp = ADD (000)

The lowest field sits at positions 2..0 and needs no shift. ADD sub-opcode is 000.

6

Merge all fields with OR

Operation

0x0000 or 0x0600 or 0x0040 or 0x0010 or 0x0000

Instruction built so far

0000011001010000

Placed

0000 011 001 010 000

= 0x0650 — final 16-bit instruction

Each field occupies a different non-overlapping group of bits, so OR safely merges them all into one 16-bit word with no collisions.

Step 6: Parse .kasm Into Encoded Program Words

In the repository, this step is implemented by KasmParser with a clean two-pass assembler flow.

The public entry keeps it simple:

class KasmParser private constructor(private val encoder: KasmEncoder) {
 
	companion object {
		operator fun invoke(program: String): List<UShort> {
			val encoder = KasmEncoder
			return KasmParser(encoder).parse(program)
		}
	}
 
	fun parse(program: String): List<UShort> {
		return context(encoder) { assemble(program = program) }
	}
}

Pass 0 tokenizes lines into either Label or Instruction:

private fun parse(program: String): List<AsmLine> {
	return program.lines()
		.map { it.trim() }
		.filter { it.isNotEmpty() }
		.map { line ->
			if (line.endsWith(":")) {
				Label(line.dropLast(1))
			} else {
				val parts = line.split(" ", ",")
					.map { it.trim() }
					.filter { it.isNotEmpty() }
 
				Instruction(parts[0].uppercase(), parts.drop(1))
			}
		}
}

Pass 1 resolves label addresses in bytes (pc += 2 per instruction):

private fun resolveLabels(lines: List<AsmLine>): Map<String, Int> {
	val labels = mutableMapOf<String, Int>()
	var pc = 0
 
	for (line in lines) {
		when (line) {
			is Label -> labels[line.name] = pc
			is Instruction -> pc += 2
		}
	}
 
	return labels
}

Pass 2 encodes instructions using KasmEncoder, while pc tracks the current instruction address:

context(encoder: KasmEncoder)
private fun assemble(program: String): List<UShort> {
	val lines = parse(program)
	val labels = resolveLabels(lines)
 
	val output = mutableListOf<UShort>()
	var pc = 0
 
	for (line in lines) {
		if (line is Instruction) {
			val inst = when (line.name) {
				"ADD" -> encoder.ADD(parseReg(line.args[0]), parseReg(line.args[1]), parseReg(line.args[2]))
				"JMP" -> encoder.JMP(parseImm(line.args[0], labels, pc))
				"BEQ" -> encoder.BEQ(parseReg(line.args[0]), parseReg(line.args[1]), parseImm(line.args[2], labels, pc))
				"HALT" -> encoder.HALT()
				...
				else -> error("Unknown instruction: ${line.name}")
			}
 
			output.add(inst)
			pc += 2
		}
	}
 
	return output
}

The critical part is immediate parsing for labels (PC-relative, word-based offsets):

private fun parseImm(value: String, labels: Map<String, Int>, pc: Int): Int {
	return when {
		value.matches(Regex("-?\\d+")) -> value.toInt()
 
		labels.containsKey(value) -> {
			val target = labels[value]!!
			(target - (pc + 2)) / 2
		}
 
		else -> error("Unknown immediate: $value")
	}
}

Why this formula works:

  • pc + 2 means "the next instruction after the current one".
  • target - (pc + 2) gives byte distance from next instruction to label.
  • dividing by 2 converts bytes into 16-bit instruction-word offsets.

Label Offset

We want to encode this line: BEQ R1, R2, loop. The assembler must convert loop into a small number called an offset.

1) Where Are We In The Program?

Each instruction is 2 bytes, so addresses move by +2.

0x0000: LI R1, 3

0x0002: LI R2, 3

0x0004: BEQ R1, R2, loop <-- current line (pc)

0x0006: HALT

0x0008: loop:

0x0008: ADD R3, R3, R1

2) Step-By-Step Math

The branch offset is computed from the next instruction, not the current one.

1. pc = 0x0004 (address of BEQ)

2. target = 0x0008 (address of label loop)

3. pc + 2 = 0x0006 (next instruction)

4. target - (pc + 2) = 0x0002 bytes

5. 0x0002 / 2 = 1 instruction word

Result: immediate offset = 1

3) Why pc + 2?

The CPU has already fetched the branch instruction, so the natural next position is pc + 2. Branch offsets are measured from that next position.

Offsets are stored in instruction words, not bytes. That is why we divide by 2 in (target - (pc + 2)) / 2.

Quick intuition: from 0x0006 to 0x0008 is one instruction ahead, so offset = 1.

Mini glossary: pc = current instruction address, target = label address, offset = how many instructions to jump.

If you want to see the complete code for this step, check KasmParser.kt.

Step 7: Implement Fetch, Stack, And Word Memory Rules

Memory is byte-addressed in this CPU, which means:

  • PC points to a byte address (not a word index).
  • each instruction is 16-bit = 2 bytes.

So one fetch must read two bytes:

  1. low byte at PC
  2. high byte at PC + 1

Then PC advances by 2, because we consumed 2 bytes (one full instruction).

Real fetch code:

fun fetch(): UShort {
	val low = memory[PC.toInt()].toUInt()
	val high = memory[(PC + 1u).toInt()].toUInt()
	PC = (PC + 2u).toUShort()
	return ((high shl 8) or low).toUShort()
}

Why ((high shl 8) or low) works:

  • high shl 8: moves the high byte into bits [15..8]
  • low: already occupies bits [7..0]
  • or: merges both halves into one 16-bit value

Example:

  • low = 0x34
  • high = 0x12
  • (high shl 8) = 0x1200
  • 0x1200 or 0x0034 = 0x1234

push and pop use the exact same byte-splitting and byte-joining logic, just in reverse order.

Push and pop are also 16-bit and little-endian:

fun push(value: UShort) {
	SP = (SP - 2u).toUShort()
	val addr = SP.toInt() and 0xFFFF
	memory[addr] = (value.toInt() and 0xFF).toUByte()
	memory[addr + 1] = ((value.toInt() shr 8) and 0xFF).toUByte()
}
 
fun pop(): UShort {
	val addr = SP.toInt() and 0xFFFF
	val low = memory[addr].toUInt()
	val high = memory[addr + 1].toUInt()
	val value = ((high shl 8) or low).toUShort()
	SP = (SP + 2u).toUShort()
	return value
}

Bitwise details inside push:

  • (value.toInt() and 0xFF): keeps only the low 8 bits (low byte)
  • (value.toInt() shr 8) and 0xFF: shifts high byte down, then masks to 8 bits

Bitwise details inside pop:

  • read low and high bytes from memory
  • high shl 8 puts high byte back in [15..8]
  • or low restores full 16-bit word

In short: fetch, push, and pop are consistent because they all follow the same little-endian rule:

  • low byte at lower address
  • high byte at next address

Fetch And Bitwise Visualization

`PC` is a byte address. One instruction is 2 bytes, so fetch reads two bytes and then moves `PC` by `+2`.

A) Why PC + 2?

PC = 0x0010

memory[0x0010] = 0x34 (low byte)

memory[0x0011] = 0x12 (high byte)

fetch() consumes both bytes, then PC = 0x0012

Because each instruction is exactly 2 bytes, the next instruction starts at the next byte pair.

B) Join Two Bytes Into One Word

low = 0x34 = 0011 0100

high = 0x12 = 0001 0010

(high shl 8) = 0x1200 = 0001 0010 0000 0000

or low = 0x0034 = 0000 0000 0011 0100

result = 0x1234 = 0001 0010 0011 0100

C) Push/Pop Use The Same Rule

Push (split word)

low = value and 0xFF

high = (value shr 8) and 0xFF

Mask `and 0xFF` keeps only one byte.

Pop (join word)

value = (high shl 8) or low

Shift high byte into the top half, then merge with low byte.

Step 8: Build The ALU And Flag Updates

In the repository, all ALU behavior is centralized in Alu.execute, and the CPU instruction handler just feeds it operands and stores the result.

The operation enum is small and explicit:

enum class AluInstruction(val operation: Int) {
	ADD(0b000), SUB(0b001), AND(0b010), OR(0b011),
	XOR(0b100), MOV(0b101), SHL(0b110), SHR(0b111)
}

The ALU-backed CPU handler looks like this:

context(decoder: IsaDecoder, alu: Alu)
fun Cpu.alu(inst: UShort) {
	val rd = decoder.decodeRd(inst)
	val rs1 = decoder.decodeRs1(inst)
	val rs2 = decoder.decodeRs2(inst)
	val a = readReg(rs1)
	val b = readReg(rs2)
	val result = alu.execute(
		operation = decoder.decodeAluInstruction(inst),
		a = a,
		b = b,
		updateCarryFlag = { carry -> updateCarry(carry = carry) }
	)
 
	updateZeroNegative(value = result)
	writeReg(index = rd, value = result)
}

That structure is important:

  • decoder picks which ALU operation to run
  • CPU reads the source registers
  • ALU computes the result and reports carry behavior
  • CPU updates zeroFlag and negativeFlag
  • CPU writes the final value back to rd

Real Alu.execute examples from the project:

ADD and SUB:

AluInstruction.ADD -> {
	val r = a.toUInt() + b.toUInt()
	updateCarryFlag(r > 0xFFFFu)
	r.toUShort()
}
 
AluInstruction.SUB -> {
	val r = a.toUInt() - b.toUInt()
	updateCarryFlag(a.toUInt() < b.toUInt())
	r.toUShort()
}

ADD uses UInt so the code can detect overflow beyond 0xFFFF before converting back to UShort. SUB uses the same idea for borrow detection: if a < b, carry/borrow is recorded.

Bitwise ops are even simpler:

AluInstruction.AND -> {
	updateCarryFlag(false)
	a and b
}
 
AluInstruction.OR -> {
	updateCarryFlag(false)
	a or b
}
 
AluInstruction.XOR -> {
	updateCarryFlag(false)
	a xor b
}
 
AluInstruction.MOV -> {
	updateCarryFlag(false)
	a
}

These operations do not produce arithmetic carry in this CPU, so carry is explicitly reset to false.

Shift operations are slightly more interesting:

AluInstruction.SHL -> {
	val shift = (b.toInt() and 0xF)
	val r = a.toUInt() shl shift
	updateCarryFlag((a.toUInt() shl (shift - 1)) and 0x8000u != 0u)
	r.toUShort()
}
 
AluInstruction.SHR -> {
	val shift = (b.toInt() and 0xF)
	updateCarryFlag((a.toUInt() shr (shift - 1)) and 1u != 0u)
	val r = a.toUInt() shr shift
	r.toUShort()
}

Why b.toInt() and 0xF?

  • only the low 4 bits of the shift amount are used
  • that keeps shifts in the range 0..15
  • which makes sense for a 16-bit value

After execute returns, the CPU updates the remaining flags with real code from Cpu.kt:

fun updateCarry(carry: Boolean) {
	carryFlag = carry
}
 
fun updateZeroNegative(value: UShort) {
	zeroFlag = value == 0.toUShort()
	negativeFlag = (value.toInt() and 0x8000) != 0
}

So the flag rules are:

  • zeroFlag: result equals 0
  • negativeFlag: bit 15 of the 16-bit result is 1
  • carryFlag: reported by arithmetic and shift logic

Example intuition:

  • ADD 0x0001 + 0x0001 = 0x0002 -> zero=false, negative=false
  • SUB 0x0001 - 0x0001 = 0x0000 -> zero=true
  • a result like 0x8001 -> negative=true because bit 15 is set

ALU Operations & Flags

R10x00055
R20x00077
ALUADD
Result0x000C12
FLAGS
Z=0N=0C=0V=0

ADD R1, R2

What happened

5 + 7 = 12. The result is positive, non-zero, and fits in 16 bits. No flags are set. Most instructions produce this clean outcome.

If you want to see the complete code for this step, check Alu.kt.

Step 9: Implement Instruction Execution Dispatch

At this point, the CPU already knows how to:

  • fetch a 16-bit instruction word
  • decode fields out of that word
  • run individual instruction handlers like alu, load, store, and call

So the executor loop can stay intentionally small. Its only job is to repeat the same cycle:

  1. fetch one instruction
  2. decode its top-level opcode
  3. route execution to the correct handler
  4. keep going until HALT stops the CPU

Core loop:

while (cpu.isRunning()) {
	val inst = cpu.fetch()
	val opcode = decoder.decodeOpcode(inst)
	executeInstruction(opCode = opcode, inst = inst)
}

This is the heart of the emulator: everything else plugs into this loop.

The nice design choice here is separation of responsibility:

  • fetch() knows memory layout and PC movement
  • decodeOpcode() knows instruction encoding
  • each CPU handler knows the semantics of one instruction family
  • the dispatch layer only decides where control goes next

Dispatch table:

when (opCode) {
	OpCode.ALU -> cpu.alu(inst = inst)
	OpCode.ADDI -> cpu.addi(inst = inst)
	OpCode.LI -> cpu.li(inst = inst)
	OpCode.LUI -> cpu.lui(inst = inst)
	OpCode.LOAD -> cpu.load(inst = inst)
	OpCode.STORE -> cpu.store(inst = inst)
	OpCode.BEQ -> cpu.beq(inst = inst)
	OpCode.BNE -> cpu.bne(inst = inst)
	OpCode.JMP -> cpu.jmp(inst = inst)
	OpCode.RET -> cpu.ret()
	OpCode.PUSH -> cpu.push(inst = inst)
	OpCode.POP -> cpu.pop(inst = inst)
	OpCode.CALL -> cpu.call(inst = inst)
	OpCode.HALT -> cpu.halt()
}

This table is basically the router of the CPU. If the opcode is ALU, execution goes to cpu.alu(inst). If it is LOAD, it goes to cpu.load(inst). If it is HALT, it goes to cpu.halt(), which flips the running flag and ends the loop.

That keeps the system easy to extend: adding a new instruction usually means two clear steps:

  1. define how it is encoded and decoded
  2. add one handler and one dispatch case

Step 10: Control Flow Details (BEQ, BNE, JMP, CALL, RET)

Control flow is where the CPU stops following the normal "next instruction, next instruction, next instruction" pattern.

In this CPU, all branch and jump offsets are word-based, but PC is still a byte address. That is why the handlers do offset * 2: one instruction word is 2 bytes.

Real branch/jump handlers from the project:

fun Cpu.beq(inst: UShort) {
	val rs1 = decoder.decodeRd(inst)
	val rs2 = decoder.decodeRs1(inst)
	val offset = decoder.decodeImm6(inst)
 
	if (readReg(rs1) == readReg(rs2)) {
		pc { pc -> (pc.toInt() + offset * 2).toUShort() }
	}
}
 
fun Cpu.bne(inst: UShort) {
	val rs1 = decoder.decodeRd(inst)
	val rs2 = decoder.decodeRs1(inst)
	val offset = decoder.decodeImm6(inst)
 
	if (readReg(rs1) != readReg(rs2)) {
		pc { pc -> (pc.toInt() + offset * 2).toUShort() }
	}
}
 
fun Cpu.jmp(inst: UShort) {
	val offset = decoder.decodeImm12(inst)
	pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}

The important idea is simple:

  • BEQ changes PC only when the registers are equal
  • BNE changes PC only when the registers are different
  • JMP always changes PC

If a branch is not taken, the CPU does nothing special here, because fetch() already advanced PC to the next instruction.

Branching in this implementation compares register values directly:

if (readReg(rs1) == readReg(rs2)) {
	pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}

CALL and RET behavior:

fun Cpu.call(inst: UShort) {
	val offset = decoder.decodeImm12(inst)
	push(pc { it })
	pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}
 
fun Cpu.ret() {
	pc { pop() }
}

Why do we push(pc { it }) on CALL?

Because a function call needs to remember where execution should continue after the function finishes.

At the moment call() runs, fetch() has already consumed the CALL instruction and advanced PC to the next instruction. So the current PC is already the correct return address.

That means:

  • push(pc { it }) saves "where to come back to later"
  • pc { pc -> (pc.toInt() + offset * 2).toUShort() } jumps into the function body

Why does RET do pc { pop() }?

Because returning from a function means restoring the saved return address. pop() reads that address from the stack, and pc { pop() } makes execution continue exactly where the caller left off.

This push/pop pattern matters for two reasons:

  1. It lets the CPU come back to the instruction after CALL.
  2. It supports nested calls naturally.

Example intuition:

  • main calls function A -> push return address for main
  • function A calls function B -> push return address for A
  • RET in B pops back to A
  • RET in A pops back to main

Without the stack, one call would overwrite the previous return point, and nested calls would break.

So CALL + RET is really a small protocol:

  1. save return address on stack
  2. jump to callee
  3. callee eventually executes RET
  4. restore saved address and continue

This gives a clean and predictable function call model.

Step 11: Load Program Into RAM

After the parser finishes, the .kasm source is no longer text. At that point, it has been transformed into a List<UShort>, where each item is one fully encoded 16-bit instruction word.

The loader's job is to take that list and place it into RAM in the exact byte layout the CPU expects.

Loader writes encoded instruction words at address 0x0000:

operator fun invoke(program: List<UShort>): Cpu {
	val cpu = Cpu()
	var address = 0
	for (inst in program) {
		cpu.writeMemory(address = address, value = (inst.toInt() and 0xFF).toUByte())
		cpu.writeMemory(address = address + 1, value = ((inst.toInt() shr 8) and 0xFF).toUByte())
		address += 2
	}
	return cpu
}

What happens inside the loop:

  • program is the parsed and encoded output from KasmParser
  • each inst is one 16-bit instruction
  • the low byte is written first
  • the high byte is written second
  • address += 2 moves to the next instruction slot in RAM

So the flow is:

  1. KasmReader cleans the source text
  2. KasmParser turns the .kasm file into a List<UShort>
  3. KasmLoader writes that instruction list into RAM
  4. Isa.run() starts fetching from RAM at PC = 0

That means entry point is naturally PC = 0, because the first encoded instruction is loaded at address 0x0000.

Step 12: Add Debugging Trace

A small debugger hook makes development much easier, especially when you are trying to understand:

  • Which instruction was just fetched
  • Where PC is moving
  • Whether a branch or call changed control flow
  • What happened to the registers after each step

This is implemented with an IsaDebugger interface that wraps one execution step:

interface IsaDebugger {
 
	context(decoder: IsaDecoder)
	fun Cpu.debug(opCode: OpCode, inst: UShort, step: (OpCode, UShort) -> Unit)
 
	data object None : IsaDebugger {
 
		context(decoder: IsaDecoder)
		override fun Cpu.debug(opCode: OpCode, inst: UShort, step: (OpCode, UShort) -> Unit) {
			step(opCode, inst)
		}
	}
 
	data object Logger : IsaDebugger {
 
		context(decoder: IsaDecoder)
		override fun Cpu.debug(opCode: OpCode, inst: UShort, step: (OpCode, UShort) -> Unit) {
			println(
				"[DEBUGGER] PC=${pc { it }.toString(16).padStart(4, '0')} " +
						"INST=0x${inst.toString(16).padStart(4, '0')} " +
						decoder.disassemble(opCode = opCode, inst = inst)
			)
			step(opCode, inst)
			println("   → ${dumpRegs()}")
		}
	}
}

This design is useful because debugging is optional.

  • IsaDebugger.None just runs the instruction normally
  • IsaDebugger.Logger prints trace information before and after that instruction

So you can turn tracing on without changing CPU semantics.

The debugger is plugged directly into the main run loop:

fun run() {
	context(decoder, alu) {
		while (cpu.isRunning()) {
			val inst = cpu.fetch()
			val opcode = decoder.decodeOpcode(inst)
			with(debugger) {
				cpu.debug(opCode = opcode, inst = inst, step = { op, inst ->
					executeInstruction(opCode = op, inst = inst)
				})
			}
		}
	}
}

That means every instruction goes through the same tracing hook:

  1. print current PC, raw instruction, and disassembly
  2. execute the instruction
  3. print registers after execution

Logger prints:

  • Current PC
  • Raw instruction word
  • Human-readable disassembly
  • Register dump after execution

The disassembly text is also generated inside IsaDebugger, so the trace is much easier to read than raw hex alone:

private fun IsaDecoder.disassemble(opCode: OpCode, inst: UShort): String {
	return when (opCode) {
		OpCode.ALU -> {
			val aluInstruction = decodeAluInstruction(inst)
			val rd = decodeRd(inst)
			val rs1 = decodeRs1(inst)
			val rs2 = decodeRs2(inst)
 
			when (aluInstruction) {
				AluInstruction.ADD -> "ADD R$rd, R$rs1, R$rs2"
				AluInstruction.SUB -> "SUB R$rd, R$rs1, R$rs2"
				AluInstruction.AND -> "AND R$rd, R$rs1, R$rs2"
				AluInstruction.OR -> "OR  R$rd, R$rs1, R$rs2"
				AluInstruction.XOR -> "XOR R$rd, R$rs1, R$rs2"
				AluInstruction.MOV -> "MOV R$rd, R$rs1"
				AluInstruction.SHL -> "SHL R$rd, R$rs1, R$rs2"
				AluInstruction.SHR -> "SHR R$rd, R$rs1, R$rs2"
			}
		}
		...
	}
}

Representative log output looks like this:

One detail to notice: the logged PC is the value after fetch() has already advanced it, so it points to the next instruction slot, not the original byte address before fetch.

[DEBUGGER] PC=0002 INST=0x2045 LI R1, 5
   → R0=0 R1=5 R2=0 R3=0 R4=0 R5=0 R6=0 R7=0
[DEBUGGER] PC=0004 INST=0x2087 LI R2, 7
   → R0=0 R1=5 R2=7 R3=0 R4=0 R5=0 R6=0 R7=0
[DEBUGGER] PC=0006 INST=0x0650 ADD R3, R1, R2
   → R0=0 R1=5 R2=7 R3=12 R4=0 R5=0 R6=0 R7=0

This is useful for understanding control flow bugs quickly, because you can see both the instruction stream and the machine state evolve one step at a time.

If you want to see the complete code for this step, check IsaDebugger.kt.

Conclusion

Thanks for reading this far.

In this part, we took the CPU model from Part 1 and turned it into a working machine: a small ISA, a .kasm parser, an encoder/decoder, a RAM-backed CPU state, a fetch-decode-execute loop, stack-based calls, and debugger-friendly execution tracing.

That is already enough to make the system feel real. We are no longer talking only about CPU ideas in the abstract. We now have a concrete emulator that can load instructions into RAM, execute them step by step, branch, call functions, return through the stack, and expose its internal state while it runs.

At the same time, this project is intentionally small. It is a learning CPU, not a full operating-system-grade machine. For example, we did not implement things like:

  • a heap allocator or dynamic memory management
  • processes or concurrency
  • interrupts or exception handling
  • caches, pipelines, or speculative execution
  • virtual memory or privilege levels

That is exactly why this project is useful: it isolates the core ideas first. Once fetch, decode, execute, memory layout, stack discipline, and control flow make sense at this scale, the bigger topics become much easier to learn later.

Thank you! ❤️