2026/03/30

16-bit CPU from Scratch in Kotlin (Part 2): Building the CPU

Translating CPU theory into a 16-bit Kotlin emulator: ISA decisions, execution loop, memory model, and testing.

This is Part 2 of the series.

In Part 1 we built the mental model of a CPU. In this post we turn that model into real Kotlin code from kotlin-cpu — If you enjoy the project, a ⭐ star on GitHub goes a long way and helps others find it.

It is highly recommended that you take a look at the repository code before you start reading, and keep it open while you go through this post. Seeing the real files next to the explanations makes the whole build process much easier to follow.

This is a full step-by-step build story: from assembly language design, to instruction encoding, to CPU state, to execution loop, to tests.

Step 1: Define The CPU Model

We start with a tiny and explicit machine model:

64KB byte-addressed RAM
Little-endian word layout in memory (low byte first, high byte second)
8 registers (R0..R7), each 16-bit
PC and SP special registers
Flags: zero, carry, negative
A running flag to stop on HALT

Core structure:

interface Memory<T> {
    operator fun get(address: Int): T
}
 
interface MutableMemory<T> : Memory<T> {
    operator fun set(address: Int, value: T)
}
 
@OptIn(ExperimentalUnsignedTypes::class)
class RAM(size: Int) : MutableMemory<UByte> {
    private val data = UByteArray(size)
 
    override operator fun get(address: Int): UByte {
        return data[address]
    }
 
    override operator fun set(address: Int, value: UByte) {
        data[address] = value
    }
 
    fun toUByteArray(): UByteArray {
        return data
    }
}
 
@OptIn(ExperimentalUnsignedTypes::class)
class RegisterFile(size: Int) : MutableMemory<UShort> {
    private val data = UShortArray(size)
 
    override operator fun get(address: Int): UShort {
        return data[address]
    }
 
    override operator fun set(address: Int, value: UShort) {
        data[address] = value
    }
 
    fun toShortArray(): UShortArray {
        return data
    }
}
 
class Cpu {
	val memory: Memory<UByte> field = RAM(65536)
	val regs: Memory<UShort> field = RegisterFile(8)
 
	private var PC: UShort = 0u
	private var SP: UShort = 0xFFFEu
 
	private var zeroFlag: Boolean = false
	private var carryFlag: Boolean = false
	private var negativeFlag: Boolean = false
 
	private var running: Boolean = true
 
	...
}

Two details matter a lot:

R0 is hard-wired to zero (reads always return 0, writes are ignored).
Stack starts at 0xFFFE and grows downward in 2-byte steps.

Why make R0 always zero?

It gives you a free constant 0 in every instruction, without loading it first.
It simplifies instruction patterns. For example, MOV R3, R0 clears R3, and ADD R5, R2, R0 behaves like a copy from R2 to R5.
Ignoring writes keeps behavior deterministic: even if code does LI R0, 123, later reads from R0 are still 0.

This is a common CPU design trick: one special register makes both assembly code and hardware logic simpler.

If you want to see the complete code for this step, check Cpu.kt.

Step 2: Decide The `.kasm` Language

We want a tiny assembly language that is easy to read and easy to parse.

Rules:

One instruction per line.
Labels end with :.
Comments start with ;.
Register names are R0 to R7.

Example:

; sum two numbers and branch
LI R1, 5
LI R2, 7
ADD R3, R1, R2
BEQ R3, R2, skip
HALT
skip:
HALT

Reader stage strips comments and empty lines first:

private fun readKasm(kasm: String): String {
	return kasm
		.lineSequence()
		.map { line -> line.substringBefore(';').trim() }
		.filter { it.isNotEmpty() }
		.joinToString("\n")
}

If you want to see more .kasm examples, check src/main/resources.

Step 3: Define ISA Encoding (16-bit)

The CPU uses fixed 16-bit instructions.

Base idea:

opcode = bits[15..12]
Remaining bits depend on instruction format.

Formats:

R-type: opcode[15..12] rd[11..9] rs1[8..6] rs2[5..3] aluOp[2..0]
I-type: opcode[15..12] rd[11..9] rs1[8..6] imm6[5..0]
J-type: opcode[15..12] imm12[11..0]
Stack : opcode[15..12] reg[11..9]

Instruction Encoding Explorer

Real `.kasm` lines and the exact 16-bit instruction words they encode to.

R-Type Example

.kasm Instruction

ADD R3, R1, R2

bits: 0000 011 001 010 000

hex: 0x0650

opcode

bits 15..12 = 0000

ALU family

bits 11..9 = 011

R3 (destination)

rs1

bits 8..6 = 001

rs2

bits 5..3 = 010

aluOp

bits 2..0 = 000

ADD

Immediate behavior:

imm6 is sign-extended (-32..31)
imm12 is sign-extended (-2048..2047)
Branch/jump offsets are word-based (offset * 2)

Step 4: Implement The ISA Instruction Set

I kept the ISA small but complete enough for programs with arithmetic, memory, branching, and calls.

Mnemonic	Opcode	Format	Semantics
`ADD rd, rs1, rs2`	`0x0 + aluOp=000`	R	`rd = rs1 + rs2`
`SUB rd, rs1, rs2`	`0x0 + aluOp=001`	R	`rd = rs1 - rs2`
`AND rd, rs1, rs2`	`0x0 + aluOp=010`	R	`rd = rs1 & rs2`
`OR rd, rs1, rs2`	`0x0 + aluOp=011`	R	`rd = rs1 \| rs2`
`XOR rd, rs1, rs2`	`0x0 + aluOp=100`	R	`rd = rs1 ^ rs2`
`MOV rd, rs1`	`0x0 + aluOp=101`	R	`rd = rs1`
`SHL rd, rs1, rs2`	`0x0 + aluOp=110`	R	`rd = rs1 << (rs2 & 0xF)`
`SHR rd, rs1, rs2`	`0x0 + aluOp=111`	R	`rd = rs1 >> (rs2 & 0xF)`
`ADDI rd, rs1, imm6`	`0x1`	I	`rd = rs1 + imm6`
`LI rd, imm6`	`0x2`	I	`rd = imm6`
`LUI rd, imm6`	`0x3`	I	`rd = imm6 << 8`
`LOAD rd, base, off6`	`0x4`	I	`rd = mem16[(base + off6) & 0xFFFE]`
`STORE rs, base, off6`	`0x5`	I	`mem16[(base + off6) & 0xFFFE] = rs`
`BEQ rs1, rs2, off6`	`0x6`	I	if equal, `PC += off6 * 2`
`BNE rs1, rs2, off6`	`0x7`	I	if not equal, `PC += off6 * 2`
`JMP off12`	`0x8`	J	`PC += off12 * 2`
`RET`	`0x9`	J	`PC = pop()`
`PUSH rs`	`0xC`	Stack	`push(rs)`
`POP rd`	`0xD`	Stack	`rd = pop()`
`CALL off12`	`0xE`	J	`push(returnPC); PC += off12 * 2`
`HALT`	`0xF`	J	`stop execution`

Real `Isa` Code Layers (from `kotlin-cpu`)

The instruction table above is the contract. The Isa implementation is the engine that enforces that contract.

Layer 1 - wiring dependencies (Cpu, IsaDecoder, Alu, IsaDebugger):

class Isa private constructor(
	val cpu: Cpu,
	private val decoder: IsaDecoder,
	private val alu: Alu,
	private val debugger: IsaDebugger
) : IsaExecution {
 
	companion object {
		fun create(
			cpu: Cpu,
			debugger: IsaDebugger = IsaDebugger.None
		): Isa {
			return Isa(
				cpu = cpu,
				decoder = IsaDecoder,
				alu = Alu,
				debugger = debugger
			)
		}
	}
}

This layer is composition: it assembles the CPU state with decode logic, ALU logic, and optional debugging.

Layer 2 - fetch/decode/execute loop:

fun run() {
	context(decoder, alu) {
		while (cpu.isRunning()) {
			val inst = cpu.fetch()
			val opcode = decoder.decodeOpcode(inst)
			with(debugger) {
				cpu.debug(opCode = opcode, inst = inst, step = { op, inst ->
					executeInstruction(opCode = op, inst = inst)
				})
			}
		}
	}
}

This layer is control flow: fetch one word, decode opcode, execute, repeat until HALT flips running to false.

Layer 3 - opcode dispatch table:

context(decoder: IsaDecoder, alu: Alu)
private fun executeInstruction(opCode: OpCode, inst: UShort) {
	when (opCode) {
		OpCode.ALU -> cpu.alu(inst = inst)
		OpCode.ADDI -> cpu.addi(inst = inst)
		OpCode.LI -> cpu.li(inst = inst)
		OpCode.LUI -> cpu.lui(inst = inst)
		OpCode.LOAD -> cpu.load(inst = inst)
		OpCode.STORE -> cpu.store(inst = inst)
		OpCode.BEQ -> cpu.beq(inst = inst)
		OpCode.BNE -> cpu.bne(inst = inst)
		OpCode.JMP -> cpu.jmp(inst = inst)
		OpCode.RET -> cpu.ret()
		OpCode.PUSH -> cpu.push(inst = inst)
		OpCode.POP -> cpu.pop(inst = inst)
		OpCode.CALL -> cpu.call(inst = inst)
		OpCode.HALT -> cpu.halt()
	}
}

This layer is routing: one opcode maps to one CPU behavior.

Layer 4 - per-instruction semantics (example handlers):

context(decoder: IsaDecoder, alu: Alu)
fun Cpu.alu(inst: UShort) {
	val rd = decoder.decodeRd(inst)
	val rs1 = decoder.decodeRs1(inst)
	val rs2 = decoder.decodeRs2(inst)
	val a = readReg(rs1)
	val b = readReg(rs2)
	val result = alu.execute(
		operation = decoder.decodeAluInstruction(inst),
		a = a,
		b = b,
		updateCarryFlag = { carry -> updateCarry(carry = carry) }
	)
 
	updateZeroNegative(value = result)
	writeReg(index = rd, value = result)
}
 
context(decoder: IsaDecoder)
fun Cpu.call(inst: UShort) {
	val offset = decoder.decodeImm12(inst)
	push(pc { it })
	pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}
 
fun Cpu.halt() {
	stopRunning()
}

This layer is the actual behavior of each instruction: register reads/writes, flags, stack, memory, and PC updates.

Pseudo instruction:

NOP ; expands to ADD R0, R0, R0

If you want to see the complete code for this step, check Isa.kt.

Step 5: Build The Encoder And Decoder

Encoder writes bit-fields into a UShort:

private fun encodeRegister(opcode: OpCode, rd: Int, rs1: Int, rs2: Int, aluOperation: AluInstruction): UShort {
	return (
		(opcode.code shl 12) or
		(rd shl 9) or
		(rs1 shl 6) or
		(rs2 shl 3) or
		aluOperation.operation
	).toUShort()
}

Decoder extracts fields back out:

fun decodeOpcode(inst: UShort): OpCode {
	val code = (inst.toInt() shr 12) and 0xF
	return OpCode.entries.first { it.code == code }
}
 
fun decodeRd(inst: UShort) = (inst.toInt() shr 9) and 0x7
fun decodeRs1(inst: UShort) = (inst.toInt() shr 6) and 0x7
fun decodeRs2(inst: UShort) = (inst.toInt() shr 3) and 0x7

Sign extension for immediates:

fun decodeImm6(inst: UShort): Int {
	val imm = inst.toInt() and 0x3F
	return if (imm and 0x20 != 0) imm or -0x40 else imm
}
 
fun decodeImm12(inst: UShort): Int {
	val imm = inst.toInt() and 0xFFF
	return if (imm and 0x800 != 0) imm or -0x1000 else imm
}

Bitwise Encode / Decode

ADD R3, R1, R20x0650

0000011001010000

opcode [15..12]rd [11..9]rs1 [8..6]rs2 [5..3]aluOp [2..0]

Place opcode in bits 15..12

Operation

0x0 shl 12

Instruction built so far

0000000000000000

Placed

0000

opcode = 0x0 (ALU family)

shl 12 shifts the opcode value all the way to the top 4 bits. ADD belongs to the ALU opcode family, so opcode = 0x0 and all four bits are 0.

Place rd in bits 11..9

Operation

3 shl 9

Instruction built so far

0000011000000000

Placed

011

rd = R3 (011)

shl 9 moves register index 3 (R3) into bit positions 11..9. OR-ing it with the running word fills those three bits without touching others.

Place rs1 in bits 8..6

Operation

1 shl 6

Instruction built so far

0000011001000000

Placed

001

rs1 = R1 (001)

shl 6 moves register index 1 into bit positions 8..6.

Place rs2 in bits 5..3

Operation

2 shl 3

Instruction built so far

0000011001010000

Placed

010

rs2 = R2 (010)

shl 3 moves register index 2 into bit positions 5..3.

Place aluOp in bits 2..0

Operation

0x0 (no shift needed)

Instruction built so far

0000011001010000

Placed

000

aluOp = ADD (000)

The lowest field sits at positions 2..0 and needs no shift. ADD sub-opcode is 000.

Merge all fields with OR

Operation

0x0000 or 0x0600 or 0x0040 or 0x0010 or 0x0000

Instruction built so far

0000011001010000

Placed

0000 011 001 010 000

= 0x0650 — final 16-bit instruction

Each field occupies a different non-overlapping group of bits, so OR safely merges them all into one 16-bit word with no collisions.

Step 6: Parse `.kasm` Into Encoded Program Words

In the repository, this step is implemented by KasmParser with a clean two-pass assembler flow.

The public entry keeps it simple:

class KasmParser private constructor(private val encoder: KasmEncoder) {
 
	companion object {
		operator fun invoke(program: String): List<UShort> {
			val encoder = KasmEncoder
			return KasmParser(encoder).parse(program)
		}
	}
 
	fun parse(program: String): List<UShort> {
		return context(encoder) { assemble(program = program) }
	}
}

Pass 0 tokenizes lines into either Label or Instruction:

private fun parse(program: String): List<AsmLine> {
	return program.lines()
		.map { it.trim() }
		.filter { it.isNotEmpty() }
		.map { line ->
			if (line.endsWith(":")) {
				Label(line.dropLast(1))
			} else {
				val parts = line.split(" ", ",")
					.map { it.trim() }
					.filter { it.isNotEmpty() }
 
				Instruction(parts[0].uppercase(), parts.drop(1))
			}
		}
}

Pass 1 resolves label addresses in bytes (pc += 2 per instruction):

private fun resolveLabels(lines: List<AsmLine>): Map<String, Int> {
	val labels = mutableMapOf<String, Int>()
	var pc = 0
 
	for (line in lines) {
		when (line) {
			is Label -> labels[line.name] = pc
			is Instruction -> pc += 2
		}
	}
 
	return labels
}

Pass 2 encodes instructions using KasmEncoder, while pc tracks the current instruction address:

context(encoder: KasmEncoder)
private fun assemble(program: String): List<UShort> {
	val lines = parse(program)
	val labels = resolveLabels(lines)
 
	val output = mutableListOf<UShort>()
	var pc = 0
 
	for (line in lines) {
		if (line is Instruction) {
			val inst = when (line.name) {
				"ADD" -> encoder.ADD(parseReg(line.args[0]), parseReg(line.args[1]), parseReg(line.args[2]))
				"JMP" -> encoder.JMP(parseImm(line.args[0], labels, pc))
				"BEQ" -> encoder.BEQ(parseReg(line.args[0]), parseReg(line.args[1]), parseImm(line.args[2], labels, pc))
				"HALT" -> encoder.HALT()
				...
				else -> error("Unknown instruction: ${line.name}")
			}
 
			output.add(inst)
			pc += 2
		}
	}
 
	return output
}

The critical part is immediate parsing for labels (PC-relative, word-based offsets):

private fun parseImm(value: String, labels: Map<String, Int>, pc: Int): Int {
	return when {
		value.matches(Regex("-?\\d+")) -> value.toInt()
 
		labels.containsKey(value) -> {
			val target = labels[value]!!
			(target - (pc + 2)) / 2
		}
 
		else -> error("Unknown immediate: $value")
	}
}

Why this formula works:

pc + 2 means "the next instruction after the current one".
target - (pc + 2) gives byte distance from next instruction to label.
dividing by 2 converts bytes into 16-bit instruction-word offsets.

Label Offset

We want to encode this line: BEQ R1, R2, loop. The assembler must convert loop into a small number called an offset.

1) Where Are We In The Program?

Each instruction is 2 bytes, so addresses move by +2.

0x0000: LI R1, 3

0x0002: LI R2, 3

0x0004: BEQ R1, R2, loop <-- current line (pc)

0x0006: HALT

0x0008: loop:

0x0008: ADD R3, R3, R1

2) Step-By-Step Math

The branch offset is computed from the next instruction, not the current one.

1. pc = 0x0004 (address of BEQ)

2. target = 0x0008 (address of label loop)

3. pc + 2 = 0x0006 (next instruction)

4. target - (pc + 2) = 0x0002 bytes

5. 0x0002 / 2 = 1 instruction word

Result: immediate offset = 1

3) Why pc + 2?

The CPU has already fetched the branch instruction, so the natural next position is pc + 2. Branch offsets are measured from that next position.

Offsets are stored in instruction words, not bytes. That is why we divide by 2 in (target - (pc + 2)) / 2.

Quick intuition: from 0x0006 to 0x0008 is one instruction ahead, so offset = 1.

Mini glossary: pc = current instruction address, target = label address, offset = how many instructions to jump.

If you want to see the complete code for this step, check KasmParser.kt.

Step 7: Implement Fetch, Stack, And Word Memory Rules

Memory is byte-addressed in this CPU, which means:

PC points to a byte address (not a word index).
each instruction is 16-bit = 2 bytes.

So one fetch must read two bytes:

low byte at PC
high byte at PC + 1

Then PC advances by 2, because we consumed 2 bytes (one full instruction).

Real fetch code:

fun fetch(): UShort {
	val low = memory[PC.toInt()].toUInt()
	val high = memory[(PC + 1u).toInt()].toUInt()
	PC = (PC + 2u).toUShort()
	return ((high shl 8) or low).toUShort()
}

Why ((high shl 8) or low) works:

high shl 8: moves the high byte into bits [15..8]
low: already occupies bits [7..0]
or: merges both halves into one 16-bit value

Example:

low = 0x34
high = 0x12
(high shl 8) = 0x1200
0x1200 or 0x0034 = 0x1234

push and pop use the exact same byte-splitting and byte-joining logic, just in reverse order.

Push and pop are also 16-bit and little-endian:

fun push(value: UShort) {
	SP = (SP - 2u).toUShort()
	val addr = SP.toInt() and 0xFFFF
	memory[addr] = (value.toInt() and 0xFF).toUByte()
	memory[addr + 1] = ((value.toInt() shr 8) and 0xFF).toUByte()
}
 
fun pop(): UShort {
	val addr = SP.toInt() and 0xFFFF
	val low = memory[addr].toUInt()
	val high = memory[addr + 1].toUInt()
	val value = ((high shl 8) or low).toUShort()
	SP = (SP + 2u).toUShort()
	return value
}

Bitwise details inside push:

(value.toInt() and 0xFF): keeps only the low 8 bits (low byte)
(value.toInt() shr 8) and 0xFF: shifts high byte down, then masks to 8 bits

Bitwise details inside pop:

read low and high bytes from memory
high shl 8 puts high byte back in [15..8]
or low restores full 16-bit word

In short: fetch, push, and pop are consistent because they all follow the same little-endian rule:

low byte at lower address
high byte at next address

Fetch And Bitwise Visualization

`PC` is a byte address. One instruction is 2 bytes, so fetch reads two bytes and then moves `PC` by `+2`.

A) Why PC + 2?

PC = 0x0010

memory[0x0010] = 0x34 (low byte)

memory[0x0011] = 0x12 (high byte)

fetch() consumes both bytes, then PC = 0x0012

Because each instruction is exactly 2 bytes, the next instruction starts at the next byte pair.

B) Join Two Bytes Into One Word

low = 0x34 = 0011 0100

high = 0x12 = 0001 0010

(high shl 8) = 0x1200 = 0001 0010 0000 0000

or low = 0x0034 = 0000 0000 0011 0100

result = 0x1234 = 0001 0010 0011 0100

C) Push/Pop Use The Same Rule

Push (split word)

low = value and 0xFF

high = (value shr 8) and 0xFF

Mask `and 0xFF` keeps only one byte.

Pop (join word)

value = (high shl 8) or low

Shift high byte into the top half, then merge with low byte.

Step 8: Build The ALU And Flag Updates

In the repository, all ALU behavior is centralized in Alu.execute, and the CPU instruction handler just feeds it operands and stores the result.

The operation enum is small and explicit:

enum class AluInstruction(val operation: Int) {
	ADD(0b000), SUB(0b001), AND(0b010), OR(0b011),
	XOR(0b100), MOV(0b101), SHL(0b110), SHR(0b111)
}

The ALU-backed CPU handler looks like this:

context(decoder: IsaDecoder, alu: Alu)
fun Cpu.alu(inst: UShort) {
	val rd = decoder.decodeRd(inst)
	val rs1 = decoder.decodeRs1(inst)
	val rs2 = decoder.decodeRs2(inst)
	val a = readReg(rs1)
	val b = readReg(rs2)
	val result = alu.execute(
		operation = decoder.decodeAluInstruction(inst),
		a = a,
		b = b,
		updateCarryFlag = { carry -> updateCarry(carry = carry) }
	)
 
	updateZeroNegative(value = result)
	writeReg(index = rd, value = result)
}

That structure is important:

decoder picks which ALU operation to run
CPU reads the source registers
ALU computes the result and reports carry behavior
CPU updates zeroFlag and negativeFlag
CPU writes the final value back to rd

Real Alu.execute examples from the project:

ADD and SUB:

AluInstruction.ADD -> {
	val r = a.toUInt() + b.toUInt()
	updateCarryFlag(r > 0xFFFFu)
	r.toUShort()
}
 
AluInstruction.SUB -> {
	val r = a.toUInt() - b.toUInt()
	updateCarryFlag(a.toUInt() < b.toUInt())
	r.toUShort()
}

ADD uses UInt so the code can detect overflow beyond 0xFFFF before converting back to UShort. SUB uses the same idea for borrow detection: if a < b, carry/borrow is recorded.

Bitwise ops are even simpler:

AluInstruction.AND -> {
	updateCarryFlag(false)
	a and b
}
 
AluInstruction.OR -> {
	updateCarryFlag(false)
	a or b
}
 
AluInstruction.XOR -> {
	updateCarryFlag(false)
	a xor b
}
 
AluInstruction.MOV -> {
	updateCarryFlag(false)
	a
}

These operations do not produce arithmetic carry in this CPU, so carry is explicitly reset to false.

Shift operations are slightly more interesting:

AluInstruction.SHL -> {
	val shift = (b.toInt() and 0xF)
	val r = a.toUInt() shl shift
	updateCarryFlag((a.toUInt() shl (shift - 1)) and 0x8000u != 0u)
	r.toUShort()
}
 
AluInstruction.SHR -> {
	val shift = (b.toInt() and 0xF)
	updateCarryFlag((a.toUInt() shr (shift - 1)) and 1u != 0u)
	val r = a.toUInt() shr shift
	r.toUShort()
}

Why b.toInt() and 0xF?

only the low 4 bits of the shift amount are used
that keeps shifts in the range 0..15
which makes sense for a 16-bit value

After execute returns, the CPU updates the remaining flags with real code from Cpu.kt:

fun updateCarry(carry: Boolean) {
	carryFlag = carry
}
 
fun updateZeroNegative(value: UShort) {
	zeroFlag = value == 0.toUShort()
	negativeFlag = (value.toInt() and 0x8000) != 0
}

So the flag rules are:

zeroFlag: result equals 0
negativeFlag: bit 15 of the 16-bit result is 1
carryFlag: reported by arithmetic and shift logic

Example intuition:

ADD 0x0001 + 0x0001 = 0x0002 -> zero=false, negative=false
SUB 0x0001 - 0x0001 = 0x0000 -> zero=true
a result like 0x8001 -> negative=true because bit 15 is set

ALU Operations & Flags

R10x00055

R20x00077

ALUADD

Result0x000C12

FLAGS

Z=0N=0C=0V=0

ADD R1, R2

What happened

5 + 7 = 12. The result is positive, non-zero, and fits in 16 bits. No flags are set. Most instructions produce this clean outcome.

If you want to see the complete code for this step, check Alu.kt.

Step 9: Implement Instruction Execution Dispatch

At this point, the CPU already knows how to:

fetch a 16-bit instruction word
decode fields out of that word
run individual instruction handlers like alu, load, store, and call

So the executor loop can stay intentionally small. Its only job is to repeat the same cycle:

fetch one instruction
decode its top-level opcode
route execution to the correct handler
keep going until HALT stops the CPU

Core loop:

while (cpu.isRunning()) {
	val inst = cpu.fetch()
	val opcode = decoder.decodeOpcode(inst)
	executeInstruction(opCode = opcode, inst = inst)
}

This is the heart of the emulator: everything else plugs into this loop.

The nice design choice here is separation of responsibility:

fetch() knows memory layout and PC movement
decodeOpcode() knows instruction encoding
each CPU handler knows the semantics of one instruction family
the dispatch layer only decides where control goes next

Dispatch table:

when (opCode) {
	OpCode.ALU -> cpu.alu(inst = inst)
	OpCode.ADDI -> cpu.addi(inst = inst)
	OpCode.LI -> cpu.li(inst = inst)
	OpCode.LUI -> cpu.lui(inst = inst)
	OpCode.LOAD -> cpu.load(inst = inst)
	OpCode.STORE -> cpu.store(inst = inst)
	OpCode.BEQ -> cpu.beq(inst = inst)
	OpCode.BNE -> cpu.bne(inst = inst)
	OpCode.JMP -> cpu.jmp(inst = inst)
	OpCode.RET -> cpu.ret()
	OpCode.PUSH -> cpu.push(inst = inst)
	OpCode.POP -> cpu.pop(inst = inst)
	OpCode.CALL -> cpu.call(inst = inst)
	OpCode.HALT -> cpu.halt()
}

This table is basically the router of the CPU. If the opcode is ALU, execution goes to cpu.alu(inst). If it is LOAD, it goes to cpu.load(inst). If it is HALT, it goes to cpu.halt(), which flips the running flag and ends the loop.

That keeps the system easy to extend: adding a new instruction usually means two clear steps:

define how it is encoded and decoded
add one handler and one dispatch case

Step 10: Control Flow Details (`BEQ`, `BNE`, `JMP`, `CALL`, `RET`)

Control flow is where the CPU stops following the normal "next instruction, next instruction, next instruction" pattern.

In this CPU, all branch and jump offsets are word-based, but PC is still a byte address. That is why the handlers do offset * 2: one instruction word is 2 bytes.

Real branch/jump handlers from the project:

fun Cpu.beq(inst: UShort) {
	val rs1 = decoder.decodeRd(inst)
	val rs2 = decoder.decodeRs1(inst)
	val offset = decoder.decodeImm6(inst)
 
	if (readReg(rs1) == readReg(rs2)) {
		pc { pc -> (pc.toInt() + offset * 2).toUShort() }
	}
}
 
fun Cpu.bne(inst: UShort) {
	val rs1 = decoder.decodeRd(inst)
	val rs2 = decoder.decodeRs1(inst)
	val offset = decoder.decodeImm6(inst)
 
	if (readReg(rs1) != readReg(rs2)) {
		pc { pc -> (pc.toInt() + offset * 2).toUShort() }
	}
}
 
fun Cpu.jmp(inst: UShort) {
	val offset = decoder.decodeImm12(inst)
	pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}

The important idea is simple:

BEQ changes PC only when the registers are equal
BNE changes PC only when the registers are different
JMP always changes PC

If a branch is not taken, the CPU does nothing special here, because fetch() already advanced PC to the next instruction.

Branching in this implementation compares register values directly:

if (readReg(rs1) == readReg(rs2)) {
	pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}

CALL and RET behavior:

fun Cpu.call(inst: UShort) {
	val offset = decoder.decodeImm12(inst)
	push(pc { it })
	pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}
 
fun Cpu.ret() {
	pc { pop() }
}

Why do we push(pc { it }) on CALL?

Because a function call needs to remember where execution should continue after the function finishes.

At the moment call() runs, fetch() has already consumed the CALL instruction and advanced PC to the next instruction. So the current PC is already the correct return address.

That means:

push(pc { it }) saves "where to come back to later"
pc { pc -> (pc.toInt() + offset * 2).toUShort() } jumps into the function body

Why does RET do pc { pop() }?

Because returning from a function means restoring the saved return address. pop() reads that address from the stack, and pc { pop() } makes execution continue exactly where the caller left off.

This push/pop pattern matters for two reasons:

It lets the CPU come back to the instruction after CALL.
It supports nested calls naturally.

Example intuition:

main calls function A -> push return address for main
function A calls function B -> push return address for A
RET in B pops back to A
RET in A pops back to main

Without the stack, one call would overwrite the previous return point, and nested calls would break.

So CALL + RET is really a small protocol:

save return address on stack
jump to callee
callee eventually executes RET
restore saved address and continue

This gives a clean and predictable function call model.

Step 11: Load Program Into RAM

After the parser finishes, the .kasm source is no longer text. At that point, it has been transformed into a List<UShort>, where each item is one fully encoded 16-bit instruction word.

The loader's job is to take that list and place it into RAM in the exact byte layout the CPU expects.

Loader writes encoded instruction words at address 0x0000:

operator fun invoke(program: List<UShort>): Cpu {
	val cpu = Cpu()
	var address = 0
	for (inst in program) {
		cpu.writeMemory(address = address, value = (inst.toInt() and 0xFF).toUByte())
		cpu.writeMemory(address = address + 1, value = ((inst.toInt() shr 8) and 0xFF).toUByte())
		address += 2
	}
	return cpu
}

What happens inside the loop:

program is the parsed and encoded output from KasmParser
each inst is one 16-bit instruction
the low byte is written first
the high byte is written second
address += 2 moves to the next instruction slot in RAM

So the flow is:

KasmReader cleans the source text
KasmParser turns the .kasm file into a List<UShort>
KasmLoader writes that instruction list into RAM
Isa.run() starts fetching from RAM at PC = 0

That means entry point is naturally PC = 0, because the first encoded instruction is loaded at address 0x0000.

Step 12: Add Debugging Trace

A small debugger hook makes development much easier, especially when you are trying to understand:

Which instruction was just fetched
Where PC is moving
Whether a branch or call changed control flow
What happened to the registers after each step

This is implemented with an IsaDebugger interface that wraps one execution step:

interface IsaDebugger {
 
	context(decoder: IsaDecoder)
	fun Cpu.debug(opCode: OpCode, inst: UShort, step: (OpCode, UShort) -> Unit)
 
	data object None : IsaDebugger {
 
		context(decoder: IsaDecoder)
		override fun Cpu.debug(opCode: OpCode, inst: UShort, step: (OpCode, UShort) -> Unit) {
			step(opCode, inst)
		}
	}
 
	data object Logger : IsaDebugger {
 
		context(decoder: IsaDecoder)
		override fun Cpu.debug(opCode: OpCode, inst: UShort, step: (OpCode, UShort) -> Unit) {
			println(
				"[DEBUGGER] PC=${pc { it }.toString(16).padStart(4, '0')} " +
						"INST=0x${inst.toString(16).padStart(4, '0')} " +
						decoder.disassemble(opCode = opCode, inst = inst)
			)
			step(opCode, inst)
			println("   → ${dumpRegs()}")
		}
	}
}

This design is useful because debugging is optional.

IsaDebugger.None just runs the instruction normally
IsaDebugger.Logger prints trace information before and after that instruction

So you can turn tracing on without changing CPU semantics.

The debugger is plugged directly into the main run loop:

fun run() {
	context(decoder, alu) {
		while (cpu.isRunning()) {
			val inst = cpu.fetch()
			val opcode = decoder.decodeOpcode(inst)
			with(debugger) {
				cpu.debug(opCode = opcode, inst = inst, step = { op, inst ->
					executeInstruction(opCode = op, inst = inst)
				})
			}
		}
	}
}

That means every instruction goes through the same tracing hook:

print current PC, raw instruction, and disassembly
execute the instruction
print registers after execution

Logger prints:

Current PC
Raw instruction word
Human-readable disassembly
Register dump after execution

The disassembly text is also generated inside IsaDebugger, so the trace is much easier to read than raw hex alone:

private fun IsaDecoder.disassemble(opCode: OpCode, inst: UShort): String {
	return when (opCode) {
		OpCode.ALU -> {
			val aluInstruction = decodeAluInstruction(inst)
			val rd = decodeRd(inst)
			val rs1 = decodeRs1(inst)
			val rs2 = decodeRs2(inst)
 
			when (aluInstruction) {
				AluInstruction.ADD -> "ADD R$rd, R$rs1, R$rs2"
				AluInstruction.SUB -> "SUB R$rd, R$rs1, R$rs2"
				AluInstruction.AND -> "AND R$rd, R$rs1, R$rs2"
				AluInstruction.OR -> "OR  R$rd, R$rs1, R$rs2"
				AluInstruction.XOR -> "XOR R$rd, R$rs1, R$rs2"
				AluInstruction.MOV -> "MOV R$rd, R$rs1"
				AluInstruction.SHL -> "SHL R$rd, R$rs1, R$rs2"
				AluInstruction.SHR -> "SHR R$rd, R$rs1, R$rs2"
			}
		}
		...
	}
}

Representative log output looks like this:

One detail to notice: the logged PC is the value after fetch() has already advanced it, so it points to the next instruction slot, not the original byte address before fetch.

[DEBUGGER] PC=0002 INST=0x2045 LI R1, 5
   → R0=0 R1=5 R2=0 R3=0 R4=0 R5=0 R6=0 R7=0
[DEBUGGER] PC=0004 INST=0x2087 LI R2, 7
   → R0=0 R1=5 R2=7 R3=0 R4=0 R5=0 R6=0 R7=0
[DEBUGGER] PC=0006 INST=0x0650 ADD R3, R1, R2
   → R0=0 R1=5 R2=7 R3=12 R4=0 R5=0 R6=0 R7=0

This is useful for understanding control flow bugs quickly, because you can see both the instruction stream and the machine state evolve one step at a time.

If you want to see the complete code for this step, check IsaDebugger.kt.

Conclusion

Thanks for reading this far.

In this part, we took the CPU model from Part 1 and turned it into a working machine: a small ISA, a .kasm parser, an encoder/decoder, a RAM-backed CPU state, a fetch-decode-execute loop, stack-based calls, and debugger-friendly execution tracing.

That is already enough to make the system feel real. We are no longer talking only about CPU ideas in the abstract. We now have a concrete emulator that can load instructions into RAM, execute them step by step, branch, call functions, return through the stack, and expose its internal state while it runs.

At the same time, this project is intentionally small. It is a learning CPU, not a full operating-system-grade machine. For example, we did not implement things like:

a heap allocator or dynamic memory management
processes or concurrency
interrupts or exception handling
caches, pipelines, or speculative execution
virtual memory or privilege levels

That is exactly why this project is useful: it isolates the core ideas first. Once fetch, decode, execute, memory layout, stack discipline, and control flow make sense at this scale, the bigger topics become much easier to learn later.

Thank you! ❤️

16-bit CPU from Scratch in Kotlin (Part 2): Building the CPU

Step 1: Define The CPU Model

Step 2: Decide The .kasm Language

Step 3: Define ISA Encoding (16-bit)

R-Type Example

Step 4: Implement The ISA Instruction Set

Real Isa Code Layers (from kotlin-cpu)

Step 5: Build The Encoder And Decoder

Step 6: Parse .kasm Into Encoded Program Words

Step 7: Implement Fetch, Stack, And Word Memory Rules

Step 8: Build The ALU And Flag Updates

Step 9: Implement Instruction Execution Dispatch

Step 10: Control Flow Details (BEQ, BNE, JMP, CALL, RET)

Step 11: Load Program Into RAM

Step 12: Add Debugging Trace

Conclusion

Step 2: Decide The `.kasm` Language

Real `Isa` Code Layers (from `kotlin-cpu`)

Step 6: Parse `.kasm` Into Encoded Program Words

Step 10: Control Flow Details (`BEQ`, `BNE`, `JMP`, `CALL`, `RET`)