2026/03/30
16-bit CPU from Scratch in Kotlin (Part 2): Building the CPU
Translating CPU theory into a 16-bit Kotlin emulator: ISA decisions, execution loop, memory model, and testing.
This is Part 2 of the series.
In Part 1 we built the mental model of a CPU. In this post we turn that model into real Kotlin code from kotlin-cpu — If you enjoy the project, a ⭐ star on GitHub goes a long way and helps others find it.
It is highly recommended that you take a look at the repository code before you start reading, and keep it open while you go through this post. Seeing the real files next to the explanations makes the whole build process much easier to follow.
This is a full step-by-step build story: from assembly language design, to instruction encoding, to CPU state, to execution loop, to tests.
Step 1: Define The CPU Model
We start with a tiny and explicit machine model:
- 64KB byte-addressed RAM
- Little-endian word layout in memory (low byte first, high byte second)
- 8 registers (
R0..R7), each 16-bit PCandSPspecial registers- Flags: zero, carry, negative
- A running flag to stop on
HALT
Core structure:
interface Memory<T> {
operator fun get(address: Int): T
}
interface MutableMemory<T> : Memory<T> {
operator fun set(address: Int, value: T)
}
@OptIn(ExperimentalUnsignedTypes::class)
class RAM(size: Int) : MutableMemory<UByte> {
private val data = UByteArray(size)
override operator fun get(address: Int): UByte {
return data[address]
}
override operator fun set(address: Int, value: UByte) {
data[address] = value
}
fun toUByteArray(): UByteArray {
return data
}
}
@OptIn(ExperimentalUnsignedTypes::class)
class RegisterFile(size: Int) : MutableMemory<UShort> {
private val data = UShortArray(size)
override operator fun get(address: Int): UShort {
return data[address]
}
override operator fun set(address: Int, value: UShort) {
data[address] = value
}
fun toShortArray(): UShortArray {
return data
}
}
class Cpu {
val memory: Memory<UByte> field = RAM(65536)
val regs: Memory<UShort> field = RegisterFile(8)
private var PC: UShort = 0u
private var SP: UShort = 0xFFFEu
private var zeroFlag: Boolean = false
private var carryFlag: Boolean = false
private var negativeFlag: Boolean = false
private var running: Boolean = true
...
}Two details matter a lot:
R0is hard-wired to zero (reads always return0, writes are ignored).- Stack starts at
0xFFFEand grows downward in 2-byte steps.
Why make R0 always zero?
- It gives you a free constant
0in every instruction, without loading it first. - It simplifies instruction patterns. For example,
MOV R3, R0clearsR3, andADD R5, R2, R0behaves like a copy fromR2toR5. - Ignoring writes keeps behavior deterministic: even if code does
LI R0, 123, later reads fromR0are still0.
This is a common CPU design trick: one special register makes both assembly code and hardware logic simpler.
If you want to see the complete code for this step, check Cpu.kt.
Step 2: Decide The .kasm Language
We want a tiny assembly language that is easy to read and easy to parse.
Rules:
- One instruction per line.
- Labels end with
:. - Comments start with
;. - Register names are
R0toR7.
Example:
; sum two numbers and branch
LI R1, 5
LI R2, 7
ADD R3, R1, R2
BEQ R3, R2, skip
HALT
skip:
HALTReader stage strips comments and empty lines first:
private fun readKasm(kasm: String): String {
return kasm
.lineSequence()
.map { line -> line.substringBefore(';').trim() }
.filter { it.isNotEmpty() }
.joinToString("\n")
}If you want to see more .kasm examples, check src/main/resources.
Step 3: Define ISA Encoding (16-bit)
The CPU uses fixed 16-bit instructions.
Base idea:
opcode = bits[15..12]- Remaining bits depend on instruction format.
Formats:
R-type: opcode[15..12] rd[11..9] rs1[8..6] rs2[5..3] aluOp[2..0]
I-type: opcode[15..12] rd[11..9] rs1[8..6] imm6[5..0]
J-type: opcode[15..12] imm12[11..0]
Stack : opcode[15..12] reg[11..9]Instruction Encoding Explorer
Real `.kasm` lines and the exact 16-bit instruction words they encode to.
R-Type Example
.kasm Instruction
ADD R3, R1, R2
bits: 0000 011 001 010 000
hex: 0x0650
opcode
bits 15..12 = 0000
ALU family
rd
bits 11..9 = 011
R3 (destination)
rs1
bits 8..6 = 001
R1
rs2
bits 5..3 = 010
R2
aluOp
bits 2..0 = 000
ADD
Immediate behavior:
imm6is sign-extended (-32..31)imm12is sign-extended (-2048..2047)- Branch/jump offsets are word-based (
offset * 2)
Step 4: Implement The ISA Instruction Set
I kept the ISA small but complete enough for programs with arithmetic, memory, branching, and calls.
| Mnemonic | Opcode | Format | Semantics |
|---|---|---|---|
ADD rd, rs1, rs2 | 0x0 + aluOp=000 | R | rd = rs1 + rs2 |
SUB rd, rs1, rs2 | 0x0 + aluOp=001 | R | rd = rs1 - rs2 |
AND rd, rs1, rs2 | 0x0 + aluOp=010 | R | rd = rs1 & rs2 |
OR rd, rs1, rs2 | 0x0 + aluOp=011 | R | rd = rs1 | rs2 |
XOR rd, rs1, rs2 | 0x0 + aluOp=100 | R | rd = rs1 ^ rs2 |
MOV rd, rs1 | 0x0 + aluOp=101 | R | rd = rs1 |
SHL rd, rs1, rs2 | 0x0 + aluOp=110 | R | rd = rs1 << (rs2 & 0xF) |
SHR rd, rs1, rs2 | 0x0 + aluOp=111 | R | rd = rs1 >> (rs2 & 0xF) |
ADDI rd, rs1, imm6 | 0x1 | I | rd = rs1 + imm6 |
LI rd, imm6 | 0x2 | I | rd = imm6 |
LUI rd, imm6 | 0x3 | I | rd = imm6 << 8 |
LOAD rd, base, off6 | 0x4 | I | rd = mem16[(base + off6) & 0xFFFE] |
STORE rs, base, off6 | 0x5 | I | mem16[(base + off6) & 0xFFFE] = rs |
BEQ rs1, rs2, off6 | 0x6 | I | if equal, PC += off6 * 2 |
BNE rs1, rs2, off6 | 0x7 | I | if not equal, PC += off6 * 2 |
JMP off12 | 0x8 | J | PC += off12 * 2 |
RET | 0x9 | J | PC = pop() |
PUSH rs | 0xC | Stack | push(rs) |
POP rd | 0xD | Stack | rd = pop() |
CALL off12 | 0xE | J | push(returnPC); PC += off12 * 2 |
HALT | 0xF | J | stop execution |
Real Isa Code Layers (from kotlin-cpu)
The instruction table above is the contract.
The Isa implementation is the engine that enforces that contract.
Layer 1 - wiring dependencies (Cpu, IsaDecoder, Alu, IsaDebugger):
class Isa private constructor(
val cpu: Cpu,
private val decoder: IsaDecoder,
private val alu: Alu,
private val debugger: IsaDebugger
) : IsaExecution {
companion object {
fun create(
cpu: Cpu,
debugger: IsaDebugger = IsaDebugger.None
): Isa {
return Isa(
cpu = cpu,
decoder = IsaDecoder,
alu = Alu,
debugger = debugger
)
}
}
}This layer is composition: it assembles the CPU state with decode logic, ALU logic, and optional debugging.
Layer 2 - fetch/decode/execute loop:
fun run() {
context(decoder, alu) {
while (cpu.isRunning()) {
val inst = cpu.fetch()
val opcode = decoder.decodeOpcode(inst)
with(debugger) {
cpu.debug(opCode = opcode, inst = inst, step = { op, inst ->
executeInstruction(opCode = op, inst = inst)
})
}
}
}
}This layer is control flow: fetch one word, decode opcode, execute, repeat until HALT flips running to false.
Layer 3 - opcode dispatch table:
context(decoder: IsaDecoder, alu: Alu)
private fun executeInstruction(opCode: OpCode, inst: UShort) {
when (opCode) {
OpCode.ALU -> cpu.alu(inst = inst)
OpCode.ADDI -> cpu.addi(inst = inst)
OpCode.LI -> cpu.li(inst = inst)
OpCode.LUI -> cpu.lui(inst = inst)
OpCode.LOAD -> cpu.load(inst = inst)
OpCode.STORE -> cpu.store(inst = inst)
OpCode.BEQ -> cpu.beq(inst = inst)
OpCode.BNE -> cpu.bne(inst = inst)
OpCode.JMP -> cpu.jmp(inst = inst)
OpCode.RET -> cpu.ret()
OpCode.PUSH -> cpu.push(inst = inst)
OpCode.POP -> cpu.pop(inst = inst)
OpCode.CALL -> cpu.call(inst = inst)
OpCode.HALT -> cpu.halt()
}
}This layer is routing: one opcode maps to one CPU behavior.
Layer 4 - per-instruction semantics (example handlers):
context(decoder: IsaDecoder, alu: Alu)
fun Cpu.alu(inst: UShort) {
val rd = decoder.decodeRd(inst)
val rs1 = decoder.decodeRs1(inst)
val rs2 = decoder.decodeRs2(inst)
val a = readReg(rs1)
val b = readReg(rs2)
val result = alu.execute(
operation = decoder.decodeAluInstruction(inst),
a = a,
b = b,
updateCarryFlag = { carry -> updateCarry(carry = carry) }
)
updateZeroNegative(value = result)
writeReg(index = rd, value = result)
}
context(decoder: IsaDecoder)
fun Cpu.call(inst: UShort) {
val offset = decoder.decodeImm12(inst)
push(pc { it })
pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}
fun Cpu.halt() {
stopRunning()
}This layer is the actual behavior of each instruction: register reads/writes, flags, stack, memory, and PC updates.
Pseudo instruction:
NOP ; expands to ADD R0, R0, R0If you want to see the complete code for this step, check Isa.kt.
Step 5: Build The Encoder And Decoder
Encoder writes bit-fields into a UShort:
private fun encodeRegister(opcode: OpCode, rd: Int, rs1: Int, rs2: Int, aluOperation: AluInstruction): UShort {
return (
(opcode.code shl 12) or
(rd shl 9) or
(rs1 shl 6) or
(rs2 shl 3) or
aluOperation.operation
).toUShort()
}Decoder extracts fields back out:
fun decodeOpcode(inst: UShort): OpCode {
val code = (inst.toInt() shr 12) and 0xF
return OpCode.entries.first { it.code == code }
}
fun decodeRd(inst: UShort) = (inst.toInt() shr 9) and 0x7
fun decodeRs1(inst: UShort) = (inst.toInt() shr 6) and 0x7
fun decodeRs2(inst: UShort) = (inst.toInt() shr 3) and 0x7Sign extension for immediates:
fun decodeImm6(inst: UShort): Int {
val imm = inst.toInt() and 0x3F
return if (imm and 0x20 != 0) imm or -0x40 else imm
}
fun decodeImm12(inst: UShort): Int {
val imm = inst.toInt() and 0xFFF
return if (imm and 0x800 != 0) imm or -0x1000 else imm
}Bitwise Encode / Decode
Place opcode in bits 15..12
Operation
0x0 shl 12
Instruction built so far
Placed
0000
opcode = 0x0 (ALU family)
shl 12 shifts the opcode value all the way to the top 4 bits. ADD belongs to the ALU opcode family, so opcode = 0x0 and all four bits are 0.
Place rd in bits 11..9
Operation
3 shl 9
Instruction built so far
Placed
011
rd = R3 (011)
shl 9 moves register index 3 (R3) into bit positions 11..9. OR-ing it with the running word fills those three bits without touching others.
Place rs1 in bits 8..6
Operation
1 shl 6
Instruction built so far
Placed
001
rs1 = R1 (001)
shl 6 moves register index 1 into bit positions 8..6.
Place rs2 in bits 5..3
Operation
2 shl 3
Instruction built so far
Placed
010
rs2 = R2 (010)
shl 3 moves register index 2 into bit positions 5..3.
Place aluOp in bits 2..0
Operation
0x0 (no shift needed)
Instruction built so far
Placed
000
aluOp = ADD (000)
The lowest field sits at positions 2..0 and needs no shift. ADD sub-opcode is 000.
Merge all fields with OR
Operation
0x0000 or 0x0600 or 0x0040 or 0x0010 or 0x0000
Instruction built so far
Placed
0000 011 001 010 000
= 0x0650 — final 16-bit instruction
Each field occupies a different non-overlapping group of bits, so OR safely merges them all into one 16-bit word with no collisions.
Step 6: Parse .kasm Into Encoded Program Words
In the repository, this step is implemented by KasmParser with a clean two-pass assembler flow.
The public entry keeps it simple:
class KasmParser private constructor(private val encoder: KasmEncoder) {
companion object {
operator fun invoke(program: String): List<UShort> {
val encoder = KasmEncoder
return KasmParser(encoder).parse(program)
}
}
fun parse(program: String): List<UShort> {
return context(encoder) { assemble(program = program) }
}
}Pass 0 tokenizes lines into either Label or Instruction:
private fun parse(program: String): List<AsmLine> {
return program.lines()
.map { it.trim() }
.filter { it.isNotEmpty() }
.map { line ->
if (line.endsWith(":")) {
Label(line.dropLast(1))
} else {
val parts = line.split(" ", ",")
.map { it.trim() }
.filter { it.isNotEmpty() }
Instruction(parts[0].uppercase(), parts.drop(1))
}
}
}Pass 1 resolves label addresses in bytes (pc += 2 per instruction):
private fun resolveLabels(lines: List<AsmLine>): Map<String, Int> {
val labels = mutableMapOf<String, Int>()
var pc = 0
for (line in lines) {
when (line) {
is Label -> labels[line.name] = pc
is Instruction -> pc += 2
}
}
return labels
}Pass 2 encodes instructions using KasmEncoder, while pc tracks the current instruction address:
context(encoder: KasmEncoder)
private fun assemble(program: String): List<UShort> {
val lines = parse(program)
val labels = resolveLabels(lines)
val output = mutableListOf<UShort>()
var pc = 0
for (line in lines) {
if (line is Instruction) {
val inst = when (line.name) {
"ADD" -> encoder.ADD(parseReg(line.args[0]), parseReg(line.args[1]), parseReg(line.args[2]))
"JMP" -> encoder.JMP(parseImm(line.args[0], labels, pc))
"BEQ" -> encoder.BEQ(parseReg(line.args[0]), parseReg(line.args[1]), parseImm(line.args[2], labels, pc))
"HALT" -> encoder.HALT()
...
else -> error("Unknown instruction: ${line.name}")
}
output.add(inst)
pc += 2
}
}
return output
}The critical part is immediate parsing for labels (PC-relative, word-based offsets):
private fun parseImm(value: String, labels: Map<String, Int>, pc: Int): Int {
return when {
value.matches(Regex("-?\\d+")) -> value.toInt()
labels.containsKey(value) -> {
val target = labels[value]!!
(target - (pc + 2)) / 2
}
else -> error("Unknown immediate: $value")
}
}Why this formula works:
pc + 2means "the next instruction after the current one".target - (pc + 2)gives byte distance from next instruction to label.- dividing by
2converts bytes into 16-bit instruction-word offsets.
Label Offset
We want to encode this line: BEQ R1, R2, loop. The assembler must convert loop into a small number called an offset.
1) Where Are We In The Program?
Each instruction is 2 bytes, so addresses move by +2.
0x0000: LI R1, 3
0x0002: LI R2, 3
0x0004: BEQ R1, R2, loop <-- current line (pc)
0x0006: HALT
0x0008: loop:
0x0008: ADD R3, R3, R1
2) Step-By-Step Math
The branch offset is computed from the next instruction, not the current one.
1. pc = 0x0004 (address of BEQ)
2. target = 0x0008 (address of label loop)
3. pc + 2 = 0x0006 (next instruction)
4. target - (pc + 2) = 0x0002 bytes
5. 0x0002 / 2 = 1 instruction word
Result: immediate offset = 1
3) Why pc + 2?
The CPU has already fetched the branch instruction, so the natural next position is pc + 2. Branch offsets are measured from that next position.
Offsets are stored in instruction words, not bytes. That is why we divide by 2 in (target - (pc + 2)) / 2.
Quick intuition: from 0x0006 to 0x0008 is one instruction ahead, so offset = 1.
Mini glossary: pc = current instruction address, target = label address, offset = how many instructions to jump.
If you want to see the complete code for this step, check KasmParser.kt.
Step 7: Implement Fetch, Stack, And Word Memory Rules
Memory is byte-addressed in this CPU, which means:
PCpoints to a byte address (not a word index).- each instruction is 16-bit = 2 bytes.
So one fetch must read two bytes:
- low byte at
PC - high byte at
PC + 1
Then PC advances by 2, because we consumed 2 bytes (one full instruction).
Real fetch code:
fun fetch(): UShort {
val low = memory[PC.toInt()].toUInt()
val high = memory[(PC + 1u).toInt()].toUInt()
PC = (PC + 2u).toUShort()
return ((high shl 8) or low).toUShort()
}Why ((high shl 8) or low) works:
high shl 8: moves the high byte into bits[15..8]low: already occupies bits[7..0]or: merges both halves into one 16-bit value
Example:
low = 0x34high = 0x12(high shl 8) = 0x12000x1200 or 0x0034 = 0x1234
push and pop use the exact same byte-splitting and byte-joining logic, just in reverse order.
Push and pop are also 16-bit and little-endian:
fun push(value: UShort) {
SP = (SP - 2u).toUShort()
val addr = SP.toInt() and 0xFFFF
memory[addr] = (value.toInt() and 0xFF).toUByte()
memory[addr + 1] = ((value.toInt() shr 8) and 0xFF).toUByte()
}
fun pop(): UShort {
val addr = SP.toInt() and 0xFFFF
val low = memory[addr].toUInt()
val high = memory[addr + 1].toUInt()
val value = ((high shl 8) or low).toUShort()
SP = (SP + 2u).toUShort()
return value
}Bitwise details inside push:
(value.toInt() and 0xFF): keeps only the low 8 bits (low byte)(value.toInt() shr 8) and 0xFF: shifts high byte down, then masks to 8 bits
Bitwise details inside pop:
- read
lowandhighbytes from memory high shl 8puts high byte back in[15..8]or lowrestores full 16-bit word
In short: fetch, push, and pop are consistent because they all follow the same little-endian rule:
- low byte at lower address
- high byte at next address
Fetch And Bitwise Visualization
`PC` is a byte address. One instruction is 2 bytes, so fetch reads two bytes and then moves `PC` by `+2`.
A) Why PC + 2?
PC = 0x0010
memory[0x0010] = 0x34 (low byte)
memory[0x0011] = 0x12 (high byte)
fetch() consumes both bytes, then PC = 0x0012
Because each instruction is exactly 2 bytes, the next instruction starts at the next byte pair.
B) Join Two Bytes Into One Word
low = 0x34 = 0011 0100
high = 0x12 = 0001 0010
(high shl 8) = 0x1200 = 0001 0010 0000 0000
or low = 0x0034 = 0000 0000 0011 0100
result = 0x1234 = 0001 0010 0011 0100
C) Push/Pop Use The Same Rule
Push (split word)
low = value and 0xFF
high = (value shr 8) and 0xFF
Mask `and 0xFF` keeps only one byte.
Pop (join word)
value = (high shl 8) or low
Shift high byte into the top half, then merge with low byte.
Step 8: Build The ALU And Flag Updates
In the repository, all ALU behavior is centralized in Alu.execute, and the CPU instruction handler just feeds it operands and stores the result.
The operation enum is small and explicit:
enum class AluInstruction(val operation: Int) {
ADD(0b000), SUB(0b001), AND(0b010), OR(0b011),
XOR(0b100), MOV(0b101), SHL(0b110), SHR(0b111)
}The ALU-backed CPU handler looks like this:
context(decoder: IsaDecoder, alu: Alu)
fun Cpu.alu(inst: UShort) {
val rd = decoder.decodeRd(inst)
val rs1 = decoder.decodeRs1(inst)
val rs2 = decoder.decodeRs2(inst)
val a = readReg(rs1)
val b = readReg(rs2)
val result = alu.execute(
operation = decoder.decodeAluInstruction(inst),
a = a,
b = b,
updateCarryFlag = { carry -> updateCarry(carry = carry) }
)
updateZeroNegative(value = result)
writeReg(index = rd, value = result)
}That structure is important:
- decoder picks which ALU operation to run
- CPU reads the source registers
- ALU computes the result and reports carry behavior
- CPU updates
zeroFlagandnegativeFlag - CPU writes the final value back to
rd
Real Alu.execute examples from the project:
ADD and SUB:
AluInstruction.ADD -> {
val r = a.toUInt() + b.toUInt()
updateCarryFlag(r > 0xFFFFu)
r.toUShort()
}
AluInstruction.SUB -> {
val r = a.toUInt() - b.toUInt()
updateCarryFlag(a.toUInt() < b.toUInt())
r.toUShort()
}ADD uses UInt so the code can detect overflow beyond 0xFFFF before converting back to UShort.
SUB uses the same idea for borrow detection: if a < b, carry/borrow is recorded.
Bitwise ops are even simpler:
AluInstruction.AND -> {
updateCarryFlag(false)
a and b
}
AluInstruction.OR -> {
updateCarryFlag(false)
a or b
}
AluInstruction.XOR -> {
updateCarryFlag(false)
a xor b
}
AluInstruction.MOV -> {
updateCarryFlag(false)
a
}These operations do not produce arithmetic carry in this CPU, so carry is explicitly reset to false.
Shift operations are slightly more interesting:
AluInstruction.SHL -> {
val shift = (b.toInt() and 0xF)
val r = a.toUInt() shl shift
updateCarryFlag((a.toUInt() shl (shift - 1)) and 0x8000u != 0u)
r.toUShort()
}
AluInstruction.SHR -> {
val shift = (b.toInt() and 0xF)
updateCarryFlag((a.toUInt() shr (shift - 1)) and 1u != 0u)
val r = a.toUInt() shr shift
r.toUShort()
}Why b.toInt() and 0xF?
- only the low 4 bits of the shift amount are used
- that keeps shifts in the range
0..15 - which makes sense for a 16-bit value
After execute returns, the CPU updates the remaining flags with real code from Cpu.kt:
fun updateCarry(carry: Boolean) {
carryFlag = carry
}
fun updateZeroNegative(value: UShort) {
zeroFlag = value == 0.toUShort()
negativeFlag = (value.toInt() and 0x8000) != 0
}So the flag rules are:
zeroFlag: result equals0negativeFlag: bit 15 of the 16-bit result is1carryFlag: reported by arithmetic and shift logic
Example intuition:
ADD 0x0001 + 0x0001 = 0x0002-> zero=false, negative=falseSUB 0x0001 - 0x0001 = 0x0000-> zero=true- a result like
0x8001-> negative=truebecause bit 15 is set
ALU Operations & Flags
ADD R1, R2
What happened
5 + 7 = 12. The result is positive, non-zero, and fits in 16 bits. No flags are set. Most instructions produce this clean outcome.
If you want to see the complete code for this step, check Alu.kt.
Step 9: Implement Instruction Execution Dispatch
At this point, the CPU already knows how to:
- fetch a 16-bit instruction word
- decode fields out of that word
- run individual instruction handlers like
alu,load,store, andcall
So the executor loop can stay intentionally small. Its only job is to repeat the same cycle:
- fetch one instruction
- decode its top-level opcode
- route execution to the correct handler
- keep going until
HALTstops the CPU
Core loop:
while (cpu.isRunning()) {
val inst = cpu.fetch()
val opcode = decoder.decodeOpcode(inst)
executeInstruction(opCode = opcode, inst = inst)
}This is the heart of the emulator: everything else plugs into this loop.
The nice design choice here is separation of responsibility:
fetch()knows memory layout andPCmovementdecodeOpcode()knows instruction encoding- each CPU handler knows the semantics of one instruction family
- the dispatch layer only decides where control goes next
Dispatch table:
when (opCode) {
OpCode.ALU -> cpu.alu(inst = inst)
OpCode.ADDI -> cpu.addi(inst = inst)
OpCode.LI -> cpu.li(inst = inst)
OpCode.LUI -> cpu.lui(inst = inst)
OpCode.LOAD -> cpu.load(inst = inst)
OpCode.STORE -> cpu.store(inst = inst)
OpCode.BEQ -> cpu.beq(inst = inst)
OpCode.BNE -> cpu.bne(inst = inst)
OpCode.JMP -> cpu.jmp(inst = inst)
OpCode.RET -> cpu.ret()
OpCode.PUSH -> cpu.push(inst = inst)
OpCode.POP -> cpu.pop(inst = inst)
OpCode.CALL -> cpu.call(inst = inst)
OpCode.HALT -> cpu.halt()
}This table is basically the router of the CPU.
If the opcode is ALU, execution goes to cpu.alu(inst).
If it is LOAD, it goes to cpu.load(inst).
If it is HALT, it goes to cpu.halt(), which flips the running flag and ends the loop.
That keeps the system easy to extend: adding a new instruction usually means two clear steps:
- define how it is encoded and decoded
- add one handler and one dispatch case
Step 10: Control Flow Details (BEQ, BNE, JMP, CALL, RET)
Control flow is where the CPU stops following the normal "next instruction, next instruction, next instruction" pattern.
In this CPU, all branch and jump offsets are word-based, but PC is still a byte address.
That is why the handlers do offset * 2: one instruction word is 2 bytes.
Real branch/jump handlers from the project:
fun Cpu.beq(inst: UShort) {
val rs1 = decoder.decodeRd(inst)
val rs2 = decoder.decodeRs1(inst)
val offset = decoder.decodeImm6(inst)
if (readReg(rs1) == readReg(rs2)) {
pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}
}
fun Cpu.bne(inst: UShort) {
val rs1 = decoder.decodeRd(inst)
val rs2 = decoder.decodeRs1(inst)
val offset = decoder.decodeImm6(inst)
if (readReg(rs1) != readReg(rs2)) {
pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}
}
fun Cpu.jmp(inst: UShort) {
val offset = decoder.decodeImm12(inst)
pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}The important idea is simple:
BEQchangesPConly when the registers are equalBNEchangesPConly when the registers are differentJMPalways changesPC
If a branch is not taken, the CPU does nothing special here, because fetch() already advanced PC to the next instruction.
Branching in this implementation compares register values directly:
if (readReg(rs1) == readReg(rs2)) {
pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}CALL and RET behavior:
fun Cpu.call(inst: UShort) {
val offset = decoder.decodeImm12(inst)
push(pc { it })
pc { pc -> (pc.toInt() + offset * 2).toUShort() }
}
fun Cpu.ret() {
pc { pop() }
}Why do we push(pc { it }) on CALL?
Because a function call needs to remember where execution should continue after the function finishes.
At the moment call() runs, fetch() has already consumed the CALL instruction and advanced PC to the next instruction.
So the current PC is already the correct return address.
That means:
push(pc { it })saves "where to come back to later"pc { pc -> (pc.toInt() + offset * 2).toUShort() }jumps into the function body
Why does RET do pc { pop() }?
Because returning from a function means restoring the saved return address.
pop() reads that address from the stack, and pc { pop() } makes execution continue exactly where the caller left off.
This push/pop pattern matters for two reasons:
- It lets the CPU come back to the instruction after
CALL. - It supports nested calls naturally.
Example intuition:
- main calls function A -> push return address for main
- function A calls function B -> push return address for A
RETin B pops back to ARETin A pops back to main
Without the stack, one call would overwrite the previous return point, and nested calls would break.
So CALL + RET is really a small protocol:
- save return address on stack
- jump to callee
- callee eventually executes
RET - restore saved address and continue
This gives a clean and predictable function call model.
Step 11: Load Program Into RAM
After the parser finishes, the .kasm source is no longer text.
At that point, it has been transformed into a List<UShort>, where each item is one fully encoded 16-bit instruction word.
The loader's job is to take that list and place it into RAM in the exact byte layout the CPU expects.
Loader writes encoded instruction words at address 0x0000:
operator fun invoke(program: List<UShort>): Cpu {
val cpu = Cpu()
var address = 0
for (inst in program) {
cpu.writeMemory(address = address, value = (inst.toInt() and 0xFF).toUByte())
cpu.writeMemory(address = address + 1, value = ((inst.toInt() shr 8) and 0xFF).toUByte())
address += 2
}
return cpu
}What happens inside the loop:
programis the parsed and encoded output fromKasmParser- each
instis one 16-bit instruction - the low byte is written first
- the high byte is written second
address += 2moves to the next instruction slot in RAM
So the flow is:
KasmReadercleans the source textKasmParserturns the.kasmfile into aList<UShort>KasmLoaderwrites that instruction list into RAMIsa.run()starts fetching from RAM atPC = 0
That means entry point is naturally PC = 0, because the first encoded instruction is loaded at address 0x0000.
Step 12: Add Debugging Trace
A small debugger hook makes development much easier, especially when you are trying to understand:
- Which instruction was just fetched
- Where
PCis moving - Whether a branch or call changed control flow
- What happened to the registers after each step
This is implemented with an IsaDebugger interface that wraps one execution step:
interface IsaDebugger {
context(decoder: IsaDecoder)
fun Cpu.debug(opCode: OpCode, inst: UShort, step: (OpCode, UShort) -> Unit)
data object None : IsaDebugger {
context(decoder: IsaDecoder)
override fun Cpu.debug(opCode: OpCode, inst: UShort, step: (OpCode, UShort) -> Unit) {
step(opCode, inst)
}
}
data object Logger : IsaDebugger {
context(decoder: IsaDecoder)
override fun Cpu.debug(opCode: OpCode, inst: UShort, step: (OpCode, UShort) -> Unit) {
println(
"[DEBUGGER] PC=${pc { it }.toString(16).padStart(4, '0')} " +
"INST=0x${inst.toString(16).padStart(4, '0')} " +
decoder.disassemble(opCode = opCode, inst = inst)
)
step(opCode, inst)
println(" → ${dumpRegs()}")
}
}
}This design is useful because debugging is optional.
IsaDebugger.Nonejust runs the instruction normallyIsaDebugger.Loggerprints trace information before and after that instruction
So you can turn tracing on without changing CPU semantics.
The debugger is plugged directly into the main run loop:
fun run() {
context(decoder, alu) {
while (cpu.isRunning()) {
val inst = cpu.fetch()
val opcode = decoder.decodeOpcode(inst)
with(debugger) {
cpu.debug(opCode = opcode, inst = inst, step = { op, inst ->
executeInstruction(opCode = op, inst = inst)
})
}
}
}
}That means every instruction goes through the same tracing hook:
- print current
PC, raw instruction, and disassembly - execute the instruction
- print registers after execution
Logger prints:
- Current
PC - Raw instruction word
- Human-readable disassembly
- Register dump after execution
The disassembly text is also generated inside IsaDebugger, so the trace is much easier to read than raw hex alone:
private fun IsaDecoder.disassemble(opCode: OpCode, inst: UShort): String {
return when (opCode) {
OpCode.ALU -> {
val aluInstruction = decodeAluInstruction(inst)
val rd = decodeRd(inst)
val rs1 = decodeRs1(inst)
val rs2 = decodeRs2(inst)
when (aluInstruction) {
AluInstruction.ADD -> "ADD R$rd, R$rs1, R$rs2"
AluInstruction.SUB -> "SUB R$rd, R$rs1, R$rs2"
AluInstruction.AND -> "AND R$rd, R$rs1, R$rs2"
AluInstruction.OR -> "OR R$rd, R$rs1, R$rs2"
AluInstruction.XOR -> "XOR R$rd, R$rs1, R$rs2"
AluInstruction.MOV -> "MOV R$rd, R$rs1"
AluInstruction.SHL -> "SHL R$rd, R$rs1, R$rs2"
AluInstruction.SHR -> "SHR R$rd, R$rs1, R$rs2"
}
}
...
}
}Representative log output looks like this:
One detail to notice: the logged PC is the value after fetch() has already advanced it, so it points to the next instruction slot, not the original byte address before fetch.
[DEBUGGER] PC=0002 INST=0x2045 LI R1, 5
→ R0=0 R1=5 R2=0 R3=0 R4=0 R5=0 R6=0 R7=0
[DEBUGGER] PC=0004 INST=0x2087 LI R2, 7
→ R0=0 R1=5 R2=7 R3=0 R4=0 R5=0 R6=0 R7=0
[DEBUGGER] PC=0006 INST=0x0650 ADD R3, R1, R2
→ R0=0 R1=5 R2=7 R3=12 R4=0 R5=0 R6=0 R7=0This is useful for understanding control flow bugs quickly, because you can see both the instruction stream and the machine state evolve one step at a time.
If you want to see the complete code for this step, check IsaDebugger.kt.
Conclusion
Thanks for reading this far.
In this part, we took the CPU model from Part 1 and turned it into a working machine: a small ISA, a .kasm parser, an encoder/decoder, a RAM-backed CPU state, a fetch-decode-execute loop, stack-based calls, and debugger-friendly execution tracing.
That is already enough to make the system feel real. We are no longer talking only about CPU ideas in the abstract. We now have a concrete emulator that can load instructions into RAM, execute them step by step, branch, call functions, return through the stack, and expose its internal state while it runs.
At the same time, this project is intentionally small. It is a learning CPU, not a full operating-system-grade machine. For example, we did not implement things like:
- a heap allocator or dynamic memory management
- processes or concurrency
- interrupts or exception handling
- caches, pipelines, or speculative execution
- virtual memory or privilege levels
That is exactly why this project is useful: it isolates the core ideas first. Once fetch, decode, execute, memory layout, stack discipline, and control flow make sense at this scale, the bigger topics become much easier to learn later.
Thank you! ❤️