2026/03/29
16-bit CPU from Scratch in Kotlin (Part 1): How a CPU Works
A guide to CPU architecture: ISA, registers, memory, ALU, and the fetch-decode-execute cycle.
This is Part 1 of a 2-part series on building a 16-bit CPU from scratch in Kotlin.
If you have ever wondered what actually happens inside a computer when your code runs — what a register is, how memory works, why there is something called a stack — this is where we start.
This is not an easy topic at first. CPU internals can feel abstract, dense, and even a little intimidating, so give yourself patience as you read. The goal of this post is not to make everything feel instantly obvious, but to help the pieces click one by one.
By the end of this series we will have built a fully working 16-bit CPU in Kotlin: registers, memory, an ALU, a complete instruction set, and a fetch-decode-execute loop. The finished project is kotlin-cpu — If you enjoy the project, a ⭐ star on GitHub goes a long way and helps others find it.
Part 1 (this post) is an initial look at how a CPU is structured and why it works the way it does. Think of it as building the mental model first — no Kotlin code yet, just the concepts you will need before we write a single line.
Part 2 is where we take everything here and turn it into real Kotlin code, step by step.
What A CPU Really Is
A CPU is a state machine that keeps transforming state.
The state includes things like:
- Register values
- Memory contents
- Status flags
- The program counter (
PC)
Every instruction is a rule for changing that state.
At a big-picture level, the CPU repeats the same simple loop: get the next instruction, understand it, do it, then save the updated state.
CPU Big Picture
What The CPU Is Doing
CPU reads the next instruction from memory using the current PC address.
Learning Path
State
Think of this as the CPU's 'memory' at any single moment. It keeps track of a few important things: values stored in registers (like sticky notes), data in memory (like a filing cabinet), where the next instruction is (PC), and the results of recent operations (flags). This is the 'snapshot' of everything the CPU currently knows.
Learn in order from 1 → 4. Each step builds on the previous one.
The Internal Blocks
Inside a simple CPU, a few blocks are always involved:
- Control unit: decides what to do for current instruction
- Register file: fast storage used constantly
- ALU: math and logic execution
- Memory interface: read/write to instruction and data memory
- Buses: wires/paths for address, data, and control signals
CPU Architecture Diagram
Click any block to learn more
The key idea: control decides, datapath executes.
What Instructions Look Like: Assembly
Software tells the CPU what to do through instructions. Each instruction is a command that changes CPU state.
Assembly language is the symbolic, human-readable form of instructions. Instead of binary, we write things like ADD R1, R2 which means "add the value in R2 to R1 and store the result in R1."
Every instruction has parts:
- Mnemonic: the name of the operation (
ADD,LOAD,JMP, etc.) - Operands: the location(s) of data (registers or memory addresses)
A typical assembly instruction looks like:
ADD R1, R2This means:
- Operation: addition
- Source: R2
- Destination: R1
- Effect: R1 ← R1 + R2
Another example:
LOAD R3, 0x1024This means:
- Operation: load from memory
- Address: 0x1024
- Destination: R3
- Effect: R3 ← MEM[0x1024]
When the CPU fetches and decodes either of these instructions, the control unit translates the mnemonic into a set of micro-operations: which registers to read, which ALU function to run, where to write the result.
The CPU does not actually understand ADD as text. What it reads is a binary encoding of that instruction. The assembler (a tool) converts mnemonics into that binary form before the program runs.
Here is a simple Kotlin-to-ASM mental model:
val a = 5
val b = 7
val sum = a + b
if (sum == 12) {
print("ok")
}A compiler lowers this high-level code into many small instructions that move data through registers, run ALU ops, compare values, and branch.
Kotlin to Hypothetical ASM
Kotlin (High-level)
ASM (Low-level)
MOV R3, R1
ADD R3, R2
Copy a into destination register, then perform add with b.
Real compilers generate many more instructions (stack setup, register spills, calling convention rules). This example keeps only the core idea.
Registers, ALU, Flags
Registers are the CPU's fastest storage. They live directly on the CPU chip — not in external RAM — making them orders of magnitude faster than main memory. A typical CPU has between 8 and 32 registers, each holding exactly one word.
A word is the native chunk of data the CPU works with in a single operation. It is the bite size of the machine. Our CPU uses 16-bit words, meaning every register holds exactly 16 bits — a number between 0 and 65,535 unsigned, or −32,768 to +32,767 signed. Real-world CPUs use 32-bit or 64-bit words; the bigger the word, the more values they can represent and address in one step.
Most instructions follow the same pattern: read one or two registers, pass the values to the ALU, and write the result back to a register.
The Register File
Registers come in two categories.
General-purpose registers (R0, R1, R2, ...) hold any temporary value the program needs. The CPU enforces no meaning on them — that is left entirely to software conventions such as calling conventions used by compilers.
Special-purpose registers have hardware-enforced roles:
PC(program counter): always holds the address of the next instruction to fetch. After every fetch the CPU increments it automatically. Jump and call instructions overwrite it directly — that is the only mechanism for control flow. There is no "go to next line" built into programs; it is all just PC manipulation.SP(stack pointer): tracks the top of the call stack in memory. The stack grows downward:CALLdecrementsSPand pushes the return address;RETpops that address back intoPC. Local variables also live on the stack relative toSP.FLAGS(status register): a small register whose bits are set by the ALU after every arithmetic or logic operation.
Step through to see how PC advances instruction by instruction and how SP moves when a function is called and returned:
Special Registers in Action
Memory
PC Register
0x0010
PC starts at 0x0010. The CPU will fetch the instruction at this address first.
What's Happening
PC always holds the address of the next instruction to fetch. Before execution begins it is pointed at the program's entry point — address 0x0010 here.
Register File
General Purpose
Special Purpose
The ALU
The ALU (arithmetic and logic unit) is the math engine of the CPU. It is a purely combinational circuit — no internal state, no clock of its own. It takes two inputs and an operation code, computes a result instantly, and produces two outputs: the result value and a set of status bits (flags).
Common ALU operations:
ADD,SUB: integer arithmeticAND,OR,XOR,NOT: bitwise logicCMP: subtraction that only updates flags, discarding the result — common in many real CPUs before conditional jumpsSHL,SHR: left and right bit shifts
Flags
After every ALU operation the status flags are updated. Each is a single bit:
| Flag | Name | Set when |
|---|---|---|
Z | Zero | Result is exactly 0 |
N | Negative | Result's sign bit (bit 15) is 1 |
C | Carry | Unsigned result exceeded 16 bits |
V | Overflow | Signed result wrapped sign |
Different CPUs expose different flag sets and branch instructions. For example, many real CPUs use instructions like CMP, JE, and JLT, while our project CPU in Part 2 keeps things smaller and uses BEQ / BNE with zero, carry, and negative flags. The core idea is the same: flags give later instructions information about what just happened.
ALU Operations & Flags
ADD R1, R2
What happened
5 + 7 = 12. The result is positive, non-zero, and fits in 16 bits. No flags are set. Most instructions produce this clean outcome.
If you want the smallest useful mental model: registers hold working values, the ALU transforms them, and flags record what happened.
Memory And Bus Traffic
A CPU talks to memory through three shared "lanes" (buses):
- Address bus: where to read/write
- Control bus: what operation to do (
READorWRITE) - Data bus: the value moving between CPU and memory
Think of one memory operation in 3 simple steps:
- CPU puts an address on the address bus ("go to this location").
- CPU sets control signals (
READorWRITE) ("what kind of operation"). - Data moves on the data bus ("the actual value or instruction bytes").
This is not just hardware detail. It is the concrete mechanism behind instruction fetch, load, and store.
Address, Control, Data Buses
Use this rule of thumb: address = where, control = read or write, data = the value.
CPU asks memory for the next instruction
Where in memory?
Read or write?
What value moves?
At this point you can think of the CPU as a machine that keeps moving values between registers, ALU, and memory.
Endianness: Byte Order In Memory
When a value is larger than one byte (for example 16-bit, 32-bit, or 64-bit), memory still stores it as separate bytes. Endianness tells us the order of those bytes in memory.
- Little-endian: lowest byte at the lowest address
- Big-endian: highest byte at the lowest address
Example with the 16-bit value 0x1234 stored starting at address 0x1000:
- Little-endian:
MEM[0x1000] = 0x34,MEM[0x1001] = 0x12 - Big-endian:
MEM[0x1000] = 0x12,MEM[0x1001] = 0x34
Our CPU model in this series uses little-endian layout for multi-byte values.
Why this matters in real-world systems:
- Network protocols: internet protocols traditionally use big-endian ("network byte order"). If a little-endian host sends raw integers without conversion, the receiver reads wrong values.
- Binary file parsing: reading a binary header with the wrong byte order can produce nonsense sizes/offsets, which often causes corrupted parsing or crashes.
- Cross-language serialization: one service writes bytes in little-endian, another reads as big-endian; IDs, lengths, timestamps, or checksums become incorrect.
- Debugger confusion: memory dump bytes can look correct but interpreted values look wrong when the expected endianness is not matched.
Endianness bugs are subtle because the data is still there, just interpreted in the wrong order. That is why low-level code, binary protocols, and emulator code must always make byte order explicit.
Fetch Decode Execute, But In Micro-Steps
At a high level, every CPU instruction follows the same loop: fetch, decode, execute, then move to the next instruction.
That sounds simple, but inside the CPU this is broken into very small timed actions (micro-steps). Think of micro-steps as tiny checklist items the hardware performs in order.
For one instruction, a beginner-friendly view is:
Fetch: use PC as an address and read instruction bytes from memory.
Decode: figure out what operation this instruction means and which registers or memory locations are involved.
Execute: do the actual work, such as ALU math, a load or store, or a branch check.
Next: update PC to the next instruction, or overwrite it if a jump or branch is taken.
Why this matters: when code behaves unexpectedly, the bug is usually in one of these tiny steps (wrong address, wrong decode, wrong flag, wrong next PC).
Fetch Decode Execute (Simple Walkthrough)
One instruction is handled in this order: Fetch -> Decode -> Execute -> Next.
Fetch
Stage 1 of 4Get the next instruction from memory.
The CPU uses PC as an address and asks memory to return the instruction stored there.
Hardware Actions
- Address bus <- PC
- Control <- READ
- IR <- MEM[PC]
PC=0x0012, IR=ADD R1,R2
Thinking in micro-steps helps explain why control logic exists and why timing order matters.
Datapath vs Control Path (Practical View)
This distinction is one of the most useful ways to think about a CPU.
The datapath is the part that moves and changes values.
- Register reads/writes
- ALU inputs/outputs
- Memory data transfers
If you ask, "where is the data going?" or "what value is being changed?", you are thinking about the datapath.
The control path is the part that tells the datapath what to do and when to do it.
- Decode opcode
- Choose ALU function
- Choose source/destination muxes
- Trigger memory read/write
- Choose next
PC
If you ask, "which operation should happen now?" or "which register should be written?", you are thinking about the control path.
Here is an easy way to separate them:
- Datapath is the engine
- Control path is the driver
For example, imagine the instruction
ADD R1, R2- The datapath reads
R1andR2, sends them into the ALU, computes the sum, and writes the result back - The control path recognizes that this is an
ADD, tells the ALU to do addition, tells the register file which inputs to read, and tells the CPU where the result should be written
So the datapath is the machinery that carries values around, while the control path is the decision-making logic that coordinates the machinery.
Both are always working together. A datapath with no control does not know what operation to perform. A control path with no datapath has no hardware to move or transform values.
Once this makes sense, the more abstract software-facing view becomes easier to understand.
ISA: The Contract Between Software And CPU
The instruction set architecture (ISA) is the contract software must follow.
The ISA defines:
- Available instructions (
ADD,LOAD,JMP, ...) - Register model
- Instruction bit encoding
- Addressing modes
- Word size and memory behavior
- Flag semantics
If software and hardware agree on ISA, programs run correctly.
This is the layer a compiler, assembler, or emulator targets. The compiler does not care how your ALU is wired internally. It cares that ADD R1, R2 means a specific operation with specific rules.
Why Encoding Matters
Internally, the control unit reads bit fields, not keywords.
For example, an instruction word is usually split into fields like:
- Opcode
- Destination register
- Source register or immediate
- Optional mode bits
Instruction Encoding Explorer
Real `.kasm` lines and the exact 16-bit instruction words they encode to.
R-Type Example
.kasm Instruction
ADD R3, R1, R2
bits: 0000 011 001 010 000
hex: 0x0650
opcode
bits 15..12 = 0000
ALU family
rd
bits 11..9 = 011
R3 (destination)
rs1
bits 8..6 = 001
R1
rs2
bits 5..3 = 010
R2
aluOp
bits 2..0 = 000
ADD
Addressing Modes
Addressing mode tells where an operand comes from.
Most beginner ISAs include at least:
-
Immediate: the value is written directly inside the instruction.
LOADI R1, 5This means "put the number 5 into R1."
-
Register: the value comes from another register.
ADD R1, R2This means "take the value already stored in R2."
-
Direct: the instruction contains a memory address.
LOAD R1, 0x1024This means "read the value stored at memory address 0x1024."
-
Register-indirect: a register holds the memory address.
LOAD R1, [R3]If
R3 = 0x1024, this means "go to the address stored in R3, then read from there."
The important difference is this:
- Immediate gives you the value itself.
- Register gives you a value already inside the CPU.
- Direct gives you a fixed memory location.
- Register-indirect gives you a memory location stored in a register.
Addressing modes are a design tradeoff between flexibility and encoding complexity.
They are also one of the places where ISA design directly shapes what a compiler can emit efficiently.
Control Flow And The Stack
Without control flow, a program would only run straight downward, one instruction after another.
Control flow is what lets a program:
- Skip code
- Repeat code in loops
- Choose between two paths with
if - Jump into a function and later come back
At the hardware level, all of this is really just one thing: changing PC.
If PC keeps increasing normally, execution stays linear.
If an instruction writes a new address into PC, execution jumps somewhere else.
That is why these instructions matter:
JMP: always replacePCwith a new address- Conditional branches: replace
PConly if a condition is true CALL: jump to a function, but first save where to come back toRET: restore the saved return address and continue from there
The stack is what makes CALL and RET practical.
When a function call happens, the CPU needs to remember:
- Where the function should return
- Any local values that belong only to that function
- Sometimes saved register values from the caller
That temporary information is usually stored on the stack.
A simple mental model is:
CALL= jump away and leave a bookmark on the stackRET= read the bookmark and jump back
Control Flow and Stack
Think of control flow as one question: what should PC point to next?
No jump happens
When there is no jump, the CPU simply keeps moving forward one instruction at a time.
Program View
PC Now
0x0012
PC Next
0x0013
This is why SP is fundamental in most CPU designs: it keeps track of where the current stack data lives while control flow moves in and out of functions.
End-To-End Example: Watch State Change
Program:
LOAD R1, 5
LOAD R2, 7
ADD R1, R2
CMP R1, 12
JE doneWhat to track at each step:
PCmovement- Register writes
- Flag updates
- Memory accesses
CPU State Timeline
Before Program
Initial state
| Register | Value |
|---|---|
| PC | 0x0010 |
| SP | 0x00F8 |
| R1 | 0x0000 |
| R2 | 0x0000 |
| FLAGS | Z=0 C=0 N=0 V=0 |
Program In Memory
| Address | Instruction |
|---|---|
| 0x0010 | LOAD R1, 5 |
| 0x0011 | LOAD R2, 7 |
| 0x0012 | ADD R1, R2 |
| 0x0013 | CMP R1, 12 |
| 0x0014 | JE done |
This is the same mental model you will use while building an emulator.
Common Confusions
- ISA vs implementation: ISA is the contract; emulator/hardware is an implementation.
- Registers vs memory: both store values, but registers are far fewer and much faster.
- ALU vs control unit: ALU computes, control unit orchestrates.
- Instructions are data: machine code is bytes interpreted through ISA rules.
Part 2 Setup
In Part 2 we turn this model into Kotlin code by defining:
- CPU state model
- Memory model
- Instruction decoder
- Execution loop
- ALU/flags behavior
- Test strategy
Next post: how I designed and built kotlin-cpu.
Read Part 2: Building the CPU →