2026/03/29

16-bit CPU from Scratch in Kotlin (Part 1): How a CPU Works

A guide to CPU architecture: ISA, registers, memory, ALU, and the fetch-decode-execute cycle.

This is Part 1 of a 2-part series on building a 16-bit CPU from scratch in Kotlin.

If you have ever wondered what actually happens inside a computer when your code runs — what a register is, how memory works, why there is something called a stack — this is where we start.

This is not an easy topic at first. CPU internals can feel abstract, dense, and even a little intimidating, so give yourself patience as you read. The goal of this post is not to make everything feel instantly obvious, but to help the pieces click one by one.

By the end of this series we will have built a fully working 16-bit CPU in Kotlin: registers, memory, an ALU, a complete instruction set, and a fetch-decode-execute loop. The finished project is kotlin-cpu — If you enjoy the project, a ⭐ star on GitHub goes a long way and helps others find it.

Part 1 (this post) is an initial look at how a CPU is structured and why it works the way it does. Think of it as building the mental model first — no Kotlin code yet, just the concepts you will need before we write a single line.

Part 2 is where we take everything here and turn it into real Kotlin code, step by step.

What A CPU Really Is

A CPU is a state machine that keeps transforming state.

The state includes things like:

Register values
Memory contents
Status flags
The program counter (PC)

Every instruction is a rule for changing that state.

At a big-picture level, the CPU repeats the same simple loop: get the next instruction, understand it, do it, then save the updated state.

CPU Big Picture

What The CPU Is Doing

CPU reads the next instruction from memory using the current PC address.

Learning Path

State

Think of this as the CPU's 'memory' at any single moment. It keeps track of a few important things: values stored in registers (like sticky notes), data in memory (like a filing cabinet), where the next instruction is (PC), and the results of recent operations (flags). This is the 'snapshot' of everything the CPU currently knows.

Learn in order from 1 → 4. Each step builds on the previous one.

The Internal Blocks

Inside a simple CPU, a few blocks are always involved:

Control unit: decides what to do for current instruction
Register file: fast storage used constantly
ALU: math and logic execution
Memory interface: read/write to instruction and data memory
Buses: wires/paths for address, data, and control signals

CPU Architecture Diagram

Click any block to learn more

The key idea: control decides, datapath executes.

What Instructions Look Like: Assembly

Software tells the CPU what to do through instructions. Each instruction is a command that changes CPU state.

Assembly language is the symbolic, human-readable form of instructions. Instead of binary, we write things like ADD R1, R2 which means "add the value in R2 to R1 and store the result in R1."

Every instruction has parts:

Mnemonic: the name of the operation (ADD, LOAD, JMP, etc.)
Operands: the location(s) of data (registers or memory addresses)

A typical assembly instruction looks like:

ADD R1, R2

This means:

Operation: addition
Source: R2
Destination: R1
Effect: R1 ← R1 + R2

Another example:

LOAD R3, 0x1024

This means:

Operation: load from memory
Address: 0x1024
Destination: R3
Effect: R3 ← MEM[0x1024]

When the CPU fetches and decodes either of these instructions, the control unit translates the mnemonic into a set of micro-operations: which registers to read, which ALU function to run, where to write the result.

The CPU does not actually understand ADD as text. What it reads is a binary encoding of that instruction. The assembler (a tool) converts mnemonics into that binary form before the program runs.

Here is a simple Kotlin-to-ASM mental model:

val a = 5
val b = 7
val sum = a + b
if (sum == 12) {
	print("ok")
}

A compiler lowers this high-level code into many small instructions that move data through registers, run ALU ops, compare values, and branch.

Kotlin to Hypothetical ASM

Kotlin (High-level)

ASM (Low-level)

MOV R3, R1

ADD R3, R2

Copy a into destination register, then perform add with b.

Real compilers generate many more instructions (stack setup, register spills, calling convention rules). This example keeps only the core idea.

Registers, ALU, Flags

Registers are the CPU's fastest storage. They live directly on the CPU chip — not in external RAM — making them orders of magnitude faster than main memory. A typical CPU has between 8 and 32 registers, each holding exactly one word.

A word is the native chunk of data the CPU works with in a single operation. It is the bite size of the machine. Our CPU uses 16-bit words, meaning every register holds exactly 16 bits — a number between 0 and 65,535 unsigned, or −32,768 to +32,767 signed. Real-world CPUs use 32-bit or 64-bit words; the bigger the word, the more values they can represent and address in one step.

Most instructions follow the same pattern: read one or two registers, pass the values to the ALU, and write the result back to a register.

The Register File

Registers come in two categories.

General-purpose registers (R0, R1, R2, ...) hold any temporary value the program needs. The CPU enforces no meaning on them — that is left entirely to software conventions such as calling conventions used by compilers.

Special-purpose registers have hardware-enforced roles:

PC (program counter): always holds the address of the next instruction to fetch. After every fetch the CPU increments it automatically. Jump and call instructions overwrite it directly — that is the only mechanism for control flow. There is no "go to next line" built into programs; it is all just PC manipulation.
SP (stack pointer): tracks the top of the call stack in memory. The stack grows downward: CALL decrements SP and pushes the return address; RET pops that address back into PC. Local variables also live on the stack relative to SP.
FLAGS (status register): a small register whose bits are set by the ALU after every arithmetic or logic operation.

Step through to see how PC advances instruction by instruction and how SP moves when a function is called and returned:

Special Registers in Action

Memory

0x0010LOAD R1, 5← PC

0x0011LOAD R2, 7

0x0012ADD R1, R2

0x0013CMP R1, 12

0x0014JE 0x0020

0x0015...

0x0020LOAD R3, 0

PC Register

0x0010

PC starts at 0x0010. The CPU will fetch the instruction at this address first.

What's Happening

PC always holds the address of the next instruction to fetch. Before execution begins it is pointed at the program's entry point — address 0x0010 here.

1 / 6

General Purpose

Special Purpose

The ALU

The ALU (arithmetic and logic unit) is the math engine of the CPU. It is a purely combinational circuit — no internal state, no clock of its own. It takes two inputs and an operation code, computes a result instantly, and produces two outputs: the result value and a set of status bits (flags).

Common ALU operations:

ADD, SUB: integer arithmetic
AND, OR, XOR, NOT: bitwise logic CMP: subtraction that only updates flags, discarding the result — common in many real CPUs before conditional jumps
SHL, SHR: left and right bit shifts

Flags

After every ALU operation the status flags are updated. Each is a single bit:

Flag	Name	Set when
`Z`	Zero	Result is exactly 0
`N`	Negative	Result's sign bit (bit 15) is 1
`C`	Carry	Unsigned result exceeded 16 bits
`V`	Overflow	Signed result wrapped sign

Different CPUs expose different flag sets and branch instructions. For example, many real CPUs use instructions like CMP, JE, and JLT, while our project CPU in Part 2 keeps things smaller and uses BEQ / BNE with zero, carry, and negative flags. The core idea is the same: flags give later instructions information about what just happened.

ALU Operations & Flags

R10x00055

R20x00077

ALUADD

Result0x000C12

FLAGS

Z=0N=0C=0V=0

ADD R1, R2

What happened

5 + 7 = 12. The result is positive, non-zero, and fits in 16 bits. No flags are set. Most instructions produce this clean outcome.

If you want the smallest useful mental model: registers hold working values, the ALU transforms them, and flags record what happened.

Memory And Bus Traffic

A CPU talks to memory through three shared "lanes" (buses):

Address bus: where to read/write
Control bus: what operation to do (READ or WRITE)
Data bus: the value moving between CPU and memory

Think of one memory operation in 3 simple steps:

CPU puts an address on the address bus ("go to this location").
CPU sets control signals (READ or WRITE) ("what kind of operation").
Data moves on the data bus ("the actual value or instruction bytes").

This is not just hardware detail. It is the concrete mechanism behind instruction fetch, load, and store.

Address, Control, Data Buses

Use this rule of thumb: address = where, control = read or write, data = the value.

CPU asks memory for the next instruction

Address BusCPU places address 0x0012 (from PC)

Where in memory?

Control BusREAD=1

Read or write?

Data BusMemory returns instruction bytes

What value moves?

At this point you can think of the CPU as a machine that keeps moving values between registers, ALU, and memory.

Endianness: Byte Order In Memory

When a value is larger than one byte (for example 16-bit, 32-bit, or 64-bit), memory still stores it as separate bytes. Endianness tells us the order of those bytes in memory.

Little-endian: lowest byte at the lowest address
Big-endian: highest byte at the lowest address

Example with the 16-bit value 0x1234 stored starting at address 0x1000:

Little-endian: MEM[0x1000] = 0x34, MEM[0x1001] = 0x12
Big-endian: MEM[0x1000] = 0x12, MEM[0x1001] = 0x34

Our CPU model in this series uses little-endian layout for multi-byte values.

Why this matters in real-world systems:

Network protocols: internet protocols traditionally use big-endian ("network byte order"). If a little-endian host sends raw integers without conversion, the receiver reads wrong values.
Binary file parsing: reading a binary header with the wrong byte order can produce nonsense sizes/offsets, which often causes corrupted parsing or crashes.
Cross-language serialization: one service writes bytes in little-endian, another reads as big-endian; IDs, lengths, timestamps, or checksums become incorrect.
Debugger confusion: memory dump bytes can look correct but interpreted values look wrong when the expected endianness is not matched.

Endianness bugs are subtle because the data is still there, just interpreted in the wrong order. That is why low-level code, binary protocols, and emulator code must always make byte order explicit.

Fetch Decode Execute, But In Micro-Steps

At a high level, every CPU instruction follows the same loop: fetch, decode, execute, then move to the next instruction.

That sounds simple, but inside the CPU this is broken into very small timed actions (micro-steps). Think of micro-steps as tiny checklist items the hardware performs in order.

For one instruction, a beginner-friendly view is:

Fetch: use PC as an address and read instruction bytes from memory.

Decode: figure out what operation this instruction means and which registers or memory locations are involved.

Execute: do the actual work, such as ALU math, a load or store, or a branch check.

Next: update PC to the next instruction, or overwrite it if a jump or branch is taken.

Why this matters: when code behaves unexpectedly, the bug is usually in one of these tiny steps (wrong address, wrong decode, wrong flag, wrong next PC).

Fetch Decode Execute (Simple Walkthrough)

One instruction is handled in this order: Fetch -> Decode -> Execute -> Next.

Fetch

Stage 1 of 4

Get the next instruction from memory.

The CPU uses PC as an address and asks memory to return the instruction stored there.

Hardware Actions

Address bus <- PC
Control <- READ
IR <- MEM[PC]

PC=0x0012, IR=ADD R1,R2

Thinking in micro-steps helps explain why control logic exists and why timing order matters.

Datapath vs Control Path (Practical View)

This distinction is one of the most useful ways to think about a CPU.

The datapath is the part that moves and changes values.

Register reads/writes
ALU inputs/outputs
Memory data transfers

If you ask, "where is the data going?" or "what value is being changed?", you are thinking about the datapath.

The control path is the part that tells the datapath what to do and when to do it.

Decode opcode
Choose ALU function
Choose source/destination muxes
Trigger memory read/write
Choose next PC

If you ask, "which operation should happen now?" or "which register should be written?", you are thinking about the control path.

Here is an easy way to separate them:

Datapath is the engine
Control path is the driver

For example, imagine the instruction

ADD R1, R2

The datapath reads R1 and R2, sends them into the ALU, computes the sum, and writes the result back
The control path recognizes that this is an ADD, tells the ALU to do addition, tells the register file which inputs to read, and tells the CPU where the result should be written

So the datapath is the machinery that carries values around, while the control path is the decision-making logic that coordinates the machinery.

Both are always working together. A datapath with no control does not know what operation to perform. A control path with no datapath has no hardware to move or transform values.

Once this makes sense, the more abstract software-facing view becomes easier to understand.

ISA: The Contract Between Software And CPU

The instruction set architecture (ISA) is the contract software must follow.

The ISA defines:

Available instructions (ADD, LOAD, JMP, ...)
Register model
Instruction bit encoding
Addressing modes
Word size and memory behavior
Flag semantics

If software and hardware agree on ISA, programs run correctly.

This is the layer a compiler, assembler, or emulator targets. The compiler does not care how your ALU is wired internally. It cares that ADD R1, R2 means a specific operation with specific rules.

Why Encoding Matters

Internally, the control unit reads bit fields, not keywords.

For example, an instruction word is usually split into fields like:

Opcode
Destination register
Source register or immediate
Optional mode bits

Instruction Encoding Explorer

Real `.kasm` lines and the exact 16-bit instruction words they encode to.

R-Type Example

.kasm Instruction

ADD R3, R1, R2

bits: 0000 011 001 010 000

hex: 0x0650

opcode

bits 15..12 = 0000

ALU family

bits 11..9 = 011

R3 (destination)

rs1

bits 8..6 = 001

rs2

bits 5..3 = 010

aluOp

bits 2..0 = 000

ADD

Addressing Modes

Addressing mode tells where an operand comes from.

Most beginner ISAs include at least:

Immediate: the value is written directly inside the instruction.
```
LOADI R1, 5
```
This means "put the number 5 into R1."
Register: the value comes from another register.
```
ADD R1, R2
```
This means "take the value already stored in R2."
Direct: the instruction contains a memory address.
```
LOAD R1, 0x1024
```
This means "read the value stored at memory address 0x1024."
Register-indirect: a register holds the memory address.
```
LOAD R1, [R3]
```
If R3 = 0x1024, this means "go to the address stored in R3, then read from there."

The important difference is this:

Immediate gives you the value itself.
Register gives you a value already inside the CPU.
Direct gives you a fixed memory location.
Register-indirect gives you a memory location stored in a register.

Addressing modes are a design tradeoff between flexibility and encoding complexity.

They are also one of the places where ISA design directly shapes what a compiler can emit efficiently.

Control Flow And The Stack

Without control flow, a program would only run straight downward, one instruction after another.

Control flow is what lets a program:

Skip code
Repeat code in loops
Choose between two paths with if
Jump into a function and later come back

At the hardware level, all of this is really just one thing: changing PC.

If PC keeps increasing normally, execution stays linear. If an instruction writes a new address into PC, execution jumps somewhere else.

That is why these instructions matter:

JMP: always replace PC with a new address
Conditional branches: replace PC only if a condition is true
CALL: jump to a function, but first save where to come back to
RET: restore the saved return address and continue from there

The stack is what makes CALL and RET practical.

When a function call happens, the CPU needs to remember:

Where the function should return
Any local values that belong only to that function
Sometimes saved register values from the caller

That temporary information is usually stored on the stack.

A simple mental model is:

CALL = jump away and leave a bookmark on the stack
RET = read the bookmark and jump back

Control Flow and Stack

Think of control flow as one question: what should PC point to next?

No jump happens

When there is no jump, the CPU simply keeps moving forward one instruction at a time.

Program View

0x0010LOAD R1, 5

0x0011ADD R1, R2

0x0012STORE R1, 0x2000PC

0x0013SUB R3, R4

PC Now

0x0012

PC Next

0x0013

This is why SP is fundamental in most CPU designs: it keeps track of where the current stack data lives while control flow moves in and out of functions.

End-To-End Example: Watch State Change

Program:

LOAD R1, 5
LOAD R2, 7
ADD  R1, R2
CMP  R1, 12
JE   done

What to track at each step:

PC movement
Register writes
Flag updates
Memory accesses

CPU State Timeline

Before Program

Initial state

Register	Value
PC	0x0010
SP	0x00F8
R1	0x0000
R2	0x0000
FLAGS	Z=0 C=0 N=0 V=0

Program In Memory

Address	Instruction
0x0010	LOAD R1, 5
0x0011	LOAD R2, 7
0x0012	ADD R1, R2
0x0013	CMP R1, 12
0x0014	JE done

This is the same mental model you will use while building an emulator.

Common Confusions

ISA vs implementation: ISA is the contract; emulator/hardware is an implementation.
Registers vs memory: both store values, but registers are far fewer and much faster.
ALU vs control unit: ALU computes, control unit orchestrates.
Instructions are data: machine code is bytes interpreted through ISA rules.

Part 2 Setup

In Part 2 we turn this model into Kotlin code by defining:

CPU state model
Memory model
Instruction decoder
Execution loop
ALU/flags behavior
Test strategy

Next post: how I designed and built kotlin-cpu.

Read Part 2: Building the CPU →

16-bit CPU from Scratch in Kotlin (Part 1): How a CPU Works

What A CPU Really Is

State

Hardware

The Loop

ISA Details

State

The Internal Blocks

What Instructions Look Like: Assembly

Registers, ALU, Flags

The Register File

The ALU

Flags

Memory And Bus Traffic

Endianness: Byte Order In Memory

Fetch Decode Execute, But In Micro-Steps

Datapath vs Control Path (Practical View)

ISA: The Contract Between Software And CPU

Why Encoding Matters

R-Type Example

Addressing Modes

Control Flow And The Stack

End-To-End Example: Watch State Change

Before Program

Program In Memory

Common Confusions

Part 2 Setup