Pipelining
- Latency (time it takes to finish a single task) is unchanged
- Throughput (number of jobs finished per hour) increases
- Maximum throughput speedup = number of stages
- Limited by cost of filling and draining pipeline: not all resources used at the start and end
- Pipeline rate limited by slowest pipeline stage
- Faster stages have to wait for slower stagesd
Pipelining Stages
We can pipeline stages by adding registers between the stages, so the clock cycle can be as little as 200ps.
| Stage | IF | ID | EX | MEM | WB |
|---|---|---|---|---|---|
| Device | IMEM | Reg | ALU | DMEM | Reg |
| Time | 200ps | 100ps | 200ps | 200ps | 100ps |
| Event | IMEM read | Reg read | Execute | Memory access | Reg write |
Single Cycle vs. Pipelined
add t0, t1, t2 |IF|ID|EX| |WB|
lw t0, 8(t3) |IF|ID|EX|ME|WB|
or t3, t4, t5 |IF|ID|EX| |WB|
sw t0, 4(t3) |IF|ID|EX|ME| |
sll t6, t0, t3 |IF|ID|EX| |WB|- Sequential: Resource use by same instruction over time (multiple clock cycles)
- Simutaneous: Resource use by multiple instruction in same clock
Latency vs. Processor Throughput:
| Single Cycle | Pipelined | |
|---|---|---|
| Timing of each stage | 200, 100, 200, 200, 100 ps | 200 ps |
| Latency | 800 ps | 1000 ps |
| Clock cycle time | 800 ps | 200 ps |
| Clock rage | 1.25 GHz | 5 GHz |
| CPI | ~1 | ~1 or <1 |
| Relative throughput | 1x | 4x |
Construction a Pipelined RV32I Datapath
- A pipelined datapath needs to "separate" the five stages of the RV32I datapath.
- Each stage needs to process data from a different instruction
- Use pipeline registers to carry instruction data between stages
IF/ID Pipeline Registers
- IF/ID has two pipeline registers:
PC_IDinst_ID
- Increment PC to PC+4 for next cycle's IF stage
ID/EX Pipelien Registers
- Instruction need to be piped with data to correctly operate control in each stage
- Five registers:
PC_EXra1_EXra2_EXimm_EXinst_EX
EX/MEM Pipeline Registers
- Four registers:
PC_MEMalu_MEMrs2_MEMinst_MEM
rs2(data to store) needs to be piped through toMEM
MEM/WB Pipeline Registers
- Four Registers, 3 before MUX
PC+4_WB- PC from
PC_MEMneed to +4 before thePC+4_WBregister
- PC from
alu_WBmem_WBinst_WB- Instruction finally pipe back and decoded to
rsWto ensure data write
- Instruction finally pipe back and decoded to
Pipeline of Control
- Control signals are derived from the instruction
- Computed during ID stage
- Control information for later stages is stored in pipeline registers (forwarding):
- IF/ID: Derive control infromation
- ID/EX: EX_ctrl, MEM_ctrl, WB_ctrl
- EX/MEM: MEM_ctrl, WB_ctrl
- MEM/WB: WB_ctrl
Structural Hazards
A hazard is a situation that prevents starting the next instruction in the next clock cycle.
Types:
- Structural hazard
- A required resource is busy (e.g. needed in multiple stages)
- e.g., Two memory reads (IMEM and DMEM both in memory) in one cycle
- Data hazard
- Data dependency between instruction
- Need to wait for previous instruction to complete its data read/write
- e.g., The result of
t3will be stages ahead (WB) of it's use (ID)
- Control hazard
- Flow of execution depeneds on previous instruction
- e.g., Branching
Structural Hazards
Hardware does not support access accross multiple instructions in the same cycle.
- Occurs when multiple instructions compete for access to a single physical resource
Solution 1 (inefficient):
- Instructions take turns using the resource
- Some instructions stall when the resource is busy
Solution 2: Add more hareware
- In current CPUs, structural hazards are not an issue
- RV32I ISA datapath avoids structural hazards via its hardware requirements on RegFile and Memory
FIX: Required RegFile
Required RegFile:
- Each RV32I instruction:
- Reads up to 2 operands in ID (decode) stage
- Writes up to 1 operand in WB (writeback) stage
- Structural hazard can occur if RegFile HW does not support simultaneous read/write
- RV32I's required RegFile design works:
- Two independent read ports, one independent write port
- Three accesses (2 read, 1 write) can happen in the same cycle
FIX: Separate IMEM, DMEM
- CPU can read memory twice in the same cycle:
- IF: Instruction memory (IMEM)
- MEM: Data memory (DMEM)
- Structural hazard if IMEM, DMEM were same hardware:
- Without separate memories, instruction fetch would have to stall for a cycle
- RV32I's required separation of IMEM and DMEM works
Instruction and Data Caches
- Two fast, separate on-chip memories, one for instruction and one for data:
+------------------------------+ +------------------+
| Processor | | Memory |
| +-----------+ | | |
| | Control | | | |
| +--|-----^--+ | | |
| | | | | |
| +--V-----|--+ +-----------+ | | |
| | Datapath | |Instruction<----> |
| | <-->Cache | | | |
| | | +-----------+ | | |
| | | | | |
| | | | | |
| | | | | |
| | | +-----------+ | | |
| | <-->Data | | | |
| | | |Cache <----> |
| | | +-----------+ | | |
| | | | | |
| | | | | |
| +-----------+ | | |
+------------------------------+ +------------------+Data Hazards
- Instructions have data dependency
- Need to wait for previous instruction to complete its data read/write
Occurs when an instruction reads a register before a previous instruction has finished writing to that register.
Three cases:
- Register access
- ALU result
- Load data hazard
Data Hazard 1: Register Access
Problem: If the same register is written and read in one cycle:
- WB must write value before ID reads new value
- Not structural hazard! Separate ports allow simultaneous R/W
Both RegFile!!
V
add >t0<, t1, t2 |IF|ID|EX| >WB<
lw t0, 8(t3) |IF|ID|EX|ME|WB|
or t3, t4, t5 |IF|ID|EX| |WB|
sw >t0<, 4(t3) |IF>ID<EX|ME| |
sll t6, t0, t3 |IF|ID|EX| |WB|Solution: RegFile HW should write-then-read in same cycle
- Exploits high speed of RegFile (100 ps + 100 ps)
- Might not always be possible in high-frequency designs
In one cycle:
|>>> Reg |
| Reg >>>|Data Hazard 2: ALU Result
Problem: Instruction depends on WB's RegFile write from previous instruction.
- Instructions that reads old value calculates wrong result
add >s0<, t0, t1 |IF|ID|EX| >WB<
sub t2,>s0<, t1 |IF>ID|EX| |WB|
or t6, s0, t3 |IF>ID<EX| |WB|
xor t5, t1, s0 |IF|ID|EX| |WB|
sw s0, 4(t4) |IF|ID|EX|ME|WB|
s0 value |5 |5 |5 |5|5/9|9 |9 |9 |9 |Solution 1: Stalling
"Bubble" to effectively nop
- Affected pipeline stages do nothing during clock cycles
- Stall all stages preventing PC, IF/ID pipeline register from writing (see textbook)
add s0, t0, t1 |IF|ID|EX|ME|WB|
sub -> nop |IF|()|()|()|()|
sub -> nop |IF|()|()|()|()|
sub t2, s0, t0 |IF|ID|EX|ME|WB|Stalls reduces performance
- Compiler could rearrange code/insert nops to avoid hazards, but this requires knowledge of the pipeline structure
Solution 2: Forwarding
Forwarding, aka bypassing, uses the result when it is computed.
- Don't wait for value to be stored into RegFile
- Instead, grap operand from the pipeline stage