ETC INSTRUCTIN603EWP

Optimizing Instruction Execution in the
PowerPC™ 603e Superscalar
Microprocessor
by
Top Changwatchai
Skipper Smith
Nasr Ullah
Motorola RISC Microprocessor Division
System Performance Modeling and Simulation
Document 1998-014-1
Version 1.0
WTC, SS, NU
Page 1
RISC Microprocessor Division
Outline
• Superscalar Microprocessor Architectures
• The PowerPC 603e Superscalar Microprocessor
• Stall Conditions
–
–
–
–
Dispatch and Completion Stalls
Execution Unit Stalls
Load and Store Stalls
Instruction Interaction Stalls
• Summary
Page 2
RISC Microprocessor Division
This presentation discusses techniques for optimizing instruction execution in a superscalar
microprocessor architecture such as the PowerPC™ 603e microprocessor.
Instruction execution in a superscalar processor is enhanced by allowing the parallel execution of
multiple instructions. In order to enable the maximum potential of most superscalar processors, one
needs to be aware of their instruction flow and execution mechanisms. Optimal performance in a
microprocessor can be attained by ensuring a continuous flow of instructions through the instruction
pipeline.
Being aware of the dependencies and constraints of the instruction flow mechanisms allows one to
generate code that can most effectively and optimally take advantage of all the capabilities of a
superscalar processor such as the PowerPC 603e microprocessor.
The 603e is a low-power implementation of the PowerPC family of reduced instruction set computer
(RISC) microprocessors. The 603e is a superscalar processor capable of issuing and retiring as many
as three instructions per clock. Instructions can “execute” out-of-order for increased performance, but
they “retire” in-order to ensure functional correctness and well-ordered behavior.
In this paper, we first discuss the instruction flow mechanism of the PowerPC 603e microprocessor and
then describe dependencies and constraints that should be avoided to reduce stalls in the instruction
pipeline and maximize performance.
By closely examining the instruction flow mechanism of the 603e, a software developer will not only be
able to optimize code for the 603e, but will also be able to understand some of the general principles
behind superscalar microprocessors that can impact performance.
Page 3
RISC Microprocessor Division
Common Superscalar Characteristics
• Multiple instruction dispatch
– ability to fetch and dispatch more than one instruction at a time
• Multiple functional units
– ability to execute in parallel more than one instruction
– out of order execution
• Multiple instruction retirement
– Multiple instruction completion of the instructions
• Multiple read/write ports to register file set
• Mechanisms to avoid false dependencies
Page 4
RISC Microprocessor Division
There are many characteristics found in common among contemporary superscalar processors. One
characteristic is the ability to execute multiple instructions in parallel. To enable this, superscalar
processors contain multiple functional units, and they allow the fetching, issuing (dispatching), and
retiring of multiple instructions in one clock cycle.
Superscalar processors typically contain a register set with multiple read/write ports to allow multiple
instructions in execution to access data simultaneously. Most superscalar processors also have
separate floating-point and integer register sets.
The key focus of superscalar processors is to increase overall instruction throughput by keeping the
instruction pipeline free of stalls. Mechanisms exist to avoid false dependencies between instructions;
instructions should be allowed to dispatch and execute until they are forced to stall due to change in
instruction flow, lack of resources, or true data dependencies.
To allow the free flow of execution, most superscalar processors allow out-of-order execution of
instructions. However, a mechanism must exist for bringing the instructions back in program order
when they complete executing.
The primary reasons the instruction flow can stall within a superscalar processor are changes in
instruction flow, resource constraints, and data dependencies. By understanding the flow mechanism,
and being aware of the situations that can cause stalls, one can write code that avoids these situations
and thereby executes faster.
Page 5
RISC Microprocessor Division
Block Diagram of the 603e
64-bit
Sequential
Fetcher
Instruction flow
64-bit
Signalling
CTR
CR
LR
Instruction
Queue
System
Register
Unit
Branch
Processing
Unit
64-bit
Instruction Dispatch
INSTRUCTION
UNIT
64-bit
Integer
Unit
GPR File
5 Renames
3
32-bit
buses
Load/
Store
Unit
64bit
FPR File
4 Renames 64bit
FP
Unit
Completion Unit
Page 6
RISC Microprocessor Division
Like most other superscalar processors, the 603e features pipelined execution flow, in which the
processing of an instruction is split into discrete stages. Each stage is able to handle a different
instruction, allowing multiple instructions to be in execution at once. For example, it may take three
cycles for a floating-point instruction to complete (three-cycle latency), but if there are no stalls in the
floating-point pipeline, then a series of floating-point instructions can have a throughput of one
instruction per cycle.
The 603e processor core consists of a fetcher, a dispatcher, and five execution units: an integer unit
(IU), a floating-point unit (FPU), a branch processing unit (BPU), a load/store unit (LSU), and a system
register unit (SRU). An instruction queue (IQ) holds up to six instructions that are fetched in and
waiting for dispatch. A completion queue (CQ) holds up to five instructions that have dispatched and
are waiting to be finished and retired.
Page 7
RISC Microprocessor Division
Registers and Execution Units
• Execution Units
–
–
–
–
–
Integer Unit
System Register Unit
Floating-Point Unit
Branch Unit
Load/Store Unit
• Registers
– Integer Registers
– Floating Point Registers
• Execution Unit/Register Interaction
– Rename Registers
– Data Forwarding
Page 8
RISC Microprocessor Division
Execution units
• The Integer Unit accepts all integer instructions.
• The System Register Unit accepts all synchronizing, condition register, and system register
instructions. Since these instructions appear infrequently, the SRU also accepts basic add and
compare instructions.
• The Floating-Point Unit accepts all instructions utilizing the FP registers (other than loads and
stores).
• The Branch Processing Unit redirects instruction fetches, performs prediction and helps control
speculative execution, and folds appropriate branches out of the pipeline to permit an effective branch
cycle time of zero.
• The Load/Store Unit accepts instructions accessing data cache and memory.
Registers
The 603e supports 32 general-purpose registers (GPRs) that are 32 bits wide, and 32 floating-point
registers (FPRs) that are 64 bits wide. These two register files are supported by rename registers
which allow quick forwarding of data, in order to reduce stalls based on data dependencies. There are
five GPR rename registers and four FPR rename registers.
As an example of usage, suppose we have an integer divide instruction which is computing the value of
a given GPR, followed by a store instruction which must store this value to memory. When the divide is
dispatched to the IU, the GPR is assigned a GPR rename register. The store is then dispatched to the
LSU and must wait for the value to become valid. When the value is computed, the store immediately
gets the value through the GPR rename bus, and can begin storing the value at the same time it is
being written back to the GPR file.
Page 9
RISC Microprocessor Division
Instruction Pipeline
Fetch
Dispatch
In
Program
Order
Execute
Out-of-order
Finish
Retire
Page 10
RISC Microprocessor Division
The fetcher fetches up to two instructions per clock into the IQ, where they appear in the lowest
available slots. The dispatcher then dispatches up to two instructions from the IQ to the execution
units (excluding the BPU, which scans instructions as they are brought in by the fetcher). The
dispatcher also performs source and destination dependency checking and determines dispatch
serializations. Each instruction dispatched has an entry created for it in the completion queue.
When an execution unit has finished processing an instruction, it signals the completion unit, and the
instruction’s entry in the completion queue is marked “finished.” Up to two finished instructions per
clock may be retired (removed) from the completion queue. When an instruction is retired, the
architectural registers are updated.
Note that fetching, dispatching, and retiring of instructions is done in program order, but executing and
finishing can be done out-of-order and in parallel.
In a pipelined architecture, anything that prevents an instruction from moving from one stage of a
pipeline to the next is known as a stall. Resource checks must be performed to see if stalls will occur.
The rest of this paper discusses how and where stalls can occur in the instruction pipeline.
Page 11
RISC Microprocessor Division
Dispatch & Completion
64-bit
Sequential
Fetcher
Instruction flow
64-bit
Signalling
CTR
CR
LR
Instruction
Queue
System
Register
Unit
Branch
Processing
Unit
64-bit
Instruction Dispatch
INSTRUCTION
UNIT
64-bit
Integer
Unit
GPR File
5 Renames
3
32-bit
buses
Load/
Store
Unit
64bit
FPR File
4 Renames 64bit
FP
Unit
Completion Unit
Page 12
RISC Microprocessor Division
The dispatcher and completion unit control the execution of instructions. Interactions between the
dispatcher and completion unit and the various execution units can reduce potential stalls in the
instruction pipeline. In the next few slides, we discuss these interactions.
The dispatcher is capable of buffering up to 6 instructions (in the instruction queue). However,
instructions must dispatch in-order out of the dispatcher and only from the bottom two slots (an
exception to this rule occurs with branch folding, discussed later). If the instruction in the bottom slot is
not capable of dispatching, then the instruction in the second slot cannot, either. It is the job of the
dispatcher to determine whether or not an execution unit is capable of accepting an instruction. The
dispatcher will stall instruction dispatch when the instructions awaiting dispatch requires an execution
unit that is unavailable, or will stall the second instruction if both instructions awaiting dispatch need the
same execution resource.
The completion unit is capable of buffering up to 5 instructions (in the completion queue). The
completion unit records the proper order of dispatch to enforce in-order completion. While instructions
are being tracked, the completion unit also keeps a record of exceptions generated, speculation, out-oforder finishing, etc. All instructions except folded branches must be tracked in the completion unit. The
completion unit assigns rename registers (up to 5 integer and 4 floating point) to the instructions as
they dispatch. The completion unit will stall the dispatcher if no appropriate rename register resources
are available. Additionally, if there are no slots available in the completion queue, the completion unit
will order the dispatcher to stall the dispatching of instructions.
Page 13
RISC Microprocessor Division
Branching
Opcode
Mnemonic
Addressing
Range
Branch Always
b
ba
Relative
Absolute
+/- 32MB
0 +/- 32MB
Branch Conditional
bc (condition field(s))
bca (condition field(s))
Relative
Absolute
+/- 32KB
0 +/- 32KB
Branch Conditional
to Count Register
bcctr (condition field(s))
count reg.
4 GB
Branch Conditional
to Link Register
bclr (condition field(s))
link reg.
4 GB
Page 14
RISC Microprocessor Division
The PowerPC Architecture instruction set includes of two types of branches, unconditional (branch
always) and conditional. Conditional branches can depend on the contents of the condition register
(CR), which can be set by compare and arithmetic instructions; on the contents of the count register
(CTR), which is typically used when executing looping instructions; or on both the CR and the CTR.
Branch instructions can specify an absolute or relative target address, or they can branch to the link
register (LR) or CTR. The LR is typically used for subroutine calls, and the CTR (if specified as a
destination address) is typically used for absolute jumps.
Page 15
RISC Microprocessor Division
Branch Processing Unit
Instruction Cache
Fetcher
5
branch folding
4
Branch
Processing
Unit
3
2
1
0
Dispatcher
Instruction
Queue
Execution
Units
Completion
Queue
Page 16
RISC Microprocessor Division
When the fetcher fetches instructions into the instruction queue (IQ), it also forwards them to the
branch processing unit (BPU), which scans these instructions for branches. The BPU immediately
begins address calculation for branches found and attempts to fold certain branches out of the
instruction queue (discussed later).
Because branch instructions can change the instruction flow, they can potentially cause stalls in the
instruction pipeline when new instructions must be fetched from the target address. The 603e includes
two mechanisms for reducing the impact of branch instructions: branch folding and branch prediction.
Page 17
RISC Microprocessor Division
Branch Folding
Instruction
Queue
Completion
Queue
5
5
5
5
4
4
4
4
3
3
2
3
D
3
2
C
2
F
2
1
B
1
E
1
B
1
D
0
A
0
D
0
A
0
C
4
4
4
4
3
3
3
3
2
2
2
2
1
1
B
1
1
B
0
0
A
0
0
A
Page 18
RISC Microprocessor Division
The branch processing unit (BPU) can fold certain branches out of the instruction queue. They are
removed from the IQ before being dispatched, allowing the dispatcher to handle other instructions, and
freeing space in the instruction queue and completion queue for other instructions. Frequently,
instruction flow can continue as if the branch had not occurred.
The BPU can fold all unconditional branches, as well as conditional branches that do not involve the
CTR or LR. Conditional branches that do involve these registers cannot be folded because the CTR
and LR have corresponding rename registers which can only be tracked if branches using them get
recorded in the completion queue by being dispatched.
Consider the left two columns of diagrams. We start with four instructions in the instruction queue.
Instruction C is a branch. In the second column, we see that instructions A and B have been
dispatched and have entries in the completion queue, and that instruction C has been folded out by the
BPU. Instructions E and F have also been fetched in.
Because superscalar processors feature multiple units that are attempting to flow instructions through
their pipelines as quickly as possible, race conditions between various resources can occasionally
arise. One race condition occurs in the instruction queue: if the dispatcher can tag a branch for
dispatch before the BPU can fold it out of the instruction queue, then the branch will not be folded; it will
be dispatched and an entry created for it in the completion queue. This situation typically occurs if the
IQ is empty or near-empty and the foldable branch is fetched directly into one of the bottom two slots
(i.e. the slots from which instructions are dispatched). However, the performance impact of this race
condition is negligible.
The right two columns illustrate the branch race condition. Instructions A and B have just been fetched
into the instruction queue, with A being a branch. In this case, the dispatcher grabs A before it can be
folded, and we see it in the completion queue in the next cycle.
Page 19
RISC Microprocessor Division
Branch Code Stall Example
FOR...NEXT w/ bdnz
li
mtspr
FOR...NEXT w/ subi./bgt
r13,COUNT
CTR,r13
LOOP:
li
LOOP:
;Do some
;useful work
bdnz
LOOP
Page 20
r13,COUNT
subi.
r13,0x0001
;Do some
;useful work
bgt
LOOP
RISC Microprocessor Division
In this slide, we depict a potential stall that can occur with branches. The code fragments demonstrate
how, in some cases, one can use branches that are foldable to attain better performance than using
non-foldable branches.
The two loops repeat for COUNT iterations. The first code fragment initializes the CTR and uses only
one instruction to control the looping, bdnz. (bdnz is a simplified mnemonic for a conditional branch
which decrements the CTR and branches if CTR is not zero.) This branch cannot be folded and must
be dispatched. Since branches that dispatch are required to retire from the last stage of the completion
unit, any loop involving a branch that dispatches may need an extra clock (in addition to the loop body
time) to complete execution.
It is possible to avoid the additional latency by using a foldable branch instead of the bdnz. The bgt
and the subi. instructions in the second code fragment can be used to obtain the same functionality
as the bdnz. The subi. instruction is a single cycle instruction that can retire paired with almost any
other instruction; thus in most loops, subi. adds no time to the execution of that loop. The bgt is also
capable of being folded out of the pipeline and not dispatching at all. Therefore, code that uses the
subi./bgt combination will likely be a clock faster each time through the loop then bdnz. However,
the exact timing difference, if any, would depend on the actual composition of the loop body.
Page 21
RISC Microprocessor Division
Branch Prediction and
Speculative Execution
Instruction
Queue
Completion
Queue
5
5
5
5
4
4
4
4
3
D
3
F
3
H
3
2
C
2
E
2
G
2
1
B
1
D
1
F
1
0
A
0
B
0
E
0
4
4
4
4
3
3
3
3
2
2
2
D
2
1
1
1
B
1
B
0
0
0
A
0
A
A
Page 22
RISC Microprocessor Division
Each conditional branch instruction includes a prediction bit, which is set by the compiler or an
assembly language programmer. This bit helps specify whether the branch is predicted to be taken or
not taken. This is known as static prediction because the prediction behavior is encoded in the
instruction. While the branch condition is waiting to be resolved, execution continues down the
predicted path, and these subsequent instructions are marked as speculative instructions.
(Speculative instructions are not allowed to change the programming model, such as update register
files or memory, and may stall until the branch is resolved and they become non-speculative.)
When the branch condition is resolved, if the prediction was correct, then the speculative instructions
are marked non-speculative, and no penalty is assessed. If the prediction was incorrect, then the
speculative instructions are flushed (removed from the instruction pipeline) and execution resumes
along the correct execution path.
The 603e has one level of prediction, meaning that a conditional branch encountered along a
speculative path cannot itself be executed speculatively. Instead, it will stall in the pipeline until the
previous branch is resolved.
In the leftmost diagram, we have instructions A, B, C, and D in the instruction queue. Instruction C is a
branch. In the next diagram (next cycle), instruction A was dispatched and C folded out by the BPU.
However, assume that branch C cannot be resolved (perhaps it is dependent on the results of
instruction A). All subsequent instructions are then marked speculative: D and the newly fetched
instructions E and F.
In the next diagram, we see that B and D were dispatched to the CQ and G and H fetched into the IQ.
In our example, branch C is now resolved and it turns out the branch was mispredicted. In the final
diagram, the speculative instructions are flushed, and the fetcher is ready to fetch instructions from the
correct input stream. If branch C had been correctly predicted, the speculative instructions would
simply be marked non-speculative and no stall would occur.
Page 23
RISC Microprocessor Division
Performance Impact of
Branch Prediction
• Speculative execution allows instruction flow to
proceed before branch conditionals have been
resolved
• Correct predictions incur no performance penalty
• Incorrect predictions only incur significant performance
penalties when mispredicted paths result in instruction
cache misses
• Incorrect predictions may be avoided by separating
the instruction that sets a branch condition from the
branch that uses it
Page 24
RISC Microprocessor Division
Speculative execution allows the fetcher to fetch instructions without stalling while the branch is being
resolved. Prediction does not cause any pipeline stalls unless the prediction is deemed to be incorrect.
If the prediction is incorrect, it is the function of the BPU to perform the necessary tasks to recover from
speculation.
Branch prediction of the type used by the 603e is correct approximately 86% of the time. Due to the
603e’s ability to invert the normal prediction mechanism, a smart programmer or compiler can attain
greater prediction accuracy.
Mispredicted branches, which occur infrequently even using only the default speculation mechanism,
only incur significant performance penalties when speculative branches also result in cache misses on
the mispredicted path.
Since incorrect predictions can potentially cause many stalls, it is possible to improve performance by
avoiding prediction in some code fragments. By separating the instruction that is setting the branch
condition from the branch that uses it, it is possible to prevent the processor from executing
speculatively altogether.
In the 603e, we can calculate the approximate separation distance by using worst case analysis for a
conditional branch dependent on the CR register. Assuming that the processor dispatches 3
instructions per clock (2 instructions and a unconditional branch or nop), and assuming a worst case
conditional register update time of 3 clocks, we calculate that by separating the branch condition from
the condition register update instruction by 9 instructions, we will avoid speculative execution. For most
code fragments, the 603e can dispatch instructions at a peak rate of 2 instructions. Additionally, most
instructions (such as the COMPARE instruction) take only 1 clock to update the Condition Register.
Under these conditions, one can prevent speculative execution by separating the branch condition from
the condition register update instruction by only 3 instructions.
Page 25
RISC Microprocessor Division
Integer Unit and
System Register Unit
• Integer Unit
– No stalls caused by single-cycle instructions
– Multi-cycle instructions keep the integer unit busy
– Possible stalls due to dependencies minimized by allowing
access to operands as soon as the source data is valid
• System Register Unit
– Handles access to system registers
– Assists the Integer unit by handling some integer unit operations
Page 26
RISC Microprocessor Division
Integer Unit
Most integer instructions only take one cycle to execute; thus the integer unit does not usually stall.
The only times that the integer unit stalls is if it is executing multiple-clock integer instructions such as
trap, multiply, and divide, or if the instruction cannot execute because it is dependent on the results of
another operation. The internal bus structure of the 603e allows an integer instruction to immediately
access any operand as soon as it becomes valid.
System Register Unit
The SRU handles all of the special purpose register instructions, context synchronizing instructions,
and certain integer add/compare operations. Some special purpose register instructions are also
inherently context synchronizing. Context synchronization will always cause some instruction stall, but
this is almost always critical to guarantee correct operation. Integer operations in the SRU take only
one cycle to execute, thereby causing no stalls.
Page 27
RISC Microprocessor Division
Floating Point Unit
Fetcher/Dispatch Unit
IQ5
IQ4
IQ3
IQ2
IQ1
IQ0
32 64 bit Floating
Point Register File
Multiplier
Adder
Exceptions Monitor
Normalizer
Completion Unit
CQ4
CQ3
CQ2
CQ1
CQ0
4 64 bit Floating Point
Rename Registers
Page 28
RISC Microprocessor Division
The Floating-Point Unit consists of four stages: Multiplier, Adder, Normalizer, and Exception, which are
organized conceptually as shown. The Multiplier stage is a single precision multiplier that every FP
instruction must pass through. No instruction can enter the FP unit if the Multiplier is occupied. Double
precision operations will cause the Multiplier to be occupied for two consecutive clocks. The Adder
always takes a single clock. Typically, instructions will flow through these stages without stalling unless
a stall in the Normalize or Exception stage blocks the instruction pipeline flow. The FP register file only
supports a single write-back port from the rename registers.
The Normalizer stage can cause delays of up to several clocks. The number of clocks that
normalization takes is data-dependent. When the normalizer stalls, it prevents instructions in the
Multiplier or Adder stages from stepping through.
To prevent potential speed path problems, an additional stage exists after the normalization stage. This
stage is simply a holding stage and floating-point instructions use up one clock cycle to pass through it.
Page 29
RISC Microprocessor Division
FPU Stall Conditions
Fetcher/Dispatch Unit
IQ5
IQ4
IQ3
IQ2
IQ1
IQ0
Disable
Dispatch
32 64 bit Floating
Point Register File
• Stall Conditions
– Normalization blocking
dispatch
– Late-release of FP rename
registers
– Enabling exceptions
Multiplier
V
Adder
V
V
Normalizer
4 64 bit Floating Point
Rename Registers
Page 30
RISC Microprocessor Division
As previously mentioned, the Normalizer stage can take multiple cycles, thereby stalling the flow of
instructions within the FPU. When FP instructions occupy the Normalizer, Multiplier, and Adder stages
at the same time, a signal will be sent to the dispatcher, halting dispatch of instructions to the floatingpoint unit. Even if normalization doesn’t stall the pipeline, the distance between the Normalizer and
dispatcher prevents the FPU from informing the dispatcher to resume dispatching until after it is too late
to dispatch an instruction on that clock. This causes a stall after every third consecutive single cycle
FP instruction.
The wait stage that exists after the normalization stage also contributes to potential stalls. The
additional wait stage causes FPR rename registers to be released one cycle after the FP operation is
complete. This causes a stall if a series of single-cycle FP instructions are executing in the FPU. After
every fourth single cycle FP instruction, a stall will occur due to lack of FPR rename registers.
These two stall scenarios cause a series of single-cycle FP instructions to dispatch in clocks 1, 2, 3, 5,
7, 8, 9, 11, 13, 14, 15, etc. The next slide depicts the stall scenarios described above.
Finally, if exception checking is enabled in the FPU, the instruction may have to wait in the Normalizer
while exceptions are checked. One can enhance performance by pre-qualifying data prior to running it
and polling for possible exceptions at the last reasonable instant.
Page 31
RISC Microprocessor Division
FPU Code Stalls
clock
Instruction Queue
Completion Queue
renames
D
0
B
A
D
1
D
C
A
A
B
A
A, B
B
A
A, B, C
B
D
2
H
3
F
E
D
C
G
F
E
D
C
F
D
4
I
H
G
F
E
C
D
B
A
A, B, C
F
I
5
H
G
F
D
E
C
B
D
6
J
I
H
G
F
A, B, C, D
F
E
D
C
B, C, D
E
D
C, D, E
D
7
J
I
H
G
F
F
D
8
K
J
I
H
F
G
D = Marked for dispatch
E
D
D, E, F
F = Marked “finished”
Page 32
RISC Microprocessor Division
Series of single-cycle FP instructions
Clock 0: First two instructions (A and B) are brought into the instruction queue (IQ). A is marked for
dispatch.
Clock 1: A is dispatched to the Multiplier stage of the FPU and is allocated FP rename register 0. B is
marked for dispatch. C and D are brought into the IQ.
Clock 2: A steps to the Adder stage in the FPU. B is dispatched to the Multiplier stage and is
allocated FP rename register 1. C is marked for dispatch. E is brought into the IQ. (Anything after E is
ignored for this discussion.)
Clock 3: A steps to the Normalizer stage in the FPU. B steps to the Adder stage. C is dispatched to
the Multiplier stage and is allocated FP rename register 2. At this point, a signal is sent to the
dispatcher indicating that no instruction may be dispatched to the FPU until a stage has been freed up.
This signal is negated as soon as the Normalizer stage is finished, but this will be too late to actually
permit an instruction to dispatch on the next clock. D, therefore, stalls in the IQ (is not marked for
dispatch).
Clock 4: A steps to a wait stage in the FPU. A signal has been sent to the completion unit indicating
that A is finished and, since it is the oldest instruction in the completion queue, it is permitted to retire.
B steps to the Normalizer stage. C steps to the Adder stage. D is given permission to dispatch on the
next clock.
Continued on next slide
Page 33
RISC Microprocessor Division
Clock 5: A is gone from the completion queue, but a delay on FP rename register deallocation
prevents FP rename register 0 from being re-allocated. B is finished and permitted to retire. C steps to
the Normalizer stage. D is dispatched to the FP Multiplier stage and is allocated FP rename register 3.
At this point, all four FP rename registers are in use, which means E cannot be marked for dispatch this
cycle. E stalls in the IQ.
Clock 6: FP rename register 0 is deallocated. B is gone, but its FP rename register deallocation is
delayed for one clock. C is finished and permitted to retire. D moves to the Adder stage. E is marked
for dispatch.
Clock 7: FP rename register 1 is deallocated. C steps the FP wait stage. D steps to the Normalizer
stage. E is dispatched to the Multiplier stage and is allocated FP rename register 0. At this point, the
pattern of stalls repeats.
Again, note the dispatch stall during clock 3. This is caused by all of M/A/N stages being in use.
Also note the dispatch stall during clock 5. This is caused by all of the rename registers being tied up
(a rename register must be deallocated for one clock before it can be reused).
Page 34
RISC Microprocessor Division
FPU and Completion Unit
loop:
lfsu
f22,4(r20)
; A
fmadd
f15,f16,f13,f28
; B
lfsu
f23,4(r21)
; C
fmadd
f18,f19,f14,f29
; D
lfsu
f13,4(r20)
; E
fmadd
f25,f24,f22,f30
; F
lfsu
f14,4(r21)
; G
bdnz
loop
; H
cycles
1
3
4
F
G
D
E
F
B
C
D
E
A
B
C
D
Completion
Queue
Page 35
2
RISC Microprocessor Division
Completion of floating-point unit instructions is a potential source of stalls. Due to the single write-back
port on the floating point register file, multiple instructions trying to write back floating-point results will
have to do so in a sequential manner. This will typically happen in matrix math where math operations
occur in parallel with loads that initialize registers for subsequent math operations.
The code segment above depicts such a scenario. Adjacent load and fmadd instructions have no
register dependencies nor do they require the same execution unit. Therefore, each pair can dispatch
together, execute in parallel, and even finish (update rename registers) in parallel. However, due to the
single write back port, this code has an effective throughput of only a single instruction per clock. If this
code is part of a larger code segment that includes integer instructions, then it is possible to achieve a
greater instruction throughput by intermixing integer instructions (from elsewhere in the code sequence)
with these floating-point instructions. This will allow the integer execution and write-back to overlap
with the floating-point write-back, thereby improving the overall instruction throughput on the entire
code segment.
Page 36
RISC Microprocessor Division
Load/Store Hierarchy
Dispatcher
Reservation Station
LSU EIB
STORES
EA Calculation
Store Queue 1
LSU Store
Queue
LOADS
Store Queue 0
Load Miss
LSU
DC
Store Miss
BIU
Data Load
Store 1
Store 0
BIU Queues
Bus
Page 37
RISC Microprocessor Division
The load/store hierarchy within the PowerPC chip consists of the load/store unit (LSU), data cache
(DC), and the bus interface unit (BIU). The LSU stages consist of a two-element EIB, to receive
dispatched instructions and calculate effective addresses, and a two-element store queue, to hold
stores waiting for the data cache. The data cache stages consist of slots for a load miss and a store
miss. Only one miss can be handled at a time. The BIU stages consist of a number of one-element
queues, such as the data load and store queues. Each queue can hold a separate instruction waiting
for access to memory.
Instructions are first dispatched from the instruction queue (IQ) to the LSU EIB, which has two slots:
the “reservation station” slot (LSU RS) and an “effective address calculation” slot (LSU EA). An
instruction is held in the LSU EA slot until its address operand is available.
Normally if the LSU is available for dispatch (see below), then the instruction is dispatched directly to
the LSU EA slot, if both slots are empty. If the LSU EA slot is occupied, then the instruction is
dispatched to the LSU RS slot.
Once the instruction’s effective address has been calculated, its progress through the pipeline depends
on whether it is a load or a store. A load would then access the data cache (DC), as described later.
The load’s entry in the completion queue (CQ) is marked “finished” when the data for the load returns.
A store would pass to the first LSU store queue slot, and its entry in the CQ would be marked
“finished.” Thus, a store can be considered finished and even retired from the completion queue long
before its data is actually written to cache or to memory. On the next clock cycle, the store passes to
the second LSU store queue slot and, on the subsequent clock, it is free to access the data cache.
Note that because a store must traverse two additional slots than a load before accessing the data
cache, a load instruction may bypass preceding stores within the LSU. Also, if both a load (in the LSU
EA slot) and a store (in the second LSU store queue slot) are free to access the data cache, then the
load will take precedence.
Page 38
RISC Microprocessor Division
Data Cache Miss Stall
LSU RS
LSU EA
LSU
SQ
DC load
miss
DC store
miss
LSU RS
LSU EA
C
A
Figure 1
DC load
miss
LSU
SQ
DC store
miss
B
LSU RS
LSU EA
C
D
B
A stw
B lwz
C stw
r3, 0(r4)
r5, 0(r6)
r7, 0(r8)
Figure 2 - Store Stalls
DC load
miss
A
B
C
D
A
stw
stw
stw
lwz
r3,
r5,
r7,
r9,
LSU
SQ
DC store
miss
0(r4)
0(r6)
0(r8)
0(r10)
Figure 3 - Load Stall
Page 39
RISC Microprocessor Division
Although superscalar architectures feature multiple execution paths, resource limitations can stall full
utilization of these paths. Data cache misses are the primary cause of stalls in the LSU. The example
above demonstrates how stalls can occur because the 603e data cache can only handle one miss at a
time.
When a load or store misses in the data cache, the data cache asserts a busy signal that stalls
subsequent instructions in the LSU, as shown in Figure 1. While the data cache is busy, no other
instructions can access the data cache, and instructions are blocked from leaving the LSU EA stage.
This prevents a store from propagating from the LSU EA stage to the LSU store queue (LSU SQ), even
if the store queue is available.
For a load miss access, the data cache is busy until the data comes back from the BIU. For a store
miss access, the data cache is busy until the store is able to propagate to the BIU.
Figure 2 demonstrates store stalls. While load B is waiting for its data to come back, store A may not
access the data cache, and store C may not propagate to the LSU store queue. Note that load B
bypassed store A in the LSU.
Figure 3 demonstrates a load stall. Load D may not access the data cache until store A propagates to
the BIU. When it does, the data cache is no longer busy, and load D will bypass stores B and C.
Page 40
RISC Microprocessor Division
Address Alias Stall
Figure 5
LSU RS
LSU EA
LSU RS
C
B
LSU
SQ
A
DC load
miss
B
LSU EA
DC store
miss
DC load
miss
Figure 1
A
LSU EA
B
A
B
A
LSU RS
B
A
DC load
miss
LSU
SQ
LSU RS
LSU EA
LSU EA
DC store
miss
LSU RS
Figure 3
Figure 2
LSU
SQ
LSU
SQ
DC load
miss
B
LSU
SQ
DC store
miss
DC load
miss
DC store
miss
Figure 4
DC store
miss
Page 41
RISC Microprocessor Division
To understand the flow within a superscalar architecture, one cannot ignore instruction-specific details.
For example, consider Figure 1, in which load C would ordinarily bypass stores A and B. However, if
the data address of C can potentially collide (alias) with the data address of A or B, then C will stall in
the LSU EA slot until the aliasing store passes out of the LSU store queue.
Address translation may occur after alias checking. Since only the lower 12 bits remain constant
through translation, these are the only bits that can be checked. In addition, the addresses are
checked with word granularity (four bytes, mask = 0xffc) if the sizes of both load and store are less than
or equal to four bytes, or with double-word granularity (eight bytes, mask = 0xff8) otherwise. For
instance, 0x2000 and 0x3003 would alias to each other, but 0x2000 and 0x2020 would not.
Note that it is possible to have an alias stall even if the load and store do not actually access the same
location, because only the lower 12 bits of the address can be compared.
In a superscalar architecture, other stalls may occur due to timing considerations. For example, if a
load which aliases a store has spent only one cycle in the LSU EA stage, then the LSU circuitry is not
fast enough to prevent the load from bypassing the store in accessing the data cache. Since this
aliased load should not access the cache before the store, the LSU must cancel the load in the
subsequent cycle. Figures 2-5 depict this situation.
In Figure 2, load B and store A have aliasing addresses. If B has been in the LSU EA stage for more
than one cycle (due to some other stall), then there is time to prevent it from accessing the data cache,
and the next cycle A will access the data cache. However, if B has only been in the LSU EA for one
cycle, the alias check comes too late to prevent the cache access shown in Figure 3. A is stalled and
cannot access the cache.
In the next cycle (Figure 4), the load is canceled, and in Figure 5 the store propagates to the data
cache. Note that in this example, the store also misses in the cache and blocks the load from
accessing the data cache the next cycle.
Page 42
RISC Microprocessor Division
Completion Queue Stall
LSU RS
LSU RS
LSU EA
A
DC load
miss
LSU
SQ
DC store
miss
LSU EA
DC load
miss
LSU
SQ
DC store
miss
A
Figure 1
Figure 2
Page 43
RISC Microprocessor Division
Superscalar architectures frequently signal between different parts of the architecture, in order to
coordinate various aspects of the units. Diagrams may not always show all the signals that are shared
between units with the system. These interactions can also cause stalls. We discuss a case in which
the state of the completion queue can affect instruction flow in the LSU.
Since the 603e allows out-of-order execution, instructions will frequently dispatch to the LSU (as well as
other execution units) before previous instructions have finished executing. If one of these previous
instructions generates an exception, then all subsequent instructions (including the LSU instruction)
must be canceled from the instruction flow (flushed). Various parts of the processor, including the
LSU, must be careful to stall instructions that could be canceled before they permanently change the
processor state.
On the 603e, if a load or store’s entry in the completion queue is not in the bottom slot, then there are
preceding instructions that could potentially generate exceptions which may cancel the load or store.
The instruction must be stalled before it reaches a state that cannot be canceled.
Figures 1-2 depict this situation, in which instruction A is stalled because its entry in the completion
queue is not in the bottom slot. In Figure 1, store A is stalled in the second slot of the LSU store
queue, since writing to the data cache would incur too much of a penalty to undo.
In Figure 2, load A is stalled in the data cache miss slot if it is accessing guarded memory. Guarded
memory is typically used to prevent out-of-order loads to I/O devices, which may produce undesired
results otherwise. Note that even if load A were at the bottom of the completion queue, the 603e would
stall the load for one cycle before making its request to the BIU.
Page 44
RISC Microprocessor Division
LSU EA Stall
Figure 1
LSU RS
B
LSU EA
A
DC load
miss
Figure 3
LSU
SQ
DC store
miss
LSU RS
C
LSU EA
B
DC load
miss
A
LSU RS
C
LSU EA
B
DC load
miss
LSU
SQ
DC store
miss
Figure 2
LSU
SQ
DC store
miss
LSU RS
C
LSU EA
B
DC load
miss
B
LSU
SQ
DC store
miss
Figure 4
Page 45
RISC Microprocessor Division
The fast timing requirements of superscalar processors sometimes lead to unusual types of stalls.
If a load has spent only one cycle in the LSU EA slot before accessing the data cache, then it is
removed from the LSU after this access (assuming that the access is not canceled). However, if a load
spends more than one cycle in the LSU EA slot, then it will appear to remain in this slot (blocking
subsequent LSU instructions) even after the load has accessed the data cache. This block will remain
until the data becomes available (and the load is marked “finished” in the completion queue).
In Figure 1, A flows into the LSU EA slot and flows out in Figure 2. This allows C to be dispatched to
the LSU. However, because B is stalled in the LSU EA slot, in Figure 3 when B accesses the data
cache it keeps its entry in the LSU EA slot in Figure 4. This stalls C and stalls dispatch of any
subsequent load/store instruction until the data for B returns from the BIU.
Page 46
RISC Microprocessor Division
Misaligned Address Stall
LSU RS
LSU EA
LSU RS
A
A1
DC load
miss
LSU
SQ
DC store
miss
LSU RS
LSU EA
DC load
miss
LSU EA
A
DC load
miss
A1
LSU
SQ
DC store
miss
LSU RS
A2
LSU
SQ
DC store
miss
Figure 1 - Misaligned Store
LSU EA
A
DC load
miss
A2
LSU
SQ
DC store
miss
Figure 2 - Misaligned Load
Page 47
RISC Microprocessor Division
Accessing misaligned addresses can result in significant performance penalties in most RISC
superscalar microarchitectures.
On the 603e, a data address is aligned if it falls on a multiple of the access size. Thus a word (4-byte)
access is aligned at 0x0000 and 0x0004 but not at 0x0002; a doubleword (8-byte) access is aligned at
0x0000 and 0x0008 but not at 0x0004.
If the data address of a load or store in the LSU EA slot is not aligned, then it is split into two aligned
accesses. Figure 1 shows a misaligned store. A is first split into the aligned store A1, then on the next
clock it is split into the aligned store A2 and the LSU EA entry removed. In Figure 2 we have a
misaligned load, in which A is split into aligned loads A1 and A2. Note that because the load stayed in
the LSU EA slot for more than one cycle, it remains in this slot until its data comes back (see LSU EA
stall above).
Page 48
RISC Microprocessor Division
Instruction Interactions
Rename Register Stall
H
G
H
F
F
G
D
E
E
F
B
C
D
D
E
A
B
C
C
D
Instruction
Queue
Completion
Queue
B
Figure 1
C
A
A
B
B
Figure 2
Figure 3
Figure 4
Figure 5
Page 49
RISC Microprocessor Division
As with the other execution units, there may also be stalls due to contention for the rename registers.
Figures 1-5 show the interaction between the IQ and CQ for a series of lwzu’s which are fetched into
the instruction queue two at a time. Each lwzu uses two general purpose register (GPR) rename
registers, one for the address operand and one for the data operand. The 603e has five GPR rename
registers available (and four FPR rename registers).
In Figure 3, instruction C cannot dispatch because A and B have already taken four GPR rename
registers and there is only one available.
Later, when A retires and releases its rename registers (Figure 4), C has the resources it needs to
dispatch. It dispatches the following cycle (Figure 5).
Page 50
RISC Microprocessor Division
Instruction Interactions
Dependency Stall
ORIGINAL SEQUENCE
A lwzx
B add
C lis
D stwu
E ori
F cmpw
REORDERED SEQUENCE
r13,r14,r15
r26,r27,r13
r20,0xDEAD
r16,4(r17)
r20,r20,0xBEEF
r26,r20
A lwzx
B lis
C stwu
D ori
E add
F cmpw
clock cycle
1
2
3
stw
ori
lis
stw
cmp
add
add
lis
ori
lwz
lwz
add
stw
Completion
Queue
4
r13,r14,r15
r20,0xDEAD
r16,4(r17)
r20,r20,0xBEEF
r26,r27,r13
r26,r20
clock cycle
1
5
Page 51
3
4
5
ori
Completion
Queue
cmp
2
stw
add
lis
lis
ori
cmp
lwz
lwz
stw
add
RISC Microprocessor Division
The mix of instructions in an instruction sequence can result in a variety of stalls. Dependency stalls
are the most common. A dependency occurs if one instruction uses as its source data the results from
another instruction. Such a dependency will cause a stall if the two instructions are placed right next to
each other. The 603e reduces the impact of most of these situations through use of the rename
registers and forwarding of results. However, in some situations, stalls can happen as follows.
Two orderings of a code sequences are shown. In both sequences, the add instruction uses as its
source the results of the lwzx load instruction. In the original code, the add occurs right after the
lwzx. In the reordered sequence, the add is separated from the lwzx by moving it down three
instructions.
Analysis of original code sequence:
Assuming the lwzx hits in the data cache, its data will return in 2 clocks. Although both the add and
the lwzx can be dispatched to the completion queue in the same clock, the add cannot begin
execution until the data from the lwzx returns. Therefore it cannot retire with the lwzx and is stalled
by one clock. The lis dispatches to the SRU, executes, and is ready to retire with the add in cycle 3.
In cycle 4, the stwu and ori can also retire together. Then in cycle 5, the cmpw retires alone. Total
time: 5 clocks.
Analysis of reordered code sequence:
The lwzx (cache hit) takes 2 clocks. Since the lis is not dependent on the lwzx, it can retire with the
lwzx in clock 2. The stwu and ori can also retire together on the next clock (clock 3). Finally, in
clock 4, the add and cmpw retire together. Total time: 4 clocks.
Thus by separating the generation of a result from the subsequent use of that result, we were able to
prevent a stall. It is normally a good practice to provide this separation; however in some cases the
benefit gained in one place is lost in another place.
Page 52
RISC Microprocessor Division
Summary
• Scheduling code around superscalar microprocessor
resource constraints reduces code stall conditions
• Code stalls can occur in instruction issue/completion
control logic
– Availability of instruction and completion buffers
– Availability of rename registers
– Number of register file write ports
• Code stalls can occur within execution units
– Aliasing between loads and stores
– Misaligned accesses
• Code stalls can occur due to instruction mixes
– Dependencies between instructions
Page 53
RISC Microprocessor Division
By being able to process multiple instructions at the same time, superscalar microprocessors like the
603e enable systems to attain extremely high levels of performance. However, there are many aspects
of a superscalar architecture that can cause code stalls in the instruction flow. By being aware of
constraints that cause code stall conditions, one can generate code that can will execute with minimal
latency in a superscalar processor. This paper has discussed the aspects of PowerPC 603e that can
cause stalls. Although this paper is specific to the 603e, the lessons learned can be applied to most
contemporary superscalar microprocessors.
Page 54
RISC Microprocessor Division