Presentation

CAPSULE: Parallel Execution of
Component-Based Programs
Pierre Palatin, Yves Lhuillier, Olivier Temam
INRIA, France
Software & Hardware Trends Blend Well
=
+
Multi-Core
Component
Program
Software components:
Separate/Encapsulated software parts
Explicit communication links
Component Programming Is Not Enough
+
Multi-Core
=
Component
Program
BUT component programming ≠ finding parallelism
Augment components with ability to divide
●
BUT locks still managed by user
Granularity & Mapping too complex to be exposed to the
user
 architecture or run-time system support
User: where division may occur
Architecture/Run-Time System: when division does occur
idx = min;
while (idx < max) {
c[idx] = a[idx] + b[idx];
idx++;
ctxt = capsys_probe(&loop);
if (ctxt) {
newmin = (idx+max) / 2;
newmax = max;
max = newmin – 1;
… //Prepare “arg”
capsys_divide(ctxt, arg);
}
}
no resources
Split bounds
Division code
Normal loop
Probing & Division
division !
no resources
Low-Level C Implementation
“do loop”
Probing & division must be very efficient
With Hardware Support
Target architecture: SMT (8 threads)
1 component ~ 1 thread
Dynamic thread creation (nthr)
Division steering
Fetch
Stage
Branch
Predictor
SMT
Fetch
Unit
PC + state
Inst.
Cache
Logic
For thread
Eviction
Decode /
rename
Stage
Registers
access
Stage
Execute
Stage
Ld/St latency
informations
sent back to
SOMT Fetch
Unit
Branch
Prediction
FP
registers
SOMT
Fetch
Unit
Decode &
Rename
Stalling &
Freeing
signals
Logic
For thread
Creation
IQ
Stage
FP
units
Issue
Queues
Data
Cache
Integer
registers
Int/ld/st
units
Mapping
tables
Hardware
Contexts
Stack
Address
Locking
thread
Waiting
thread
0xff05
0
null
0xcd34
3
4
0x0000
4
null
Locks
table
P. Palatin, Y. Lhuillier, O. Temam: Capsule, Hardware-Assisted
Parallel Execution of Component-Based Programs, MICRO 2006
Software-Only Version
Software / Libraries
Software / Libraries
C++ layer
Hardware
Support
C “core” library
Linux
x86
ARM11 MPCore
Capsule parts
Low-Level C-layer for speed and portability; back-end for other //
frameworks (OpenMP, STAPL, ...)
Higher-Level library in development
Extends threading layer (pthread for now)
Shared-memory CMPs (distributed-memory CMPs in the works)
Probing ~ 10 cycles, division ~ 1400 cycles Intel on Core Duo
https://alchemy.futurs.inria.fr/capsule
Dynamic Exploitation of Resources
Load-Balancing properties vs.
static mapping
Performance stability across
data sets
Example run
real-time systems
50
120
CAPSULE
Static
45
100
Superscalar
40
35
Number of runs
30
25
20
15
10
90
CAPSULE
Static
Superscalar
80
70
60
50
40
30
Execution time (cycles)
11
M
13
M
14
M
16
M
18
M
19
M
21
M
QuickSort
9M
41
M
36
M
31
M
26
M
21
M
Dijkstra
16
M
0
11
M
0
7M
10
0
3M
5
8M
20
7M
Number of runs
110
Execution time (cycles)
Thread Steering & Swapping
2.1
Greedy = default division policy
Safeguard: slow down division if
threads termination rate too high
Thread swapping
2
1.9
Speedup over superscalar
Overhead issue of too fine-grain
parallelization
can emulate some static locality
optimizations
1.6
1.5
1.4
1.3
1.2
LZW
Tiled
Perceptron
Capsule
breaks execution into chunks
(~blocks of iterations)
1.7
1
Time
Classic, BUT, combined with task
division:
1.8
1.1
Stack of stalled threads
Swap out threads with long latency
loads
Static
Greedy
Capsule
20 %
not
started
40 %
not
completed
60 %
80 %
completed computation on A
matrix element
Matrix-Vector Multiply
Larger Applications (SPEC-CINT2000)
Partly reverse-engineered
and reprogrammed
Moderate code modifications
BUT
HOWEVER components
facilitate/favor program
reuse
Benchmark
Lines
mcf
2412
vpr 17729
bzip2
4649
crafty 45000
Modified or
added
functions
2
10
3
8
Modified
or added
lines
174
624
317
201
% total
execution
time
45%
93%
20%
100%
Modifications for componentization
20%
2.2
45%
XX% : Modified part
2
1.8
93%
100%
1.6
1.4
1.2
1
mcf
vpr
bzip2
crafty
Speedup over original SPEC
benchmarks
# divisions # divisions % divisions
#insts per
requested granted
granted division granted
mcf
99598
40532
40%
3.7K
vpr
67560
2702
4%
4.5M
bzip2
38656
2319
6%
30M
Benchmark
Percentage and rate of successful divisions
Few probes result into actual divisions
Low overhead of probes
Overall
Component
section
2.4
Speedup
component decomposition
easier when starting from
program specification
better suited to new
implementations rather than
existing programs
2.6
Conclusions & Future Work
Main contributions
Relieve user of parallelization granularity & mapping issues
Compatible/Designed for “complex” applications
Leverage an existing trend in software engineering
Tight program/hardware (or run-time) integration
Future work
Application to distributed-memory CMPs w/ & w/o hardware
support
High-Level library for a more simple syntax
Disseminate through a few popular applications (x264, Gzip,...)
Real-Time & QoS
Thank You !