CAPSULE: Parallel Execution of Component-Based Programs Pierre Palatin, Yves Lhuillier, Olivier Temam INRIA, France Software & Hardware Trends Blend Well = + Multi-Core Component Program Software components: Separate/Encapsulated software parts Explicit communication links Component Programming Is Not Enough + Multi-Core = Component Program BUT component programming ≠ finding parallelism Augment components with ability to divide ● BUT locks still managed by user Granularity & Mapping too complex to be exposed to the user architecture or run-time system support User: where division may occur Architecture/Run-Time System: when division does occur idx = min; while (idx < max) { c[idx] = a[idx] + b[idx]; idx++; ctxt = capsys_probe(&loop); if (ctxt) { newmin = (idx+max) / 2; newmax = max; max = newmin – 1; … //Prepare “arg” capsys_divide(ctxt, arg); } } no resources Split bounds Division code Normal loop Probing & Division division ! no resources Low-Level C Implementation “do loop” Probing & division must be very efficient With Hardware Support Target architecture: SMT (8 threads) 1 component ~ 1 thread Dynamic thread creation (nthr) Division steering Fetch Stage Branch Predictor SMT Fetch Unit PC + state Inst. Cache Logic For thread Eviction Decode / rename Stage Registers access Stage Execute Stage Ld/St latency informations sent back to SOMT Fetch Unit Branch Prediction FP registers SOMT Fetch Unit Decode & Rename Stalling & Freeing signals Logic For thread Creation IQ Stage FP units Issue Queues Data Cache Integer registers Int/ld/st units Mapping tables Hardware Contexts Stack Address Locking thread Waiting thread 0xff05 0 null 0xcd34 3 4 0x0000 4 null Locks table P. Palatin, Y. Lhuillier, O. Temam: Capsule, Hardware-Assisted Parallel Execution of Component-Based Programs, MICRO 2006 Software-Only Version Software / Libraries Software / Libraries C++ layer Hardware Support C “core” library Linux x86 ARM11 MPCore Capsule parts Low-Level C-layer for speed and portability; back-end for other // frameworks (OpenMP, STAPL, ...) Higher-Level library in development Extends threading layer (pthread for now) Shared-memory CMPs (distributed-memory CMPs in the works) Probing ~ 10 cycles, division ~ 1400 cycles Intel on Core Duo https://alchemy.futurs.inria.fr/capsule Dynamic Exploitation of Resources Load-Balancing properties vs. static mapping Performance stability across data sets Example run real-time systems 50 120 CAPSULE Static 45 100 Superscalar 40 35 Number of runs 30 25 20 15 10 90 CAPSULE Static Superscalar 80 70 60 50 40 30 Execution time (cycles) 11 M 13 M 14 M 16 M 18 M 19 M 21 M QuickSort 9M 41 M 36 M 31 M 26 M 21 M Dijkstra 16 M 0 11 M 0 7M 10 0 3M 5 8M 20 7M Number of runs 110 Execution time (cycles) Thread Steering & Swapping 2.1 Greedy = default division policy Safeguard: slow down division if threads termination rate too high Thread swapping 2 1.9 Speedup over superscalar Overhead issue of too fine-grain parallelization can emulate some static locality optimizations 1.6 1.5 1.4 1.3 1.2 LZW Tiled Perceptron Capsule breaks execution into chunks (~blocks of iterations) 1.7 1 Time Classic, BUT, combined with task division: 1.8 1.1 Stack of stalled threads Swap out threads with long latency loads Static Greedy Capsule 20 % not started 40 % not completed 60 % 80 % completed computation on A matrix element Matrix-Vector Multiply Larger Applications (SPEC-CINT2000) Partly reverse-engineered and reprogrammed Moderate code modifications BUT HOWEVER components facilitate/favor program reuse Benchmark Lines mcf 2412 vpr 17729 bzip2 4649 crafty 45000 Modified or added functions 2 10 3 8 Modified or added lines 174 624 317 201 % total execution time 45% 93% 20% 100% Modifications for componentization 20% 2.2 45% XX% : Modified part 2 1.8 93% 100% 1.6 1.4 1.2 1 mcf vpr bzip2 crafty Speedup over original SPEC benchmarks # divisions # divisions % divisions #insts per requested granted granted division granted mcf 99598 40532 40% 3.7K vpr 67560 2702 4% 4.5M bzip2 38656 2319 6% 30M Benchmark Percentage and rate of successful divisions Few probes result into actual divisions Low overhead of probes Overall Component section 2.4 Speedup component decomposition easier when starting from program specification better suited to new implementations rather than existing programs 2.6 Conclusions & Future Work Main contributions Relieve user of parallelization granularity & mapping issues Compatible/Designed for “complex” applications Leverage an existing trend in software engineering Tight program/hardware (or run-time) integration Future work Application to distributed-memory CMPs w/ & w/o hardware support High-Level library for a more simple syntax Disseminate through a few popular applications (x264, Gzip,...) Real-Time & QoS Thank You !