STAPL Standard Template Adaptive Parallel Library Lawrence Rauchwerger Antal Buss, Harshvardhan, Ioannis Papadopoulous, Olga Pearce, Timmie Smith, Gabriel Tanase, Nathan Thomas, Xiabing Xu, Mauro Bianco, Nancy M. Amato http://parasol.tamu.edu/stapl Dept of Computer Science and Engineering, Texas A&M Motivation ● Multicore systems: ubiquitous ● Problem complexity and size is increasing – Dynamic programs are even harder ● Programmability needs to improve ● Portable performance is lacking – Parallel programs are not portable – Scalability & Efficiency is (usually) poor STAPL: Standard Template Adaptive Parallel Library A library of parallel components that adopts the generic programming philosophy of the C++ Standard Template Library (STL) –Application Development Components pAlgorithms, pContainers, Views, pRange Provide Shared Object View to eliminate explicit communication in application –Portability and Optimization Runtime System(RTS) and Adaptive Remote Method Invocation (ARMI) Communication Library Framework for Algorithm Selection and Tuning (FAST) Three STAPL Developer Levels ● Application Developer User Application Code – Writes application – Uses pContainers and pAlgorithms ● Library Developer – Writes new pContainers and pAlgorithms pAlgorithms – Uses pRange and RTS ● Views pContainers pRange Run-time System Developer – Ports system to new architectures – Writes task scheduling modules – Uses native threading and communication libraries Run-time System ARMI Communication Library Scheduler Executor Pthreads, OpenMP, MPI, Native, … Performance Monitor Applications Using STAPL ● ● ● ● Particle Transport - PDT Bioinformatics - Protein Folding Geophysics - Seismic Ray Tracing Aerospace - MHD – Seq. “Ctran” code (7K LOC) – STL (1.2K LOC) – STAPL (1.3K LOC) pContainers : Parallel Containers ● Container - Data structure with an interface to maintain and access a collection of generic elements – STL (vector, list, map, set, hash), MTL[1] (matrix), BGL[2] (graph), etc. ● pContainer - distributed storage and concurrent methods – – – – – Shared Object View Compatible with sequential counterpart (e.g., STL) Thread Safe Support for user customization (e.g., data distributions) Currently Implemented: pArray, pVector, pList, pGraph, pMatrix, pAssociative 1] Matrix Template Library 2] Boost Graph Library pContainer Framework Concepts and methodology for developing parallel containers – pContainers - a collection of base containers and information for parallelism management – Improved user productivity ● Base classes providing fundamental functionality ◆ ◆ ● Inheritance Specialization Composition of existing pContainers – Scalable performance ● ● ● Distributed, non replicated data storage Parallel (semi-random) access to data Low overhead relative to the base container counterpart pContainer Framework Concepts ● Base Container : data storage – sequential containers (e.g., STL, MTL, ● Data Distribution Information - Shared object view BGL) – parallel containers (e.g., Intel TBB) p_array pa(6) 0 a 1 b 2 c Data Distribution Info_0 Info_1 Base Containers a b e f Location 0 c d Location 1 - Global Identifier, Domain, Partition, Location, Partition Mapper 3 d 4 e 5 f User Level Data Distribution Info_0 Info_1 a b c Location 0 d e f Location 1 pContainer Interfaces – Constructors ● ● Default constructors May specify a desired data distribution – Concurrent Methods ● Sync, async, split phase – Views stapl_main(){ partition_block_cyclic partition(10); //argument is block size p_matrix<int> data(100, 100, partition); p_generate(data.view(), rand()); res=p_accumulate(data.view(),0); } pGraph Methods ● Performance for add vertex and add edge asynchronously ● Weak scaling on CRAY using up to 24000 cores and on Power 5 cluster using up to 128 cores ● Torus with 1500x1500 vertices per processor CRAY XT4 Power 5 pGraph Algorithms ● Performance for find_sources and find_sinks in a directed graph ● Weak scaling on CRAY using up to 24000 cores and on Power 5 cluster using up to 128 cores ● Torus with 1500x1500 vertices per processor CRAY XT4 Power 5 Views ● ● A View defines an abstract data type that provides methods for access and traversal of the elements of a pContainer that is independent of how the elements are stored in the pContainer. Example: print the elements of a matrix Matrix Rows view 1 2 3 4 5 6 7 8 9 print(View v) for i=1 to v.size() do print(v[i]) 1 2 3 4 5 6 7 8 Columns view 1 2 3 4 5 6 7 8 9 9 Output 1,2,3,4,5,6,7,8,9 Output 1,4,7,2,5,8,3,6,9 pAlgorithms ● Build and execute task graphs to perform computation – Task graphs in STAPL are called pRanges ● Easy to develop – Work functions look like sequential code – Work functions can call STAPL pAlgorithms – pRange factories simplify task graph construction ● STAPL pAlgorithms accelerate application development – Basic building blocks for applications – Parallel equivalents of STL algorithms – Parallel algorithms for pContainers ● ● Graph algorithms for pGraphs Numeric algorithms/operations for pMatrices Parallel Find ● Find first element equal to the given value View::iterator p_find(View view, T value) return map_reduce( view, find_work_funtion(value), std::less() ); map operation reduce operation View::iterator find_work_function(View view) if (do_nesting()) return p_find(view, value) else return std::find(view.begin(), view.end(), value) endif end Parallel Sample Sort ● pAlgorithm written using sequence of task graphs. p_sort(View view, Op comparator) // handle recursive call if (view.size() <= get_num_locations()) reduce(view, merge_sort_work_function(comparator)); sample_view = map(view, select_samples_work_function()); // sort the samples p_sort(sample_view, comparator); // paritition the data using the samples partitioned_view = map(view, full_overlap_view(sample_view), bucket_partition_work_function(comparator)); // sort each partition map(partitioned_view, sort_work_function(comparator)); Scalability of pAlgorithms Execution times for weak scaling of pAlgorithms on data stored in different pContainers on CRAY XT4. STAPL Runtime System Smart Application Comm. Thread Application Specific Parameters RMI Thread Task Thread Advanced stage Experimental stage: multithreading STAPL RTS ARMI Executor Memory Manager ARMI Executor Memory Manager Custom scheduling User-Level Dispatcher Kernel scheduling Operating System Kernel Scheduler (no custom scheduling, e.g. NPTL) The STAPL Runtime System (RTS)... Abstracts platform resources – threads, mutexes, atomics Provides consistent API and behavior across platforms Configured at compile-time for a specific platform Hardware counters, different interconnect characteristics Adapts at runtime at the runtime environment Available memory, communication intensity etc. Provides interface for calling functions on distributed objects – ARMI – Adaptive Remote Method Invocation There is one instance of the RTS running in every process – So it is distributed as well ARMI: Adaptive Remote Method Invocation Communication service of Asynchronous Synchronous sync_rmi opaque_rmi the RTS async_rmi multi_async_rmi multi_sync_rmi Provides two degrees of reduce_rmi multi_opaque_rmi freedom one-to-many Allows transfer of data, work, or both across the system Used to hide latency Synchronization rmi_fence_os Collective broadcast_rmi allreduce_rmi rmi_fence rmi_flush rmi_barrier Used to call a function on a distributed object // Example of ARMI use anywhere on the system async_rmi(destination, p_object, function, arg0, ...); Supports a mixed-mode r = sync_rmi(destination, p_object, operation (MPI+threads) function, arg0, ...); The STAPL RTS: Major Components STAPL Runtime System Resource Reuse Hardware Monitoring •Thread Pool •Memory Manager •Hardware Counters •System resource usage •Interfaces with TAU and OS provided system calls ARMI •Different Communication methods •RMI request aggregation and combining Scheduler Adaptivity •Feedback to pContainers / pRange •Resource acquisition before needed •Load monitoring and thread allocation •Detection if on shared memory or have to use MPI •Execution of RMIs •Consistency Models Multithreaded Support Executor •Execution of tasks provided by the pRange FAST Architecture ● ● ● ● Framework for parallel algorithm selection Developer specifies important parameters, metrics to use Developer queries model produced to select implementation Selection process transparent to code calling algorithm STAPL User Code Parallel Algorithm Choices Adaptive Executable Data Characteristics Runtime Tests Model Selected Algorithm Parallel Sorting: Experimental Results Attributes for Selection Model –Processor Count –Data Type –Input Size –Max Value (impacts radix sort) –Presortedness SGI Altix Validation Set (V1) – 100% Accuracy N=120M SGI Altix Selection Model |-------------------------------------------------------MaxElement = 120,000-------------------------------------------------------| 4 8 16 32 4 8 16 32 4 8 16 32 4 8 16 32 4 8 16 32 Radix Sample Column Adaptive Normalized Execution Time Procs: Nearly / Integer Rev / Integer 30 25 20 15 10 5 0 Adaptive Performance Penalty Rand / Integer Nearly / Double Rand / Double Parallel Sorting - Altix Relative Performance (V2) Relative Speedup 1 Sample 0.8 Column Radix 0.6 Random Adaptive Best 0.4 0.2 2 4 8 16 Processors ● Model obtains 99.7% of the possible performance. ● Next best algorithm (sample) provides only 90.4%. 32 PDT: Developing Applications with STAPL ● Important application for DOE – E.g., Sweep3D and UMT2K ● ● Large, on-going DOE project at TAMU to develop application in STAPL STAPL precursor used by PDT in DOE PSAAP center One sweep Eight simultaneous sweeps pRanges in PDT: Writing new pAlgorithms prB prA 4 1 3 2 1 6 3 11 16 10 15 20 9 2 9 2 5 2 1 1 7 1 3 9 2 7 12 5 B 8 14 19 24 13 18 23 28 A 5 3 0 2 6 2 2 1 8 1 4 1 0 6 3 1 2 7 2 3 1 9 1 5 1 1 7 3 2 2 8 2 4 2 0 1 6 1 2 8 1 2 3 4 4 8 pRanges are sweeps in particle transport application ● Reflective materials on problem boundary create dependencies ● Composition operator will allow easy composition 5 6 7 ● 9 10 13 11 14 17 12 15 18 21 16 19 22 25 17 22 27 32 20 23 26 29 21 26 31 24 27 30 25 30 28 31 29 32 zip(prA, prB, Zipper(4,32,4)); Sweep Performance ● Weak scaling keeps number of unknowns per processor constant. ● Communication increases with processor count. ● KBA Model shows performance of perfectly scheduled sweep ● Divergence after 2048 processors due to nonoptimal task scheduling Conclusion ● STAPL allows productive parallel application development ● pContainers and pAlgorithms – Application building blocks – Simplify development – Extensibility enables easy development of new components ● Composition of pContainers and pAlgorithms enable reuse ● RTS and FAST provide portability and adaptability