Presentation

STAPL
Standard Template Adaptive Parallel Library
Lawrence Rauchwerger
Antal Buss, Harshvardhan, Ioannis Papadopoulous,
Olga Pearce, Timmie Smith, Gabriel Tanase, Nathan Thomas,
Xiabing Xu, Mauro Bianco, Nancy M. Amato
http://parasol.tamu.edu/stapl
Dept of Computer Science and Engineering, Texas A&M
Motivation
●
Multicore systems: ubiquitous
●
Problem complexity and size is increasing
– Dynamic programs are even harder
●
Programmability needs to improve
●
Portable performance is lacking
– Parallel programs are not portable
– Scalability & Efficiency is (usually) poor
STAPL: Standard Template Adaptive Parallel Library
A library of parallel components that
adopts the generic programming
philosophy of the C++ Standard
Template Library (STL)
–Application Development Components
 pAlgorithms, pContainers, Views, pRange
 Provide Shared Object View to eliminate
explicit communication in application
–Portability and Optimization
 Runtime System(RTS) and
Adaptive Remote Method Invocation (ARMI)
Communication Library
 Framework for Algorithm Selection and
Tuning (FAST)
Three STAPL Developer Levels
●
Application Developer
User Application Code
– Writes application
– Uses pContainers and pAlgorithms
●
Library Developer
– Writes new pContainers and pAlgorithms
pAlgorithms
– Uses pRange and RTS
●
Views
pContainers
pRange
Run-time System Developer
– Ports system to new architectures
– Writes task scheduling modules
– Uses native threading and
communication libraries
Run-time System
ARMI Communication
Library
Scheduler
Executor
Pthreads, OpenMP, MPI, Native, …
Performance
Monitor
Applications Using STAPL
●
●
●
●
Particle Transport - PDT
Bioinformatics - Protein Folding
Geophysics - Seismic Ray Tracing
Aerospace - MHD
– Seq. “Ctran” code (7K LOC)
– STL (1.2K LOC)
– STAPL (1.3K LOC)
pContainers :
Parallel Containers
●
Container - Data structure with an interface to
maintain and access a collection of generic elements
– STL (vector, list, map, set, hash), MTL[1] (matrix), BGL[2] (graph), etc.
●
pContainer - distributed storage and concurrent
methods
–
–
–
–
–
Shared Object View
Compatible with sequential counterpart (e.g., STL)
Thread Safe
Support for user customization (e.g., data distributions)
Currently Implemented: pArray, pVector, pList, pGraph,
pMatrix, pAssociative
1] Matrix Template Library 2] Boost Graph Library
pContainer Framework
Concepts and methodology for developing parallel
containers
– pContainers - a collection of base containers and
information for parallelism management
– Improved user productivity
●
Base classes providing fundamental functionality
◆
◆
●
Inheritance
Specialization
Composition of existing pContainers
– Scalable performance
●
●
●
Distributed, non replicated data storage
Parallel (semi-random) access to data
Low overhead relative to the base container counterpart
pContainer Framework Concepts
●
Base Container : data storage
– sequential containers (e.g., STL, MTL,
●
Data Distribution Information
- Shared object view
BGL)
– parallel containers (e.g., Intel TBB)
p_array pa(6)
0
a
1
b
2
c
Data Distribution
Info_0
Info_1
Base
Containers
a
b
e
f
Location 0
c
d
Location 1
- Global Identifier, Domain, Partition,
Location, Partition Mapper
3
d
4
e
5
f
User Level
Data Distribution
Info_0
Info_1
a
b
c
Location 0
d
e
f
Location 1
pContainer Interfaces
– Constructors
●
●
Default constructors
May specify a desired data distribution
– Concurrent Methods
●
Sync, async, split phase
– Views
stapl_main(){
partition_block_cyclic partition(10); //argument is block size
p_matrix<int> data(100, 100, partition);
p_generate(data.view(), rand());
res=p_accumulate(data.view(),0);
}
pGraph Methods
●
Performance for add vertex and add edge asynchronously
●
Weak scaling on CRAY using up to 24000 cores and on Power 5
cluster using up to 128 cores
●
Torus with 1500x1500 vertices per processor
CRAY XT4
Power 5
pGraph Algorithms
●
Performance for find_sources and find_sinks in a directed graph
●
Weak scaling on CRAY using up to 24000 cores and on Power 5
cluster using up to 128 cores
●
Torus with 1500x1500 vertices per processor
CRAY XT4
Power 5
Views
●
●
A View defines an abstract data type that provides
methods for access and traversal of the elements of a
pContainer that is independent of how the elements are
stored in the pContainer.
Example: print the elements of a matrix
Matrix
Rows view
1
2
3
4
5
6
7
8
9
print(View v)
for i=1 to v.size() do
print(v[i])
1
2
3
4
5
6
7
8
Columns view
1
2
3
4
5
6
7
8
9
9
Output
1,2,3,4,5,6,7,8,9
Output
1,4,7,2,5,8,3,6,9
pAlgorithms
●
Build and execute task graphs to perform computation
– Task graphs in STAPL are called pRanges
●
Easy to develop
– Work functions look like sequential code
– Work functions can call STAPL pAlgorithms
– pRange factories simplify task graph construction
●
STAPL pAlgorithms accelerate application development
– Basic building blocks for applications
– Parallel equivalents of STL algorithms
– Parallel algorithms for pContainers
●
●
Graph algorithms for pGraphs
Numeric algorithms/operations for pMatrices
Parallel Find
●
Find first element equal to the given value
View::iterator
p_find(View view, T value)
return
map_reduce(
view,
find_work_funtion(value),
std::less()
);
map operation
reduce operation
View::iterator
find_work_function(View view)
if (do_nesting())
return p_find(view, value)
else
return std::find(view.begin(),
view.end(),
value)
endif
end
Parallel Sample Sort
●
pAlgorithm written using sequence of task graphs.
p_sort(View view, Op comparator)
// handle recursive call
if (view.size() <= get_num_locations())
reduce(view, merge_sort_work_function(comparator));
sample_view = map(view, select_samples_work_function());
// sort the samples
p_sort(sample_view, comparator);
// paritition the data using the samples
partitioned_view = map(view, full_overlap_view(sample_view),
bucket_partition_work_function(comparator));
// sort each partition
map(partitioned_view, sort_work_function(comparator));
Scalability of pAlgorithms
Execution
times for
weak scaling
of pAlgorithms
on data stored
in different
pContainers
on CRAY XT4.
STAPL Runtime System
Smart Application
Comm. Thread
Application Specific Parameters
RMI Thread
Task Thread
Advanced stage
Experimental stage:
multithreading
STAPL RTS
ARMI
Executor
Memory Manager
ARMI
Executor
Memory Manager
Custom scheduling
User-Level
Dispatcher
Kernel scheduling
Operating System
Kernel Scheduler
(no custom scheduling, e.g. NPTL)
The STAPL Runtime System (RTS)...

Abstracts platform resources
– threads, mutexes, atomics


Provides consistent API and behavior across platforms
Configured at compile-time for a specific platform


Hardware counters, different interconnect characteristics
Adapts at runtime at the runtime environment


Available memory, communication intensity etc.
Provides interface for calling functions on distributed
objects
– ARMI – Adaptive Remote Method Invocation

There is one instance of the RTS running in every process
– So it is distributed as well
ARMI:
Adaptive Remote Method Invocation


Communication service of
Asynchronous
Synchronous
sync_rmi
opaque_rmi
the RTS
async_rmi
multi_async_rmi
multi_sync_rmi
Provides two degrees of
reduce_rmi
multi_opaque_rmi
freedom
one-to-many




Allows transfer of data,
work, or both across the
system
Used to hide latency
Synchronization
rmi_fence_os
Collective
broadcast_rmi
allreduce_rmi
rmi_fence
rmi_flush
rmi_barrier
Used to call a function on a
distributed object
// Example of ARMI use
anywhere on the system async_rmi(destination, p_object,
function, arg0, ...);
Supports a mixed-mode r = sync_rmi(destination,
p_object,
operation (MPI+threads)
function, arg0, ...);
The STAPL RTS: Major Components
STAPL Runtime System
Resource Reuse
Hardware Monitoring
•Thread Pool
•Memory Manager
•Hardware Counters
•System resource usage
•Interfaces with TAU and
OS provided system calls
ARMI
•Different Communication methods
•RMI request aggregation and combining
Scheduler
Adaptivity
•Feedback to pContainers / pRange
•Resource acquisition before needed
•Load monitoring and thread allocation
•Detection if on shared memory or
have to use MPI
•Execution of RMIs
•Consistency Models
Multithreaded
Support
Executor
•Execution of tasks
provided by the pRange
FAST Architecture
●
●
●
●
Framework for parallel algorithm selection
Developer specifies important parameters, metrics to use
Developer queries model produced to select implementation
Selection process transparent to code calling algorithm
STAPL
User
Code
Parallel Algorithm Choices
Adaptive Executable
Data Characteristics
Runtime Tests
Model
Selected Algorithm
Parallel Sorting: Experimental Results
Attributes for Selection Model
–Processor Count
–Data Type
–Input Size
–Max Value (impacts radix sort)
–Presortedness
SGI Altix Validation Set (V1) – 100% Accuracy
N=120M
SGI Altix Selection Model
|-------------------------------------------------------MaxElement = 120,000-------------------------------------------------------|
4 8 16 32
4 8 16 32
4 8 16 32
4 8 16 32
4 8 16 32
Radix
Sample
Column
Adaptive
Normalized Execution Time
Procs:
Nearly / Integer
Rev / Integer
30
25
20
15
10
5
0
Adaptive Performance Penalty
Rand / Integer
Nearly / Double
Rand / Double
Parallel Sorting - Altix Relative Performance (V2)
Relative Speedup
1
Sample
0.8
Column
Radix
0.6
Random
Adaptive
Best
0.4
0.2
2
4
8
16
Processors
●
Model obtains 99.7% of the possible performance.
●
Next best algorithm (sample) provides only 90.4%.
32
PDT:
Developing Applications with STAPL
●
Important application for DOE
– E.g., Sweep3D and UMT2K
●
●
Large, on-going DOE project at TAMU to develop
application in STAPL
STAPL precursor used by PDT in DOE PSAAP center
One sweep
Eight simultaneous sweeps
pRanges in PDT:
Writing new pAlgorithms
prB
prA
4
1
3
2
1
6
3
11 16
10 15 20
9
2
9
2
5
2
1
1
7
1
3
9
2
7 12
5
B
8
14 19 24
13 18 23 28
A
5
3
0
2
6
2
2
1
8
1
4
1
0
6
3
1
2
7
2
3
1
9
1
5
1
1
7
3
2
2
8
2
4
2
0
1
6
1
2
8
1
2
3
4
4
8
pRanges are
sweeps in particle
transport application
●
Reflective materials
on problem
boundary create
dependencies
●
Composition
operator will allow
easy composition
5
6
7
●
9
10 13
11 14 17
12 15 18 21
16 19 22 25
17 22 27 32
20 23 26 29
21 26 31
24 27 30
25 30
28 31
29
32
zip(prA, prB, Zipper(4,32,4));
Sweep Performance
●
Weak scaling keeps
number of unknowns per
processor constant.
●
Communication
increases with processor
count.
●
KBA Model shows
performance of perfectly
scheduled sweep
●
Divergence after 2048
processors due to nonoptimal task scheduling
Conclusion
●
STAPL allows productive parallel application development
●
pContainers and pAlgorithms
– Application building blocks
– Simplify development
– Extensibility enables easy development of new components
●
Composition of pContainers and pAlgorithms enable reuse
●
RTS and FAST provide portability and adaptability