Dynamic Testing of Flow Graph Based Parallel Applications

Dynamic Testing of Flow Graph
Based Parallel Applications
B. Schaeli, R.D. Hersch
Ecole Polytechnique Fédérale de
Lausanne (EPFL), Switzerland
Message race detection
• In message-passing parallel application, the
delivery order of messages is not deterministic
• A message race occurs when reordering messages
modifies the computation output
– Existence of the race depends on the actual
implementation of the application
• Assumption: processes only exchange information
via messages
– Any execution can be expressed by a message orderings
Message race detection
• Problem
– Number of possible orderings explodes
• Goals
– Identify and prevent execution of equivalent
orderings
– Avoid executing full application for every
remaining ordering
Plan
• Present Dynamic Parallel Schedules
(DPS) parallelization framework
• Construction and reduction of messagepassing state graph
– Practical results
– Limitations
– Heuristic for building partial graph
• Application to MPI
Dynamic Parallel Schedules
InputData
SubTask
SplitData
SubResult
Process
Result
MergeResults
• Four operation types
–
–
–
–
Split: 1 input, n outputs
Leaf: 1 input, 1 output
Merge: n inputs, 1 output
Stream: m inputs, n outputs
• Operations contain arbitrary user code
– The actual number of operations executed is only
known at runtime
Execution context
• Operations execute within processes,
grouped into process collections
master[0] slaves[0]
master[0]
slaves[1]
slaves[2]
• Execution is fully asynchronous and datadriven
Checkpointing
• A serializable C++ object can be attached to
each process
– Represents checkpointable process state
– Processes of a collection use the same object type
class A {
public:
int a;
double b;
std::vector<int> v;
};
class A : public ISerializable {
CLASSDEF(A)
MEMBERS
ITEM(int, a)
ITEM(double, b)
ITEM(std::vector<int>, v)
CLASSEND;
};
• Objects can be compared
– Determine members modified by an operation
Message-passing state graph
• Application state:
state of all processes and set
of messages in transit, when
no operation is running
2
6
1
3
• Transition:
triggered by message delivery
2
4
3
4
3
3
1
1
2
3
2
3
4
4
5
5
5
6
5
4
3
4
2
5
5
2
2
4
5
Message-passing state graph
• Application state:
state of all processes and set
of messages in transit, when
no operation is running
2
6
1
3
• Transition:
triggered by message delivery
2
1
1
4
3
2
3
4
3
3
5
3
2
4
4
5
5
6
5
4
3
4
2
5
5
2
2
4
6’
5
Using flow graph information
master[0] slaves[0]
master
1
3
2
4
slaves[1]
master
slaves
master
Floyd-Steinberg halftoning algorithm
• Number of messages delivered to
execute all orderings
2 proc.
All orderings
Full state graph
Optimized state graph
6.8·108
4 proc. 6 proc.
overflow
338 3.9·104
47
8 proc.
overflow
overflow
overflow
overflow
765 2.7·104
1.0·106
Pipelined LU factorization
• Number of messages delivered to
execute all orderings
3 proc.
All orderings
> 1017
4 proc.
overflow
Full state graph
4841 6.2·109
Optimized state graph
4780 6.2·109
Trading time for space
• Limitations
– Reexecution is serial
– Graph and checkpoints are stored in RAM
Run: 4 proc.
Testing: 4 proc.
[s]
[MB]
[s]
[MB]
FS
0.014
0.066
0.716
1.9
LU
0.014
0.115
33804
302
Testing: 6 proc.
[s]
[MB]
48
overflow
19
overflow
Partial tests
• One can test part of application
– E.g. between two global synchronizations
– Test may start from arbitrary checkpoint
Heuristic: DFS
2
1
3
4
8 8
5
3
5
9 9
4
3
4 5
3
2
1
1
5
4
5
8
9
12
6
7
10
11
13
14
16
15
10
7
10
7
11 11
6
7
12
14
3
13
12
14
3
6 7
13
15
11
6
3
3
4
7 11
6
10 10
4
8
6
3
3 8
5
3
9 9
3
2 3
3
15
16
7
10 10
8
7
11 11
8
7
4
2
2
2
5
13
15
2
6 7
13
15
2
2
4 5
2
2
9
11
6
11
9
7
5
6
10
6
4
10
2
2
2
6
5
9 9
5
12
4
4
8 8
12
14
14
Application to MPI
• Blocking point-to-point and collective calls
only, assume no buffering
• Each MPI call represents one event
– Inspired by Siegel and Avrunin
Application to MPI
P0
I0
0
r*,0
P1
I1
r*,1
P2
I2
s2,0
F2
P3
I3
s3,1
F3
{ s2,0
I0
I1
I2
I3
{I0, I1,
I2 , I 3 }
0
r*,0
r*,1
s2,0
s3,1
1
r*,0
0
, r*,0
} r
*,1
F2
0
s3,1
{ s3,1 , r*,1 }
1
r*,0
s1,0
1
r*,0
{ s3,1 , r*,1 }
s1,0
F2
0
s2,0 , r*,0
{
} F
0
3
r*,0
s1,0
s2,0
F3
0 }
{ s1,0 , r*,0
F0
F1
{ s1,0
F0 {F0, F1,
1 }
F2, F3}
, r*,0
F1
F2
F3
Mismatched
buffer sizes
MPI implementation
• Methods for generating relevant orderings
– Include non-blocking and buffered calls
– ISP: Vakkalanka et al. (PPoPP’08)
• Checkpointing of MPI applications
– Berkeley Lab Checkpoint/Restart library
• We must be able to compare checkpoints to merge
identical states
– Focus on user-defined part of checkpoint
– Autoserial library: serialization and comparison of C++
objects
Conclusion
• Message-passing state graph construction
– Execute once parts common to multiple orderings
– Trades reexecution time for memory consumption
• Number of states reduced using partial order reduction
technique
– Flow graph provides information about future computations
– Use info about how operations read and write local variables
• Heuristic to maximize "desynchronization" within application
• Integration into DPS framework
– Testing by recompiling application
– Replay functionality runs within traditional debugger
• Results are input- and parameter-specific
– Actual data or number of processes often irrelevant
Thank you
http://dps.epfl.ch
http://home.gna.org/autoserial