Dynamic Testing of Flow Graph Based Parallel Applications B. Schaeli, R.D. Hersch Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland Message race detection • In message-passing parallel application, the delivery order of messages is not deterministic • A message race occurs when reordering messages modifies the computation output – Existence of the race depends on the actual implementation of the application • Assumption: processes only exchange information via messages – Any execution can be expressed by a message orderings Message race detection • Problem – Number of possible orderings explodes • Goals – Identify and prevent execution of equivalent orderings – Avoid executing full application for every remaining ordering Plan • Present Dynamic Parallel Schedules (DPS) parallelization framework • Construction and reduction of messagepassing state graph – Practical results – Limitations – Heuristic for building partial graph • Application to MPI Dynamic Parallel Schedules InputData SubTask SplitData SubResult Process Result MergeResults • Four operation types – – – – Split: 1 input, n outputs Leaf: 1 input, 1 output Merge: n inputs, 1 output Stream: m inputs, n outputs • Operations contain arbitrary user code – The actual number of operations executed is only known at runtime Execution context • Operations execute within processes, grouped into process collections master[0] slaves[0] master[0] slaves[1] slaves[2] • Execution is fully asynchronous and datadriven Checkpointing • A serializable C++ object can be attached to each process – Represents checkpointable process state – Processes of a collection use the same object type class A { public: int a; double b; std::vector<int> v; }; class A : public ISerializable { CLASSDEF(A) MEMBERS ITEM(int, a) ITEM(double, b) ITEM(std::vector<int>, v) CLASSEND; }; • Objects can be compared – Determine members modified by an operation Message-passing state graph • Application state: state of all processes and set of messages in transit, when no operation is running 2 6 1 3 • Transition: triggered by message delivery 2 4 3 4 3 3 1 1 2 3 2 3 4 4 5 5 5 6 5 4 3 4 2 5 5 2 2 4 5 Message-passing state graph • Application state: state of all processes and set of messages in transit, when no operation is running 2 6 1 3 • Transition: triggered by message delivery 2 1 1 4 3 2 3 4 3 3 5 3 2 4 4 5 5 6 5 4 3 4 2 5 5 2 2 4 6’ 5 Using flow graph information master[0] slaves[0] master 1 3 2 4 slaves[1] master slaves master Floyd-Steinberg halftoning algorithm • Number of messages delivered to execute all orderings 2 proc. All orderings Full state graph Optimized state graph 6.8·108 4 proc. 6 proc. overflow 338 3.9·104 47 8 proc. overflow overflow overflow overflow 765 2.7·104 1.0·106 Pipelined LU factorization • Number of messages delivered to execute all orderings 3 proc. All orderings > 1017 4 proc. overflow Full state graph 4841 6.2·109 Optimized state graph 4780 6.2·109 Trading time for space • Limitations – Reexecution is serial – Graph and checkpoints are stored in RAM Run: 4 proc. Testing: 4 proc. [s] [MB] [s] [MB] FS 0.014 0.066 0.716 1.9 LU 0.014 0.115 33804 302 Testing: 6 proc. [s] [MB] 48 overflow 19 overflow Partial tests • One can test part of application – E.g. between two global synchronizations – Test may start from arbitrary checkpoint Heuristic: DFS 2 1 3 4 8 8 5 3 5 9 9 4 3 4 5 3 2 1 1 5 4 5 8 9 12 6 7 10 11 13 14 16 15 10 7 10 7 11 11 6 7 12 14 3 13 12 14 3 6 7 13 15 11 6 3 3 4 7 11 6 10 10 4 8 6 3 3 8 5 3 9 9 3 2 3 3 15 16 7 10 10 8 7 11 11 8 7 4 2 2 2 5 13 15 2 6 7 13 15 2 2 4 5 2 2 9 11 6 11 9 7 5 6 10 6 4 10 2 2 2 6 5 9 9 5 12 4 4 8 8 12 14 14 Application to MPI • Blocking point-to-point and collective calls only, assume no buffering • Each MPI call represents one event – Inspired by Siegel and Avrunin Application to MPI P0 I0 0 r*,0 P1 I1 r*,1 P2 I2 s2,0 F2 P3 I3 s3,1 F3 { s2,0 I0 I1 I2 I3 {I0, I1, I2 , I 3 } 0 r*,0 r*,1 s2,0 s3,1 1 r*,0 0 , r*,0 } r *,1 F2 0 s3,1 { s3,1 , r*,1 } 1 r*,0 s1,0 1 r*,0 { s3,1 , r*,1 } s1,0 F2 0 s2,0 , r*,0 { } F 0 3 r*,0 s1,0 s2,0 F3 0 } { s1,0 , r*,0 F0 F1 { s1,0 F0 {F0, F1, 1 } F2, F3} , r*,0 F1 F2 F3 Mismatched buffer sizes MPI implementation • Methods for generating relevant orderings – Include non-blocking and buffered calls – ISP: Vakkalanka et al. (PPoPP’08) • Checkpointing of MPI applications – Berkeley Lab Checkpoint/Restart library • We must be able to compare checkpoints to merge identical states – Focus on user-defined part of checkpoint – Autoserial library: serialization and comparison of C++ objects Conclusion • Message-passing state graph construction – Execute once parts common to multiple orderings – Trades reexecution time for memory consumption • Number of states reduced using partial order reduction technique – Flow graph provides information about future computations – Use info about how operations read and write local variables • Heuristic to maximize "desynchronization" within application • Integration into DPS framework – Testing by recompiling application – Replay functionality runs within traditional debugger • Results are input- and parameter-specific – Actual data or number of processes often irrelevant Thank you http://dps.epfl.ch http://home.gna.org/autoserial