Slides

The 2014 X10 Workshop (X10'14), June 12, 2014
Porting MPI based HPC Applications to X10
Hiroki Murata, Michihiro Horie, Koichi Shirahata†,
Jun Doi, Hideki Tai, Mikio Takeuchi, Kiyokuni Kawachiya
IBM Research – Tokyo, †Tokyo Institute of Technology
This material is based upon work supported by the Department of Energy
under award DE-FOA-0000619.
© 2014 IBM Corporation
IBM Research - Tokyo
Motivation: feasibility study X10 as the language
for HPC
 There are many HPC applications that are written in message
passing style.
 Re-implementing those applications with native X10
communication primitives may outperform the original code, but it
takes time to implement.
 Porting those applications to X10 while preserving the original code
structure is possible, but the performance may degrade.
 Are there any X10 idioms and modification patterns to avoid the
performance degradation?
2
© 2014 IBM Corporation
IBM Research - Tokyo
Contributions
 Implementation of two proxy applications in X10 tuned to achieve
performance nearly comparable to the original code
 X10 idioms and modification patterns which include APGAS code
that can replace MPI-based communication patterns
 Performance evaluations of the code
3
© 2014 IBM Corporation
IBM Research - Tokyo
CoMD (Co-design for Molecular Dynamics)
 An application of Molecular Dynamics (MD) simulation
 Simulation of materials with inter-atomic potential
– Supported potentials
• Lennard-Jones potential
• Embedded-Atom Method potential
– Forces between all atom pairs within the cutoff distance are
evaluated.
 Cartesian spatial decomposition of atoms across nodes
 Atoms on a node are assigned to cubic link cells which are slightly
larger than the cutoff distance
 Implementation
– C+MPI
4
© 2014 IBM Corporation
IBM Research - Tokyo
MCCK (Monte Carlo Communication Kernel)
 An application of a domain decomposed particle tracking algorithm
 Simulation of randomly-moving particles
– Initially a number of particles exist in each physical node
– A particle in a node can move into one of the neighbor nodes
– As time advances, particle fission occurs
 Implementation
– C+MPI
5
© 2014 IBM Corporation
IBM Research - Tokyo
Porting strategy
 Keep the original code structure as much as possible
 Map an MPI process to an X10 place
 Map a C array to an X10 Rail
 Map a C struct to an X10 class
 Transform MPI point to point communication to Rail.asyncCopy
method call
 And apply modifications to achieve comparable performance
6
© 2014 IBM Corporation
IBM Research - Tokyo
Basic modifications
 Memory allocation
 Object access
 Others
7
© 2014 IBM Corporation
IBM Research - Tokyo
Basic modifications (Memory allocation)
1. Replace local variable with field variable
2. Move allocation statements in loop to out of loop
3. Allocate object on stack (with @StackAllocate, experimental)
// X10 code
public class Calc {
val tmp = new Rail[Msg](Links.MAXA);
// C code
void sort(Link* boxes, ...) {
int nA = boxes->nA[iBox];
Msg tmp[nA];
...
}
int force(SF* s) {
...
for (int iBox=0; ...) {
...
for (int iOff= ...) {
...
for (int jOff= ...) {
...
double pT;
ip(&pT);
...
}}}}
8
1.
public def sort(boxes:Links.Link, ...):void {
val nA = boxes.nA(iBox);
...
}
2.
public def force(s:Types.SF):Int {
val pT = new Cell[Double](0.0);
...
for (var iBox:Int=0n; ...) {
...
for (var iOff:Int=...) {
...
for (var jOff:Int = ...) {
...
ip(pT);
...
}}}}
© 2014 IBM Corporation
IBM Research - Tokyo
Basic modifications (Object access)
1. Use val as much as possible
2. Unify common expressions
3. Hoist loop invariant
4. Flatten a multi-dimensional array index and its loop
5. Replace short array with variables
// C code
int force(SF* s) {
...
for (int iBox=0; ...) {
...
for (int iOff= ...) {
...
s->atom->p[iOff][0] -= dP * dr / r;
s->atom->p[iOff][1] += dP * dr / r;
...
for (int jOff= ...) {
double d[3];
...
double r2 = 0.0;
for (int m=0; m<3; m++) {
d[m] = s->atom->r[jOff][m];
r2 += d[m] * d[m];
}
...
}}}}
9
1.
2.
3.
4.
5.
// X10 code
public def force(s:Types.SF):Int {
val pT = new Cell[Double](0.0);
...
for (var iBox:Int=0n; ...) {
...
for (var iOff:Int=...) {
...
val tmp2 = dP.value * dr / r;
s.atom.p(iOff)(0) -= tmp2;
s.atom.p(iOff)(1) += tmp2;
...
val r = s.atom.r;
for (var jOff:Int = ...) {
val jOff3 = jOff*3;
...
var r2:Double = 0.0;
val d0 = r(jOff3);
r2 += d0 * d0;
val d1 = r(jOff3+1);
r2 += d1 * d1;
val d2 = r(jOff3+2);
r2 += d2 * d2;
...
}}}}
© 2014 IBM Corporation
IBM Research - Tokyo
Basic modifications (Others)
1. Set class as final
2. Forcibly inline method called in loop (with @Inline)
// X10 code
3. Prepare dedicated method
// C code
typedef struct {
double x; double y; double z;
int absorbed;
} Particle;
int compare(const void *pp1, const void *pp2) {
Particle *p1 = (Particle *) pp1;
Particle *p2 = (Particle *) pp2;
if (p1->absorbed && !p2->absorbed) return 1;
else if (!p1->absorbed && p2->absorbed) return -1;
else return 0;
}
static int pack() {
Particle *p = mc->particles; int np = mc->nparticles;
int count;
qsort(p, np, sizeof(Particle), compare);
for (count = 0; count < np; ++count)
if (p[count].absorbed) break;
return count;
}
3.
dedicated
method
10
public class Particle {
var x:Double; var y:Double; var z:Double;
var absorbed:Boolean;
static def compare(p1:Particle, p2:Particle):Int {
if (p1.absorbed && !p2.absorbed) return 1n;
else if (!p1.absorbed && p2.absorbed) return -1n;
else return 0n;
}
}
public class MC_Cycle {
private def pack():Int {
val p = mc.particles(); val np = mc.nparticles as Long;
var count:Int;
//RailUtils.qsort[Particle](p,0,np-1,(x:Particle,y:Particle)=>Particle.compare(x,y));
myqsort(p, 0, np-1);
for (count = 0n; count < np; ++count)
if (p(count).absorbed) break;
return count;
}
static def myqsort(a:Rail[Particle], lo:Long, hi:Long):void {
if (hi <= lo) return;
var l:Long = lo - 1; var h:Long = hi; var temp:Particle;
while (true) {
while (Particle.compare(a(++l), a(hi))<0) ;
while (Particle.compare(a(hi), a(--h))<0 && h>lo) ;
if (l >= h) break;
temp = a(l); a(l) = a(h); a(h) = temp;
}
temp = a(l); a(l) = a(hi); a(hi) = temp;
myqsort(a, lo, h-1n); myqsort(a, l+1n, hi);
}
© 2014 IBM Corporation
}
IBM Research - Tokyo
Modifications related to MPI
 Point-to-Point communication
– CoMD uses the send-receive operation.
– MCCK uses the sequence of MPI_Barrier, MPI_Isend,
MPI_Irecv, and MPI_Waitall. This sequence can be
implemented by executing barrier synchronization and then
moving to the other place and accessing the data.
 Collective operations
– Almost all of the collective operations are prepared in the X10
“Team” API.
– However, MPI_MINLOC and MPI_MAXLOC are not.
11
© 2014 IBM Corporation
IBM Research - Tokyo
Modifications related to MPI (P2P send-receive)
// C code
void sendReceive(void* sendBuf, int sendLen, int dest,
void* recvBuf, int recvLen, int source,
int* recvdLen) {
MPI_Status status;
MPI_Sendrecv(sendBuf, sendLen, MPI_BYTE, dest, 0,
recvBuf, recvLen, MPI_BYTE, source, 0,
MPI_COMM_WORLD, &status);
MPI_Get_count(&status, MPI_BYTE, recvdLen);
}
// X10 code
public class Communicator {
val startLatch = PlaceLocalHandle[Rail[MyLatch]];
val finishLatch = PlaceLocalHandle[Rail[MyLatch]];
def this() {
this.startLatch =
PlaceLocalHandle.make[Rail[MyLatch]](Place.places(),
()=>new Rail[MyLatch](Place.MAX_PLACES));
this.finishLatch =
PlaceLocalHandle.make[Rail[MyLatch]](Place.places(),
()=>new Rail[MyLatch](Place.MAX_PLACES));
}
public def sendReceive[T](
sendBuf:PlaceLocalHandle[Rail[T]],
sendLen:PlaceLocalHandle[Rail[Int]], dest:Int,
recvBuf:PlaceLocalHandle[Rail[T]], recvLen:Int,
source:Int, recvdLen:PlaceLocalHandle[Rail[Int]]) {
val me = here.id() as Int;
if (me == dest && me == source) {
val len = sendLen()(0);
Rail.copy[T](sendBuf(), 0, recvBuf(), 0, len as Long);
recvdLen()(0) = len;
}
else
{
12
// X10 code (continued)
finish {
// Notify the "dest" that "me" is ready to send
// 1. Trigger asyncCopy(me -> dest)
at (Place.place(dest)) async {
// Notify the "dest" that "me" is ready to send
startLatch()(me).release();
}
// 2.1. Wait for the "source" ready to send
val recvBufRef = GlobalRail(recvBuf());
val recvdLenRef = GlobalRail(recvdLen());
// Wait for a notification from the "source"
startLatch()(source).await();
// Now both recvBuf at "me" and sendBuf
// at the "source" are ready for asyncCopy
// 2.2. Perform asyncCopy(source -> me)
at (Place.place(source)) async {
val len2 = sendLen()(0) as Long;
finish {
Rail.asyncCopy[T](sendBuf(), 0, recvBufRef, 0, len2);
Rail.asyncCopy[Int](sendLen(), 0, recvdLenRef, 0, 1);
}
// Notify the "source" the completion
// of asyncCopy(source -> me)
finishLatch()(me).release();
}
// 3. Wait for a notification of the completion
// of asyncCopy(me -> dest)
finishLatch()(dest).await();
}
}
}
© 2014 IBM Corporation
IBM Research - Tokyo
Modifications related to MPI (Collective MPI_MINLOC)
// C code
typedef struct {
double value;
int rank;
} RankReduceData;
// X10 code
public class Communicator {
static class RankReduceData {
var value:Double = 0.0;
var rank:Int = 0n;
}
public def minRankDouble(sendBuf:Rail[RankReduceData],
recvBuf:Rail[RankReduceData], count:Int):void {
val sendBuf2 = new Rail[Double](count);
val recvBuf2 = new Rail[Double](count);
for (var i:Int = 0n; i < count; i++) {
sendBuf2(i) = sendBuf(i).value;
}
team.allreduce[Double](sendBuf2, 0, recvBuf2, 0,
count as Long, Team.MIN);
for (var i:Int = 0n; i < count; i++) {
recvBuf(i).rank = team.indexOfMin(sendBuf(i).value,
sendBuf(i).rank);
recvBuf(i).value = recvBuf2(i);
}
}
void minRankDouble(
RankReduceData* sBuf, RankReduceData* rBuf, int count) {
MPI_Allreduce(sBuf, rBuf, count,
MPI_DOUBLE_INT, MPI_MINLOC, MPI_COMM_WORLD);
}
}
13
© 2014 IBM Corporation
IBM Power775 (13x POWER7 3.84 GHz (8x4 HWT, 128GB mem)), RHEL 6.2
X10 2.4.1 native backend (compile option: -x10rt pami -O -NO_CHECKS)
C/C++ compiler: XL C V12.1 (compile option: -O3 -qinline)
Communication runtime: PAMI (X10) and MPI (Original)
IBM Research - Tokyo
Evaluation (CoMD w/ Lennard-Jones potential)
Original
X10
Scales well up to 392 nodes. Up to 11%
slower than the original.
Original calculation part
X10 calculation part
The performance loss is due to
communication factor.
1.4
1.2
1
The data and the received length are
sent with two async copies, rather than
with a single send-receive in the original.
0.8
0.6
0.4
0.2
0
0
50
100
150
200
250
300
350
400
Number of Nodes (Places)
Weak scaling performance of CoMD with
Lennard-Jones (LJ) potential
(Problem size: 256000 atoms/node)
14
© 2014 IBM Corporation
IBM Power775 (13x POWER7 3.84 GHz (8x4 HWT, 128GB mem)), RHEL 6.2
X10 2.4.1 native backend (compile option: -x10rt pami -O -NO_CHECKS)
C/C++ compiler: XL C V12.1 (compile option: -O3 -qinline)
Communication runtime: PAMI (X10) and MPI (Original)
IBM Research - Tokyo
Evaluation (CoMD w/ Embedded-Atom potential)
Original
X10
Scales well up to 392 nodes. Up to 25%
slower than the original.
Original calculation part
X10 calculation part
The performance loss is due to both
calculation and communication factors.
1.4
1.2
1
For the calculation, a table (an array) is
looked up, which is slower than in the
original.
0.8
0.6
0.4
0.2
0
0
50
100
150
200
250
300
350
400
Number of Nodes (Places)
Weak scaling performance of CoMD with
Embedded-Atom Method (EAM) potential
(Problem size: 256000 atoms/node)
15
For the communication, the data and the
received length are sent with two async
copies, rather than with a single sendreceive in the original. Furthermore, it
communicates more data than LJ
potential version.
© 2014 IBM Corporation
IBM Power775 (13x POWER7 3.84 GHz (8x4 HWT, 128GB mem)), RHEL 6.2
X10 2.4.1 native backend (compile option: -x10rt pami -O -NO_CHECKS)
C/C++ compiler: XL C V12.1 (compile option: -O3 -qinline)
Communication runtime: PAMI (X10) and MPI (Original)
IBM Research - Tokyo
Evaluation (MCCK)
Original
X10
Scales well up to 360 places (limited by
memory size). Up to 30% faster than the
original for one place. Up to 17% and on
average 1.8% slower than the original on
multiple places.
Original calculation part
X10 calculation part
1.8
1.6
The performance gain on one place is due
to the fast calculation and the slow
communication.
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
50
100
150
200
250
300
350
Number of Nodes (Places)
Weak scaling performance of MCCK
(Problem size: 20000000 particles/node,
0.2 leakage)
16
400
The calculation (i.e. sorting) is fast, since it
is implemented as a dedicated method
and the comparator function is inlined.
Furthermore it sorts an array of reference
rather than an array of struct.
The communication is slow, since it
emulates point-to-point communications.
The communication time varies irregularly,
its rate of variation is similar to that of the
original.
© 2014 IBM Corporation
IBM Research - Tokyo
Conclusion
 Explained X10 idioms and modification patterns for porting two
proxy applications, CoMD and MCCK, onto X10
 Those performance was measured on an IBM Power 775 cluster
– X10 port of those applications scaled comparably with the
original.
– X10 ports of CoMD with Lennard-Jones and Embedded-Atom
Method potentials were 11% and 25% slower than the original,
respectively.
– X10 port of MCCK was close to the original.
17
© 2014 IBM Corporation