Workshop on the State of Art in Software Research, at UC Irvine Programming Language X10 on Multiple JVMs 2013/05/15 Kiyokuni (KIYO) Kawachiya and Mikio Takeuchi IBM Research - Tokyo © 2013 IBM Corporation IBM Research - Tokyo This Talk Focuses on “Managed X10” Managed X10 = X10 implementation on Java (i.e. managed runtime) – X10 program is translated into Java source code and compiled into Java bytecode, then executed on multiple Java VMs X10 Compiler Frontend X10 Source Parsing / Type Check X10 AST AST Optimizations AST Lowering AST: Abstract Syntax Tree XRX: X10 Runtime in X10 XRJ: X10 Runtime in Java XRC: X10 Runtime in C++ X10RT: X10 Comm. Runtime X10 AST C++ Backend XRC C++ Code Generation Java Code Generation C++ Source Java Source C++ Compiler XRX Java Compiler Native Code 2 2013/05/15 Kiyokuni Kawachiya XRJ Bytecode Native X10 Managed X10 JNI Native Env Java Backend X10RT Java VMs © 2013 IBM Corporation IBM Research - Tokyo “Hello World” in X10 X10 Program Parallel distributed execution by creating an activity at each place class MyHello { public static def main(Array[String]) { finish for (pl in Place.places()) { async at (pl) Console.OUT.println("Hello World from place " + here.id); }}} Compile and Execute $ x10c MyHello.x10 # compile $ ls MyHello.x10 MyHello$$Main.class MyHello.java MyHello$$Closure$0.class MyHello.class $ X10_NPLACES=4 x10 MyHello # execute Hello World from place 3 Hello World from place 0 Hello World from place 1 Hello World from place 2 3 2013/05/15 Kiyokuni Kawachiya X10 program is translated into Java (or C+) source code and further compiled Executed asynchronously on each place For C++ back-end: $ x10c++ -o MyHello MyHello.x10 $ X10_NPLACES=4 runx10 MyHello © 2013 IBM Corporation IBM Research - Tokyo Distributed Execution Model of X10 APGAS: Asynchronous Partitioned Global Address Space Address space Activity Place 0 Place 1 at async asyn c Creation at Place MAX_PLACES-1 ... c yn as at Place shift DistArray Object DistArray Local ref Remote ref Immutable data, class, struct, function A global address space is divided into multiple places (≒computer) – Place can contain activities and data (objects, structs, functions) – In Managed X10, each place is implemented by a Java VM An object belongs to a place where it is created – Object cannot be accessed from other places, but – Object can be remotely referenced from other places, using GlobalRef 4 2013/05/15 Kiyokuni Kawachiya © 2013 IBM Corporation IBM Research - Tokyo X10 Characteristics Java-like syntax, but uses var, val, def, ... Rich array class X10 24 /* Array example */ X10 1 class Sample[T] implements (String)=>String { 2 var data:T; 25 val pt = [2,4] as Point; // Point{rank==2} Operators can 3 def this(d:T) { data = d; } // constructor 26 val R1 = (1..2)*(3..5); // Region{rank==2} 4 public operator this(str:String) = str + data; also be defined 27 val arr = new Array[Int](R1); // Array[Int]{rank==2} 5 28 arr(2,4) = 8; 6 static struct MyPair[T,U](fst:T,snd:U) { 29 Console.OUT.println(arr(pt)); // -> 8 7 public def toString() = "(" + fst + "," + snd +")"; 30 /* Parallel processing */ 8 } 31 var m:Int = 0; val i = 1; // mutable/immutable data 9 32 finish async { m = i * 2; } 10 public static def main(args:Array[String](1)) { 33 Console.OUT.println(m); // -> 2 11 /* Class example */ 34 /* Distributed processing */ Parallel/distributed 12 val o = new Sample[Double](1.2); 35 at (here.next()) o.data = 3.4; // copy of o is modified processing 13 Console.OUT.println(o.data); // -> 1.2 36 Console.OUT.println(o.data); // -> 1.2 14 Console.OUT.println(o("Data is ")); // -> Data is 1.2 37 /* GlobalRef example */ 15 var a:Any = o; 38 val g = GlobalRef(o); 16 Console.OUT.println(o.typeName());// -> Sample[x10.lang.Double] 39 at (here.next()) { at (g.home) g().data = 5.6; } 17 /* Struct example */ 40 Console.OUT.println(o.data); // -> 5.6 18 val p = MyPair[Int,Double](1,2.3); 41 } Global data access 19 Console.OUT.println(p); // -> (1,2.3) 42 } 20 val x = 4; Strong type system – type 21 /* Function example */ 22 val q = MyPair[(Int)=>Int, Int]((i:Int)=>i*x, 5); parameters are not erased 23 Console.OUT.println(q.fst(q.snd)); // -> 20 New 1st class data types, structs and functions 5 2013/05/15 Kiyokuni Kawachiya Primitive types (Int, Double, ...) are defined as structs © 2013 IBM Corporation IBM Research - Tokyo Compiling X10 to Java 1 class B { 2 def w() = this; 3} 4 class C(p:Int) extends B { 5 val q:Int; var r:Int; 6 def u() { } 7 private def v() { } 8 def w() = super.w(); 9 def this(p:Int) { 10 property(p); 11 q = 1; 12 } 13 static val s = Place.MAX_PLACES; 14 static def t() { } 15 } (in X10 2.3.0) X10 Class/field names are basically same Constructor is separated into Java constructor and field initializer – to support inlining by front-end For private instance method and nonvirtual call, additional bridge methods are generated x10c 1 import x10.core.concurrent.AtomicInteger; Java 2 import x10.lang.ExceptionInInitializer; 3 import x10.lang.Place; 4 import x10.runtime.impl.java.InitDispatcher; 5 public class C extends B { 6 public int p; // property 7 public int q, r; // instance fields 8 // constructor just for allocation 9 public C(System[] $dummy) { super($dummy); } 10 // instance field initializer 11 final public C C$$init$S(int p) { 12 this.p = p; 13 this.r = 0; 14 this.q = 1; 15 return this; 16 } 17 // 1-phase constructor for Java interop programmers 18 public C(int p) { 19 this((System[]) null); C$$init$S(p); 20 } 21 public void u() {} // instance methods 22 private void v() {} 23 // bridge for private instance method 24 public static void v$P(C C) { C.v(); } 25 public B w() { return super.w(); } 26 // bridge for non-virtual call 27 public B B$w$S() { return super.w(); } 28 public static void t() {} // static method (some lines are omitted) – to support inlining by front-end 6 2013/05/15 Kiyokuni Kawachiya © 2013 IBM Corporation IBM Research - Tokyo Distributed Execution using Multiple JVMs Multi-JVM is supported in X10 2.1.2 – Each X10 “Place” uses its own Java VM – Uses common X10RT (in C++) through JNI “x10” command Place P JVM Initiate Places via SSH x10.runtime.impl.java.Runtime Run activity at place P Invoke function Serialize function into byte array Deserialize function from byte array Send byte array Receive byte array from place P JNI (Java->C++) JNI (Java->C++) x10rt_send_msg x10rt_probe Communication Physical Layer (e.g. socket, MPI) Communication Physical Layer (e.g. socket, MPI) Kiyokuni Kawachiya Msg Buffer Byte array message Network 2013/05/15 Place Q x10.runtime.impl.java.Runtime Byte array message 7 JVM Uses backend unique wire format in X10 2.2.0 © 2013 IBM Corporation IBM Research - Tokyo Remote Reference using GlobalRef X10 1 class GRefExample { 2 static class ResultBox { var value:Int = 0; } 3 public static def main(Array[String]) { 4 val place1 = here.next(); 5 val o = new ResultBox(); 6 val g = GlobalRef[ResultBox](o); 7 finish { 8 at (place1) async { // create an activity at place1 9 val r = do_long_calculation(g); 10 at (g) g().value = r; // set the result through GlobalRef 11 } 12 do_some_calculation_locally(); 13 } // end of finish 14 Console.OUT.println(o.value); 15 } } Figure. Distributed Processing with GlobalRef. Place 0 Place 1 o g (at Line 6) at g() g (at Line 9) at g (at Line 10) This object must not be collected while its GlobalRefs (g) exist in other places GlobalRef is a special struct to hold a global reference to an object – Created by “GlobalRef[T](obj)” and cannot be modified – The object can be accessed by “at (g) g()...” 8 2013/05/15 Kiyokuni Kawachiya © 2013 IBM Corporation IBM Research - Tokyo Garbage Collection in Managed X10 X10 data is represented by Java objects and collected by each JVM’s GC – However, remote reference is not a reference in the JVM level Place 0 Place 1 Even if A disappears, C must not be collected D (GlobalRef) C A X10 environment on multiple JVMs This remote reference is not visible to underlying JVMs B 123 JVM0 JVM1 GC Computer 0 GC Computer 1 Network Figure. Remote References across JVMs. In old X10, remotely-referenced (globalized) objects were registered into a management table and never collected Î We needed better implementation 9 2013/05/15 Kiyokuni Kawachiya © 2013 IBM Corporation IBM Research - Tokyo Behavior of the Distributed GC XRJ: X10 Runtime in Java GOT: Globalized Object Tracker RRT: Remote Reference Tracker Creating and using a GlobalRef X10 1 class GRefExample { 2 static class ResultBox { var value:Int = 0; } 3 public static def main(Array[String]) { 4 val place1 = here.next(); 5 val o = new ResultBox(); 6 val g = GlobalRef[ResultBox](o); // g 7 finish { 8 at (place1) async { // g’ 9 val r = do_long_calculation(g); 10 at (g) g().value = r; // g’’ 11 } 12 do_some_calculation_locally(); 13 } // end of finish 14 Console.OUT.println(o.value); 15 } } Place 0 g’’ Place 1 obj home=P0 id=123 g g’ (at Place 1) o obj obj=null home=P0 home=P0 id=0 id=123 id=123 GOT for o id2got 123 got2id 123 XRJ RRT for g’ weakRef weakRef strongRef home=P0 rCnt=0 rCnt=10 rCnt=8 id=123 id=123 wCnt=8 wCnt=10 Computer 0 rrtTable XRJ Computer 1 Network Figure. Data Structures for the Distributed GC. 10 2013/05/15 Kiyokuni Kawachiya © 2013 IBM Corporation IBM Research - Tokyo Behavior of the Distributed GC Collecting the globalized object, and related data X10 1 class GRefExample { 2 static class ResultBox { var value:Int = 0; } 3 public static def main(Array[String]) { 4 val place1 = here.next(); 5 val o = new ResultBox(); 6 val g = GlobalRef[ResultBox](o); // g 7 finish { 8 at (place1) async { // g’ 9 val r = do_long_calculation(g); 10 at (g) g().value = r; // g’’ 11 } 12 do_some_calculation_locally(); 13 } // end of finish 14 Console.OUT.println(o.value); 15 } } Place 0 g’’ XRJ: X10 Runtime in Java GOT: Globalized Object Tracker RRT: Remote Reference Tracker Place 1 obj home=P0 id=123 g g’ (at Place 1) o obj obj=null home=P0 home=P0 id=123 id=123 GOT for o id2got 123 got2id 123 XRJ RRT for g’ weakRef weakRef strongRef home=P0 rCnt=0 rCnt=8 id=123 id=123 wCnt=8 Computer 0 rrtTable XRJ Computer 1 Network Figure. Data Structures for the Distributed GC. Extra inter-place comm. is performed only when a remote ref. is removed 11 2013/05/15 Kiyokuni Kawachiya © 2013 IBM Corporation IBM Research - Tokyo Evaluation: Distributed Fibonacci X10 1 class DistFib { 2 static val atomicI = new x10.util.concurrent.AtomicInteger(0); 3 static def getAvailPlace() { // returns next available place 4 val i = at (Place.FIRST_PLACE) atomicI.incrementAndGet(); 5 return Place(i % Place.MAX_PLACES); 6 } 7 8 private val root = GlobalRef[DistFib](this); // ref to root obj 9 transient var v:Int; // in-out param, exists only in the root 10 def this(n:Int) { v = n; } 11 def compute() { // global method, can be invoked at any place 12 val n = at (root) root().v; // get the input from the root object 13 if (n < 2) return; // Fib(0)==0, Fib(1)==1 14 val f1 = new DistFib(n-1), f2 = new DistFib(n-2); 15 finish { 16 at (getAvailPlace()) async f1.compute(); // compute remotely 17 f2.compute(); // compute here 18 } 19 val r = f1.v + f2.v; // result = Fib(n-1) + Fib(n-2) 20 at (root) root().v = r; // set the result to the root object 21 } 22 23 public static def main(args:Array[String](1)) { 24 var n:Int = 10; if (args.size > 0) n = Int.parseInt(args(0)); 25 val f = new DistFib(n); f.compute(); // this line is measured 26 Console.OUT.println(f.v); 27 } Figure. Distributed Fibonacci. 28 } 12 2013/05/15 Kiyokuni Kawachiya Calculates n-th Fibonacci number recursively using multiple places – Field root is a GlobalRef that points to the “root object” – Field v of the root object is accessed by “at (root) root().v ...” Place 0 Place 1 DistFib(3) Place 2 DistFib(2), copied root DistFib(1), copied root v=3 root v = undef DistFib(2) at DistFib(1) root root v=2 v=1 DistFib(1) v = undef at DistFib(0) root root v=1 v=0 © 2013 IBM Corporation Environment: 6-core Xeon X5670 blades (HT disabled) 32GB memory, 40Gbps Infiniband Red Hat Enterprise Server Linux 5.5 x86_64 IBM J9 Java VM for Linux x86_64, Version 6.0 SR9 Generational GC, -Xms4m –Xmx512m 16 places by starting 4 JVMs on each of 4 blades IBM Research - Tokyo Result: Distributed Fibonacci Execution times to calculate n-th Fibonacci numbers using 16 places Elapsed time in msec (smaller is better) * Logarithmic scale 100,000 Base Dist. GC 10,000 1,000 100 10 n = 5 (5) 10 (55) 15 (610) 20 (6765) 25 (75025) Index (and result) Figure. Execution Times of Distributed Fibonacci. For very small n’s, base X10 is faster by about 10% – Because of the small overhead in the Dist. GC version For n=25, distributed GC version is 13% faster – Because heap is not consumed by uncollectable objects – Heap size of Place 0 reaches 468MB in base, but 148MB in Dist. GC 13 2013/05/15 Kiyokuni Kawachiya © 2013 IBM Corporation IBM Research - Tokyo Summary of the Talk Explained implementation details of Managed X10 (X10 on Java VMs) Challenges in compiling X10 to Java – For performance and functionality Distributed GC over Multiple Java VMs – No modification to the underlying JVMs – Introduced two data structures for monitoring globalized objects and remote references 14 2013/05/15 Kiyokuni Kawachiya © 2013 IBM Corporation