19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization of HPC Programs Yasushi Negishi, Hiroki Murata and Takao Moriyama Deep Computing, Tokyo Research Laboratory, IBM Research 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Outline of this Presentation 1.Proposal of an algorithm for managing operation history of source-to-source optimization. 2.Prototype system with new user interface for managing operation history explicitly. 2 © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Outline of this Presentation 1.Proposal of an algorithm for managing operation history of source-to-source optimization. 2.Prototype system with new user interface for managing operation history explicitly. 3 © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Background Improvement of single processor performance is stopping, and architectures of supercomputers is becoming more complex. – Architecture-specific optimizations are needed to utilize various kinds of network and processor architectures to achieve reasonable performance. Application areas for numerical simulations continue to expand. –We need solve performance issues more effectively and more easily. Source-to-source optimization tools are becoming important. –Automatic conversion (a.k.a. refactoring) for optimization –Support typical architecture-specific and application-specific performance optimization patterns. –Reduce programmer’s time and human errors by supporting routine but troublesome optimization. 4 © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Typical Source-to-Source Optimization Steps Strength reduction Optimization steps are combinations of automatic conversion and manual editing – Replace costly operation with an equivalent but less expensive operation • E.g. x = r ** (-1) x = 1 / r – Steps 1. Modify the code to use less expensive operation by manual editing Loop unrolling & SIMDization – Use SIMD instructions If compiler does not generate optimal SIMD instructions in a loop • E.g. x(i) = a(i) + b(i) * c(i) x(i) = FPMADD(a(i), b(i), c(i)) • x(i+1) = a(i+1) + b(i+1) * c(i+1) – Steps 1. Unroll the loop by automatic conversion with specifying the range and unroll factor. 2. Modify the unrolled loop body with in-line assemble code for SIMD by manual editing Loop tiling (a.k.a. loop blocking, strip mine and interchange) – Change loop structure to increase memory access locality and cache hit ratio. • E.g. for (i=0; i<N; i+= Bi) for (i=0; i<N; i++) for (j=0; j<N; j+= Bj） for (j=0; j<N; j++) for (ii=i; ii<min(i+Bi,N); ii++) c[i] = c[i]+ a[i,j]*b[j]; for (jj=j; jj<min(j+Bj,N); jj++) c[ii] =c[ii]+ a[ii,jj]*b[jj]; – Steps 1. Modify the loop by automatic conversion with specifying the range and blocking factors. 5 © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois “Reapplication Conflict” Because of trial-and-error nature of optimization work, it is sometimes required to undo an operation in the past or to insert or change operation in the past even if a single user manages the code. We call this conflict caused by a single user as “Reapplication Conflict”. System for supporting Source-to-Source optimization should handle this conflict correctly. 6 © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Issues of Existing Version Management Systems Handling “Reapplication Conflict” Because of trial-and-error nature of optimization work, it is sometimes required to undo an operation in the past or to insert or change operation in the past even if a single user manages the code. –We call this conflict caused by a single user as “Reapplication Conflict”. System should handle this conflict correctly. Existing version management systems use algorithm of “patch” command or similar one to handle conflicts. But the patch algorithm has a issue. –As for modification by manual editing, the patch algorithm works fine. • The algorithm applies difference by an operation on different base code, with adjusting target range to be applied. –As for modification by automatic conversion, the patch algorithm may generate unexpected results. Scenario in which existing system does not work expectedly is shown. 7 © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Example Scenario of “Reapplication Conflict” (original) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, x(n), sin, s, t1, t2, t3, rtc a=0 b=0 pi = 3.14159265d0 s = rtc() do i = 1, n x(i) = i * sin(i / (pi * 4.0d0)) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + x(i) ** (-1) enddo t2 = rtc() - s s = rtc() do i = 2, n b = b + ((x(i) + a) / (pi * 4.0d0) + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end 8 Original Original code is checked out. © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Example Scenario of “Reapplication Conflict” (Step 1) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a=0 b=0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + x(i) ** (-1) enddo t2 = rtc() - s s = rtc() do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end 9 Original Operation A Original: Step 1: Step 1: Do loop invariant code motion by manual editing, and check it in © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Example Scenario of “Reapplication Conflict” (Step 2) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a=0 b=0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end 10 Original A B Original: Step 1: Step 2: Step 2: Do strength reduction by manual editing, and check it in. © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Example Scenario of “Reapplication Conflict” (Step 3) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a=0 b=0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n, 4 b = b + ((x(i) + a) / fourpi + 1.0d0) b = b + ((x(i+1) + a) / fourpi + 1.0d0) b = b + ((x(i+2) + a) / fourpi + 1.0d0) b = b + ((x(i+3) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end 11 Original A B C Original: Step 1: Step 2: Step 3: Step 3: Do loop unrolling by automatic conversion, and check it in. © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Example Scenario of “Reapplication Conflict” (Step 4) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a=0 b=0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n, 4 b = b + ((x(i) + a) / fourpi + 1.0d0) b = b + ((x(i+1) + a) / fourpi + 1.0d0) b = b + ((x(i+2) + a) / fourpi + 1.0d0) b = b + ((x(i+3) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end 12 Original A B C Original: Step 1: Step 2: Step 3: N.G. O.K. O.K. Step 4: Compile and execute the code, and analyze effects of optimizations Find the following results Optimization A: not effective Optimization B: effective Optimization C: effective © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Example Scenario of “Reapplication Conflict” (Step 5) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc 13 Original A B C Original: a=0 Step 1: b=0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 Step 2: do i = 1, n x(i) = i * sin(i / fourpi) Target of optimization A enddo t1 = rtc() - s Step 3: s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo Step 5: t2 = rtc() - s s = rtc() do i = 2, n, 4 b = b + ((x(i) + a) / fourpi + 1.0d0) Step 5: Undo the optimization A by b = b + ((x(i+1) + a) / fourpi + 1.0d0) “patch” command b = b + ((x(i+2) + a) / fourpi + 1.0d0) b = b + ((x(i+3) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s Not target of optimization A, but influenced write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Example Scenario of “Reapplication Conflict” (Final Results) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, x(n), sin, s, t1, t2, t3, rtc a=0 b=0 pi = 3.14159265d0 s = rtc() do i = 1, n x(i) = i * sin(i / (pi * 4.0d0)) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n, 4 b = b + ((x(i) + a) / (pi * 4.0d0) + 1.0d0) b = b + ((x(i+1) + a) / fourpi + 1.0d0) b = b + ((x(i+2) + a) / fourpi + 1.0d0) b = b + ((x(i+3) + a) / fourpi + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end 14 Problem: The wrong line is unrolled !! Because “patch” does not actually apply the automatic conversion operation again, but does just apply difference of the results by automatic conversion operation. System for managing automatic conversion operations needed. (1) Adjust the target range (2) Apply the automatic operation actually again. © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Proposed Algorithm for saving/applying automatic operations Manual editing handled by the patch algorithm Saving an operation Manual Editing Original code Applying an saved operation Modified code Patch Optimization results Optimized results on modified code algorithm Context difference file Context difference file Operation log Operation log Automatic conversion handled by our proposed algorithm Specify Range Original Code 15 Specify Conversion ID and arguments Pseudo change file Optimization results Modified Code Patch algorithm Pseudo change file Optimization results Context difference file Apply automatic Context difference file conversion Conversion ID Conversion ID Arguments Arguments Operation log Operation log © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Scenario of Proposed Algorism to Save Automatic Operations program sample program implicit nonesample implicit integer i, n none integer i, n parameter(n=10000000) Identifier of automatic conversion Operation parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc “loop unrolling” a=0 b = a0 = 0 =0 pi =b3.14159265d0 parameter = 3.14159265d0 s = pi rtc() s = rtc() 4 fourpi = pi * 4.0d0 do ifourpi = 1, n= pi * 4.0d0 n / fourpi) x(i)do= ii =* 1, sin(i x(i) = i * sin(i / fourpi) enddo difference context file enddo t1 = rtc() - s = rtc() - sJul 11 11:36:34 2009 *** opeB.F Sat s = t1 rtc() s = rtc() --- opeC2.F do i = 1, n Sun Jul 12 13:36:10 2009 n / x(i) *************** ado = ai =+1, 1.0d0 a = a + 1.0d0 / x(i) *** 19,27 enddo**** --- 19,29 ---- - s t2 =enddo rtc() t2 = rtc() - s enddo s = rtc() s = t2i == rtc() do 2,rtc() n- s $BEGIN s b= =rtc() b + ((x(i) + a) / fourpi + 1.0d0) do i = 2, n + $BEGIN enddo =-bns+ ((x(i) + a) / fourpi + 1.0d0) =b 2, t3do = irtc() enddo b = b 'a=', + ((x(i) + a) b/ fourpi + 1.0d0) write(*,*) a, 'b=', $END enddo 'time=', t1, t2, t3 write(*,*) + $END endt3 = rtc() - s t3 write(*,*) = rtc() - s'a=', a, 'b=', b write(*,*) 'time=', t1,bt2, t3 write(*,*) 'a=', a, 'b=', end 'time=', t1, t2, t3 write(*,*) 16 Algorithm for saving operation history log Step 1: Generate pseudo change file pseudo change file by inserting special lines to specify range for the automatic operation. Step 2: Create context difference file between the file before editing and the pseudo change file By saving this context difference file, range-adjust algorithm of “patch” command can be used for identifying the target range of automatic conversion. Step 3: Save identifier of automatic conversion operation (e.g. “loop unrolling”), its parameter (e.g. “4”), and the context difference file as its operation log. © 2009 IBM Corporation Identifier of automatic conversion Operation log 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois “loopAutomatic unrolling” Scenario of Proposed Algorism to Apply Operation (Step 1) program sample program implicit nonesample implicit integer i, n none integer i, n parameter(n=10000000) parameter(n=10000000) real*8 a, b, pi, x(n), sin, s, t1, t2, t3, rtc real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a=0 b = a0 = 0 =0 pi =b3.14159265d0 = 3.14159265d0 s = pi rtc() do is==1,rtc() n = pi */ 4.0d0 x(i)fourpi = i * sin(i (pi * 4.0d0)) do i = 1, n enddo x(i) =- is* sin(i / fourpi) t1 = rtc() s = enddo rtc() t1 do i = =1,rtc() n -s s = rtc() a = a + x(i) ** (-1) do i = 1, n enddo a =-as+ 1.0d0 / x(i) t2 = rtc() s = enddo rtc() t2 do i = =2,rtc() n -s s = rtc() b = b + ((x(i) + a) / (pi * 4.0d0) + 1.0d0) $BEGIN enddo i = -2,s n t3 =do rtc() b = 'a=', b + ((x(i) + a) write(*,*) a, 'b=', b / fourpi + 1.0d0) enddo'time=', t1, t2, t3 write(*,*) $END end t3 = rtc() - s write(*,*) 'a=', a, 'b=', b 2: Ignore the starting and ending Trialwrite(*,*) 3: 4: 1: Apply outer history one thelines line same 'time=',the t1, t2, t3mostattwo end line numbersthe modification position before/after Not Match Match 17 parameter 4 context difference file *** opeB.F Sat Jul 11 11:36:34 2009 --- opeC2.F Sun Jul 12 13:36:10 2009 *************** *** 19,27 **** --- 19,29 ---enddo t2 = rtc() - s s = rtc() + $BEGIN do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo + $END t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 Algorithm for applying operation history on modified target code Step1: Apply the context diff file to the target program by using algorithm used © 2009 IBM Corporation pseudo change file by the “patch” command. Identifier of automatic conversion Operation log 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois “loopAutomatic unrolling” Scenario of Proposed Algorism to Apply Operation (Step 2) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, fourpi, x(n), sin, s, t1, t2, t3, rtc a=0 b=0 pi = 3.14159265d0 s = rtc() fourpi = pi * 4.0d0 do i = 1, n x(i) = i * sin(i / fourpi) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1.0d0 / x(i) enddo Redo “loop unrolling” “4” times t2 = rtc() - s s = rtc() $BEGIN do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo $END t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end 18 parameter 4 context difference file on “the loop” pseudo change file *** opeB.F Sat Jul 11 11:36:34 2009 --- opeC2.F Sun Jul 12 13:36:10 2009 *************** *** 19,27 **** --- 19,29 ---enddo t2 = rtc() - s s = rtc() + $BEGIN do i = 2, n b = b + ((x(i) + a) / fourpi + 1.0d0) enddo + $END t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 Algorithm for applying operation history on modified target code Step2: Redo automatic conversion with its parameter saved in the operation log. © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Proposed Algorism to Apply Automatic Operation (Final Results) program sample implicit none integer i, n parameter(n=10000000) real*8 a, b, pi, x(n), sin, s, t1, t2, t3, rtc a=0 b=0 pi = 3.14159265d0 s = rtc() do i = 1, n x(i) = i * sin(i / (pi * 4.0d0)) enddo t1 = rtc() - s s = rtc() do i = 1, n a = a + 1 / x(i) enddo t2 = rtc() - s s = rtc() do i = 2, n, 4 b = b + ((x(i) + a) / (pi * 4.0d0) + 1.0d0) b = b + ((x(i+1) + a) / (pi * 4.0d0) + 1.0d0) b = b + ((x(i+2) + a) / (pi * 4.0d0) + 1.0d0) b = b + ((x(i+3) + a) / (pi * 4.0d0) + 1.0d0) enddo t3 = rtc() - s write(*,*) 'a=', a, 'b=', b write(*,*) 'time=', t1, t2, t3 end 19 Problem solved. The correct line is unrolled !! The proposed system can reapply automatic conversion operations correctly. © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Outline of this Presentation 1.Proposal of an algorithm for managing operation history of source-to-source optimization. 2.Prototype system with new user interface for managing operation history explicitly. 20 © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois User defined Transformation rules User defined Transformation rules Pre-defined Transformation rules Prototype Implementation of the Proposed System HPC refactoring module Photran module (Fortran) CDT module (C) Open Open Source Source Eclipse Implemented as an Eclipse plug-in module – Worked with open source CDT/Photran modules – Use CDT/Photran’s C/Fortran parser 21 © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Proposal of user interface for operation history management system 1. Operation History is displayed as a sequence, and user can select and modify any point of source code. 2. The succeeding operations are automatically reapplied as needed to Operation produce a new version according to history view the user’s instructions. Source code view 3. Operations are categorized into the following three categories and according status and Source code to theInformation console output view and tree view of the necessity reapplication, are displayed by using three colors. Green: Applied Yellow: Not tried to applied Operation Red: Tried to applied, but fail. view 22 history © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Conclusion 1.Explained proposal of an algorithm for managing operation history of source-to-source optimization. 2.Explained Prototype system with new user interface for managing operation history explicitly. 23 © 2009 IBM Corporation 19-20 July, 2009 | PADTAD 2009 @ Chicago, Illinois Questions ? 24 © 2009 IBM Corporation

- Similar pages
- TI TPS2590
- TI TPS2590RSA