IBM Hight Performance Computing Toolkit MPI Tracing/Profiling User Manual Advanced Computing Technology Center IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598 April 4, 2008 Contents 1 Overview 2 2 System and Software Requirement 2 3 Compiling and Linking 3.1 AIX on Power . . . . 3.2 Linux on Power . . . 3.3 Blue Gene/L . . . . 3.4 Blue Gene/P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Environment Variable 2 3 3 3 4 4 5 Output 5.1 Plain Text File . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Viz File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Trace File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 10 10 6 Configuration 6.1 Configuration Function . 6.2 Data Structure . . . . . 6.3 Utility Functions . . . . 6.4 Example . . . . . . . . . 10 11 11 13 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Final Note 16 7.1 Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 7.2 Multi-Threading . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 8 Contacts 17 1 1 Overview This is the documentation for the IBM High Performance Computing Toolkit MPI Profiling/Tracing library. This library collects profiling and tracing data for MPI programs. The library file names and their usage are shown in Table 1. Name libmpitrace.a mpt.h Usage library file for both C and Fortran application C header file Table 1: Library file names and usage Note: 1. The C header file is used when it needs to configure the library. Please see Section 6 for details. 2 System and Software Requirement Current supported architecture/OS and required software are: • AIX on Power (32 bit and 64 bit) – IBM Parallel Environment (PE) for AIX program product and its Parallel Operating Environment (POE). • Linux on Power (32 bit and 64 bit) – IBM Parallel Environment (PE) for Linux program product and its Parallel Operating Environment (POE). • Blue Gene/L – System Software version 3. • Blue Gene/P 3 Compiling and Linking The trace library uses the debugging information stored within the binary to map the performance information back to the source code. To use the library, the application must be compiled with the ”-g” option. The application may consider turning off or having lower level of optimization (-O2, -O1,...) when linking with the MPI profiler/tracer. High level optimization will affect the correctness of the debugging information. It may also affect the call stack behavior. 2 To link the application with the library, add three options to your command line: the option -L/path/to/libraries, where /path/to/libraries is the path where the libraries are located, the option -lmpitrace (this trace library should be before the MPI library -lmpich) in the linking order, and the option -llicense to link the license library. For some platforms, if the shared library liblicense.so is used, you may need to set the environment variable LD LIBRARY PATH to $IHPCT BASE/lib(lib64) to make sure the application finds the correct library during runtime. 3.1 AIX on Power • C example CC = /usr/lpp/ppe.poe/bin/mpcc_r TRACE_LIB = -L</path/to/libmpitrace.a> -lmpitrace -llicense mpitrace.ppe: mpi_test.c $(CC) -g -o $@ $< $(TRACE_LIB) -lm • Fortran example FC = /usr/lpp/ppe.poe/bin/mpxlf_r TRACE_LIB = -L</path/to/libmpitrace.a> -lmpitrace -llicense swim.ppe: swim.f $(FC) -g -o $@ $< $(TRACELIB) 3.2 Linux on Power • C example CC = /opt/ibmhpc/ppe.poe/bin/mpcc TRACE_LIB = -L</path/to/libmpitrace.a> -lmpitrace -llicense mpitrace: mpi_test.c $(CC) -g -o $@ $< $(TRACE_LIB) -lm • Fortran example FC = /opt/ibmhpc/ppe.poe/bin/mpfort TRACE_LIB = -L</path/to/libmpitrace.a> -lmpitrace -llicense statusesf_trace: statusesf.f $(FC) -g -o $@ $< $(TRACE_LIB) 3.3 Blue Gene/L • C example BGL_INSTALL = /bgl/BlueLight/ppcfloor LIBS_RTS = -lrts.rts -ldevices.rts LIBS_MPI = -L$(BGL_INSTALL)/bglsys/lib -lmpich.rts -lmsglayer.rts $(LIBS_RTS) XLC_TRACE_LIB = -L</path/to/libmpitrace.a> -lmpitrace -llicense XLC_RTS = blrts_xlc XLC_CFLAGS = -I$(BGL_INSTALL)/bglsys/include -g -O -qarch=440 -qtune=440 -qhot mpitrace_xlc.rts: mpi_test.c $(XLC_RTS) -o $@ $< $(XLC_CFLAGS) $(XLC_TRACE_LIB) $(LIBS_MPI) -lm 3 • Fortran example BGL_INSTALL = /bgl/BlueLight/ppcfloor LIBS_RTS = -lrts.rts -ldevices.rts LIBS_MPI = -L$(BGL_INSTALL)/bglsys/lib -lmpich.rts -lmsglayer.rts $(LIBS_RTS) TRACE_LIB = -L</path/to/libmpitrace.a> -lmpitrace -llicense BG_XLF = blrts_xlf FC_FLAGS = -I$(BGL_INSTALL)/bglsys/include -g -O statusesf_trace.rts: statusesf.f $(BG_XLF) -o $@ $< $(FC_FLAGS) $(TRACE_LIB) $(MPI_LIBS) 3.4 Blue Gene/P • C example BGPHOME=/bgsys/drivers/ppcfloor CC=$(BGPHOME)/comm/bin/mpicc CFLAGS = -I$(BGPHOME)/comm/include -g -O TRACE_LIB = -L</path/to/libmpitrace.a> -lmpitrace -llicense LIB1 = -L$(BGPHOME)/comm/lib -lmpich.cnk -ldcmfcoll.cnk -ldcmf.cnk LIB2 = -L$(BGPHOME)/runtime/SPI -lSPI.cna -lpthread -lrt LIB3 = -lgfortranbegin -lgfortran # please read the NOTE mpitrace: mpi_test.c $(CC) -o $@ $< $(CFLAGS) $(TRACE_LIB) $(LIB1) $(LIB2) -lm NOTE: the C example uses mpicc which currently is based on GNU compiler. In order to accomodate part of the tracing/profiling library that is written in Fortran. It is necessary to link those two GNU Fortran libraries. • Fortran example BGPHOME=/bgsys/drivers/ppcfloor CC=$(BGPHOME)/comm/bin/mpif77 FFLAGS = -I$(BGPHOME)/comm/include -g -O TRACE_LIB = -L</path/to/libmpitrace.a> -lmpitrace -llicense LIB1 = -L$(BGPHOME)/comm/lib -lmpich.cnk -ldcmfcoll.cnk -ldcmf.cnk LIB2 = -L$(BGPHOME)/runtime/SPI -lSPI.cna -lpthread -lrt statusesf: statusesf.f $(CC) -o $@ $< $(FFLAGS) $(TRACE_LIB) $(LIB1) $(LIB2) 4 Environment Variable • TRACE ALL EVENTS The wrappers can be used in two modes. The default value is set to yes and will collect both a timing summary and a time-history of MPI calls suitable for graphical display. If this environment variable is set to be yes, it will save a record of all MPI events 1 after MPI Init(), until the application completes, or until the trace buffer is full. 1 By default, for MPI ranks 0-255, or for all MPI ranks if there are 256 or fewer processes in MPI COMM WORLD. You can change this by setting the TRACE ALL TASKS or using configuration described in Section 6. 4 Another method is to control time-history measurement within the application by calling routines to start/stop tracing: – Fortran syntax call mt_trace_start() do work + mpi ... call mt_trace_stop() – C syntax void MT_trace_start(void); void MT_trace_stop(void); MT_trace_start(); do work + mpi ... MT_trace_stop(); – C++ syntax extern "C" void MT_trace_start(void); extern "C" void MT_trace_stop(void); MT_trace_start(); do work + mpi ... MT_trace_stop(); To use this control method, the environment variable needs to be disabled (otherwise it would trace all events): export TRACE_ALL_EVENTS=no (bash) setenv TRACE_ALL_EVENTS no (csh) • TRACE ALL TASKS When saving MPI event records, it is easy to generate trace files that are just too large to visualize. To cut down on the data volume, the default behavior when you set TRACE ALL EVENTS=yes is to save event records from MPI tasks 0-255, or for all MPI processes if there are 256 or fewer processes in MPI COMM WORLD. That should be enough to provide a good visual record of the communication pattern. If you want to save data from all tasks, you have to set this environment variable to yes: export TRACE_ALL_TASKS=yes (bash) setenv TRACE_ALL_TASKS yes (csh) • TRACE MAX RANK 5 To provide more control, you can set MAX TRACE RANK=#. For example, if you set MAX TRACE RANK=2048, you will get trace data from 2048 tasks, 0-2047, provided you actually have at least 2048 tasks in your job. By using the time-stamped trace feature selectively, both in time (trace start/ trace stop), and by MPI rank, you can get good insight into the MPI performance of very large complex parallel applications. • OUTPUT ALL RANKS For scalability reason, by default only four ranks will generate plain text files and the events in the trace: rank 0, rank with (min,med,max) MPI communication time. if rank 0 is one of the ranks with (min,med,max) MPI communication time, only three ranks will generate plain text files and events in the trace. If plain text files and events should be output from all ranks, set this environment variable to yes: export OUTPUT_ALL_RANKS=yes (bash) setenv OUTPUT_ALL_RANKS yes (csh) • TRACEBACK LEVEL In some cases there may be deeply nested layers on top of MPI, and you may need to profile higher up the call chain (functions in the call stack). You can do this by setting this environment variable (default value is 0). For example, setting TRACEBACK LEVEL=1 tells the library to save addresses starting not with the location of the MPI call (level = 0), but from the parent in the call chain (level = 1). • SWAP BYTES The event trace file is binary, and so it is sensitive to byte order. For example, Blue Gene/L is big endian, and your visualization workstation is probably little endian (e.g., x86). The trace files are written in little endian format by default. If you use a big endian system for graphical display (examples are Apple OS/X, AIX p-series workstations, etc.), you can set an environment variable export SWAP_BYTES=no (bash) setenv SWAP_BYTES no (csh) when you run your job. This will result in a trace file in big endian format. • TRACE SEND PATTERN (Blue Gene/L and Blue Gene/P Only) In either profiling or tracing mode there is an option to collect information about the number of hops for point-to-point communication on the torus. This feature can be enabled by setting an environment variable: export TRACE_SEND_PATTERN=yes setenv TRACE_SEND_PATTERN yes 6 When this variable is set, the wrappers keep track of how many bytes are sent to each task, and a binary file ”send bytes.matrix” is written during MPI Finalize which lists how many bytes were sent from each task to all other tasks. The format of the binary file is: D00 , D01 , ...D0n , D10 , ..., Dij , ..., Dnn where the data type Dij is double (in C) and it represents the size of MPI data sent from rank i to rank j. This matrix can be used as input to external utilities that can generate efficient mappings of MPI tasks onto torus coordinates. The wrappers also provide the average number of hops for all flavors of MPI Send. The wrappers do not track the message-traffic patterns in collective calls, such as MPI Alltoall. Only point-to-point send operations are tracked. The AverageHops for all communications on a given processor is measured as follows: X Hopsi × Bytesi AverageHops = i X Bytesi i where Hopsi is the distance between the processors for ith MPI communication and Bytesi is the size of the data transferred in this communication. The logical concept behind this performance metric is to measure how far each byte has to travel for the communication (in average). If the communication processor pair is close to each other in the coordinate, the AverageHops value will tend to be small. 5 Output After building the binary executable and setting the environment, run the application as you normally would. To have better control for the performance data collected and output, please refer to Sections 4 and 6. 5.1 Plain Text File The wrapper for MPI Finalize() writes the timing summaries in files called mpi profile.taskid. The mpi profile.0 file is special: it contains a timing summary from each task. Currently for scalability reason, by default only four ranks will generate plain text file: rank 0, rank with (min,med,max) MPI communication time. To change this default setting, please refer to the TRACE ALL RANKS environment variable. An example of mpi profile.0 file is shown as follows: elapsed time from clock-cycles using freq = 700.0 MHz ----------------------------------------------------------------MPI Routine #calls avg. bytes time(sec) 7 ----------------------------------------------------------------MPI_Comm_size 1 0.0 0.000 MPI_Comm_rank 1 0.0 0.000 MPI_Isend 21 99864.3 0.000 MPI_Irecv 21 99864.3 0.000 MPI_Waitall 21 0.0 0.014 MPI_Barrier 47 0.0 0.000 ----------------------------------------------------------------total communication time = 0.015 seconds. total elapsed time = 4.039 seconds. ----------------------------------------------------------------Message size distributions: MPI_Isend #calls 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 avg. bytes 2.3 8.0 16.0 32.0 64.0 128.0 256.0 512.0 1024.0 2048.0 4096.0 8192.0 16384.0 32768.0 65536.0 131072.0 262144.0 524288.0 1048576.0 time(sec) 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 MPI_Irecv #calls 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 avg. bytes 2.3 8.0 16.0 32.0 64.0 128.0 256.0 512.0 1024.0 2048.0 4096.0 8192.0 16384.0 32768.0 65536.0 131072.0 262144.0 524288.0 1048576.0 time(sec) 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ----------------------------------------------------------------Communication summary for all tasks: minimum communication time = 0.015 sec for task 0 median communication time = 4.039 sec for task 20 maximum communication time = 4.039 sec for task 30 taskid 0 1 2 xcoord 0 1 2 ycoord 0 0 0 zcoord 0 0 0 procid 0 0 0 total_comm(sec) 0.015 4.039 4.039 8 avg_hops 1.00 1.00 1.00 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 1 1 1 2 2 2 2 3 3 3 3 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MPI tasks sorted by communication time: taskid xcoord ycoord zcoord procid 0 0 0 0 0 9 1 2 0 0 26 2 2 1 0 10 2 2 0 0 2 2 0 0 0 1 1 0 0 0 17 1 0 1 0 5 1 1 0 0 23 3 1 1 0 4 0 1 0 0 29 1 3 1 0 21 1 1 1 0 15 3 3 0 0 19 3 0 1 0 31 3 3 1 0 20 0 1 1 0 6 2 1 0 0 7 3 1 0 0 8 0 2 0 0 3 3 0 0 0 16 0 0 1 0 11 3 2 0 0 13 1 3 0 0 14 2 3 0 0 24 0 2 1 0 27 3 2 1 0 22 2 1 1 0 25 1 2 1 0 28 0 3 1 0 12 0 3 0 0 18 2 0 1 0 30 2 3 1 0 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.00 1.00 1.00 1.00 4.00 1.00 1.00 1.00 4.00 1.00 1.00 1.00 7.00 1.00 1.00 1.00 4.00 1.00 1.00 1.00 4.00 1.00 1.00 1.00 4.00 1.00 1.00 1.00 7.00 total_comm(sec) 0.015 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 4.039 avg_hops 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 4.00 1.00 1.00 1.00 7.00 4.00 7.00 1.00 1.00 4.00 1.00 4.00 1.00 4.00 1.00 1.00 1.00 4.00 1.00 1.00 1.00 1.00 1.00 1.00 9 Figure 1: Peekperf 5.2 Viz File In addition to the mpi profile.taskid files, the library may also generate mpi profile taskid.viz XML format files that be viewed by using Peekperf as shown in Figure 1. 5.3 Trace File The library will also generate a file called single trace. The Peekview utility can be used (inside Peekperf or independently) to display this trace file as shown in Figure 2. 6 Configuration In this section, we describe a more general way to make the tracing tool configurable, and thereafter allows users to focus on interesting performance points. By providing a flexible mechanism to control the events that are recorded, the library can remain useful for even very large-scale parallel applications. 10 Figure 2: Peekview 6.1 Configuration Function There are three functions that can be rewritten to configure the library. During the runtime, the return values of those three functions decide what performance information to be stored, which process (MPI rank) will output the performance information, and what performance information will be output to files. • int MT trace event(int); Whenever a MPI function (that is profiled/traced) is called, this function will be invoked. The integer passed into this function is the ID number for the MPI function. The return value is 1 if the performance information should be stored in the buffer; otherwise 0. • int MT output trace(int); This function is called once in the MPI Finalize(). The integer passed into this function is the MPI rank. The return value is 1 if it will output performance information; otherwise 0. • int MT output text(void); This function will be called inside the MPI Finalize() once. The user can rewrite this function to customize the performance data output (e.g., user-defined performance metrics or data layout). 6.2 Data Structure Each data structure described in this section is usually used with associated utility function described in Section 6.3 to provide user information when implementing the configuration functions described in Section 6.1. 11 • MT summarystruct This data structure holds statistics results including MPI ranks and statistical values (e.g., Min, Max, Median, Average and Sum). The data structure is used together with MT get allresults() utility function. struct MT_summarystruct { int min_rank; int max_rank; int med_rank; void *min_result; void *max_result; void *med_result; void *avg_result; void *sum_result; void *all_result; void *sorted_all_result; int *sorted_rank; }; • MT envstruct This data structure is used with MT get environment() utility function. It holds MPI process self information includes MPI rank (mpirank), total number of MPI tasks (ntasks), and total number of MPI function types (that are profiled/traced; nmpi). For Blue Gene/L, it also provides the process self environment information including x,y,z coordinates in the torus, dimension of the torus (xSize, ySize, zSize), the processor ID (procid) and the CPU clock frequency (clockHz). struct MT_envstruct { int mpirank; int xCoord; int yCoord; int zCoord; int xSize; int ySize; int zSize; int procid; int ntasks; double clockHz; int nmpi; }; • MT tracebufferstruct This data structure is used together with MT get tracebufferinfo() utility function. It holds information about how many events are recorded (number events) and information about memory space (in total/used/available; MBytes) for tracing. struct MT_tracebufferstruct { int number_events; double total_buffer; /* in terms of MBytes */ double used_buffer; double free_buffer; }; 12 • MT callerstruct This data structure holds the caller’s information for the MPI function. It is used with MT get callerinfo() utility function. The information includes source file path, source file name, function name and line number in the source file. struct MT_callerstruct { char *filepath; char *filename; char *funcname; int lineno; }; • MT memorystruct (Blue Gene/L Only) Since the memory space per compute node on Blue Gene/L is limited. This data structure is used with MT get memoryinfo() utility function to provide memory usage information. struct MT_memorystruct { unsigned int max_stack_address; unsigned int min_stack_address; unsigned int max_heap_address; }; 6.3 Utility Functions • long long MT get mpi counts(int); The integer passed in is the MPI ID and the number of call counts for this MPI function will be returned. The MPI ID can be one of IDs listed in Table 2. • double MT get mpi bytes(int); Similar to the MT get mpi counts(), this function will return the accumulated size of data transferred by the MPI function. • double MT get mpi time(int); Similar to the MT get mpi counts(), this function will return the accumulated time spent in the MPI function. • double MT get avg hops(void); If the distance between two processors p, q with physical coordinates (xp , yp , zp ) and (xq , yq , zq ) is calculated as Hops(p, q) = |xp − xq | + |yp − yq | + |zp − zq |. We measure the AverageHops for all communications on a given processor as follows: X Hopsi × Bytesi AverageHops = i X i 13 Bytesi COMM SIZE ID SSEND ID ISEND ID IBSEND ID RSEND INIT ID RECV ID SENDRECV REPLACE ID PROBE ID TESTANY ID WAIT ID WAITSOME ID BCAST ID GATHERV ID SCAN ID REDUCE ID ALLTOALL ID COMM RANK ID RSEND ID ISSEND ID SEND INIT ID BSEND INIT ID IRECV ID BUFFER ATTACH ID IPROBE ID TESTALL ID WAITANY ID START ID BARRIER ID SCATTER ID ALLGATHER ID ALLREDUCE ID ALLTOALLV ID SEND ID BSEND ID IRSEND ID SSEND INIT ID RECV INIT ID SENDRECV ID BUFFER DETACH ID TEST ID TESTSOME ID WAITALL ID STARTALL ID GATHER ID SCATTERV ID ALLGATHERV ID REDUCE SCATTER ID Table 2: MPI ID where Hopsi is the distance between the processors for ith MPI communication and Bytesi is the size of the data transferred in this communication. The logical concept behind this performance metric is to measure how far each byte has to travel for the communication (in average). If the communication processor pair is close to each other in the coordinate, the AverageHops value will tend to be small. • double MT get time(void); This function returns the time since MPI Init() is called. • double MT get elapsed time(void); This function returns the time between MPI Init() and MPI Finalize() are called. • char *MT get mpi name(int); This function takes a MPI ID and returns its name in a string. • int MT get tracebufferinfo(struct MT tracebufferstruct *); This function returns the size of buffer used/free by the tracing/profiling tool at the moment. • unsigned long MT get calleraddress(int level); This function will return the caller address in the memory. • int MT get callerinfo(unsigned long caller memory address, struct MT callerstruct *); 14 This function takes the caller memory address (from MT get calleraddress()) and returns detailed caller information including the path, the source file name, the function name and the line number of the caller in the source file. • void MT get environment(struct MT envstruct *); This function returns its self environment information including MPI rank, physical coordinates, dimension of the block, number of total tasks and CPU clock frequency. • int MT get allresults(int data type, int mpi id, struct MT summarystruct *); This function returns statistical results (e.g., min, max, median, average) on primitive performance data (e.g., call counts, size of data transferred, time...etc.) for specific or all MPI functions. The data type can be one of the data type listed in Table 3 and mpi id can be one of the MPI ID listed in Table 2 or ALLMPI ID for all MPI functions. COUNTS BYTES COMMUNICATIONTIME STACK HEAP MAXSTACKFUNC ELAPSEDTIME AVGHOPS Table 3: Data Type • int MT get memoryinfo(struct MT memorystruct *); (Blue Gene/L Only) This function returns the information for the memory usage on the compute node. 6.4 Example In Figure 3 we re-write the MT trace event() and MT output trace() routines with about 50 lines of code (and use the default version of MT output text()) on Blue Gene/L. The function automatically detects the communication pattern and shuts off the recording of trace events after the first instance of the pattern. Also only MPI ranks less than 32 will output performance data at the end of program execution. As shown in the figure, utility functions such as MT get time() and MT get environment() help the user easily obtain information needed to configure the library. In this example, MT get time() returns the execution time spent so far and MT get environment() returns the process personality including its physical coordinates and MPI rank. 15 int MT_trace_event(int id) { ... now=MT_get_time(); MT_get_environment(&env); ... /* get MPI function call distribution */ current_event_count=MT_get_mpi_counts(); /* compare MPI function call distribution */ comparison_result =compare_dist(prev_event_count,current_event_count); prev_event_count=current_event_count; /* compare MPI function call distribution */ if(comparison_result==1) return 0; /* stop tracing */ else return 1; /* start tracing */ } int MT_output_trace(int rank) { if (rank < 32) return 1; /* output performance data */ else return 0; /* no output */ } Figure 3: Sample Code for MPI Tracing Configuration 7 7.1 Final Note Overhead The library implements wrappers that use the MPI profiling interface, and have the form: int MPI_Send(...) { start_timing(); PMPI_Send(...); stop_timing(); log_the_event(); } When event tracing is enabled, the wrappers save a time-stamped record of every MPI call for graphical display. This adds some overhead, about 1-2 microseconds per call. The event-tracing method uses a small buffer in memory - up to 3 × 104 events per task - and so this is best suited for short-running applications, or time-stepping codes for just a few steps. To further trace/profile large scale application, configuration may be required to improve the scalability. Please refer Section 6 for details. 16 7.2 Multi-Threading The current version is not thread-safe, so it should be used in single-threaded applications, or when only one thread makes MPI calls. The wrappers could be made thread-safe by adding mutex locks around updates of static data - which would add some additional overhead. 8 Contacts • I-Hsin Chung ([email protected]) Comments, corrections or technical issues. • David Klepacki ([email protected]) IBM High Performance Computing Toolkit licensing and distributions. 17