Linux perf event Features and Overhead 2013 FastPath Workshop Vince Weaver http://www.eece.maine.edu/∼vweaver [email protected] 21 April 2013 Performance Counters and Workload Optimized Systems • With processor speeds constant, cannot depend on Moore’s Law to deliver increased performance • Code analysis and optimization can provide speedups in existing code on existing hardware • Systems with a single workload are best target for crossstack hardware/kernel/application optimization • Hardware performance counters are the perfect tool for this type of optimization 1 Some Uses of Performance Counters • • • • • • Traditional analysis and optimization Finding architectural reasons for slowdown Validating Simulators Auto-tuning Operating System optimization Estimating power/energy in software 2 Linux and Performance Counters • Linux has become the operating system of choice in many domains • Runs most of the Top500 list (over 90%) on down to embedded devices (Android Phones) • Until recently had no easy access to hardware performance counters, limiting code analysis and optimization. 3 Linux Performance Counter History • oprofile – system-wide sampling profiler since 2002 • perfctr – widely used general interface available since 1999, required patching kernel • perfmon2 – another general interface, included in kernel for itanium, made generic, big push for kernel inclusion 4 Linux perf event • Developed in response to perfmon2 by Molnar and Gleixner in 2009 • Merged in 2.6.31 as “PCL” • Unusual design pushes most functionality into kernel • Not well documented nor well characterized 5 perf event Interface • sys perf event open() system call • complex perf event attr structure (over 40 fields) • counters started/stopped with ioctl() call • values read either with read() or samples in mmap() circular buffer 6 perf event Kernel Features • Generalized Events – commonly used events on various architectures provided common names • Event Scheduling – kernel handles mapping events to appropriate counters • Multiplexing – if more events than counters, time based multiplexing extrapolates full counts • Per-process counts – values saved on context switch • Software Events – kernel events exposed by same API 7 Advanced Hardware Features • Offcore Response – filtered measuring of memory accesses that leave the core • Uncore and Northbridge Events – special support needed for shared resources (L2, L3, memory) • Sampled Interfaces + AMD Instruction Based Sampling (IBS) – can provide address, latency, etc., as well as minimal skid + Intel Precise Event Sampling (PEBS) – gathers extra data on triggered event (registers, latency), low-skid 8 Virtualized Counters • Recent versions of KVM can trap on access to performance MSRs and pass in guest-specific performance counts, allowing use of performance counters in a virtualized environment • counter values have to be save/restored when guest scheduled 9 More on Generalized Events • Unlike those provided by user-space libraries (PAPI), hard to know what the actual event is (this is changing) • Kernel events are sometimes wrong, a lot more hassle to update kernel than update library 10 Generalized Events – Wrong Events Until 2.6.35 total “branches” preset accidentally mapped to “taken branches” Branches 176.gcc.static.166 28M 21M 14M 7M 0M Total Taken Branch Miss % 0 50 100 150 200 Instruction Interval (100M) 164.gzip.static.log 15 10 Misses/Total Misses/Taken 5 0 0 50 100 150 200 Instruction Interval (100M) 11 250 Generalized Events – Similar Events, Different Meaning L1 DCache Loads On Nehalem, • perf event defines L1D.OP READ.RESULT ACCESS (perf: L1-dcache-loads) as MEM INT RETIRED:LOADS • PAPI defines PAPI L1 DCR as L1D CACHE LD:MESI 181.mcf.static.default 176M 132M 88M 44M 0M PAPI perf_event 0 100 200 300 400 Instruction Interval (100M) 12 Context-Switch Test Methodology • To give per-process events, have to save counts on context-switch. This has overhead • We use lmbench lat ctx benchmark. Run it with and without perf measuring it. • Up to 20% overhead when perf monitoring the threads. Benchmark documentation claim 10-15% accuracy at best 13 Core2 Context-Switch Overhead core2 Context Switch Time 20 perf_event - inactive perf_event - active perfctr - inactive perfctr - active perfmon2 - inactive perfmon2 - active 10 5 Kernel Being Tested 14 4 3. 3 3. 2 3. 1 3. 0 3. 6. 3 2. 0 6. 3 2. 1 6. 3 2. 2 6. 3 2. 3 6. 3 2. 4 6. 3 2. 5 6. 3 2. 6 6. 3 2. 7 6. 3 2. 8 6. 39 0 2. Time (us) 15 Common Performance Counter Usage Models • Aggregate • Sampled • Self-monitoring Linux perf event can do all three. 15 Aggregate Counts $ perf stat -e instructions,cycles,branches,branch-misses,cache-misses ./matrix_multiply_atlas Matrix multiply sum: s=3650244631906855424.000000 Performance counter stats for ’./matrix_multiply_atlas’: 194,492,378,876 77,585,141,514 584,202,927 3,963,325 89,863,007 instructions cycles branches branch-misses cache-misses # # 2.51 insns per cycle 0.000 GHz # 0.68% of all branches 49.973787489 seconds time elapsed perf event sets up events, forks process (start counts on exec()), handles overflow, waits for exit, prints totals. 16 Sampled Profiling $ perf record ./matrix_multiply_atlas Matrix multiply sum: s=3650244631906855424.000000 [ perf record: Woken up 14 times to write data ] [ perf record: Captured and wrote 3.757 MB perf.data $ perf report Events: 98K cycles 97.36% matrix_multiply libblas.so.3.0 [.] 0.62% matrix_multiply matrix_multiply_atlas [.] 0.27% matrix_multiply libblas.so.3.0 [.] 0.18% matrix_multiply libblas.so.3.0 [.] 0.16% matrix_multiply libblas.so.3.0 [.] 0.14% matrix_multiply libblas.so.3.0 [.] 0.13% matrix_multiply libblas.so.3.0 [.] 0.09% matrix_multiply [kernel.kallsyms] [k] (~164126 samples) ] ATL_dJIK48x48x48TN48x48x0_ naive_matrix_multiply 0x1f1728 ATL_dupMBmm0_8_0_b1 ATL_dupKBmm8_2_1_b1 ATL_dupNBmm0_1_0_b1 ATL_dcol2blk_a1 page_fault Periodically sample, grad state, record for later analysis. 17 Self-Monitoring retval = PAPI_library_init ( PAPI_VER_CURRENT ); if ( retval != PAPI_VER_CURRENT ) fprintf ( stderr , " Wrong PAPI version \ n " ); retval = PAPI_create_eventset ( & event_set ); if ( retval != PAPI_OK ) fprintf ( stderr , " Error creating eventset \ n " ); retval = PAPI_add_named_event ( event_set , " PAPI_TOT_INS " ); if ( retval != PAPI_OK ) fprintf ( stderr , " Error adding event \ n " ); retval = PAPI_start ( event_set ); nai ve_m atri x_mul tipl y (0); retval = PAPI_stop ( event_set ,& count ); printf ( " Total instructions : % lld \ n " , count ); 18 Self-Monitoring Overhead • Typical pattern is Start/Stop/Read • Want minimal possible overhead • Read performance is typically most important, especially if doing multiple reads 19 Methodology • DVFS disabled • Use rdtsc() 64-bit timestamp counter. Typically 150 cycle overhead • Measure start/stop/read with no code in between • All three (start/stop/read) measured at same time • Environment variables should not matter 20 perf event Measurement Code start_before = rdtsc (); ioctl ( fd [0] , PERF_EVENT_IOC_ENABLE ,0); start_after = rdtsc (); ioctl ( fd [0] , PERF_EVENT_IOC_DISABLE ,0); stop_after = rdtsc (); read ( fd [0] , buffer , BUFFER_SIZE * sizeof ( long long )); read_after = rdtsc (); 21 perfctr Measurement Code start_before = rdtsc (); perfctr_ioctl_w ( fd , VPERFCTR_CONTROL , & control , & vp er fc tr_ co nt rol _s de sc ); start_after = rdtsc (); cstatus = kstate - > cpu_state . cstatus ; nrctrs = p er fc tr _cs ta tu s_n rc tr s ( cstatus ); retry : tsc0 = kstate - > cpu_state . tsc_start ; rdtscl ( now ); sum . tsc = kstate - > cpu_state . tsc_sum +( now - tsc0 ); for ( i = nrctrs ; --i >=0 ;) { rdpmcl ( kstate - > cpu_state . pmc [ i ]. map , now ); sum . pmc [ i ] = kstate - > cpu_state . pmc [ i ]. sum + ( now - kstate - > cpu_state . pmc [ i ]. start ); } if ( tsc0 != kstate - > cpu_state . tsc_start ) goto retry ; read_after = rdtsc (); _vperfctr_control ( fd , & control_stop ); stop_after = rdtsc (); 22 perfmon2 Measurement Code start_before = rdtsc (); pfm_start ( ctx_fd , NULL ); start_after = rdtsc (); pfm_stop ( ctx_fd ); stop_after = rdtsc (); pfm_read_pmds ( ctx_fd , pd , inp . pfp_event_count ); read_after = rdtsc (); 23 02. per 6. fm 32 o -p n2 er f 2. ctr 6. 2. 32 6. 2. 33 6. 2. 34 6. 2. 35 6. 2. 36 6. 2. 37 6. 2. 38 6. 3 3. 9 0. 3. 0 1. 3. 0 2. 3. 0 3 3. 4. 3. .0 0- 4. rd 0 pm 3. 5. 3. c 0- 5. rd 0 pm c .3 2. 6 Average Overhead (Cycles) Overall Overhead / 1 Event, AMD Athlon64 Boxplot: 25th/median/75th, stddev whiskers, outliers amd0fh Overall Overhead of Start/Stop/Read with 1 Event 50000 40000 30000 20000 10000 0 24 02. per 6. fm 32 o -p n2 er f 2. ctr 6. 2. 32 6. 2. 33 6. 2. 34 6. 2. 35 6. 2. 36 6. 2. 37 6. 2. 38 6. 3 3. 9 0. 3. 0 1. 3. 0 2. 3. 0 3 3. 4. 3. .0 0- 4. rd 0 pm 3. 5. 3. c 0- 5. rd 0 pm c .3 2. 6 Average Overhead (Cycles) Overall Overhead / 1 Event, Intel Atom atom Overall Overhead of Start/Stop/Read with 1 Event 80000 60000 40000 20000 0 25 02. per 6. fm 32 o -p n2 er f 2. ctr 6. 2. 32 6. 2. 33 6. 2. 34 6. 2. 35 6. 2. 36 6. 2. 37 6. 2. 38 6. 3 3. 9 0. 3. 0 1. 3. 0 2. 3. 0 3 3. 4. 3. .0 0- 4. rd 0 pm 3. 5. 3. c 0- 5. rd 0 pm c .3 2. 6 Average Overhead (Cycles) Overall Overhead / 1 Event, Intel Core2 core2 Overall Overhead of Start/Stop/Read with 1 Event 30000 20000 10000 0 26 02. per 6. fm 32 o -p n2 er f 2. ctr 6. 2. 32 6. 2. 33 6. 2. 34 6. 2. 35 6. 2. 36 6. 2. 37 6. 2. 38 6. 3 3. 9 0. 3. 0 1. 3. 0 2. 3. 0 3 3. 4. 3. .0 0- 4. rd 0 pm 3. 5. 3. c 0- 5. rd 0 pm c .3 2. 6 Average Overhead (Cycles) Start Overhead / 1 Event, Intel Core2 core2 Overall Overhead of Start with 1 Event 20000 15000 10000 5000 0 27 02. per 6. fm 32 o -p n2 er f 2. ctr 6. 2. 32 6. 2. 33 6. 2. 34 6. 2. 35 6. 2. 36 6. 2. 37 6. 2. 38 6. 3 3. 9 0. 3. 0 1. 3. 0 2. 3. 0 3 3. 4. 3. .0 0- 4. rd 0 pm 3. 5. 3. c 0- 5. rd 0 pm c .3 2. 6 Average Overhead (Cycles) Stop Overhead / 1 Event, Intel Core2 core2 Overall Overhead of Stop with 1 Event 10000 8000 6000 4000 2000 0 28 02. per 6. fm 32 o -p n2 er f 2. ctr 6. 2. 32 6. 2. 33 6. 2. 34 6. 2. 35 6. 2. 36 6. 2. 37 6. 2. 38 6. 3 3. 9 0. 3. 0 1. 3. 0 2. 3. 0 3 3. 4. 3. .0 0- 4. rd 0 pm 3. 5. 3. c 0- 5. rd 0 pm c .3 2. 6 Average Overhead (Cycles) Read Overhead / 1 Event, Intel Core2 perfctr uses rdpmc core2 Overall Overhead of Read with 1 Event 20000 15000 10000 5000 0 29 Overall Overhead / Multiple Events, Core2 core2 Overall Start/Stop/Read Overhead Average Overhead (Cycles) 40000 30000 2.6.30-perfmon2 2.6.32-perfctr 2.6.32 3.5.0 3.5.0-rdpmc 20000 10000 0 1 2 3 4 Simultaneous Events Being Measured 30 Self-Monitoring Overhead Summary • perfmon2 low-overhead due to very thin layer over hardware, most of work done in userspace • perfctr has very fast rdpmc reads • Some of perf event overhead because key tasks are inkernel and cannot be done before starting events • Is 20,000 cycles too much to get an event count? Unclear, but perfctr is much faster, showing there is room for improvement. 31 New Non-perf event Developments • LIKWID – bypasses Linux kernel, accesses MSRs directly. Low overhead, but system-wide only, and conflicts with perf event • LiMiT – new patch interface similar to perfctr 32 Future Work • AMD Lightweight Profiling (LWP) – (Bulldozer) events can be setup and read purely from userspace • Intel Xeon Phi spflt userspace setup instruction • Investigate causes of overhead in greater depth, as well as rdpmc performance issues. • What can we learn from low overhead of perfctr and perfmon2? 33 Questions? [email protected] All code and data is available: git clone git://github.com/deater/perfevent overhead.git 34