here

Linux perf event Features and
Overhead
2013 FastPath Workshop
Vince Weaver
http://www.eece.maine.edu/∼vweaver
[email protected]
21 April 2013
Performance Counters and Workload
Optimized Systems
• With processor speeds constant, cannot depend on
Moore’s Law to deliver increased performance
• Code analysis and optimization can provide speedups in
existing code on existing hardware
• Systems with a single workload are best target for crossstack hardware/kernel/application optimization
• Hardware performance counters are the perfect tool for
this type of optimization
1
Some Uses of Performance Counters
•
•
•
•
•
•
Traditional analysis and optimization
Finding architectural reasons for slowdown
Validating Simulators
Auto-tuning
Operating System optimization
Estimating power/energy in software
2
Linux and Performance Counters
• Linux has become the operating system of choice in
many domains
• Runs most of the Top500 list (over 90%) on down to
embedded devices (Android Phones)
• Until recently had no easy access to hardware
performance counters, limiting code analysis and
optimization.
3
Linux Performance Counter History
• oprofile – system-wide sampling profiler since 2002
• perfctr – widely used general interface available since
1999, required patching kernel
• perfmon2 – another general interface, included in kernel
for itanium, made generic, big push for kernel inclusion
4
Linux perf event
• Developed in response to perfmon2 by Molnar and
Gleixner in 2009
• Merged in 2.6.31 as “PCL”
• Unusual design pushes most functionality into kernel
• Not well documented nor well characterized
5
perf event Interface
• sys perf event open() system call
• complex perf event attr structure (over 40 fields)
• counters started/stopped with ioctl() call
• values read either with read() or samples in mmap()
circular buffer
6
perf event Kernel Features
• Generalized Events – commonly used events on various
architectures provided common names
• Event Scheduling – kernel handles mapping events to
appropriate counters
• Multiplexing – if more events than counters, time based
multiplexing extrapolates full counts
• Per-process counts – values saved on context switch
• Software Events – kernel events exposed by same API
7
Advanced Hardware Features
• Offcore Response – filtered measuring of memory
accesses that leave the core
• Uncore and Northbridge Events – special support needed
for shared resources (L2, L3, memory)
• Sampled Interfaces
+ AMD Instruction Based Sampling (IBS) – can provide
address, latency, etc., as well as minimal skid
+ Intel Precise Event Sampling (PEBS) – gathers extra
data on triggered event (registers, latency), low-skid
8
Virtualized Counters
• Recent versions of KVM can trap on access
to performance MSRs and pass in guest-specific
performance counts, allowing use of performance
counters in a virtualized environment
• counter values have to be save/restored when guest
scheduled
9
More on Generalized Events
• Unlike those provided by user-space libraries (PAPI),
hard to know what the actual event is (this is changing)
• Kernel events are sometimes wrong, a lot more hassle to
update kernel than update library
10
Generalized Events – Wrong Events
Until 2.6.35 total “branches” preset accidentally mapped
to “taken branches”
Branches
176.gcc.static.166
28M
21M
14M
7M
0M
Total
Taken
Branch Miss %
0
50
100
150
200
Instruction Interval (100M)
164.gzip.static.log
15
10
Misses/Total
Misses/Taken
5
0
0
50
100
150
200
Instruction Interval (100M)
11
250
Generalized Events – Similar Events,
Different Meaning
L1 DCache Loads
On Nehalem,
• perf event defines L1D.OP READ.RESULT ACCESS
(perf: L1-dcache-loads) as MEM INT RETIRED:LOADS
• PAPI defines PAPI L1 DCR as L1D CACHE LD:MESI
181.mcf.static.default
176M
132M
88M
44M
0M
PAPI
perf_event
0
100
200
300
400
Instruction Interval (100M)
12
Context-Switch Test Methodology
• To give per-process events, have to save counts on
context-switch. This has overhead
• We use lmbench lat ctx benchmark. Run it with and
without perf measuring it.
• Up to 20% overhead when perf monitoring the threads.
Benchmark documentation claim 10-15% accuracy at
best
13
Core2 Context-Switch Overhead
core2 Context Switch Time
20
perf_event - inactive
perf_event - active
perfctr - inactive
perfctr - active
perfmon2 - inactive
perfmon2 - active
10
5
Kernel Being Tested
14
4
3.
3
3.
2
3.
1
3.
0
3.
6.
3
2. 0
6.
3
2. 1
6.
3
2. 2
6.
3
2. 3
6.
3
2. 4
6.
3
2. 5
6.
3
2. 6
6.
3
2. 7
6.
3
2. 8
6.
39
0
2.
Time (us)
15
Common Performance Counter Usage
Models
• Aggregate
• Sampled
• Self-monitoring
Linux perf event can do all three.
15
Aggregate Counts
$ perf stat -e instructions,cycles,branches,branch-misses,cache-misses
./matrix_multiply_atlas
Matrix multiply sum: s=3650244631906855424.000000
Performance counter stats for ’./matrix_multiply_atlas’:
194,492,378,876
77,585,141,514
584,202,927
3,963,325
89,863,007
instructions
cycles
branches
branch-misses
cache-misses
#
#
2.51 insns per cycle
0.000 GHz
#
0.68% of all branches
49.973787489 seconds time elapsed
perf event sets up events, forks process (start counts on
exec()), handles overflow, waits for exit, prints totals.
16
Sampled Profiling
$ perf record ./matrix_multiply_atlas
Matrix multiply sum: s=3650244631906855424.000000
[ perf record: Woken up 14 times to write data ]
[ perf record: Captured and wrote 3.757 MB perf.data
$ perf report
Events: 98K cycles
97.36% matrix_multiply libblas.so.3.0
[.]
0.62% matrix_multiply matrix_multiply_atlas [.]
0.27% matrix_multiply libblas.so.3.0
[.]
0.18% matrix_multiply libblas.so.3.0
[.]
0.16% matrix_multiply libblas.so.3.0
[.]
0.14% matrix_multiply libblas.so.3.0
[.]
0.13% matrix_multiply libblas.so.3.0
[.]
0.09% matrix_multiply [kernel.kallsyms]
[k]
(~164126 samples) ]
ATL_dJIK48x48x48TN48x48x0_
naive_matrix_multiply
0x1f1728
ATL_dupMBmm0_8_0_b1
ATL_dupKBmm8_2_1_b1
ATL_dupNBmm0_1_0_b1
ATL_dcol2blk_a1
page_fault
Periodically sample, grad state, record for later analysis.
17
Self-Monitoring
retval = PAPI_library_init ( PAPI_VER_CURRENT );
if ( retval != PAPI_VER_CURRENT ) fprintf ( stderr , " Wrong PAPI version \ n " );
retval = PAPI_create_eventset ( & event_set );
if ( retval != PAPI_OK ) fprintf ( stderr , " Error creating eventset \ n " );
retval = PAPI_add_named_event ( event_set , " PAPI_TOT_INS " );
if ( retval != PAPI_OK ) fprintf ( stderr , " Error adding event \ n " );
retval = PAPI_start ( event_set );
nai ve_m atri x_mul tipl y (0);
retval = PAPI_stop ( event_set ,& count );
printf ( " Total instructions : % lld \ n " , count );
18
Self-Monitoring Overhead
• Typical pattern is Start/Stop/Read
• Want minimal possible overhead
• Read performance is typically most important, especially
if doing multiple reads
19
Methodology
• DVFS disabled
• Use rdtsc() 64-bit timestamp counter. Typically 150
cycle overhead
• Measure start/stop/read with no code in between
• All three (start/stop/read) measured at same time
• Environment variables should not matter
20
perf event Measurement Code
start_before = rdtsc ();
ioctl ( fd [0] , PERF_EVENT_IOC_ENABLE ,0);
start_after = rdtsc ();
ioctl ( fd [0] , PERF_EVENT_IOC_DISABLE ,0);
stop_after = rdtsc ();
read ( fd [0] , buffer , BUFFER_SIZE * sizeof ( long long ));
read_after = rdtsc ();
21
perfctr Measurement Code
start_before = rdtsc ();
perfctr_ioctl_w ( fd , VPERFCTR_CONTROL ,
& control , & vp er fc tr_ co nt rol _s de sc );
start_after = rdtsc ();
cstatus = kstate - > cpu_state . cstatus ;
nrctrs = p er fc tr _cs ta tu s_n rc tr s ( cstatus );
retry :
tsc0 = kstate - > cpu_state . tsc_start ;
rdtscl ( now );
sum . tsc = kstate - > cpu_state . tsc_sum +( now - tsc0 );
for ( i = nrctrs ; --i >=0 ;) {
rdpmcl ( kstate - > cpu_state . pmc [ i ]. map , now );
sum . pmc [ i ] = kstate - > cpu_state . pmc [ i ]. sum +
( now - kstate - > cpu_state . pmc [ i ]. start );
}
if ( tsc0 != kstate - > cpu_state . tsc_start ) goto retry ;
read_after = rdtsc ();
_vperfctr_control ( fd , & control_stop );
stop_after = rdtsc ();
22
perfmon2 Measurement Code
start_before = rdtsc ();
pfm_start ( ctx_fd , NULL );
start_after = rdtsc ();
pfm_stop ( ctx_fd );
stop_after = rdtsc ();
pfm_read_pmds ( ctx_fd , pd , inp . pfp_event_count );
read_after = rdtsc ();
23
02. per
6. fm
32 o
-p n2
er
f
2. ctr
6.
2. 32
6.
2. 33
6.
2. 34
6.
2. 35
6.
2. 36
6.
2. 37
6.
2. 38
6.
3
3. 9
0.
3. 0
1.
3. 0
2.
3. 0
3
3.
4. 3. .0
0- 4.
rd 0
pm
3.
5. 3. c
0- 5.
rd 0
pm
c
.3
2.
6
Average Overhead (Cycles)
Overall Overhead / 1 Event, AMD Athlon64
Boxplot: 25th/median/75th, stddev whiskers, outliers
amd0fh Overall Overhead of Start/Stop/Read with 1 Event
50000
40000
30000
20000
10000
0
24
02. per
6. fm
32 o
-p n2
er
f
2. ctr
6.
2. 32
6.
2. 33
6.
2. 34
6.
2. 35
6.
2. 36
6.
2. 37
6.
2. 38
6.
3
3. 9
0.
3. 0
1.
3. 0
2.
3. 0
3
3.
4. 3. .0
0- 4.
rd 0
pm
3.
5. 3. c
0- 5.
rd 0
pm
c
.3
2.
6
Average Overhead (Cycles)
Overall Overhead / 1 Event, Intel Atom
atom Overall Overhead of Start/Stop/Read with 1 Event
80000
60000
40000
20000
0
25
02. per
6. fm
32 o
-p n2
er
f
2. ctr
6.
2. 32
6.
2. 33
6.
2. 34
6.
2. 35
6.
2. 36
6.
2. 37
6.
2. 38
6.
3
3. 9
0.
3. 0
1.
3. 0
2.
3. 0
3
3.
4. 3. .0
0- 4.
rd 0
pm
3.
5. 3. c
0- 5.
rd 0
pm
c
.3
2.
6
Average Overhead (Cycles)
Overall Overhead / 1 Event, Intel Core2
core2 Overall Overhead of Start/Stop/Read with 1 Event
30000
20000
10000
0
26
02. per
6. fm
32 o
-p n2
er
f
2. ctr
6.
2. 32
6.
2. 33
6.
2. 34
6.
2. 35
6.
2. 36
6.
2. 37
6.
2. 38
6.
3
3. 9
0.
3. 0
1.
3. 0
2.
3. 0
3
3.
4. 3. .0
0- 4.
rd 0
pm
3.
5. 3. c
0- 5.
rd 0
pm
c
.3
2.
6
Average Overhead (Cycles)
Start Overhead / 1 Event, Intel Core2
core2 Overall Overhead of Start with 1 Event
20000
15000
10000
5000
0
27
02. per
6. fm
32 o
-p n2
er
f
2. ctr
6.
2. 32
6.
2. 33
6.
2. 34
6.
2. 35
6.
2. 36
6.
2. 37
6.
2. 38
6.
3
3. 9
0.
3. 0
1.
3. 0
2.
3. 0
3
3.
4. 3. .0
0- 4.
rd 0
pm
3.
5. 3. c
0- 5.
rd 0
pm
c
.3
2.
6
Average Overhead (Cycles)
Stop Overhead / 1 Event, Intel Core2
core2 Overall Overhead of Stop with 1 Event
10000
8000
6000
4000
2000
0
28
02. per
6. fm
32 o
-p n2
er
f
2. ctr
6.
2. 32
6.
2. 33
6.
2. 34
6.
2. 35
6.
2. 36
6.
2. 37
6.
2. 38
6.
3
3. 9
0.
3. 0
1.
3. 0
2.
3. 0
3
3.
4. 3. .0
0- 4.
rd 0
pm
3.
5. 3. c
0- 5.
rd 0
pm
c
.3
2.
6
Average Overhead (Cycles)
Read Overhead / 1 Event, Intel Core2
perfctr uses rdpmc
core2 Overall Overhead of Read with 1 Event
20000
15000
10000
5000
0
29
Overall Overhead / Multiple Events, Core2
core2 Overall Start/Stop/Read Overhead
Average Overhead (Cycles)
40000
30000
2.6.30-perfmon2
2.6.32-perfctr
2.6.32
3.5.0
3.5.0-rdpmc
20000
10000
0
1
2
3
4
Simultaneous Events Being Measured
30
Self-Monitoring Overhead Summary
• perfmon2 low-overhead due to very thin layer over
hardware, most of work done in userspace
• perfctr has very fast rdpmc reads
• Some of perf event overhead because key tasks are inkernel and cannot be done before starting events
• Is 20,000 cycles too much to get an event count?
Unclear, but perfctr is much faster, showing there is
room for improvement.
31
New Non-perf event Developments
• LIKWID – bypasses Linux kernel, accesses MSRs directly.
Low overhead, but system-wide only, and conflicts with
perf event
• LiMiT – new patch interface similar to perfctr
32
Future Work
• AMD Lightweight Profiling (LWP) – (Bulldozer) events
can be setup and read purely from userspace
• Intel Xeon Phi spflt userspace setup instruction
• Investigate causes of overhead in greater depth, as well
as rdpmc performance issues.
• What can we learn from low overhead of perfctr and
perfmon2?
33
Questions?
[email protected]
All code and data is available:
git clone
git://github.com/deater/perfevent overhead.git
34