Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster IBM Software Testing Seminar 02 1 What is a Data Race? ! Two concurrent accesses to a shared location, at least one for writing. ! Indicative of a bug Thread 1 X++ Z=2 Thread 2 T=Y T=X IBM Software Testing Seminar 02 2 How to prevent Data Races? ! Explicit synchronization ! ! ! ! ! ! ! ! Locks Critical Sections Barriers Mutexes Semaphores Monitors Events Etc. Thread 1 Lock(m) X++ Unlock(m) IBM Software Testing Seminar 02 Thread 2 Lock(m) T=X Unlock(m) 3 Is This Enough? ! ! Yes! No! ! Programmer dependent ! For correctness – programmer will overdo. ! ! Need tools to detect data races Expensive ! For efficiency – programmer will spend lifetime in removing redundant synch’s. ! Need tools to remove excessive synch’s IBM Software Testing Seminar 02 4 Detecting Data Races? ! NP-hard ! ! ! ! ! ! ! [Netzer&Miller 1990] Input size = # instructions performed Even for 3 threads only Even with no loops/recursion Execution orders/scheduling (#threads)thread_length # inputs Detection-code’s side-effects Weak memory, instruction reorder, atomicity IBM Software Testing Seminar 02 5 Where is Waldo? #define N 100 Type g_stack = new Type[N]; int g_counter = 0; Lock g_lock; Similar problem found in Java.util.vector void push( Type& obj ){lock(g_lock);...unlock(g_lock);} void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);} void popAll( ) { lock(g_lock); delete[] g_stack; g_stack = new Type[N]; g_counter = 0; unlock(g_lock); } int find( Type& obj, int number ) { lock(g_lock); for (int i = 0; i < number; i++) if (obj == g_stack[i]) break; // Found!!! if (i == number) i = -1; // Not found… Return -1 to caller unlock(g_lock); return i; } int find( Type& obj ) { return find( obj, g_counter ); } Apparent Data Races ! Based only the behavior of the explicit synch ! ! ! not on program semantics Easier to locate Less accurate Initially: grades = oldDatabase; updated = false; Thread T.A. Thread Lecturer grades = newDatabase; while (updated == false); updated = true; X:=grades.gradeOf(lecturersSon); ! ! Exist iff “real” data race exist ☺ Detection is still NP-hard # Detection Approaches ! fork ! ! fork join join Restricted pgming model Static ! ! ! ! ! Emrath, Padua 88 Balasundaram, Kenedy 89 Mellor-Crummy 93 Flanagan, Freund 01 Postmortem ! ! ! Usually fork-join Netzer, Miller 90, 91 Adve, Hill 91 On-the-fly ! ! ! ! ! Nudler, Rudolph 88 Dinning, Schonberg 90, 91 Savage et.al. 97 Itzkovits et.al. 99 Perkovic, Keleher 00 IBM Software Testing Seminar 02 Issues: • pgming model • synch’ method • memory model • accuracy • overhead • granularity • coverage 8 MultiRace Approach ! ! On-the-fly detection of “apparent data races” Two detection algorithms ! ! Lockset [Savage, Burrows, Nelson, Sobalvarro, Anderson 97] Djit+ [Itzkovitz, Schuster, Zeev-ben-Mordechai 99] ! Correct even for weak memory systems ☺ ! Detection granularity ! Variables and Objects ! Especially suited for OO programming languages Source-code (C++) instrumentation + Memory mappings ! Transparent ☺ ! Low overhead ☺ ! IBM Software Testing Seminar 02 9 Djit+ Apparent Data Races Lamport’s happens-before partial order ! a,b concurrent if neither a hb→ b nor b hb→ a ! ! ! $ Apparent data race Otherwise, they are “synchronized” Djit+ basic idea: check each access performed against all “previously performed” accesses IBM Software Testing Seminar 02 Thread 1 Thread 2 . a . Unlock(L) . . . . . . a hb. → b . Lock(L) . b 10 Djit+ Local Time Frames (LTF) ! ! The execution of each thread is split into a sequence of time frames. A new time frame starts on each unlock. Thread x=1 lock( m1 ) z=2 lock( m2 ) y=3 unlock( m2 ) z=4 unlock( m1 ) x=5 IBM Software Testing Seminar 02 LTF 1 1 1 2 3 11 Djit+ Local Time Frames (LTF) ! ! ! The execution of each thread is split into a sequence of time frames. A new time frame starts on each unlock. For every access there is a timestamp = a vector of LTFs known to the thread at the moment the event takes place Thread x=1 lock( m1 ) z=2 lock( m2 ) y=3 unlock( m2 ) z=4 unlock( m1 ) x=5 IBM Software Testing Seminar 02 LTF 1 1 1 2 3 12 Djit+ Vector Time Frames (VTF) ! ! ! ! A vector stt[.] for each thread t stt[t] is the LTF of thread t stt[u] stores the latest LTF of u known to t If u is an acquirer of t’s unlock for k=0 to maxthreads-1 stu[k] = max( stu[k], stt[k] ) IBM Software Testing Seminar 02 13 Djit+ Vector Time Frames Example Thread 1 Thread 2 (1 1 1) write x unlock( m1 ) read z Thread 3 (1 1 1) (1 1 1) (2 1 1) unlock( m1 ) read y unlock( m2 ) write x (2 1 1) (2 2 1) unlock( m2 ) write x IBM Software Testing Seminar 02 (2 2 1) 14 Djit+ Checking Concurrency P(a,b) ) ! ! ! ( a.type = write b.type = write ) ( a.time_frame ≥ stb.thread_id[a.thread_id] a was logged and tested earlier. b is currently performed. P returns TRUE iff a and b form a data race. IBM Software Testing Seminar 02 15 Djit+ Which Accesses to Check? ! ! ! a in thread t1, and b and c in thread t2 in same time frame b precedes c in the program order. If a and b are synchronized, b then a and c are c No logging synchronized as well. $ It is sufficient to record only the first read access and the first write access to a variable in each time frame. Thread 1 Thread 2 lock( m ) write X unlock( m ) lock( m ) write X read X unlock( m ) race read X IBM Software Testing Seminar 02 lock( m ) read X a write X No logging write X unlock( m ) 16 Djit+ Which Time Frames to Check? ! ! ! a currently occurs in t1 b and c previously in t2 If a is synchronized with c then it is certainly synchronized with b. $ It is sufficient to check the current access with the “most recent” accesses in each of the other threads. IBM Software Testing Seminar 02 Thread 1 Thread 2 . . . . . . . . lock(m) . a b . unlock . c . unlock(m) . . 17 Djit+ Access History ! ! ! On each first read and first write to v in a time frame every thread updates the access history of v. If the access to v is a read, the thread checks all V% recent writes by other threads to v. If the access is a write, the thread checks all recent reads as well as all recent writes by other threads to v. Time frames of recent reads from v – one for each thread r-tf1 r-tf2 ... ... ... r-tfn w-tf1 w-tf2 ... ... ... w-tfn Time frames of recent writes to v – one for each thread IBM Software Testing Seminar 02 18 Djit+ Pros and Cons ☺ No false alarms ☺ No missed races (in a given scheduling) # Very sensitive to differences in scheduling # Requires enormous number of runs. Yet: cannot prove tested program is race free. IBM Software Testing Seminar 02 19 Lockset The Basic Algorithm ! ! ! C(v) set of locks that protected all accesses tov so far locks_held(t) set of currently acquired locks Algorithm: - For each v, init C(v) to the set of all possible locks - On each access to v by thread t: - lhv&locks_held(t) - if it is a read, then lhv&lhv {readers_lock} - C(v)&C(v) ∩ lhv - if C(v)= , issue a warning IBM Software Testing Seminar 02 20 Lockset Example Thread 1 Thread 2 lock( m1 ) read v unlock( m1 ) r_l = readers_lock prevents from multiple reads to generate false alarms lhv C(v) {} {m1, m2,r_l} {m1,r_l} {m1,r_l} {} lock( m2 ) write v unlock( m2 ) {m2} {} IBM Software Testing Seminar 02 {} Warning: locking discipline for v is Violated !!! 21 Lockset ! ! Locking discipline: every shared location is consistently protected by a lock. Lockset detects violations of this locking discipline. Thread 1 Thread 2 lock(m1) read v unlock(m1) lhv C(v) {} {m1, m2} {m1} {m1} {} lock(m2) write v unlock(m2) {m2} {} {} Warning: locking Discipline for v is Violated IBM Software Testing Seminar 02 22 Lockset vs. Djit+ Thread 1 Thread 2 y++[1] lock( m ) v++ unlock( m ) lock( m ) v++ unlock( m ) y++[2] [1] hb→ [2], yet there is a data race on y under a different scheduling, since the locking discipline is violated IBM Software Testing Seminar 02 23 Lockset Which Accesses to Check? ! a and b in same thread, same time frame, a precedes b Then: Locksa(v) Locksb(v) Thread unlock … ! lock(m1) ! Locksu(v) set of real locks held during access u to v. write x ! It follows that: write x ! [C(v) ∩ Locksa(v)] [C(v) ∩ Locksb(v)] lock(m2) ! If C(v) ∩ Locksa(v)≠ then write x C(v) ∩ Locksb(v)≠ unlock( m2 ) unlock( m1 ) $ Only first accesses need be Locksu(v) {m1} {m1}= {m1} {m1,m2} {m1} checked in every time frame IBM Software Testing Seminar 02 24 Lockset Pros and Cons ☺ Less sensitive to scheduling # # Lots of false alarms Still dependent on scheduling: cannot prove tested program is race free IBM Software Testing Seminar 02 25 Combining Djit+ and Lockset Violations detected by Lockset in P All shared locations in some program P ! ! ! ! Lockset can detect suspected races in more execution orders. Djit+ can filter out the spurious races reported by Lockset. The number of checks performed by Djit+ can be reduced using the additional information obtained from Lockset. The implementation overhead comes mainly from access logging – can be shared by both algorithms. S L D A F All feasibly raced locations in program P All apparently raced locations in program P Raced locations + P IBM Software Testing Seminar 02 detected by Djit in 26 Implementing Access Logging: Recording First Accesses No-Access Thread 1 View X Thread 2 View Thread 4 • page faults are caught by MultiRace • handler is a detector Y Read-Only Thread 3 Virtual Memory Physical Shared Memory X X Y Y ReadWrite View X Y IBM Software Testing Seminar 02 27 Swizzling Between Views No-Access View unlock(m) read x write x unlock(m) write x Virtual Memory X read fault Y write fault Read-Only View Physical Shared Memory Thread write fault X X Y Y ReadWrite View X Y IBM Software Testing Seminar 02 28 Detection Granularity ! A minipage (= detection unit) can contain: ! ! ! ! Objects of primitive types – char, int, double, etc. Objects of complex types – classes and structures Entire arrays of complex or primitive types An array can be placed on a single minipage or split across several minipages. ! Array still occupies contiguous addresses. 1 2 3 4 1 2 3 4 1 2 3 4 IBM Software Testing Seminar 02 29 Playing with Detection Granularity to Reduce Overhead ! Large minipages $ reduced overhead ! ! A minipage should be refined into smaller minipages when suspicious alarms occur ! ! Less faults Replay technology can help (if available) When suspicion resolved – regroup ! May disable detection on the accesses involved IBM Software Testing Seminar 02 30 Detection Granularity Slowdowns of FFT using different granularities Slowdown (logarithmic scale) 1000 100 10 1 0.1 1 2 4 8 16 32 64 128 256 all Number of complex numbers in minipage 1 thread 2 threads 4 threads 8 threads 16 threads 32 threads IBM Software Testing Seminar 02 64 threads 31 Instrumentation (1) ! new and malloc overloaded, get two additional args ! ! ! ! # requested elements # elements per minipage Type* ptr = new(50, 1) Type[50]; Every class Type inherits from SmartProxy<Type> template class. class Type : public SmartProxy<Type> { ... } The functions of SmartProxy<Type> class return a pointer or a reference to the instance through its correct view. IBM Software Testing Seminar 02 32 Instrumentation (2) ! All occurrences of potentially shared primitive types are wrapped into fully functional classes: int % class _int_ : public SmartProxy<int> { int val;...} ! ! Global and static objects are copied to our shared space during initialization. No source code – the objects are ‘touched’: memcpy( dest->write(n),src->read(n), n*sizeof(Type) ) ! All accesses to class data members are instrumented in the member functions: data_mmbr=0; % smartPointer()->data_mmbr=0; IBM Software Testing Seminar 02 33 Example of Instrumentation void func( Type* ptr, Type& ref, int num ) { for ( int i = 0; i < num; i++ ) { The desired value is ptr->smartPointer()->data += specified by user ref.smartReference().data; through source code annotation ptr++; } No Type* ptr2 = new(20, 2) Type[20]; Change!!! memset( ptr2->write(20), 0, 20*sizeof(Type) ); ptr = &ref; ptr2[0].smartReference() = *ptr->smartPointer(); ptr->member_func( ); IBM Software Testing Seminar 02 34 } Loop Optimizations ! Original code: for ( i = 0; i < num; i++) arr[i].data = i; ! Instrumented code: for ( i = 0; i < num; i++) arr[i].smartReference().data = i; ' Very expensive code ! Optimized code (entire array on single minipage): if ( num > 0 ) arr[0].smartReference().data = 0; ' Touch first element for ( i = 1; i < num; i++) ' i runs from 1 and not from 0 arr[i].data = i; ' Access the rest of array without faults ! Additional optimization (no synchronization in loop): arr.write( num ); for ( i = 0; i < num; i++) arr[i].data = i; ' Touch for writing all minipages of array ' Efficient if number of elements in array is high and number of minipages is low IBM Software Testing Seminar 02 35 Reporting Races in MultiRace Benchmark Specifications (2 threads) Input Set Shared Memory # Minipages # Write/ Read Faults # Timeframes Time in sec (NO DR) FFT 28*28 3MB 4 9/10 20 0.054 IS 223 numbers 215 values 128KB 3 60/90 98 10.68 LU 1024*1024 matrix, block size 32*32 8MB 5 127/186 138 2.72 SOR 1024*2048 matrices, 50 iterations 8MB 2 202/200 206 3.24 TSP 19 cities, recursion level 12 1MB 9 2792/ 3826 678 13.28 WATER 512 molecules, 15 steps 500KB 3 15438/ 15720 15636 9.55 Benchmark Overheads 2.4 (4-way IBM Netfinity server, 550MHz, Win-NT) 2.2 Overheads (DR/No DR) 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 1 2 FFT 4 IS 8 16 Number of Threads LU SOR TSP 32 WATER 64 Overhead Breakdown TSP 1.4 2876|4106 1.2 1 2834|3966 2794|3825 2792|3826 2788|3827 2808|3864 WATER 2.5 220864|230446 2 2820|3904 123881|128663 1.5 0.8 67708|70090 0.6 1 7788|7920 15438|15720 23178|23760 38658|39840 0.4 0.5 0.2 0 0 1 No DR ! ! ! 2 4 +Instrumentation 8 Number of threads +Write Faults 16 +Read Faults 32 +Djit 64 +Lockset 1 No DR 2 4 +Instrumentation 8 16 Number of threads +Write Faults +Read Faults 32 +Djit Numbers above bars are # write/read faults. Most of the overhead come from page faults. Overhead due to detection algorithms is small. 64 +Lockset Summary MultiRace is: ! ! ! ! ! ! ! Transparent Supports two-way and global synchronization primitives: locks and barriers Detects races that actually occurred (Djit+) Usually does not miss races that could occur with a different scheduling (Lockset) Correct for weak memory models Imposes much lower overhead than existing tools Exhibits variable detection granularity IBM Software Testing Seminar 02 40 Conclusions ! MultiRace makes it easier for programmer to trust his programs ! ! No need to add synchronization “just in case” In case of doubt - MultiRace should be activated each time the program executes IBM Software Testing Seminar 02 41 Future work ! ! ! ! ! ! ! ! ! ! Implement instrumenting pre-compiler Higher transparency Higher scalability Automatic dynamic granularity Integrate with scheduling-generator Integrate with record/replay Integrate with the compiler/debugger May get rid of faults and views Optimizations through static analysis Etc. IBM Software Testing Seminar 02 42 The End IBM Software Testing Seminar 02 43 Minipages and Dynamic Granularity of Detection ! ! ! ! Minipage is a shared location that can be accessed using the approach of views. We detect races on minipages and not on fixed number of bytes. Each minipage is associated with the access history of Djit+ and Lockset state. The size of a minipage can vary. IBM Software Testing Seminar 02 44 Implementing Access Logging ! ! ! ! ! ! In order to record only the first accesses (reads and writes) to shared locations in each of the time frames, we use the concept of views. A view is a region in virtual memory. Each view has its own protection – NoAccess / ReadOnly / ReadWrite. Each shared object in physical memory can be accessed through each of the three views. Helps to distinguish between reads and writes. Enables the realization of the dynamic detection unit and avoids false sharing problem. IBM Software Testing Seminar 02 45 Disabling Detection Obviously, Lockset can report false alarms. + ! Also Djit detects apparent races that are not necessarily feasible races: ! ! ! ! ! Intentional races Unrefined granularity Private synchronization Detection can be disabled through the use of source code annotations. IBM Software Testing Seminar 02 46 Overheads ! ! ! ! The overheads are steady for 1-4 threads – we are scalable in number of CPUs. The overheads increase for high number of threads. Number of page faults (both read and write) increases linearly with number of threads. In fact, any on-the-fly tool for data race detection will be unscalable in number of threads when number of CPUs is fixed. IBM Software Testing Seminar 02 47 Instrumentation Limitations ! Currently, no transparent solution for instrumenting global and static pointers. ! ! ! In order to monitor all accesses to these pointers they should be wrapped into classes % compiler’s automatic pointer conversions are lost. Will not be a problem in Java. All data members of the same instance of class always reside on the same minipage. ! In the future – will split classes dynamically. IBM Software Testing Seminar 02 48 Breakdowns of Overheads FFT 1.5 195|320 1.4 1.1 1.1 6|5 9|10 0.9 27|40 15|20 0.8 1.08 99|160 1 1.06 51|80 1.04 0.7 1.02 0.6 0.4 0.98 0.3 0.96 0.2 0|0 60|90 244|424 1 2 4 974|1814 3885|7485 0.94 0.1 0.92 0 1 No DR 2 4 +Instrumentation 8 16 Number of threads +Write Faults +Read Faults 32 +Djit LU 1.5 64 1218|2901 127|186 219|398 396|805 +Instrumentation 8 16 Number of threads +Write Faults +Read Faults 32 +Djit +Lockset 6402|6400 3202|3200 1.04 688|1551 64 SOR 1.08 1.06 1.3 65|63 No DR +Lockset 2158|5453 1.4 1.1 15452|30332 1 0.5 1.2 61919|122399 1.12 1.3 1.2 IS 1.14 1602|1600 1.02 802|800 402|400 1 1 0.9 0.8 0.98 0.7 0.96 0.6 0.94 0.5 202|200 102|100 0.92 0.4 0.3 0.9 0.2 0.88 0.1 0 0.86 1 No DR 2 4 +Instrumentation 8 16 Number of threads +Write Faults +Read Faults 32 +Djit 64 +Lockset 1 No DR 2 +Instrumentation 4 8 16 Number of threads +Write Faults +Read Faults 32 +Djit 64 +Lockset References ! ! ! ! T. Brecht and H. Sandhu. The Region Trap Library: Handling traps on application-defined regions of memory. In USENIX Annual Technical Conference, Monterey, CA, June 1999. A. Itzkovitz, A. Schuster, and O. Zeev-Ben-Mordechai. Towards Integration of Data Race Detection in DSM System. In The Journal of Parallel and Distributed Computing (JPDC), 59(2): pp. 180-203, Nov. 1999 L. Lamport. Time, Clock, and the Ordering of Events in a Distributed System. In Communications of the ACM, 21(7): pp. 558-565, Jul. 1978 F. Mattern. Virtual Time and Global States of Distributed Systems. In Parallel & Distributed Algorithms, pp. 215-226, 1989. IBM Software Testing Seminar 02 50 References Cont. ! ! ! R. H. B. Netzer and B. P. Miller. What Are Race Conditions? Some Issues and Formalizations. In ACM Letters on Programming Languages and Systems, 1(1): pp. 74-88, Mar. 1992. R. H. B. Netzer and B. P. Miller. On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions. In 1990 International Conference on Parallel Processing, 2: pp. 93-97, Aug. 1990 R. H. B. Netzer and B. P. Miller. Detecting Data Races in Parallel Program Executions. In Advances in Languages and Compilers for Parallel Processing, MIT Press 1991, pp. 109-129. IBM Software Testing Seminar 02 51 References Cont. ! ! ! S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T.E. Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. In ACM Transactions on Computer Systems, 15(4): pp. 391-411, 1997 E. Pozniansky. Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs. Research Thesis. O. Zeev-Ben-Mordehai. Efficient Integration of On-The-Fly Data Race Detection in Distributed Shared Memory and Symmetric Multiprocessor Environments. Research Thesis, May 2001. IBM Software Testing Seminar 02 52 Overheads FFT 1.5 1.406 1.4 1.3 1.2 1.1 1.106 1.037 1.008 1 0.913 0.866 0.9 0.807 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 4 8 16 Number of threads DR/No DR DR 32 64 1.346 1.1 1.060 1.075 1.013 1.009 1.002 1.029 1 2 4 8 16 32 64 Number of threads DR No DR SOR 1.2 1 1.160 1.117 1.000 1.1 1.3 1.074 1.120 1.000 DR/No DR 1.409 1.4 1.2 IS No DR LU 1.5 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.929 0.947 1 2 1.008 1.023 1.034 1.051 1.000 4 8 16 32 64 0.9 1 0.8 0.9 0.7 0.8 0.6 0.7 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 2 4 8 16 Number of threads DR/No DR DR No DR 32 64 Number of threads DR/No DR DR No DR Overheads TSP 1.4 1.308 1.3 1 2.2 1.093 0.995 0.996 0.997 2 1.026 0.997 1.8 0.9 1.676 1.6 0.8 1.4 0.7 1.257 1.2 0.6 1 0.5 0.4 0.8 0.3 0.6 0.2 0.4 0.1 0.2 0 0.937 0.985 1.001 1.042 1 2 4 8 0 1 2 4 8 16 Number of threads DR/No DR ! 2.329 2.4 1.2 1.1 WATER 2.6 DR 32 64 No DR 16 Number of threads DR/No DR DR 32 64 No DR The testing platform: ! ! ! 4-way IBM Netfinity, 550 MHz 2GB RAM Microsoft Windows NT IBM Software Testing Seminar 02 54