How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group [email protected] 1 Disclaimer No Intel proprietary information is disclosed. Every future estimate or projection is only a speculation Responsibility for all analysis, opinions and conclusions falls on the author only. It does not means you cannot trust them… ☺ Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 2 Agenda What is the motivation for new generation of parallel processors on a die Comparing many cores vs. multi-cores vs. asymmetric core approaches My conclusions and future directions Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 3 Motivation of the new trend Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 4 “The old good days” of computer architectures Performance used to came out of process technology and new single threaded architectural improvements. Every 18-24 months new process is announced The target of any new process is to shrink the dimension of the transistors on 0.7 (ideal shrink) As a result, on the same die area, we can have more transistors each of them runs in higher frequency Every 5-7 years we had a new architecture generation Used to target better single threaded performance Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 5 Simple processes scaling rules. If optimal scaling could be achieved (0.7 shrink) we can Double the number of transistors on same area Improve the frequency (performance) in ~50% Consume the same power for the same are Preserve the power density. Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 6 Ideal Scenarios... Ideal “Shrink” Ideal New μarch Same μarch 1X #Xistors 0.5X size 1.5X frequency Same die size 2X #Xistors 1X size 1.5X frequency 0.5X power 1X IPC (instr./cycle) 1.5X performance 1X power density 1X power 2X IPC 3X performance 1X power density Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 7 Process Technologies – Reality But in reality: New process is not ideal anymore New designs squeeze frequency to 2X per process (Moore’s law) New designs use more transistors (2X-3X to get 1.5X-1.7X perf) The die size increases So, every new process and architecture generation: Power goes up about 2X Power density goes up 30%~80% This is bad, and… Will get worse in future process generations: Voltage (Vdd) will scale down less No ideal shrinking anymore Leakage is going to the roof and it heavily depends on temperature Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 8 Power Density 1000 Rocket Nozzle Nuclear Nuclear Reactor Reactor Watts/cm 2 100 Pentium® 4 Pentium® III Pentium® II Hot plate 10 Pentium® Pro Pentium® i386 i486 1 1.5μ 1μ 0.7μ 0.5μ 0.35μ 0.25μ 0.18μ 0.13μ 0.1μ 0.07μ * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference key note - 1999. Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 9 Outcome We can’t build microprocessors with ever increasing power density and die sizes The constraint is power and power density – not manufacturability The design of any future micro-processor should take power into consideration. We need to distinguish between different aspects of power: Max power (TJ) Power density - hot spots Energy → static + dynamic In order to achieve better single threaded performance improvement at the same power envelop, we need new micro-architecture innovation Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 10 Single threaded performance is too difficult to be achieved at the same power envelop, so--let’s go parallel Intel® core™ Duo (Yonah) was the first CMP processor Intel developed for the mobile market. The following processor core™ Duo-2 (Merom) is used for all the different segments. (More information can be found in Intel Journal of Technology, May, 2006) Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 11 Theoretical calculation of Single vs CMP cores power for the same performance In theory, power increases in order of the cube of the frequency. (2.5 is more realistic factor) If we assume that frequency approximates performance Doubling performance by increasing its frequency growth the power exponentially Doubling performance by adding another core, growth the power linearly. Conclusion: as long as enough parallelism exists, it is always more efficient to double the number of cores rather than the frequency in order to achieve the same performance. Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 12 CPU Architecture - multicores Performance Power Wall CMP MP Overhead Uniprocessors Uniprocessors have lower power efficiency due to higher speculation and complexity Source: Tomer Morad, Ph.D student, Technion Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 Power 13 Can we partition the system to have more cores forever? There are at least three camps in the computer architects community Multi-cores - System will continue to contain small number of “big cores” – Intel, AMD, IBM, etc Many-cores – System will contain a large number of “small cores” – Sun Asymmetric cores – a combination of small number of small cores together with large number of small cores – IBM Cell architecture. Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 14 How many core are too many cores? – A simple Analytical model Basic assumptions Suppose that performance of a core proportional to square-root of its area (Polack rule) The thermal of each core is proportional to its power and the overall thermal capacitance is T = ∑(1/Ti) Power - Work done per time unit (Watts) Active power: P = αCV2f (α : activity, C: capacitance, V: voltage, f: frequency) Static power is out of the scope of this model. Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 15 Small cores – big cores Suppose same area is partitioned to m smaller processors Each core has the area of A/m, performance of m*SQRT(A/m) and power of K1*(C/m * F2.5)*m=K2 For the same power we can get much better performance. So if m ∞P 0 Perf. ∞ But: • F ≠ performance • Interconnect do not scale • Static power mater Suppose we like to achieve same performance Each core can run at F/m Pnew = m*K*(C/m)*(F/m)2.5 Pnew=K*C(F/m)2.5 Pold/Pnew = m2.5 AND If we try to build it • Small die allows only simple architectures • Memory BW and latency do not scale Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 16 Example: Caches and I/O traffic In order to reduce I/O we need to increase the local memory (caches) respectively. Thus the proportion of the memory within the overall logic increases in time. Let’s assume an hierarchy model where at each node half of its area is devoted to memory and half to logic At most 2 cores can share a memory hierarchy Logic 1 2 4 16 32 1/2 1/4 1/4 1/8 1/8 Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 17 Opteron: 2M L2, 2x64K L1 Implications: When increasing the number of cores, the “active” area of each core is less than 1/m It farther reduces the performance of each core Another example of the slogan - "The difference between theory and practice is always greater in practice than it is in theory" K8L: 512Kx4 L2, 4x2x64K L1 shared 2M L3 extendable. Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 18 Symmetric Multiprocessor Model Parallelism coefficient: 0 ≤ λ ≤ 1 Area of each Core: a Number of cores: n CPU1 CPU2 CPU3 CPU4 λ Denotes the number of dynamic instructions within parallel phases, divided by the total number of dynamic instructions. Fully serial programs have λ=0 FullyTomer parallel programs have Y. Morad, Uriλ=1 C. Weiser, Avinoam Kolodny, Source: Mateo Valero, and Eduard Ayguadé. “Performance, Power Efficiency, and Scalability of Asymmetric Cluster Chip Multiprocessors.” In Computer Architecture Letters, Volume 4, July 2005. Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 19 Symmetric CMP Performance (λ=0.75) Symmetric CMP Performance Vs. Power 5 Relative Performance Symmetric Upper Bound a =8 4 a =4 a =2 3 a =1 Symmetric Upper Bound Symmetric (a=8) Symmetric (a=4) Symmetric (a=2) Symmetric (a=1) 2 1 0 5 10 15 20 25 Relative Die sizePower Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 20 Asymmetric approach – Single Application Large core area: βa (β>1) Small cores of area: a Serial phases execute on large core Parallel phases execute on all cores This model can be extended to other models such as GPGPU and parallel execution between BIG and small cores Serial βa Parallel Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 21 CPU Architecture - multicores Performance ACCMP Power Wall CMP MP Overhead Uniprocessors Uniprocessors have lower power efficiency due to higher speculation and complexity Source: Tomer Morad, Ph.D student, Technion Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 Power 22 Conclusions for Symmetric cores The sequential portion of the code governs the overall performance Adding cores (in a naïve way) may reduce the performance For same area, the sequential part will be executed slower λ decrease when number of cores increases (more sequential part as a result of the cost of synchronization primitives We need “almost perfect parallelism” in order to gain many cores architectures. Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 23 How to ease the problem Reduce the serial part “smart prefetch”; e.g., scout, helper threads Push technology; e.g., Cell – push the data to the right locations in parallel (DMA) and then execute Fast synchronization – can be done in HW or SW Change the programming model; e.g., transactional memory Increase parallelism within the application Easier say than to be done We have been there too many times!!! Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 24 So, how many cores are too many cores? Many cores will never flourish without a breakthrough in compilers/application/new programming models – but this can take looooong time Many cores can be very efficient for special purpose computing and for “special MIPS” since we can “tailor”’ the applications to the HW (and the HW to the software). Multicore has its limitation since cannot show future longevity but if we can keep improving the single core performance in parallel to moderate increasing the number of cores, we may gap the time till many cores become reality Asymmetric computer is mostly appealing as a combination of general purpose computers and special purpose. It can also become in a form of FPGA and GPGPU (e.g., CUDA, CTM) Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 25 Asymmetric computer - challenges I assume that asymmetric computers will be used mainly for “special purpose” Graphics, Multi-Media, Fix functions, In order to achieve that many technologies are still at a research phase Interconnect – most likely NoC--Network on Chip Memory hierarchy: NUCA, DNUCA, UMA, NUMA? Programming model: OpenMP, MPI, CUDA, CTM, other? Comilers/debugger/ developing environment OS Do we need one scheduler or two (as today) or unified one? Do we need virtualization layer for that? I believe that asymmetric computer architectures will generate many new research opportunities for at least the next decay. Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 26 Question? Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 27 Multi-core – Intel AMD and IBM Intel and AMD companies have dual and quad core architectures For Dual core - Intel uses shared cache architecture and AMD introduces split cache architecture for the first generation and shared L3 for the new generation For servers, analysts claims that Intel is building a 16 way Itanium based processor for 2009 time frame. Power4 has 2 cores and power5 has 2 cores+SMT. Analysts claim that IBM consider to move in the near future to 2 cores+SMT for each of them. Xbox has 3 Power4 cores. All the three companies promise to increase the number of cores in a pace that fits the market’s needs. back Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 28 Many-cores –Sun – Sparc-T1:Niagara (a) (b) (c) Looking at in-order machine, each thread has computational time followed by LONG memory access time (a) If you put 4 of them on die, you can overlap between I/O, memory and computation (b) You can use this approach to extend your system (c) Alewife project did it in the 80’s back Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 29 Cell Architecture - IBM Small core Ring base bus unit BIG core back Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 30 Processor Power Evolution ? 100 Pentium® II Pentium ®4 Pentium® Max Power (Watts) Pentium® Pro Pentium® III 10 Pentium® Pentium® w/MMX tech. i486 i386 1 1.5μ 1μ 0.8μ 0.6μ 0.35μ 0.25μ 0.18μ 0.13μ Traditionally: new processor generation always increases power Using the same micro-architecture with new process decreases its power Dr. Avi Mendelson - Talk in IBM seminar/HiPEAC – 2007 31