Extended Ethernet Congestion Management (E2CM): Per Path ECM - A Hybrid Proposal M. Gusat, C. Minkenberg and R. Luijten IBM Research GmbH, Zurich March 14th 2007 Outline • Status at 802.1 • Critique Analysis of BCN • IBM hybrid proposal: E2CM ¾ unite BCN/ECM + recent CM proposals • E2CM simulations ¾ A) Orientative results ¾ B) Reference results • Conclusions IBM Research GmbH, Zurich 2 Status at 802.1au 1. 802.1 WG critique of BCN performance 1. Stability: oscillations and ‘slow’ convergence 2. Fairness: BCN’s rate allocation is ‘unfair’ 2. Recent rate-based (RB) CM proposals in 802.1 1. BECN evolves into FECN (R. Jain) 2. FECN-like destination-based probing / DBP (M. Seaman) 3. First reconciliation proposal: QCN (B. Prabhakar) ¾ optimized BCN ¾ same or better performance at lower overhead ¾ 802.1au agreement on CM method appears difficult... 9 adhoc simulation progress slowdown IBM Research GmbH, Zurich 3 Critique of BCN vs. ECN: Of Fairness, Stability and Speed IBM Research GmbH, Zurich 4 Current CM Proposals: R. Jain’s FECN + B. Prabhakar’s QCN + M. Seaman’s DBP A) R. Jain presents in Monterey a new rate-based CM (RB-CM) proposal, i.e. FECN • Performance claims (top 3): 1. 2. 3. “Perfect Fairness” “Fast Convergence” “No PAUSE required or issued” [Obs.: here “perfect” stands for max-min] [Obs.: 10ms here] [Obs.: in certain cases] Note: FECN results are to be validated by other simulation teams. B) M. Seaman’s DST-based probing (DBP) proposal “...algorithm comprising three sets of algorithms for Sources, Bridges, and Destinations. 1. the Destination originates and transmits regular Rate Report (RR) frames to each active Source. 2. Each RR traces the reverse path from the Destination to the Source and carries an advertised rate for use by the Source in transmitting to that Destination. The RR originally carries a rate set by the destination to be its receiving link speed. “ C) B. Prabhakar’s insight leading to the QCN proposal “BCN generates extra signaling traffic ¾ ¾ Hence sampling probability is kept at 1%; this can go up to 10% and improve responsiveness by a lot But, if [i] forward signaling is possible, or [ii] another means of signaling more frequently can be found, then we can send less information per signal” IBM Research GmbH, Zurich 5 IBM Proposal: BCN + FECN + DBP + QCN => E2CM The Hybrid Dilemma: How to Combine the Best Features ... Only? Tree Saturation => Complex CM ... yet leads to our proposal port 1 port 4 . . . . . . . . . . . . . . . port 4 . . . HCARx . . . . . . . . . . HCATx port 1 shortcut routes within same SEs . . . . . HCATx (PPT animation) . . port 128 port 128 32 4x4 HCARx 32 4x4 32 4x4 32 4x4 16 8x8 same SEs • Documented in papers from sim and h/w experiments [refs available] • Effect: nonlinear (saturation chain) and time-varying system (2 non-invariants) • Transport lag => transcendental characteristic equation ¾ ¾ Link-level flow control induces blocking chains in depth and breadth Few hot flows hog resources (high-order HOL-B) => blocking many/all cold/victim flows ¾ Impact on piecewise linearization, particularly at linearization points ¾ Root-locus and Routh-Hurwitz apply only to rational transfer functions... IBM Research GmbH, Zurich 7 Critique – 1 Analysis : Stability of BCN • A key observation: Baseline BCN has robust performance in the linear region! ¾ However, its dynamic range (DR) is limited by the queue capacity ¾ Furthermore, possibly fed by n simultaneous arrivals... from ECM Spec Saturated Integrator behavior... ⇒ Feedback: Fb(t) = -(q(t) – Qeq) + w*(dq/dt) / (μj* ps) => 0 ≤ q(t) ≤ qmax OVF UDF ¾ Fast transition between lower/upper saturation (n+1 stochastic procs) – – requires frequent use of saturation signals: BCN_Max, BCN(0,0) non-linear saturation patches reduce the efficiency of the baseline control alg. IBM Research GmbH, Zurich 8 Extending the Linear Region of a Saturated Integrator • How to scale BCN’s stability properties w/ network size? 1. Increase the dynamic range by chaining the j queues along the path i... 2. Control the chain of queues instead of the individual queue ¾ Adopt per path probing ... • OVF UDF Concatenate multiple queues along a path into a Path Queue -> Σqij ¾ State equations From local queue stability to per path stability: LQueue) dq/dt = HSD*λ(t) – μj , where max(HSD) = N, and max (μj ) = Cj PQueue) Q’ij = Σidqij/dt = Σiλij (t) – μj , 1< i ≤ HSD Obs.: Slope steepness decreases: from n+1 to 2 stochastic procs IBM Research GmbH, Zurich OVF UDF 9 Critique - 2 Analysis : Fairness of BCN • BCN’s ‘unfairness’: from probabilistic sampling of aggregate occupancy ¾ Queue contains frames from any flow => lack of per-flow state sensing 9 Per-flow state in the bridge is prohibited by PAR • A) One approach is to calculate / iterate the fair share (FS) as in ABR methods such as FECN, UT, OSU, H2 ¾ TBD: scalability and h/w complexity • B) Other methods were proposed by Stanford Univ. [Allerton paper] • C) IBM proposal: A per-flow probing sensor in the edge node... ¾ Probing: triggered by BCN, or autonomous (congestion avoidance) ¾ Why per flow? -> fast max-min convergence even w/o FS in the bridge IBM Research GmbH, Zurich 10 E2CM Principles: Dual Heritage • E2CM includes saturation tree mgnt: avoidance, isolation and recovery. ¾ Historically this wasn’t the case in TCP and ATM ABR I) BCN and QCN heritage: E2CM draws upon the baseline BCN ¾ PD control + feedback equation (extended, instead of piecewise linearized) ¾ AIMD-based SRF + parameters 9 Potential QCN optimizations: – – Bridges do not send increase signals Sparse quantization (6/5 bit) II) FECN and DBP heritage: SRC probes for RTT + DST calculates rate ¾ from FECN and DBP we adopt per path probing and DST rate reporting Effect: E2CM extends BCN’s dynamic range to a wider linear region proportional to the number of buffers traversed per path ¾ increased stability and phase margin (improved convergence) ¾ scalability w/ network size • By adopting per-flow accounting in the end node, E2CM converges to fair allocation rates w/o using rate-based calculations in the bridge IBM Research GmbH, Zurich 11 E2CM Operation (PPT animation) Probe Switch 1 src 1. BCN 1. Probe arrives arrives at source at source 1. Probe arrives at dst 2. Update timestamp, insert 1. Qeq exceeded flow service rate 2. Send BCN to 3.source Return probe to source Switch 2 BCN 2. Install 2. Pathrate occupancy limiter computed 3. Inject 3. AIMD probe control w/ timestamp applied using same rate limiter Switch 3 dst • Probing is triggered by BCN frames; only rate-limited flows are probed • Per flow, BCN and probes employ the same rate limiter ¾ Insert one probe every X KB of data sent per flow, e.g. X = 75 KB ¾ Probes traverse network inband: Objective is to observe real current queuing delay ¾ ¾ ¾ ¾ Control per-flow (probe) as well as per-queue (BCN) occupancy CPID of probes = destination MAC Rate limiter is never associated with probe CPID Parameters re. probes may be set differently (in particular Qeq,flow, Qmax,flow, Gd,flow, Gi,flow) IBM Research GmbH, Zurich 12 Proposed Compromise Scheme details • Extension: BCN reception triggers RTT and Tput probing in the end nodes ¾ A) While activated SRC periodically (t or n-pkts) insert probe frame for every RLT flow 9 9 Probe contains timestamp Probes traverse network in-band with regular data frames ¾ B) Upon reception of forward probe, the DST will ¾ C) Upon reception of reverse probe back at the originating SRC ¾ 1. 2. 3. Update timestamp to reflect forward latency L Calculate and report the flow service rate R since last flow probe Return probe to sending SRC 1. 2. Adjust latency L for flight time L0 Apply Little’s Formula: 3. Apply the extended BCN source response function 9 9 One rate limiter per flow Associated with last negative feedback Q = (L - L0)*R 1. Yields the mean number of bytes of probed flow stored on entire forward path 1. E.g., set Qequilibrium and apply AIMD rate adjustment Net: E2CM extends buffer occupancy per path and flow (however, CM triggering may be done on rate –instead of size– thresholds) 9 9 per flow: Separates hot from cold flows per path: extends the region of linear BCN operation IBM Research GmbH, Zurich 13 Reaction Point if (bcn.type() == BCN_BCN) { // Compute BCN reaction as usual ... } else if (bcn.type() == BCN_PROBE) { // Store minimum latency as time of flight if (flightTime > bcn.getLatency() || flightTime == 0.0) flightTime = bcn.getLatency(); // Compute amount of data queued on forward path, adjusting for flight time flowQ = bcn.getThroughput()*(bcn.getLatency() - flightTime); flowdQ = max( min( flowQ - flowLastQ, 2*flowQeq ), -2*flowQeq ); flowQoff = max( min( flowQeq - flowQ, flowQeq ), -flowQeq ); if (flowQ > flowQmax) // Qmax threshold exceeded? feedback = -(1+2*W)*flowQeq; // Apply maximum negative feedback else feedback = (flowQoff - W*flowdQ); // Compute feedback flowLastQ = flowQ; // Store last queue estimate // Apply AIMD rate adjustment if (feedback > 0) // Additive increase rate = rate + flowGi*feedback*rateUnit; else if (feedback < 0) // Multiplicative decrease rate = rate * (1.0 + flowGd*feedback); } // If needed, instantiate new rate limiter or update rate // Associate rate limiter with CP if feedback < 0 and not probe ... IBM Research GmbH, Zurich 14 E2CM Frame Format • General fields ¾ Congestion point MAC 9 Inserted by congestion point 9 Probe congestion point MAC = flow destination MAC ¾ Flow identifier 9 Hash based on src MAC, dst MAC, priority ¾ BCN type 9 BCN, BCN_MAX, BCN_ZERO, BCN_PROBE_FWD, BCN_PROBE_REV • BCN-specific fields ¾ Queue offset 9 Inserted by congestion point ¾ Queue delta 9 Inserted by congestion point • Probe-specific fields ¾ Forward latency 9 Already provided in original BCN format (but different usage) 9 Timestamp inserted by source node 9 Updated by destination node (latency = now – timestamp) ¾ Flow throughput field 9 Inserted by destination node 9 Measured between two subsequent probes for same flow * Red: new fields IBM Research GmbH, Zurich 15 E2CM: Selected Initial Simulation Results Baseline IG and OG simulations: Orientative Preview (not for reference) Input-generated Hotspot Node 2 Node 3 Node 4 5% Core Switch 40% Node 5 Node 6 • • • 5% Node 1 80% 80% 5 flows sending to hotspot: aggregate load = 21 Gb/s Max-min fair rates = (0.5; 0.5; 3; 3; 3) Gb/s = (62.5; 62.5; 375; 375; 375) MB/s Hotspot starts at t = 0.1s; at t=0.5s, service rate of node 1 is reduced by half; fair rates = (62.5; 62.5; 167; 167; 167) MB/s IBM Research GmbH, Zurich 17 BCN vs. E2CM : Fair and Steady Rate Allocation BCN E2CM Rate fluctuations Stable max-min fair rates • Graphs show aggregate (red) and per-flow throughput • 1. 2. 3. 4. 5. 6. Params Qeq_BCN = 75 kB, Qeq_E2CM = 15 kB, Gd = 1.333*10^-6, Gi = 6.6667*10^-4 (Gdf = 0.5, Gif = 0.1; E2CM gains are 5 times as high, because Qeq is 5 times as low) Ru = Rmin = 250 kB/s = 2 Mb/s M = 300 kB/port, Thr_hi = 295500, Thr_lo = 147750 sample interval = 75 kB (for BCN as well as E2CM), 15 kB for rate-limited flows BCN_MAX disabled IBM Research GmbH, Zurich 18 Output-Generated Single-Hop Hotspot Node 2 85% Service rate = 20% Core Switch Node N • • • 85% Node 1 85% All nodes: Uniform destination distribution, load = 85% (8.5 Gb/s) Node 1 service rate = 20% One congestion point ¾ ¾ Hotspot degree = N-1 All flows affected • Params 1. 2. 3. 4. 5. 6. Qeq_BCN = 75 kB, Qeq_E2CM = 15 kB, Gd = 1.333*10^-6, Gi = 6.6667*10^-5 (Gdf = 0.5, Gif = 0.01; E2CM gains are 5 times as high, because Qeq is 5 times as low) Ru = Rmin = 250 kB/s = 2 Mb/s M = 300 kB/port, Thr_hi = 295500, Thr_lo = 147750 sample interval = 75 kB (for BCN as well as E2CM), 15 kB for rate-limited flows BCN_MAX enabled (Qsc = 280500) IBM Research GmbH, Zurich 19 BCN vs. E2CM : Output-generated (Tp and Q) Reduced transient Slow recovery Fast recovery Transient No underutilization Underutilization E2CM BCN IBM Research GmbH, Zurich 20 Reference Simulation Results E2CM r1.0 Simulation Setup & Parameters • Traffic – – – • • I.i.d. Bernoulli arrivals Uniform destination distribution (to all nodes except self) Fixed frame size = 1500 B • Scenarios 1. 2. 3. 4. 5. 6. Baseline input-generated (IG) Max-min (mice-elephant) IG Single-hop output-generated (OG) Multi-Hop OG background HS Bursty On-Off Parking lot • Switch – – – – M = 300 KB/port Partitioned memory per input, shared among all outputs No limit on per-output memory usage PAUSE enabled • • • Applied on a per input basis based on local high/low watermarks watermarkhigh = 280 KB watermarklow = 260 KB Adapter – – – – – ECM – – – – – – – – – • Per-node virtual output queuing No limit on number of rate limiters Unlimited ingress buffer size Egress buffer size = 150 KB PAUSE enabled • • watermarkhigh = 140 KB watermarklow = 130 KB W = 2.0 Qeq = 75 KB (= M/4) Gd = 0.5 / ((2*W+1)*Qeq) Gi0 = (Rlink / Runit) * ((2*W+1)*Qeq) Gi = 0.005 * Gi0 Psample = 2% (on average 1 sample every 75 KB) or 10% (15 KB) Runit = Rmin = 1 Mb/s BCN_MAX enabled, threshold = 280 KB No BCN(0,0), no self-increase E2CM (per-flow) – – – – – – – W = 2.0 Qeq = 15 KB Gd = 2.5 / ((2*W+1)*Qeq) Gi = 0.025 * Gi0 Psample = 2% (on average 1 sample every 75 KB) or 10% (15 KB) Runit = Rmin = 1 Mb/s BCN_MAX enabled, threshold = 56 KB IBM Research GmbH, Zurich 22 1. Baseline Input-Generated Hotspot Node 2 50% Node 3 50% Edge Switch 1 50% Node 1 Edge Switch 2 Edge Switch 5 Node 6 Core Switch 50% Edge Switch 3 Edge Switch 6 Node 4 Node 5 50% • • • Edge Switch 4 Node 7 Four culprit flows of 5 Gb/s each from nodes 2, 3, 4, 5 to node 6 (hotspot) One victim flows of 5 Gb/s from node 1 to node 7 Fair allocation provides 2.5 Gb/s to all culprits and 5 Gb/s to the victim IBM Research GmbH, Zurich 23 Results Baseline scenario (Tp) ECM E2CM No underutilization 75 KB Victim not affected Unfair rate allocation ~150 ms transient Perfect rate allocation Fast recovery 15 KB Still unfair rate allocation IBM Research GmbH, Zurich Shorter transient 24 Results Baseline scenario (Q) ECM E2CM Stable queue length Stable queue length 75 KB Stable per-flow occupancy! 15 KB IBM Research GmbH, Zurich 25 Results Baseline scenario (Tp, Gi = 0.25*Gi0) ECM E2CM Fair allocation 75 KB Fast fairness 15 KB IBM Research GmbH, Zurich 26 Convergence times IG single-hop Gi = 0.025 * Gi0 400 ECM, 75 KB ECM, 15 KB 300 E2CM, 15 KB 250 200 150 100 50 14 E2CM, 75 KB Convergence time (ms) relative Convergence time (ms) 350 Gi = 0.25 * Gi0 16 Ideal ECM, 75 KB E2CM, 75 KB ECM, 15 KB E2CM, 15 KB 12 10 8 6 4 2 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.05 Convergence accuracy 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Convergence accuracy • Convergence times determined over 1-ms averages of hot OQ length • Relative accuracy a means that measured values stay within band [v *(1-a), v *(1+a)], where v is the steady-state value, so band width = 2*a IBM Research GmbH, Zurich 27 2. Input-Generated Mice-Elephant Hotspot Node 2 Node 3 Node 4 5% Node 6 • • Core Switch 40% Node 5 • • 5% Node 1 80% 80% 5 flows sending to hotspot: aggregate load = 21 Gb/s Max-min fair rates = (0.5; 0.5; 3; 3; 3) Gb/s = = (62.5; 62.5; 375; 375; 375) MB/s Hotspot max-min fair rates = (62.5; 62.5; 167; 167; 167) MB/s Achieved... IBM Research GmbH, Zurich 28 Results “Mice-Elephant” scenario (Tp and Q) ECM E2CM 75 KB IBM Research GmbH, Zurich Gi = 0.1 * (Rlink / Runit) * ((2*W+1)*Qeq) 29 3. Output-Generated Single-Hop Hotspot Node 2 85% Service rate = 10% Core Switch Node N • • • 85% Node 1 85% All nodes: Uniform destination distribution, load = 85% (8.5 Gb/s) Node 1 service rate = 10% One congestion point – Hotspot degree = N-1 – All flows affected => step response (test stability) IBM Research GmbH, Zurich 30 Results OG multi-hop scenario (Tp) ECM E2CM 75 KB 15 KB IBM Research GmbH, Zurich 31 Results OG (Q): >2x Faster Convergence..! ECM E2CM Lag => zero in RH splane => undershoots Reduced undershoot 75 KB 15 KB IBM Research GmbH, Zurich 32 Convergence times single-hop OG scenario 0.18 0.45 0.35 ECM, 75 KB E2CM, 75 KB ECM, 15 KB E2CM, 15 KB 0.3 0.25 0.2 0.15 0.1 0.05 Absolute 0.16 Absolute convergence tim e (s) Relative convergence time (s) 0.4 0.14 0.12 0.1 0.08 0.06 0.04 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.02 Convergence accuracy (%)/100 Relative 0 ECM, 75 KB E2CM, 75 KB ECM, 15 KB E2CM, 15 KB Configuration • Convergence times determined over 1-ms averages of hot OQ length • Relative accuracy a means that measured values stay within band [v *(1-a), v *(1+a)], where v is the steady-state value, so band width = 2*a • Absolute accuracy means that measured values stay within band [1.5, 280] KB IBM Research GmbH, Zurich 33 4. OG Multi-Hop Background Traffic Hotspot 25% Node 1 Service rate = 5% Node 2 40% Edge Switch 1 25% Edge Switch 3 Node 3 25% Node 4 Node 5 25% 40% Node 8 Core Switch 25% Edge Switch 2 Node 7 Node 9 Edge Switch 4 40% 40% Node 10 Node 6 25% • • • • All nodes: Uniform destination distribution Nodes 1-6 load = 25% (2.5 Gb/s), nodes 7-10 load = 40% (4 Gb/s) – Mean aggregate load = (6*.25+4*.4)/10 = 31% (3.1 Gb/s) Node 7 service rate = 5% Five congestion points – All switches and all flows affected IBM Research GmbH, Zurich 34 Results OG multi-hop BGND (Q) ECM E2CM 75 KB 15 KB IBM Research GmbH, Zurich 35 Convergence times multi-hop OG scenario 1.4 Convergence time (s) 1.2 ECM, 75 KB E2CM, 75 KB ECM, 15 KB E2CM, 15 KB • Convergence times determined over 1-ms averages of hot OQ length • Relative accuracy a means that measured values stay within band [v *(1-a), v *(1+a)], where v is the steady-state value, so band width = 2*a • Absolute accuracy means that measured values stay within band [1.5, 280] KB 1 0.8 relative 0.6 0.4 0.2 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Convergence accuracy (%)/100 0.16 absolute 0.14 Convergence time (s) 0.12 0.1 0.08 0.06 0.04 0.02 0 ECM, 75 KB E2CM, 75 KB ECM, 15 KB Configuration E2CM, 15 KB IBM Research GmbH, Zurich 36 5. Bursty Baseline Input-Generated Hotspot Node 2 100% 100% Node 3 Edge Switch 1 Node 1 Edge Switch 2 Edge Switch 5 Node 6 Core Switch 100% Edge Switch 3 Edge Switch 6 Node 4 Node 5 100% • • • • Edge Switch 4 Node 7 Four hot flows of 10 Gb/s each from nodes 2, 3, 4, 5 to node 6 (hotspot) Every 100 ms, flows from nodes 4 and 5 are switched from off to on and vice versa (duty cycle = 200 ms) Fair allocation provides 2.5 Gb/s per flow when 4 are active, 5 Gb/s when 2 are active Pause disabled, very small adapter buffers (10 frames) IBM Research GmbH, Zurich 37 Bursty Baseline IG scenario (Tp): Convergence < 5ms ECM E2CM Convergence but unfair rate allocation Fast convergence (< 5ms) and fair rate allocation 75 KB Close to ideal 15 KB IBM Research GmbH, Zurich 38 Bursty Baseline IG scenario (Q) ECM E2CM Fast convergence on stable queue level (75 KB) after every transition Idem 75 KB Note different queue levels (30 vs. 60 KB) depending on number of active flows Idem Idem 15 KB IBM Research GmbH, Zurich 39 6. Parking Lot Scenario Node 1 Node 2 100% Node 6 Switch 2 Switch 3 Node 7 Node 9 100% Node 4 100% 100% Node 5 • • • • • Node 8 100% Switch 1 Node 3 100% Four hot flows of 10 Gb/s each from nodes 1, 2, 3, 4 to node 9 (hotspot) Two cold flows of 10 Gb/s from node 5 to 7 and 6 to 8 Max-min fair allocation provides 2.0 Gb/s to all flows Proportionally fair allocation provides 1.67 Gb/s to all hot flows and 3.33 Gb/s to all cold flows Pause disabled, very small adapter buffers (10 frames) IBM Research GmbH, Zurich 40 Parking Lot Scenario (75KB) – Qeq, all = 15 KB E2CM ECM Tp Proportional fairness Q IBM Research GmbH, Zurich 41 E2CM - Parking Lot Scenario (75KB) – Per-flow Qeq Qeq,hot = 15 KB, Qeq,cold = 7.5 KB Qeq,all = 15 KB Tp Proportional fairness: 4x1.67 Gb/s, 2x3.33 Gb/s Max-min fairness: 6x2 Gb/s Q IBM Research GmbH, Zurich 42 Conclusion • E2CM extends baseline BCN w/ per-path probing adopted from DBP and FECN • E2CM addresses the main critiques of BCN ¾ improves performance in 9 stability and speed, based on an saturated integrator w/ extended dynamic range (distributed queue Qij = Σqij) 9 linear range of Fb is now scalable w/ no. of buffers in the network 9 fairness (per flow accuracy possible in end-nodes, when needed) – • scalability cost to 100+ Gbps Ethernet and 1M-node datacenter is TBD User-defined fairness: ¾ 1. max-min (canonical, beneficial for ‘mice’) ¾ 2. proportional (tempers ‘remote’ flows w/ long routing distance) ¾ 3. max-Tput (maximize utilization at cost of unfairness, but no starvation) • Synergy w/ FECN/DBP (probing) and w/ baseline ECM (param tuning) • We propose the hybrid E2CM as baseline CM approach That’s all, thanks! IBM Research GmbH, Zurich 43