Presentation

Building Scalable and Highly Available Systems
A Layered Approach
Eliezer Dekel
Haifa – High Availability is our prefix
IBM Labs in Haifa
© 2004 IBM Corporation
IBM Labs in Haifa
Complex heterogeneous infrastructures are a reality!
Dire
Director
ctor
and
and Se
Security
curity
Service
Se rv icess
Dozens of
systems and
applications
Existing
Existing
Applications
Applications
and
and Data
Data
DNS
DNS
Server
Server
We
b
Web
Se
err
Serv
rve
Internet Firewall
Firewall
Internet
Cache
Cache
Load Balancer
Balancer
Load
Internet Firewall
Firewall
Internet
Busine
ss
Business
Data
Data
Data
Data
Se
Serve
rv err
Web
Web
Application
Application
Server
Server
Thousands of
tuning
parameters
Storage
Storage Area
Area
Network
Network
Data
Data
Hundreds of
components
BPs
BPs and
and
Exte
rnal
External
Se
ice s
Serv
rvices
© 2004 IBM Corporation
IBM Labs in Haifa
4A2
1A2
SD
SYSTEMS
D
S
S Y S T E MS
A SX-1000
ASX -1000
3D2
C
A
SER
SER
SER
SER
L
L
L
L
RX
RX
RX
NE X T S E L E CT
NE X T S E L E CT
NE X T S E L E CT
NE X T S E L E CT
RX
ET H
ET H
ET H
TX
ET H
C
TX
C
TX
C
TX
C
P WR
P WR
P WR
P WR
AC
AC
4 8 V DC
8 V DC
4
5 V DC OK
V DC OK
5
S HUT DOW N
S HUT DOW N
RE S E T
RE S E T
RE S E T
RE S E T
P WR
P WR
P WR
P WR
A UT IO N
C
: Do
u b le P o le
/ ne
u t ra lf u
s in g
CA UT ION:D o u b
l e Po
l e
/ n
e u tr a
l fu s in g
F1
2 A /2 5 0 V
F1 2
A/ 2
50V
4A2
Cis c o 7 00 0
C is c o 70 00
FE5/0/0
CA UT
I ON:D o u
b le P o le /n e u tra l fu s
i n
g
F 1 2 A /2 5
0V
FE4/0/0
B
B
B
B
B
B
B
B
D
D
D
D
D
D
D
D
HSRP
HSRP
ICPMDISTFA1002
2A2
FE5/0/0
ATM0/0/0.1
ATM0/0/0.1
C is c o 70 00
C
L
RX
L
RX
RX
L
RX
L
NE X T S E L E CT
NE X T S E L E CT
NE X T S E L E CT
NE X T S E L E CT
2A2
3A2
ICPMDISTFA1001
ATM0/0/0.1
A
ET H
ET H
ET H
ET H
TX
C
TX
C
TX
C
TX
C
5VDC OK
SH UT
DO WN
F1 2 A
/ 25
0V
C
A
SER
SER
SER
SER
48VDC
5VDCOK
CA UT
I ON:D o
u b le P o le /n e u
t ra lf u
si n
g
C
AC
RE S E T
RE S E T
RE S E T
RE S E T
1A2
A
C
A
C
A
C
A
C
A
AC
48
V DC
SHUTDOW N
C is c o 700 0
FE4/0/0
ATM0/0/0.1
Cis c o 7 00 0
HSRP
HSRP
C is c o 7 00 0
ICPMSCOMC7505
FE4/1/0
FE4/1/0
FE4/1/0
ICPMSCOMC7506
ATM0/0/0.1
FE4/1/0
ATM0/0/0.1
Port 1/2
ICPMSCOMC7502
ICPMSCOMC7503
HSRP
FE4/0/0
Port 1/1
ICPMSFTDLC2921
(MSCOM DL1)
Port 1/1
Port 2/1
DOWNLOAD.MICROSOFT.COM
C a ta lyst
500 0
CPMSFTWBD01
CPMSFTWBD05
CPMSFTWBD06
ICPMSCOMC5002
(MSCOM2)
ICPMSCOMC5003
(MSCOM3)
CPMSFTWBA02
ICPMSCOMC5004
(MSCOM4)
IIS
DOWNLOAD.MICROSOFT.COM
IIS
CPMSFTWBD07
CPMSFTWBD08
ACTIVEX.MICROSOFT.COM
ICPMSCOMC5001
(MSCOM1)
ICPMSFTDLC2922
(MSCOM DL2)
Port 1/1
Port 2/1
Ca ta lyst
5000
C a ta lyst
500 0
Port 2/1
CPMSFTWBD03
CPMSFTWBD04
CPMSFTWBD09
IIS
IIS
WWW.MICROSOFT.COM
IIS
IIS
CPMSFTWBW37
CPMSFTWBW38
CPMSFTWBW39
IIS
IIS
REGISTER.MICROSOFT.COM
CPMSFTWBR01
CPMSFTWBR02
CPMSFTWBR06
CPMSFTWBR07
CPMSFTWBR08
CPMSFTWBT06
CPMSFTWBT08
IIS
IIS
CPMSFTWBP03
WINDOWSMEDIA.MICROSOFT.COM
CPMSFTWBJ06
CPMSFTWBJ07
CPMSFTWBJ08
IIS
IIS
IIS
CPMSFTWBAM01
CPMSFTWBAM01
IIS
IIS
IIS
IIS
IIS
IIS
CPMSFTWBA03
CPMSFTWBJ21
CPMSFTWBV01
CPMSFTWBV02
CPMSFTWBV03
IIS
IIS
NEWSWIRE
CPMSFTWBQ01
CPMSFTWBQ02
CPMSFTWBQ03
INTERNAL SMTP
CPMSFTSMTPR01
CPMSFTSMTPR02
NEWSWIRE
CPITGMSGD01
CPITGMSGD02
CPITGMSGD03
STATS
CPITGMSGD04
CPITGMSGD05
CPITGMSGD07
CPITGMSGD14
CPITGMSGD15
CPITGMSGD16
CPMSFTSTA14
CPMSFTSTA15
CPMSFTSTA16
COMMUNITIES
CPMSFTWBO30
CPMSFTWBO31
IIS
IIS
IIS
IIS
WINDOWSMEDIA.MICROSOFT.COM
CPMSFTWBJ01
CPMSFTWBJ02
CPMSFTWBJ09
CPMSFTWBJ10
CPMSFTWBO32
CPMSFTWBG04
CPMSFTWBG05
CPMSFTWBB01
CPMSFTWBB03
Ca ta lyst
5000
IIS
CPMSFTWBI01
IIS
CPMSFTWBI02
IUSCCMQUEC5001
(COMMUNIQUE1)
SearchMCSP.MICROSOFT.COM
CPMSFTWBB04
Ca ta lyst
5000
CPMSFTWBN03
CPMSFTWBN04
INSIDER.MICROSOFT.COM
CPMSFTWBM03
BACKOFFICE.MICROSOFT.COM
CPMSFTWBO04
CPMSFTWBO07
CPMSFTWBJ03
CPMSFTWBJ05
MSDN.MICROSOFT.COM
CPMSFTWBN01
CPMSFTWBN02
CGL.MICROSOFT.COM
CPMSFTWBG03
CPMSFTWBG04
CPMSFTWBG05
CPMSFTWBV42
CPMSFTWBY03
CPMSFTWBY04
IIS
WINDOWS_Redir.MICROSOFT.COM
IIS
IUSCCMQUEC5002
(COMMUNIQUE2)
NEWSLETTERS.MICROSOFT.COM
CPMSFTSMTPQ01
CPMSFTSMTPQ02
IIS
IIS
IIS
IIS
IIS
IIS
IIS
IIS
CPMSFTWBY05
IIS
IIS
IIS
CPITGMSGR01
IIS
CPITGMSGR02
COMMUNITIES.MICROSOFT.COM
CPMSFTWBJ19
CPMSFTWBJ20
Microsoft.com Stagers,
Build and Misc. Servers
CPMSFTSMTPQ11
CPMSFTSMTPQ12
CPMSFTSMTPQ13
CPMSFTSMTPQ14
CPMSFTSMTPQ15
CPMSFTWBJ06
CPMSFTWBJ07
CPMSFTWBJ08
CPMSFTWBV41
WINDOWS.MICROSOFT.COM
CPMSFTWBY01
CPMSFTWBY02
CPMSFTWBJ01
CPMSFTWBT43
CPMSFTWBT44
CPMSFTWBO01
CPMSFTWBO02
IIS
CPMSFTWBV23
MSDNSupport.MICROSOFT.COM
CPMSFTWBC03
NEWSWIRE.MICROSOFT.COM
IIS
NEWSLETTERS
CPMSFTWBT03
CPMSFTWBT07
IIS
CPMSFTWBV21
CPMSFTWBV22
CDMICROSOFT.COM
CPMSFTWBC01
CPMSFTWBC02
CPMSFTFTPA05
CPMSFTFTPA06
MSDNNews.MICROSOFT.COM
REGISTER.MICROSOFT.COM
CPMSFTWBR09
CPMSFTWBR10
CPMSFTWBV04
CPMSFTWBV05
FTP.MICROSOFT.COM
CPMSFTFTPA03
CPMSFTFTPA04
CPMSFTFTPA01
CPMSFTWBJ22
CODECS.MICROSOFT.COM
CPMSFTWBJ16
CPMSFTWBJ17
CPMSFTWBJ18
CPMSFTWBT40
CPMSFTWBT41
CPMSFTWBT42
IIS
CPMSFTWBH03
HOTFIX.MICROSOFT.COM
CPMSFTWBW27
CPMSFTWBW46
CPMSFTWBW47
CPMSFTWBR03
CPMSFTWBR04
CPMSFTWBR05
WINDOWSMEDIA.MICROSOFT.COM
OFFICEUPDATE.MICROSOFT.COM
CPMSFTWBAM03
CPMSFTWBAM04
SvcsWINDOWSMEDIA.MICROSOFT.COM
CPMSFTWBT01
CPMSFTWBT02
KBSEARCH.MICROSOFT.COM
IIS
ASKSUPPORT.MICROSOFT.COM
IIS
PremOFFICEUPDATE.MICROSOFT.COM
CPMSFTWBJ09
CPMSFTWBJ10
WWW.MICROSOFT.COM
IIS
CPMSFTWBW01
CPMSFTWBW15
CPMSFTWBW25
WINDOWS98.MICROSOFT.COM
CPMSFTWBS10
CPMSFTWBS11
CPMSFTWBS12
CPMSFTWBS13
CPMSFTWBS14
CPMSFTWBS15
CPMSFTWBS16
CPMSFTWBS17
CPMSFTWBS18
CPMSFTWBS01
CPMSFTWBS02
CPMSFTWBS03
CPMSFTWBS04
CPMSFTWBS05
CPMSFTWBS06
CPMSFTWBS07
CPMSFTWBS08
CPMSFTWBS09
IIS
CPMSFTWBW36
CPMSFTWBW44
CPMSFTWBW45
SUPPORT.MICROSOFT.COM
IIS
PREMIUM.MICROSOFT.COM
CPMSFTWBP01
CPMSFTWBP02
CPMSFTWBW08
CPMSFTWBW13
CPMSFTWBW14
CPMSFTWBW29
CPMSFTWBW35
CPMSFTWBW40
CPMSFTWBW41
CPMSFTWBW42
CPMSFTWBW43
SEARCH.MICROSOFT.COM
IIS
SUPPORT.MICROSOFT.COM
CPMSFTWBT04
CPMSFTWBT05
WWW.MICROSOFT.COM
WWW.MICROSOFT.COM
CPMSFTWBW24
CPMSFTWBW31
CPMSFTWBW32
CPMSFTWBW33
CPMSFTWBW34
CPMSFTWBH01
CPMSFTWBH02
CPMSFTWBD10
CPMSFTWBD11
HTMLNEWS(pvt).MICROSOFT.COM
NTSERVICEPACK.MICROSOFT.COM
CPMSFTWBW26
CPMSFTWBW28
CPMSFTWBW30
Port 1/1
C a t a lyst 2926
FE4/0/0
FE4/0/0
Port 1/1
Port 2/1
Port 1/2
C a ta lyst 292 6
ICPMSCOMC7504
HSRP
FE4/0/0
Ca ta lyst
5000
Port 1/1
HSRP
HSRP
ICPMSCOMC7501
IIS
IIS
CPMSFTNGXA01
CPMSFTNGXA02
CPMSFTNGXA03
IIS
CPMSFTNGXA04
CPMSFTNGXA05
Build Servers
INTERNET-BUILD
INTERNET-BUILD1
INTERNET-BUILD2
INTERNET-BUILD3
INTERNET-BUILD4
INTERNET-BUILD5
INTERNET-BUILD6
INTERNET-BUILD7
INTERNET-BUILD8
INTERNET-BUILD9
INTERNETBUILD10
INTERNETBUILD11
INTERNETBUILD12
INTERNETBUILD13
INTERNETBUILD14
INTERNETBUILD15
INTERNETBUILD16
INTERNETBUILD17
INTERNETBUILD18
INTERNETBUILD19
INTERNETBUILD20
INTERNETBUILD21
INTERNETBUILD22
INTERNETBUILD23
INTERNETBUILD24
INTERNETBUILD25
INTERNETBUILD26
INTERNETBUILD27
INTERNETBUILD30
INTERNETBUILD31
INTERNETBUILD32
INTERNETBUILD34
INTERNETBUILD36
INTERNETBUILD42
Stagers
CPMSFTCRA10
CPMSFTCRA14
CPMSFTCRA15
CPMSFTCRA32
CPMSFTCRB02
CPMSFTCRB03
CPMSFTCRP01
CPMSFTCRP02
CPMSFTCRP03
CPMSFTCRS01
CPMSFTCRS02
CPMSFTCRS03
CPMSFTSGA01
CPMSFTSGA02
CPMSFTSGA03
CPMSFTSGA04
CPMSFTSGA07
PPTP / Terminal Servers
CPMSFTPPTP01
CPMSFTPPTP02
CPMSFTPPTP03
CPMSFTPPTP04
CPMSFTTRVA01
CPMSFTTRVA02
CPMSFTTRVA03
Monitoring Servers
CPMSFTHMON01
CPMSFTHMON02
CPMSFTHMON03
CPMSFTMONA01
CPMSFTMONA02
CPMSFTMONA03
Microsoft.com Server Count
Microsoft.com SQL Servers
Live SQL Servers
CPMSFTSQLA05
CPMSFTSQLA06
CPMSFTSQLA08
CPMSFTSQLA09
CPMSFTSQLA14
CPMSFTSQLA16
CPMSFTSQLA18
CPMSFTSQLA20
CPMSFTSQLA21
CPMSFTSQLA22
CPMSFTSQLA23
CPMSFTSQLA24
CPMSFTSQLA25
CPMSFTSQLA26
CPMSFTSQLA27
CPMSFTSQLA36
CPMSFTSQLA37
CPMSFTSQLA38
CPMSFTSQLA39
Backup SQL Servers
SQL
SQL
SQL
SQL
Misc. SQL Servers
CPMSFTSQLD01
CPMSFTSQLD02
CPMSFTSQLE01
CPMSFTSQLF01
CPMSFTSQLG01
CPMSFTSQLH01
CPMSFTSQLH02
CPMSFTSQLH03
CPMSFTSQLH04
CPMSFTSQLI01
CPMSFTSQLL01
CPMSFTSQLM01
CPMSFTSQLM02
CPMSFTSQLP01
CPMSFTSQLP02
CPMSFTSQLP03
CPMSFTSQLP04
CPMSFTSQLP05
CPMSFTSQLQ01
CPMSFTSQLQ06
CPMSFTSQLR01
CPMSFTSQLR02
CPMSFTSQLR03
CPMSFTSQLR05
CPMSFTSQLR06
CPMSFTSQLR08
CPMSFTSQLR20
CPMSFTSQLS01
CPMSFTSQLS02
CPMSFTSQLW01
CPMSFTSQLW02
CPMSFTSQLX01
CPMSFTSQLX02
CPMSFTSQLZ01
CPMSFTSQLZ02
CPMSFTSQLZ04
CPMSFTSQL01
CPMSFTSQL02
CPMSFTSQL03
SQL
SQL
SQL
SQL
SQL
CPMSFTSQLB05
CPMSFTSQLB06
CPMSFTSQLB08
CPMSFTSQLB09
CPMSFTSQLB14
CPMSFTSQLB16
CPMSFTSQLB18
CPMSFTSQLB20
CPMSFTSQLB21
CPMSFTSQLC24
CPMSFTSQLC25
CPMSFTSQLC26
CPMSFTSQLC27
CPMSFTSQLC30
CPMSFTSQLC36
CPMSFTSQLC37
CPMSFTSQLC38
CPMSFTSQLC39
C is c o 700 0
Ca ta lyst
5 000
FE4/1/0
CPMSFTSQLB22
CPMSFTSQLB23
CPMSFTSQLB24
CPMSFTSQLB25
CPMSFTSQLB26
CPMSFTSQLB27
CPMSFTSQLB36
CPMSFTSQLB37
CPMSFTSQLB38
CPMSFTSQLB39
Consolidator SQL Servers
CPMSFTSQLC02
CPMSFTSQLC03
CPMSFTSQLC06
CPMSFTSQLC08
CPMSFTSQLC16
CPMSFTSQLC18
CPMSFTSQLC20
CPMSFTSQLC21
CPMSFTSQLC22
CPMSFTSQLC23
C is c o 70 00
Port 1/1
Ca ta lyst
5000
FE4/1/0
Port 1/2
Port 2/12
FTP
6
Build Servers
32
IIS
Port 1/1
210
Application
ICPMSCBAC5502
ICPCMGTC7501
ICPMSCBAC5001
ICPCMGTC7502
24
Network/Monitoring
12
SQL
120
Search
2
NetShow
3
NNTP
16
SQL
SMTP
6
Stagers
SQL
2
Exchange
SQL
Total
One of the Data Centers (500 servers)
C a n yo n Pa rk Da ta C e n te r
Mic ro so ft.c o m Ne tw o rk Dia g ra m
26
459
Dra wn b y: Ma tt G ro sho ng
La st Up d a te d : Ap ril 12, 2000
IP a d d re sse s re m o ve d b y Jim G ra y
to p ro te c t se c urity
© 2004 IBM Corporation
IBM Labs in Haifa
Web applications - Requirements Summary
Availability
Scalability
Security
Performance
Integrity
Manageability
Malleability/Longevity
Integration
Cost
© 2004 IBM Corporation
IBM Labs in Haifa
Availability
Defined as measurement of perceived uptime by a user
There are 86,400 seconds in a day (~100,000) 31,536,000 seconds in a
year (~30 million)
99% uptime represents 1% downtime is
864 seconds/day or 14.4 minutes/day
315,360 seconds/year or 5256 minutes/year or 88 hours/year
Downtime
Percentage Uptime
53 minutes/year or 0.14
minutes/day)
99.99%
5 minutes/year
99.999%
30 seconds/year
99.9999%
3 seconds/year
99.99999% (7 nines)
© 2004 IBM Corporation
IBM Labs in Haifa
The Internet Changed Expectations
1990
Very few businesses are on the Web
Phones delivered 99.999%
ATMs delivered 99.99%
Failures were front-page news
Few hackers
Outages last an “hour”
High Availability for the Rich
2004
Most businesses have Web presence
Cellphones deliver 90%
Web sites deliver 98%
Failures are business-page news
Many hackers
Outages last a “day” or more
High Availability Is for All (HAIFA)
Is this progress?
© 2004 IBM Corporation
IBM Labs in Haifa
In the News
© 2004 IBM Corporation
Source: Gartner Group
IBM Labs in Haifa
Downtime Costs (per Hour)
Brokerage operations
Credit card authorization
Ebay (1 outage 22 hours)
Amazon.com
Package shipping services
Home shopping channel
Catalog sales center
Airline reservation center
Cellular service activation
On-line network fees
ATM service fees
$6,450,000
$2,600,000
$225,000
$180,000
$150,000
$113,000
$90,000
$89,000
$41,000
$25,000
$14,000
! " #$
#% # &###
'
'
(
')
("
!
#*
© 2004 IBM Corporation
IBM Labs in Haifa
September 11, 2001
Only 15% of the companies in the World Trade Center had a working business
continuity plan
One Law firm did not have a backup outside of the building – it went out of
business
One of the trading firms was able to successfully, immediately transition over to a
backup site across the river with absolutely no interruption to their customers
An investment bank had only a tape backup. It took them four days to recover
© 2004 IBM Corporation
IBM Labs in Haifa
Scalability
The capability of a system to adapt readily to a greater or lesser intensity of use,
volume, or demand while still meeting its business objectives (acceptable levels of
performance, availability, manageability etc.)
Resource
Utilization
Utilization
increases faster
than the load Typical
Utilization
increase linearly
with load - Good
Situation
Ideal - Gracefully degrade
as load increases. Seldom
happens
Bad situation Think it's OK until
load increases.
Poor design
Load
© 2004 IBM Corporation
IBM Labs in Haifa
Motivation
Defined: Data is stored without overlap across multiple sites and each
site processes its data the same way
This is the architecture of the web (Order of magnitude circa 10^12
hits/day)
Back of the envelope thought exercise:
Assume a server can handle average number of hits ranging from
10^1/sec. – 10^4 /sec
Then, there must be 10^3 – 10^6 web sites to meet load…
Examples (data partitioning – segmented workload):
1999 data on one site, 1998 on another…
a’s on one site, b’s on another…
© 2004 IBM Corporation
IBM Labs in Haifa
The Scaling Paradox
Vertical and Horizontal Scaling
Stateful and stateless applications are two extremes
The Scaling Paradox for stateful applications
Having multiple copies (cached or replicated), leads to
inconsistencies:
Modifying one copy makes that copy different from the rest.
Keeping copies always consistent
Requires global synchronization on each modification.
Global synchronization precludes large-scale solutions.
If we can tolerate inconsistencies, we may reduce the need for global
synchronization.
Tolerating inconsistencies is application dependent.
Allow applications to work at the level of inconsistency they tolerate
© 2004 IBM Corporation
IBM Labs in Haifa
Achieving Scalability
Faster Machines (Vertical Growth)
Replicated Machines (Horizontal Growth)
Specialized Machines
Segmented Workloads
Request Batching
User Data Aggregation
Connection Management and
Caching
It is important to note that a detailed understanding of the application is
key to the successful implementation
© 2004 IBM Corporation
IBM Labs in Haifa
Techniques Applied to Web Tiers
© 2004 IBM Corporation
IBM Labs in Haifa
Technology
Computation Stack
Hardware
Tandem
Special Communication (e.g., VIA, Infiniband)
Operating System
Distributed Operating Systems (e.g., Solaris MC)
Single System Image
Group Communication Services (GCS) at the OS level
Middleware
Group Communication Services
Application
Shared Something
DB
Disks, File System
Shared Nothing
© 2004 IBM Corporation
IBM Labs in Haifa
Techniques for HA
Replication
Fault Detection
Load Balancing
Recovery
Failover
Failback
Malleability – Continuous Availability
Consensus (voting, two phase commit, Paxos)
Diversity
Terminology
MTTF
MTTR
MTBF = MTTF+MTTR
Availability =MTTF/(MTTF+MTTR)
Reconfiguration Transparency
© 2004 IBM Corporation
IBM Labs in Haifa
Free Recovery?
© 2004 IBM Corporation
IBM Labs in Haifa
Replication vs. Data Partitioning
Replication
Same or overlapping data stored at multiple locations
Partitioning
Data non-overlapping
Typically, only one “home” for any data element
© 2004 IBM Corporation
IBM Labs in Haifa
Replication vs. Caching
Difference between caching and replication
Caching: there is a fundamental difference between a cached copy
and the real “backing” data. Loss of the cache is not a failure except
from the perspective of performance
Replication: all replicas are of the same type, albeit not necessarily
identical. Loss of a replica is a failure and could result in higher
likelihood of lost data
© 2004 IBM Corporation
IBM Labs in Haifa
Semantics of Replication
Consistency/fuzzy replication
Same issue as in caching as above
What does consistency mean?
Ticket Sales (OK to not show all the seats)
Latest Score in basketball game (Can lag by up to n seconds)
Weather forecast (Variable lag, depending on serverity of change)
Prices for certain goods (Perhaps they need to be exact, as
differentials would cause customer dissatisfaction)
© 2004 IBM Corporation
IBM Labs in Haifa
Failure Detection
Explicit –clear indication that failure has occurred
Timely
Semantics clean, … as far as they go
Voting
Implicit – timeout
Requester does not receive response after waiting a while
Unclean: Does not necessarily mean remote system failed
Timeout often used in very many places/levels
Communication
Naming, …
And, ultimately, End-to-end
© 2004 IBM Corporation
IBM Labs in Haifa
Timeout In More Depth
Problems with timeouts
Semantics
Specification of timeout length
Particularly difficult when requests take variable amounts of time
And, requester, can not dynamically set time-out interval
Long intervals lead to poor customer satisfaction – imagine an
ATM that made you wait 10 minutes before failing and giving you
your card back?
Therefore, timeouts are used at multiple system levels
Lower levels have more predictable performance so can trigger
timely failures better
Higher levels are required for ultimate correctness
© 2004 IBM Corporation
IBM Labs in Haifa
Consensus - Voting
Discussed wrt: Weighted Voting Algorithm
Used to determine most up-to-date copies
What if used to detect incorrect data
N-way computation
Structure
N-inputs: vote on them and determine most typical input
N-computations on most typical input
Vote on result
N-outputs which go into next stage of computation
Or go to some device which itself votes
© 2004 IBM Corporation
IBM Labs in Haifa
The Goal Malleability
How do you change the system without taking it down?
The application
The operating system
Perhaps, even a change to the hardware
This has proven very hard
© 2004 IBM Corporation
IBM Labs in Haifa
An Approach
Ensure a service is replicated
Stop a copy
Augment its interfaces
Restart it
And repetitively do the same to the other copies
Eventually, all replicas will have no capabilities
Note: it is very hard to reduce the scope of interfaces., Augmentation is
much easier.
© 2004 IBM Corporation
IBM Labs in Haifa
Issues
Levels of availability with relations to Business processes
Mapping business requirements to availability
Common APIs for HA
Achieving malleability requires many steps for a human being to get right
So, need automation
May not handle a simultaneous failure during upgrading:
So, more replicas may be needed
%
%
%
%
%
.
.
.
.
.
%
%
Cost of availability: The shape of this curve is right, though the calibration is
unknown and undoubtedly flattens as experience grows
© 2004 IBM Corporation
IBM Labs in Haifa
Issues (continued)
Window of Vulnerability
If transactions used, there is a potential availability problem during
the “Window of Vulnerability”
The only solution is that transactions coordinators must be rather
reliable and be guaranteed to recover quickly after a crash
Multi-tier Availability
Guaranteeing exactly once execution
Handling partitions (split brain)
More…
© 2004 IBM Corporation