Defining and Monitoring Service Level Agreements for

IBM T.J. Watson Research Center
Defining and Monitoring
Service Level Agreements
for dynamic e-Business
Alexander Keller, [email protected]
Heiko Ludwig, [email protected]
LISA 02 | 11/07/2002 | Philadelphia, PA, USA
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Why should SysAdmins care about SLAs?
How much does it cost you to guarantee a Response Time less than 1 sec.?
How much do you bill a Customer for a Throughput of 1000 TAs/sec?
How much Revenue is lost per Hour of Downtime of Server X?
Express System Resources in Financial Terms
What are realistic Thresholds for Response Time/Throughput/Bandwidth?
Can you accommodate additional Workload and accept another Customer?
How much Workload do SLA Measurements put on Server X?
How does this impact your SLAs with other Customers?
SysAdmins will become involved in SLA Negotiation (today: Lawyers)
If your Systems become overloaded, which Customer will be starved out?
Classify Customers according to Revenue
SLA Violation may not be due to technical Failure, but Result of Business Decision
What’s more expensive?
A Disk-Crash on a Server or an overloaded Ethernet Segment?
Depends on how much the Customer pays whose Data is hosted there!
2
Fix Outages according to Customer Classification (today: Severity of Outage)
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Real-world SLAs – and their Requirements
Today: Confined to Availability
“Availability% := (n - #hours_Svc_down)* 100 / n”
“… Users being able to establish a TCP Connection to the Server…”
“…Customer’s ability to access the Software Application on the Server…”
“… if the Server is responding to HTTP Requests issued by monitoring SW…”
BUT: There is no agreed-upon Definition of “Availability”!
What is needed?
Define new SLAs “on demand” (e.g., Grid, Virtual Enterprises, Web Services)
Accommodate ANY QoS Parameter Definition and Service Level
Go beyond “Availability”: Response Time, Throughput, Bandwidth…
Connect to existing Application and Resource Instrumentation
Support Customer/Provider Relationships of arbitrary Depth
Delegate SLA Monitoring Tasks to Third Parties
Address Confidentiality Requirements of the Parties (“Need to know”)
Automated Setup of Monitoring Environment based on SLA Definition
3
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Terminology: SLA Parameters, Metrics, Functions
Business Metrics SLA Parameters Composite Metrics Resource Metrics
Mapping
Measurement
Directive
Mapping
Function
Function
Provider-defined
Customer-defined
The analyzed SLAs share a common Structure:
Involved
Metrics
The
used as Input to compute SLA Parameters
Functions that define how Metrics are aggregated
How
4
Parties, SLA Parameters
Metrics are retrieved from Managed Resources (Measurement Directive)
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Web Service Level Agreement (WSLA) Framework
Service
Description
Service Customer
Service Provider
Client
Application
Service
SLA
SLA Management
5
Primary Parties
Supporting Parties
Parameters/Metrics
Measurement Directives
Functions
Obligations
Response time
Throughput
Availability
SLA Management
SLA annotates an existing Service Specification:
References Service Description (e.g., Web Services: WSDL)
Other Service Descriptions possible, e.g., for Business Processes, Messaging, IT Resources
XML Schema based Language for SLAs,
Runtime Architecture comprising several SLA Monitoring Services
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
The WSLA Services: Atomic Building Blocks
Establishment & Deployment Services
Supports negotiation and authoring of SLAs
Deploys the relevant (!) Parts of the SLA to the different Parties
E.g., multiple Measurement Services may not “see” each other
Measurement Service
6
Probes and measures Resource Metrics according to SLA Specification and
aggregates them into SLA Parameters
Condition Evaluation Service
Compares SLA Parameters obtained from Measurement Service against specified
Service Levels
Notifies the involved Parties that a Violation has occurred during a valid Time Period
Management Service & Business Entity (not yet supported)
Carries out corrective Actions, provided they do not violate Business Policies
Access to - proprietary - Tuning Controls and Configuration Parameters of managed
Resources often not available,
Must be checked against Business Policies embodied by Business Entity
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
SLA Lifecycle in the WSLA Architecture
Service Customer
Establishment
1. Negotiate / Sign
Web Service
Servlet Engine
SLA references WSDL
5. Terminate
Service Provider
Admin Console
AppServer Monitoring/Management Interfaces
SLA Compliance Monitor
Deployment
2. Deploy
Measurement
3. Report
Business
Entity
Condition
Evaluation
4. Act
Management
7
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Delegating SLA Monitoring Tasks to Third Parties
ACMEProvider
XInc
Management
Violation
Notifications
Violation
Notifications
Management
ZAuditing
Condition
Evaluation
Measurement
Aggregate
Response Time,
Throughput
YMeasurement
Client
Application
Availability
Probe
Measurement
Response Time,
Operation Counter
Offered Service
Service Operation
Measurement Service Providers guarantee Accuracy and Objectivity (e.g., Keynote Systems)
8
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Typical Structure and Elements of an SLA
Parties:
Signatory Parties
Supporting Parties
Service Description:
Service Operations
Bindings
SLA Parameters
Metrics
Measurement Directives
Functions
Schedule
Obligations:
Validity Period
Predicate
Actions
9
LISA 02
Philadelphia, PA, USA
Involved Parties:
IDs and Interfaces of Signatory Parties
IDs and Interfaces of Supporting Parties
Service Characteristics & Parameters:
Operations offered by Service
Transport encoding for Messages
Agreed-upon SLA Parameters (Output)
Metrics used as Input
How/where to access Input Metrics
Measurement Algorithm
Measurement Duration, Sampling Rate
Guarantees & Constraints:
When is SLA Parameter guaranteed?
How to detect Violation (Formula)
Corrective Actions to be carried out
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
SLA Structure Example: Service Throughput
Involved Parties:
“customer.com”, “provider.com”
“msp.com, keynote.com, …”
Service Characteristics & Parameters:
“StockQuoteService:GetQuote()”
“SOAPGetQuote”
“average throughput of service”
“#Requests(svc)”
“www.msp.com/getMetric?Requests(svc)”
“AVG(#Requests(svc))”
“over 24 hours, every 60 minutes”
Guarantees & Constraints:
“weekdays, 9am-5pm”
“ > 1.000 TA/second”
“open TT”, “pay penalty/premium”
Parties:
Signatory Parties
Supporting Parties
Service Description:
Service Operations
Bindings
SLA Parameters
Metrics
Measurement Directives
Functions
Schedule
Obligations:
Validity Period
Predicate
Actions
10
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Example: Defining SLOs with Constraints
Why define Constraints for Service Level Objectives?
If your hosted Site becomes too popular and creates excessive Load, your
Throughput SLO may be impossible to fulfill
Service Provider needs to protect himself against this Situation
SLOs are defined for regular Workloads
BUT: What is a “regular Workload”? Needs to be defined within the SLA!
Example:
“If the System Load is over 80% for more than 30% of the Time, the
Obligation of a Service Provider to guarantee 1000 TAs/sec is waived”
2 SLA Parameters:
AvgThroughput: Average Throughput for TAs; must be > 1000 TA/sec
OverloadPct: The % amount of time System Utilization is => 80%
Both Parameters are measured every 5 Minutes on an hourly Basis
In an SLO, OverloadPct is used as Precondition for AvgThroughput
11
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Example (cont.)
How many TimeSeries
Elements > Threshold?
has
defined by
Metric
PercentOverUtilized
Function
PercentageGreaterThanThreshold
defined by
Metric
UtilizationTimeSeries
Function
How obtained? TimeSeriesConstructor
defined by
Metric
ProbedUtilization
Measurement Directive
Probe: acme.com/SystemUtil
12
LISA 02
has
Assign to SLA
Parameter
SLAParameter
OverloadPct
New Value put
in Time Series
ServiceObject
WSDL:getQuote
Philadelphia, PA, USA
SLAParameter
AvgThroughput
defined by
Metric
AvgThroughput
Function
Average
defined by
Metric
ThroughputTimeSeries
Function
TimeSeriesConstructor
defined by
Metric
Throughput
Function
Divide
defined by
defined by
Metric
Metric
Transactions
TimeSpent
Measurement Directive
Measurement Directive
Read: TXcount
Read: Timecount
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Defining SLA Parameters and Metrics:
Assignment of Metric
to SLA Parameter
Who Communicates with
whom? And how?
Define the Metric:
How many Values (in %) of a
“Utilization” Time Series
are over a Threshold of 80%?
Create the Time Series:
- probe every 5 Minutes
- keep the last 12 Values
13
LISA 02
<SLAParameter name="OverloadPct" type="float" unit="Percentage">
<Metric>OverLoadPct</Metric>
<Communication>
<Source>YMeasurement</Source>
<Pull>ZAuditing</Pull>
<Push>ZAuditing</Push>
</Communication>
</SLAParameter>
<Metric name="OverloadPct" type="float" unit="Percentage">
<Source>YMeasurement</Source>
<Function xsi:type="PctGTThreshold" resultType="float">
<Schedule>BusinessDay</Schedule>
<Metric>UtilizationTimeSeries</Metric>
<Value>
<LongScalar>0.8</LongScalar>
</Value>
</Function>
</Metric>
<Metric name="UtilizationTimeSeries" type="TS" unit="">
<Source>YMeasurement</Source>
<Function xsi:type="TSConstructor" resultType="float">
<Schedule>Every5Minutes</Schedule>
<Metric>ProbedUtilization</Metric>
<Window>12</Window>
</Function>
</Metric>
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
SLOs in the WSLA Language:
ACMEProvider guarantees
the SLO
The SLO is valid for 1 Day
Time Format: RFC 3060
Precondition:
OverloadPercentage < 30%
Guarantee:
Average Throughput > 1000
Send NewValue Event to
registered Parties whenever
Guarantee is broken
<ServiceLevelObjective name=“SLO_for_AvgThroughput">
<Obliged>ACMEProvider</Obliged>
<Validity>
<Start>2001-11-30T14:00:00.000-05:00</Start>
<End>2001-12-31T14:00:00.000-05:00</End>
</Validity>
<Expression>
<Implies>
<Expression>
<Predicate xsi:type="Less">
<SLAParameter>OverloadPct</SLAParameter>
<Value>0.3</Value>
</Predicate>
</Expression>
<Expression>
<Predicate xsi:type=“Greater">
<SLAParameter>AvgThroughput</SLAParameter>
<Value>1000</Value>
</Predicate>
</Expression>
</Implies>
</Expression>
<EvaluationEvent>NewValue</EvaluationEvent>
...
</ServiceLevelObjective>
14
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Conclusions and Outlook
WSLA supports:
Flexible Specification of inter- and intra-organizational SLA Parameters
Highly customizable Service and IT Resource-Level SLOs
Nested Customer/Provider Relationships
Definition of third (“supporting”) Parties in SLA Management
Formal, XML-Schema based Description Language
Applicable to various Kinds of Services (Web Services, Storage, eUtilities etc.)
SLA Compliance Monitor Implementation Part of IBM Web Services Toolkit
Current Work:
Comprehensive SLA Framework, comprising:
15
Business Metrics and Pricing,
Business Processes, Workflow and Service Composition,
SLA Editing and Reuse of common SLA Artifacts,
Integration with existing Management Frameworks (WBEM / CIM)
LISA 02
Philadelphia, PA, USA
07/11/2002
© 2002 IBM Corporation
IBM T.J. Watson Research Center
Let us know what you think!
Download WSTK 3.2 with
SLA Compliance Monitor from:
http://www.alphaworks.ibm.com/tech/webservicestoolkit
LISA 02 | 11/07/2002 | Philadelphia, PA, USA
© 2002 IBM Corporation