SLA (Service Level Agreement)

What is a SLA (Service Level Agreement)?

A Service Level Agreement (SLA) is a formal contract between a service provider and a customer that defines the expected level of service. It outlines key details such as performance standards, availability, responsibilities, and response or resolution times. SLAs usually include the penalties, terms for cancellation and outcomes a provider will face if a SLA is not met.

SLAs are common in IT services, cloud computing, and customer support, ensuring accountability and transparency. They help align expectations, provide measurable benchmarks, and serve as a reference point for addressing issues or disputes when service levels are not met.


What are the Different Types of SLA?

Customer-level SLA

Customer-based SLAs cover the agreement between a service provider and a customer. Customers can be internal or external. Customer-level SLAs may cover multiple services, products and systems.

Service-level SLA

A service-level SLA is the contract that covers an identical service offered to multiple customers. These are particularly common for public cloud services.

Multi-level SLA

A multi-level SLA is a contract split into different levels to incorporate more than two parties, or different levels of service, into the same agreement. These type of agreements are common when a provider offers a tiered product or service in different SKUs / pricing plans, many SaaS products outline multi-level SLAs.

Operational SLA

An Operational SLA is usually an internal agreement within an organization that defines the performance standards and responsibilities between different teams or departments. Unlike customer-facing SLAs, which outline commitments to external clients, an operational SLA focuses on internal processes that support the delivery of services. For example, an IT operations team might commit to resolving server issues within four hours or ensuring system backups complete daily. Operational SLAs help improve accountability, streamline workflows, and ensure internal efficiency. Operational SLAs are usually based around their own Service Level Objectives (SLOs) and Service Level Indicators (SLIs).


SLA vs. SLO vs. SLI

The SLA (Service Level Agreement) is a formal contract between a service provider and a customer. A SLA will usually contain a definition of the minimum level of service promised (e.g., 99.9% uptime per month).

A SLO (Service Level Objective) is internal goal or target that supports the SLA, it is usually more detailed and often stricter than the SLA. For example: SLA promises 99.9% uptime, but the SLO target may be 99.95% to give a safety margin.

SLIs (Service Level Indicators) are the actual measurements/metrics used to determine if you’re meeting the SLO or SLA. For example: the percentage of successful HTTP requests, average response time, or system availability.

An Example (SLA vs SLO vs SLI)

  • SLI: System uptime measured = 99.96% this month
  • SLO: Internal target = 99.95% uptime
  • SLA: Customer agreement = 99.9% uptime guarantee

Here, the system met both the SLO and SLA.


Components of a Service Level Objective (SLO)

A Service Level Objective (SLO) is an internal reliability target that defines the expected level of service for a system or application. To make it effective, an SLO typically includes several key components:

  1. Service/Function Definition – What part of the system or service the SLO applies to (e.g., API, login service, database).
  2. Metric (SLI) – The measurable indicator being tracked, such as uptime percentage, error rate, or request latency.
  3. Target/Threshold – The goal for the metric over a given period, such as “99.95% availability” or “95% of requests under 300ms.”
  4. Time Window – The duration over which the target is measured, like per day, per month, or rolling 30 days.
  5. Exclusions/Allowances – Defined exceptions (e.g., scheduled maintenance, force majeure events) that don’t count against the objective.
  6. Monitoring Method – How performance is measured and reported (e.g., Prometheus, eG Enterprise, cloud provider logs).

Together these elements can clearly define a quantifiable SLO such as:

  • “The login API should have 99.95% availability measured over a rolling 30-day period, based on successful HTTP 200 responses, excluding planned maintenance. The availability to be measured via the providers logs as published on their website.”

SLOs inherently define a budget for the repair of issues arising. For example a SLO of 99.99% uptime over 30 days implies you’d need to measure the downtime your service experiences over a month and to ensure it’s less than 4.32 minutes. Tools are available to calculate the length of time that uptime % corresponds to, for example: uptime.is or slatools.com.


What are the Benefits of SLAs?

Service Level Agreements (SLAs) provide clear value for both service providers and customers by defining expectations and measurable outcomes. The main benefits include:

  • Clarity and Transparency – SLAs outline service scope, responsibilities, and performance metrics, reducing misunderstandings.
  • Accountability – Both parties know what is expected, ensuring providers deliver agreed service levels.
  • Measurable Performance – SLAs define metrics such as uptime, response time, and resolution time, making performance easier to track.
  • Customer Satisfaction – Clear commitments build trust and improve relationships.
  • Continuous Improvement – Performance monitoring against SLAs helps identify gaps and drive service enhancements.
  • Legal and Financial Due Diligence – Ensures the customer has clearly defined redress options if a service is substandard or not provided. Ensures the provider’s liabilities are reasonably limited.

What are the Common Elements of a SLA?

A number of common elements are usually included within Service Level Agreements to define the scope, expectations, and accountability between a service provider and a customer. The most common elements include:

  1. Service Scope – Clear description of the services covered by the SLA.
  2. Performance Metrics – Measurable standards such as uptime, response time, resolution time, and throughput.
  3. Roles and Responsibilities – Defines what the provider and customer are each responsible for.
  4. Availability and Reliability – Commitments on system uptime, availability windows, and maintenance schedules.
  5. Incident Management – How issues are reported, categorized, and escalated.
  6. Support and Response Times – Target times for responding to and resolving incidents or requests.
  7. Monitoring and Reporting – How performance will be measured, tracked, and shared with stakeholders.
  8. Penalties and Remedies – Consequences if service levels are not met (e.g., service credits, refunds).
  9. Terms for Cancellation – The circumstances under which each party can cancel a SLA.
  10. Security and Compliance – Standards related to data protection, privacy, and regulatory compliance.
  11. Review and Revision Process – Guidelines for updating the SLA as services or business needs evolve.

Calculating Composite SLAs

It is very common for organizations to procure multiple services from a vendor which are then used in series or parallel, with each service having its own associated SLA uptimes and commitments. This is a very common scenario when purchasing multiple services in cloud such as from AWS. You may like to explore how to combine SLAs to calculate a composite SLAs, information covering this is included in these third-party articles:

Composite SLAs – AWS Example

Imagine you have an application that uses:

  1. AWS Lambda for processing.
  2. Amazon DynamoDB for database operations.
  3. Amazon API Gateway to expose your API.

If all three of these services are required and must be up for the application to function correctly, you would multiply their individual SLAs to get your composite SLA:

  • Lambda SLA: 99.95%
  • DynamoDB SLA: 99.99%
  • API Gateway SLA: 99.95%

Composite SLA = 0.9995 * 0.9999 * 0.9995 = 0.998900349975, or 99.89%

Where 99.95% corresponds to 21min 55s monthly of downtime, 99.89% corresponds to monthly downtime of 48min 13s.