Service level objectives are one of the most useful tools in SRE, but they often get introduced as abstract theory instead of an operating practice. This guide turns SLIs, SLOs, and error budgets into a repeatable framework you can use to estimate reliability targets, choose practical measurements, and revisit them as your product, traffic, and business expectations change. If you need a clear way to move from “we want to be reliable” to “here is the target, the budget, and the decision rule,” this article is meant to be a durable reference.
Overview
An SLO program starts with a simple idea: not every failure matters equally, and not every service needs the same reliability target. The point of service level objectives is to define reliability in a way that is measurable, useful to users, and actionable for engineering teams.
Three terms matter most:
SLI stands for service level indicator. It is the measurement you use to represent a user-relevant aspect of service health, such as successful request rate, latency under a threshold, job completion success, or freshness of data.
SLO stands for service level objective. It is the target for that measurement over a defined time window, such as “99.9% of API requests succeed over 30 days” or “95% of page loads complete within 500 ms over 7 days.”
Error budget is the amount of unreliability you are willing to tolerate while still meeting the SLO. If the objective is 99.9% success, the budget is the remaining 0.1% of requests that may fail within the window.
The practical value of this framework is not the number itself. It is what the number allows you to do. A good SLO helps teams make tradeoffs between feature delivery and reliability work, tune alerting, plan incident response, and discuss risk with product owners in concrete terms.
Many teams struggle because they start from infrastructure metrics alone. CPU saturation, memory pressure, and pod restarts are useful for diagnosis, but they are not always good SLIs. Users do not directly experience a high CPU graph; they experience a timeout, a failed checkout, or stale data. A durable SLO program begins with the user journey and works backward to telemetry.
That is also why SLOs fit naturally into a broader observability stack. Instrumentation decisions influence whether you can calculate good indicators at all. If your telemetry foundation is still evolving, it helps to align SLO planning with implementation work such as metrics and trace collection; the OpenTelemetry setup guide is a useful companion for deciding what to instrument first.
How to estimate
The easiest way to define SLOs is to treat them like an estimation exercise with repeatable inputs rather than a one-time declaration. You are estimating what users need, what the system can realistically deliver, and how much failure the business can tolerate.
A practical sequence looks like this:
1. Pick a user-facing journey.
Choose a path where reliability clearly matters: logging in, creating an order, receiving an API response, processing a background job, syncing data, or rendering a dashboard. Avoid starting with internal components unless they map directly to user outcomes.
2. Choose a meaningful indicator.
For each journey, ask what “good” looks like from the user perspective. Typical categories include:
- Availability: whether a request succeeds at all
- Latency: whether the response arrives fast enough
- Quality: whether the response is complete and correct
- Freshness: whether data is recent enough to be useful
- Durability or completion: whether a job actually finishes
3. Define the measurement formula.
Good SLIs are ratios when possible because ratios are easy to interpret and compare. Examples:
- Successful requests / total eligible requests
- Requests under 300 ms / total eligible requests
- Completed jobs / started jobs
- Datasets refreshed within 15 minutes / total scheduled refreshes
4. Set the time window.
Common windows are 7 days, 28 days, or 30 days. Short windows react faster but can be noisy. Longer windows smooth variance but delay feedback. Many teams use a rolling 28- or 30-day window for policy decisions and shorter windows for burn-rate alerting.
5. Estimate the target.
This is where teams often jump straight to a “high” number. A better approach is to estimate from context:
- How critical is the journey?
- What is the user cost of failure?
- How mature is the current system?
- How often do you deploy and change it?
- Can the team observe and mitigate failures quickly?
For a non-critical internal reporting feature, a lower objective may be acceptable. For authentication, payment processing, or a public API with contractual expectations, the target may need to be stricter. The key is to set a number that drives decisions, not one that looks impressive in a slide deck.
6. Convert the target into an error budget.
This is the calculator step that makes the SLO operational.
Error budget percentage = 100% - SLO target
Allowed bad events in window = total eligible events × error budget percentage
If you serve 10,000,000 eligible requests in 30 days and the SLO is 99.9%, then:
- Error budget percentage = 0.1%
- Allowed bad requests = 10,000,000 × 0.001 = 10,000
That budget becomes a decision mechanism. If you consume it too quickly, you may slow releases, pause risky migrations, or prioritize stability work. If budget burn is low, you may have room to ship changes more aggressively.
7. Define policy, not just measurement.
An SLO without a response policy is only a dashboard. Decide in advance what happens when budget burn crosses thresholds. For example:
- At 25% budget burned: review recent incidents and top contributors
- At 50% budget burned: require reliability review for risky releases
- At 75% budget burned: defer non-essential changes
- At 100% budget burned: focus the next sprint or change window on recovery
8. Validate with real incidents.
Look back at recent outages, slowdowns, or customer complaints. Would the proposed SLI have captured those events? Would the SLO have signaled a meaningful problem? If not, refine the indicator before you formalize the target.
This estimation method keeps the conversation grounded. Instead of arguing over whether a service should be “three nines” or “four nines,” teams compare user impact, traffic patterns, current behavior, and the cost of consuming the error budget.
Inputs and assumptions
Every SLO depends on assumptions, and weak assumptions create noisy or misleading targets. Before rolling out objectives, make the inputs explicit.
Eligible events
Define which requests, jobs, or transactions count. Excluding known bots, health checks, or internal probes may make sense. Excluding hard cases simply to improve the number does not. The rule should be stable and explainable.
Good versus bad criteria
You need a precise boundary. For availability, that may be HTTP success responses for a particular route family. For latency, it may be completion under a threshold for valid requests. Be careful with mixed semantics; if your 200 responses can still contain user-visible failures, availability alone may not represent real experience.
Threshold selection
Latency SLOs are especially sensitive to threshold choice. A threshold that is too generous hides user pain. One that is too strict creates chronic burn and alert fatigue. Start with a threshold that reflects actual usability, then refine it after reviewing a few windows of data.
Measurement source
Decide whether the SLI comes from load balancer metrics, application telemetry, synthetic checks, client-side instrumentation, or a combination. Server-side metrics are often easier to start with, but client-side signals can better represent actual user conditions. The best choice depends on the journey being measured.
Window length
Rolling windows are usually more useful than calendar months because they reflect the current state continuously. Still, teams should pick one approach and stick to it long enough to establish trend history.
Traffic volume
High-volume services can support more precise ratios. Low-volume services may show wild swings where one incident consumes a large share of the budget. In those cases, consider longer windows, event aggregation, or carefully chosen composite indicators.
Dependency boundaries
Modern systems depend on many services. You should decide whether the SLO measures the full user journey, a single service, or both. A user-journey SLO is usually better for business discussion. A component SLO is often better for team ownership and debugging. Mature programs commonly use both layers.
Organizational response
The error budget is only real if it affects behavior. If engineers can blow through the budget without any change in deployment pace, review process, or prioritization, then the SLO exists only on paper.
A few common mistakes are worth avoiding:
- Using too many SLOs for a single service before the team has operational discipline
- Choosing indicators no one can explain to product or support teams
- Setting targets based on industry folklore rather than user impact
- Using infrastructure health as a substitute for service health
- Ignoring observability gaps that make the metric incomplete or misleading
If your telemetry and alerting are still uneven, it is often smart to improve your monitoring foundation first. The Prometheus alerting rules checklist can help tighten signal quality, and the Kubernetes troubleshooting checklist is useful when service-level failures trace back to cluster issues.
Worked examples
These examples show how to estimate an objective, compute an error budget, and attach a decision rule. The exact numbers are illustrative; the method is what should carry over to your environment.
Example 1: Public API availability SLO
Service: External REST API used by customer applications
User expectation: Requests should succeed consistently
SLI: Successful API responses / total eligible API responses
Window: 30 days
Target SLO: 99.9%
Suppose the API serves 20,000,000 eligible requests in a 30-day window.
- Error budget = 0.1%
- Allowed failures = 20,000,000 × 0.001 = 20,000 bad requests
Operational use:
- If fewer than 5,000 bad requests occur, normal release flow continues
- If 10,000 to 15,000 bad requests occur early in the window, risky schema and networking changes require extra review
- If the budget is fully consumed, the team pauses non-essential releases until reliability work is completed
This is a strong candidate for a classic availability SLO because users directly experience request success or failure.
Example 2: Background job completion SLO
Service: Nightly document processing pipeline
User expectation: Submitted documents should finish processing by the next business day
SLI: Completed jobs within the daily deadline / total scheduled jobs
Window: 28 days
Target SLO: 99.5%
Suppose the pipeline runs 40,000 jobs in 28 days.
- Error budget = 0.5%
- Allowed misses = 40,000 × 0.005 = 200 jobs
Operational use:
- If a single failed deployment causes 80 missed jobs, a post-incident review is required
- If repeated misses consume more than half the budget, platform and application teams jointly review queue depth, worker sizing, and retry behavior
This example shows that SLIs do not have to be tied to HTTP requests. Completion and timeliness can be the right measure for asynchronous systems.
Example 3: Web latency SLO for a critical page
Service: Customer dashboard page
User expectation: The page should render fast enough to feel responsive
SLI: Page views rendered under 1 second / total eligible page views
Window: 7 days
Target SLO: 95%
Suppose there are 500,000 eligible page views in a week.
- Error budget = 5%
- Allowed slow views = 500,000 × 0.05 = 25,000 page views over 1 second
Operational use:
- If a new analytics widget increases slow views by 15,000 in two days, the feature may be rolled back or degraded behind a flag
- If weekly latency burn remains high, the team may optimize queries, add caching, or revise the page composition model
This is a reminder that not every useful SLO needs to target “three nines.” For some journeys, a lower ratio with a user-centered latency threshold is more meaningful than a high availability target.
Example 4: Internal platform service with low traffic
Service: Internal deployment dashboard used by engineers
User expectation: It should usually work, but short disruptions are acceptable
SLI: Successful sessions / total sessions
Window: 30 days
Target SLO: 99%
Suppose there are only 2,000 sessions each month.
- Error budget = 1%
- Allowed failures = 2,000 × 0.01 = 20 failed sessions
With low traffic, small incidents can move the percentage quickly. The team may decide to keep the SLO but avoid overreacting to every dip, or use a longer window to reduce noise.
Across all four examples, the pattern is consistent:
- Choose the journey
- Define the ratio
- Set the window
- Compute the error budget from traffic volume and target
- Attach operating decisions to budget consumption
That makes SLOs a reliability calculator, not just a reporting exercise.
When to recalculate
SLOs should be stable enough to create discipline, but not frozen forever. Reliability targets become stale when the system, users, or business context changes. The best time to revisit them is when the inputs behind the estimate no longer match reality.
Recalculate or review your SLIs, SLOs, and error budgets when any of the following happens:
- Traffic changes materially. A large increase in volume can change the practical meaning of the budget. A low-volume service becoming high-volume may support tighter measurement and more granular policy.
- User expectations change. A feature that becomes customer-critical may need a stricter objective than it had when it was internal or experimental.
- Architecture changes. Migrations to Kubernetes, service decomposition, caching layers, queue redesigns, or CDN changes can alter what should be measured and where data should come from.
- Incident patterns repeat. If major customer-facing incidents are not reflected in the SLI, the indicator is incomplete. If the SLO burns constantly from noise that users barely notice, the target or measurement may be poorly chosen.
- Instrumentation improves. Better telemetry often reveals that an older indicator was a proxy rather than a direct measure. That is a good reason to revise the SLO.
- Ownership changes. When platform teams, application teams, or product teams shift responsibilities, the boundaries of service-level accountability may need updating.
- Release velocity changes. More frequent deployments, larger batch changes, or new rollout strategies can justify a fresh look at budget policy.
A practical review cadence is quarterly for most teams, with immediate review after significant incidents or product shifts. The goal is not to chase the number every month. The goal is to keep the SLO honest enough that it still drives useful decisions.
To make that review lightweight, keep a simple checklist:
- Is the SLI still user-relevant?
- Is the telemetry source still trustworthy?
- Did recent incidents show blind spots?
- Does the target still match service criticality?
- Does the current error budget policy change team behavior in practice?
- Has traffic volume changed the usefulness of the ratio?
If the answer to any of those is no, update the SLO deliberately and document why. Avoid silent drift.
For teams building a broader reliability practice, SLO work pairs well with adjacent operational reviews. Alerting quality, deployment risk, and infrastructure consistency all affect whether objectives are achievable and meaningful. Related guides on oracles.cloud can help you tighten the surrounding system, including the CI/CD pipeline bottleneck finder for delivery risk, the Terraform best practices checklist for infrastructure discipline, and the Kubernetes cost optimization checklist when scaling reliability work also affects platform spend.
Action plan: pick one important user journey this week, define one ratio-based SLI, choose a realistic window, compute the error budget from your actual volume, and write down the policy for what happens when half the budget is gone. That single exercise will do more for operational clarity than creating a long list of dashboards without decisions attached.