In Site Reliability Engineering (SRE), an SLI stands for Service Level Indicator.
Understanding SLIs in SRE
An SLI is a crucial concept in Site Reliability Engineering (SRE) that helps teams quantify the performance and reliability of a service. Based on the provided reference, an SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.
Think of an SLI as a direct measurement of how well your service is performing against a specific metric that matters to your users or business. It's the raw data point you collect to understand service health.
Why are SLIs Important?
SLIs are fundamental to SRE because they:
- Provide Objective Data: They move discussions about service quality from subjective opinions ("it feels slow") to concrete, measurable data points ("latency is 500ms").
- Enable Goal Setting: They form the basis for setting Service Level Objectives (SLOs), which are the targets for your SLIs.
- Drive Prioritization: By monitoring SLIs, SRE teams can identify areas that need attention and prioritize work based on actual service performance.
- Facilitate Communication: They provide a common language for discussing service health with stakeholders across engineering, product, and business teams.
Common Examples of SLIs
SLIs can vary depending on the nature of the service, but some are very common. As highlighted in the reference, most services consider request latency as a key SLI.
Here are a few common SLIs:
- Request Latency: How long it takes for a service to respond to a request. The reference specifically mentions this: Most services consider request latency—how long it takes to return a response to a request—as a key SLI. This could be the average, median (p50), or tail latency (p95, p99).
- Error Rate: The percentage of requests that result in an error (e.g., HTTP 5xx errors).
- Availability: The percentage of time a service is accessible and functioning correctly.
- Throughput: The number of requests a service can handle per unit of time (e.g., requests per second).
- Durability: For storage services, the likelihood that data will be preserved over a long period.
Defining Good SLIs
Defining effective SLIs requires careful consideration. They should be:
- Quantifiable: Easily measurable and expressible numerically.
- Understandable: Clear what is being measured and why it's important.
- Measurable: Possible to collect the data reliably.
- Relevant: Directly tied to user experience or business value.
SLIs are the raw data points that power the SRE approach to reliability. By carefully defining and monitoring these indicators, teams gain the visibility needed to manage service performance effectively and make data-driven decisions.