Home InternetMarketing The importance of service level management for the customer experience

The importance of service level management for the customer experience

by Helen J. Wolf
0 comment

Organizations are facing challenges in the rising cost of goods and services, driven by a potent combination of Covid-19 and the massive layoff. This has hurt the supply of technical talent and has put pressure on employees who work in lean teams.

Staff shortages have particularly affected site reliability engineers (SREs), under extreme pressure to ensure digital assets perform at optimum levels 24/7. SREs are tasked with delivering the best possible customer experiences with limited resources, while business leaders strive for responsive and error-free services as they compete for market share.

Unfortunately, manually tracking performance and incident data is difficult, time-consuming, and frustrating for IT and the business. But by applying automation through a programmatic approach, outside human intervention can become a thing of the past.

The importance of service level management for the customer experience

Under the hood of SLM

SREs are essential to understand exactly how customers experience a product or service and to monitor the system’s performance and reliability through the customer’s eyes. Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are central to any SRE practice.

SRE teams often establish strict SLOs for customer-facing components within their applications that support the Service Level Agreement (SLA) the company has agreed with customers. From here, the team can apply error budgets to understand how much tolerance they have to solve problems to stay compliant with the SLOs and SLAs.

Service levels enable teams to express expectations through observability, creating an objective, data-driven view of service delivery across the organization. At a glance, business leaders can use service levels to monitor compliance across multiple teams and business units, which reflects team and company performance related to the customer experience.

To reduce the burden on technicians when manually tracking performance and incident data, programmatically followed SLIs and SLOs are fundamental to SRE practices.

Defining relevant indicators and objectives

SLIs should be relevant to a service provided, simple, and easy to understand. When an SLI underperforms an SLO target during the measurement period, it indicates a business impact, such as excessive unavailability or a sub-optimal user experience.

SLIs often focus on user experience measurements. Typical indicators are latency/response time, error rate/quality, availability, and uptime. Indicators less relevant to service delivery include CPU/disk/memory usage, cache hit rate, and garbage collection time. Unless resource saturation is present, these indicators do not directly correlate with user experience.

The key to a useful SLI is to choose an indicator that is clearly and unambiguously related to service delivery, easy to measure, and, most importantly, usable.

Programmatic SLIs have three main characteristics: they are current and reflect the status of a system in real-time; they are automated (they are measured and reported consistently by instrumentation, not by users); and finally, they are useful because they are selected based on what the user of a system considers important.

Programmatic SLIs allow engineering teams to easily automate tasks such as tracking performance across service boundaries, end-to-end user journeys, and measuring reliability across groups within defined tolerances. They can also reduce manual work because DevOps teams have a clear signal when something happens that impacts users and, therefore, the business.

An important part of creating programmatic SLIs is identifying the capabilities of each system or service:

A system is a collection of services and resources that exposes one or more capabilities to external customers (end users or other internal teams). Power is a particular aspect of functionality presented to its users by a service, expressed in plain language. A service is a runtime process (or horizontally scaled layer of processes) that is a subset of the system.

SLOs express the objective that the SLIs must meet over a certain period.

SLOs should be easy to understand for even non-technical stakeholders. For example, for each SLI, create a baseline SLO using a metric such as a percentile (e.g., 99%) that reflects the population size the SLIs must meet over a one-week rolling window.

In non-technical terms, this can be described as meeting 99% of all user requests within the conditions defined by the SLI during the period. Importantly, when using metrics to characterize distributions, averages should be avoided as they do not consider the extreme conditions present in skewed distributions, which are common and can ignore the impact of service delivery on a significant number of users.

SLOs reflect the entire population using a service over some time. If several cohorts have different SLAs associated with service delivery, separate SLOs must be defined that independently monitor and measure the affiliates.

SLOs are designed to balance the behavior of members of DevOps teams and ensure that the customer remains at the center of any activity where there is a risk of non-compliance with SLAs. To achieve this in practice, the day-to-day operations of teams must be guided by the current state of SLOs. When an SLO goes in the wrong direction, teams must return to activities and behaviors that bring the SLO back in line. Once the SLOs are restored, regular operations can resume.

At cloud-based payment player Zico, using a Service Level Management feature that automates tasks was critical to enable the technicians to visualize and report on the company’s service level indicators and objectives and calculate error budgets. It breaks down the process of defining an SLI and setting the goals into an easy-to-understand and repeatable process for the technical teams.

Setting up SLIs and SLOs will result in a simpler and more responsive observation practice, tighter alignment with the business, and faster improvement. To ease the burden on SREs, it is essential to provide the right tools to configure and deliver meaningful SLIs and SLOs automatically.

You may also like