Latency, Tail Latency and Response time in distributed systems

Response time is a metric that shows how fast software responds. This is the time between request was sent and the time response was received. The questions arise right from the start: a sorting of 100 000 elements naturally requires more time than of 100. How to measure a system response time in this case? It’s natural to agree on some standard fixed request and perform a load test: send a lot of such standards requests and measure response time. The distribution of the response time we receive at the end of the performance test usually resembles a Log Normal distribution.

Response time’s tail latency

So, despite we’ve sent multiple tens of thousands of exact same requests to the exact same software we may split them into 3 very different groups based on the response time:

A small group of requests that were processed really quickly. Let’s say it’s top 5% requests having the lowest response time.
The largest group of request that have response time close to a median value. Let’s put all the requests having response time between 5ht and 95th percentile in this group. According to a distribution diagram above, 95% of request will have a response time 71 milliseconds or fewer.
A small group of requests with extremely high processing time, the most right part of the diagram. It’s top 5% percent of requests with the longest response. All requests that have response time greater than 71 ms fall into this group. It doesn’t sound a lot comparing to the 29ms median value. But the pitfall of this group is that most likely, the response time will be way higher than a median value: 120ms, 140ms or even more. This group forms a long “tail” of requests with a very long response time. Such distribution’s tail we call tail latency.

Why response time differs so much? Same requests processing should take the same amount of time. An additional time is added to a response time not because of the nature of a processed request but because of the software, hardware or environment. This occasional additional time we wait to receive software’s output is called latency. What is the nature of such delays?

First of all, requests don’t arrive uniformly. For example, we may have 10 simultaneous requests then some pause and then 3 more request, while the number of physical CPUs we have is only 2. The processing of first 2 requests starts immediately, while other requests have to wait for the next CPU free time slot. In fact our software has a buffer where requests wait their processing starts. The higher number of requests wait in the buffer, the longer is their response time. Usually this part is hidden inside web server implementation.

Unpredictable pick up time time from queue

Applications running on the same server shares CPUs. For example, we may have a log collector deployed to the same server as our web server. CPU scheduler splits CPU time into small discrete slots and distribute them between all processes. So, the web server competes for the CPU resource with an application that parse web server logs. An application may implicitly spawn auxiliary threads, like garbage collector threads in Java. In this case application auxiliary threads spend CPU resources that could be spent for request processing.

For sure these gaps in the available CPU time do not make the operations we need to perform slower, but they sightly affect how much of the wall-clock time the whole request processing takes.

Unstable cpu time allocation

For sure except the latency introduced by the operating system CPU scheduler we may have additional delays due to the fact that we need to wait to ensure data consistency. Consider a payment system that processes bank card withdrawals. If two withdrawal requests for the same card arrive to payment processing system simultaneously, they still should be processed sequentially to prevent a negative card balance. Naturally, the second request response time is going to be longer than the response time for the first request, as the second one waits for the completion of first request in a queue. For sure, the existing of this delay comes from the request nature, but it’s up to design decisions made how frequently and how long such delays are. You may read more about strict consistency costs in the previous article.

Response time in distributed systems

Now let’s go back to the response time distribution. In practice, we may reduce that diagram to a simple statement: there’s 0.05 probability that our request is served in more than 71ms or significantly longer.

Let’s look on how does the typical web application deployment looks. There are 3 components: Public cloud Load Balancer, web application itself and a database used to store web application data. Typical monolithic application

Each component contributes its own part to the overall system response time. For a sake of simplicity let’s assume that we measure response time distribution for all components in isolation, and it’s the same. The shape of the distribution is similar to the one in the beginning in the article. The median is 29 ms, the 95th percentile is 71 ms. It’s mentioned earlier that latencies comes from the nature of our software and hardware. That means, that we may consider events of serving request in 71ms or more by any of components as independent ones. Let’s use Binomial distribution to model the probability of getting a long response time as a system’s client. The number of “trials” in our case is 3, and the probability of “success” is 0.05.

Num of success	0	1	2	3	1-3
Probability	0.857	0.135	0.007	0.0001	0.143

So, for a simple monolithic architecture there’s almost 15% chance that response time is affected by the tail latency. Imagine a chain of sequential calls in the application on the diagram below:

Load balancer adds a median value of 29 ms to the whole system response time.
Application adds a median value of 29ms to the whole system response time.
And database adds a 95th percentile’s value - 71 ms. So, the most part of the overall system response time comes from database tail latency! For a small chain of calls, the impact on a final response time from an only single long intermediate response could be greater than the impact of all other intermediate calls together.

Now, let’s try to apply same method for Microservice Architecture. On a diagram below we have a pretty small application that consists of 3 microservices: one is stateless and 2 others store state in some databases. There’s also 3 load balancers, one in front of each microservice. Typical microservices application A typical call of microservice “A” requires two additional network interactions with service “B” and service “C” that are executed in parallel. So, now we have eight component each of them can introduce at least 70ms latency with a probability of 0.05. What is the probability that single user request is affected by tail latency in this case? Let’s use Binomial distribution again. In this case number of “trials” is 8 and the probability of “success” is still 0.05.

Num of success	0	1	2	3	1-8
Probability	0.663	0.279	0.051	0.005	0.337

Now, there’s a 33% chance to get a longer response time during the requests chain execution. The trend is obvious, the more components in a call chain we have, the more likely the overall system response time is affected by a tail latencies. How critical is it and how does this affect response time? The next diagram illustrates the difference in a response time of 2 systems. Both systems consist of 8 components, each component inside a system has the same response time distributions. For the first system the response time median value is 30 milliseconds versus 29ms in the second system. The response time 99th percentile for the first system is 69ms, while for the second it’s just 60ms. Modeling: 8 component systems response time comparison

Whole system response time, ms	50th percentile	80th percentile	90th percentile	95th percentile	99th percentile
First system	277	328	360	394	475
Second system	267	299	323	351	419
Difference, ms	10 (2%)	29 (8.9%)	37 (10.3%)	43 (10.9%)	56 (11.8%)

A relatively small difference in the 90th percentiles of system components response time makes a significant impact on the length of the system’s “tail latency”. To see this clearly we have to ask what fraction of requests has response time more than 330 ms for 8 component system. The empirical CDF plot of two distributions highlights this difference.

Empirical response time CDFs

The difference in amount of requests with response time greater than 330ms is 10% between 2 systems of 8 components. What was the 80th percentile becomes the 90th one when the tail latency of single components is increased for 10ms only.
In other words, the tail becomes two times longer.

So far we discussed a very simplified model, but it’s enough to demonstrate origins of tail latencies and how they affect a distributed system response time. In the real life there are additional factors that affects the shape of the response time distribution (the list is not exhaustive):

Rare heavy requests naturally require more time to be processed and increase the response time for all smaller one being processed in the same time.
Concurrent requests may try to access the same data, so we may have mutual locks and additional delays.
Network response time distribution is also the Log Normal.
Your application is deployed into kubernetes cluster, that is deployed on top of virtual machines, that are hosted by servers, that are shared with somebody else. And you have a very vague vision of what is happening around your app and what resources on what schedule are actually available to it.

How to improve response time

Once we uncovered roots of “tail latency” let’s try to describe possible solutions. First, and the most obvious one is to ask ourselves honestly, how low and how stable system response time do we really need? Isn’t it enough to guarantee 80th-90th-95th percentile low and stable, perform all cheap optimization to make the 99th not so scary one (but no guarantees!) and just forgot about the one percent of the slowest requests? This may sound as a poor engineering, but to make a business decision we have to consider such factors as:

Infrastructure provisioning costs
Infrastructure maintenance complexity ( …and costs!)
Software development complexity ( …and costs!)
Software development time (time is money, costs again!)
Usage of very specific hardware & software optimization features and possible vendor lock (and associated costs!).

And maybe, in fact there’s a possibility to get good enough software for a reasonable price.

Now let’s try to figure out what can decrease tail latency. There are two major strategies associated with multiple tactics each. First strategy is to fix tail latencies inside components. The second one is to isolate or decrease the negative impact of a single component’s tail latency to the whole system.

Let’s talk about the first group of tactics first. First of all, if we’re in distributed systems world we should employ asynchronous IO and associated approaches to request processing. For example, in Java Spring world it should be WebFlux. Other eco systems use asynchronous approach as default. NodeJS is an example. An old approach when there’s a separated thread for each incoming request leads to a lot of thread constantly compete with each other. It’s not only less effective than asynchronous IO,but is also leads to a very high variance of the response time and the long tails of high response times.

Second we should try to use as fewer additional threads as possible. To make CPU scheduling even more predictable we may try to allocate some CPUs exclusively for our application. For example, Kubernetes supports this via static cpu policy.

The third thing is minifying mutual waiting of requests. Multiple things to consider here. First of all we need to design data model in way that require fewer locks to ensure consistency. Good question to ask is do we need strict consistency across the whole system, or we can replace it with eventual consistency somewhere? Could we employ optimistic locks instead of pessimistic locks? Non-blocking data structures should be used for in-memory caches and metric counters.

The good news about first strategy is that when we reduce things that add random delays to our response time we fix not only tail latencies but also make the whole distribution more narrow. Often these tactics help to improve median response time metric as well.

Now let’s talk about how to minimize tail latencies impact at the system architecture level? First of all, it’s worth to double-check is there any restrictions why we should do synchronous calls between services (even using fast asynchronous IO) instead of employing asynchronous message based communication? Consider the following example. We want to send some events to an analytics system each time user finishes some business transaction inside our web application. There’s no need to make this call to analytics service synchronously, because it doesn’t affect the result of the business transaction. It’s a side effect that’s fully hidden from the end-user. When there’s no synchronous communication, the latency is not a concern anymore.

If the synchronous communication is still required, the following tactics may help:

Try to keep the number of remote calls between microservices reasonable small.
Try different load balancing strategies. The more active request distribution across (micro-)service instances is uniform the more response time is predictable.
Introduce separate replicas \ instances for heavy requests. In this case heavy request that naturally require a lot of time for completion do not impact on the light requests response time.
Employ request hedging. The idea behind request hedging is simple, still trivial implementation leads to major performance penalties. We send multiple exact same requests to different service instances. The service caller continues request processing as the first response (the fastest one) is received.
Scale out early. Configure auto-scaling to add additional service instances if the CPU usage is around 60%. Usually CPU consumption goes up when incoming requests queue is constantly full. If system scales early it doesn’t wait until queue of requests waiting for processing gets huge.
Replace remote calls with local caches. Please pay attention that this introduces cache-consistency problem you have to think about.
Ignore slow responses. Consider the following case. There’s an API and we want to set a limit to a number of calls per user per minute. Still, we’re trying to keep API response time as low as possible. We introduce a Rate Limit Service and check each request does it violate the restrictions before continue the request processing. As we’re trying to keep API response time low we may just allow request if the Rate Limit Service do not respond within 100 milliseconds.

Notes and references

More about distributions of the response time

On my own opinion, both Erlang and Log Normal distributions can be used to model response time. Erlang distribution explains processes under the hood (sum of multiple independent exponentially distributed delays) in a better way, still the Log Normal distribution shape is something people usually know.

Earlier articles about tail latency from other authors:

Usually it’s acceptable to have a very high 99th or 99.99th percentile of a software response time. Also, usually there’s no hard restrictions for the maximal response time. Still, there’s an exciting world of the Real-time software that requires low and deterministic response time. For example software that manages airplane landing process. :)

I kindly appreciate any repost of this article in your social medias if you find it interesting or useful. For example in X/Twitter.

Andrew Pakhomov

Latency, Tail Latency and Response time in distributed systems

Response time in distributed systems

How to improve response time

Notes and references