What cpu context switch and cache pollution are and how do they impact performance

September 2024 · 7 minute read

What time-sharing is

The first Intel multicore CPU Pentium D was released in 2005, however operating systems have become multitasking way earlier. Already in 1960-x, multiple users were able to interact at the same time with the IBM 709 mainframe using separate terminals, despite it having only a single CPU core. The trick was in allocating short time slices of the CPU time to each running process in a loop. Such approach were creating an illusion of parallel process execution. How does CPU time-sharing works The time-sharing principle has become one of the cornerstones of all the modern operating systems. Because of it we’re able to listen to music, write code in the IDE of choice and receive Slack messages at the same time. While CPU time-sharing makes our desktop PCs responsive, it may bring undesired performance penalties in the server applications. Such performance penalties are especially tangible if the application processes a lot of data or experiences a high number of concurrent requests.

What context switch is

From the programming fundamentals we know that the data is stored in variables that represent some regions of RAM, while the CPU modifies the variable values according to the sequence of instructions called a program. For any program being run we may consider its state as values of all the variables and the pointer to the currently executed instruction. So, if we want to share CPU on a time basis between multiple running programs, say A and B, we have to develop another program to manage it. Such program should:

  1. Run some piece of the instructions from program A
  2. Remember the A’s last executed instruction
  3. Executes piece of program B
  4. Remembers the B’s last executed instruction,
  5. Run the program A from the place it was stopped…
  6. … and so on in a loop.

This is what an operating system’s process scheduler does.

The CPU has a small built-in memory called CPU registers. This is the smallest and the fastest computer memory. Some registers have special purposes. For example Instruction Pointer Register holds an address of the next instruction that should be executed by the CPU.
Other registers can be used to store program’s variables. It may take up to 100 times longer to access values stored in the RAM compared to the value stored in the register, hence the compiler may decide to store some frequently accessed variables directly inside CPU registers and write their values back to RAM when these variables are not used in computations anymore. When it’s time to allocate a time slice to another thread, the process scheduler saves values of the CPU registers into the special RAM area called process control block and then restores their state for the moment of time when another thread was interrupted. How CPU context switch works

What CPU cache misses and cache pollution are

Each operation performed by the CPU, including the operations of memory read and write, takes some time measured in CPU cycles. The wall clock time a single CPU cycle takes depends on the CPU clock rate: the higher the clock rate is, the less time 1 CPU cycle takes. For a CPU running at 2.5 GHz, one CPU cycle takes 0.4 nanoseconds (1/2.5 x 10^-9). There’s a huge difference in time required to access values stored inside computer RAM and values stored inside processor caches or registers.

  Operation CPU Cycles Time
  Get a value from the CPU registers 1 0.4 nanoseconds
  Get a value from the L1 cache 4 1.6 nanoseconds
  Get a value from the L2 cache 10 4 nanoseconds
  Get a value from the L3 cache 40 16 nanoseconds
  Get a value from the RAM 100-300 40-120 nanoseconds

The numbers are approximate and vary depending on the exact CPU model, but the magnitude’s order of these values stays the same. The decision on getting variables values from CPU caches or from RAM is mainly made by the CPU hardware. Once the data stored at specific RAM address is request, the CPU checks first if there’s an L1 cache entry for the queried memory address. if the value of the requested RAM address is cached (cache hit), the data is returned from cache, otherwise CPU queries slower caches (L2, L3) and, finally, queries RAM (cache miss). Why CPU has multiple cache levels

Programs can explicitly instruct the CPU to bypass cache in some cases. Writes and reads of volatile variables are directed straight to the RAM to ensure correctness of concurrent programs. Some compilers may decide to bypass cache for variable write, if the variable’s value is not used after the modification. This strategy improves the usage efficiency of the CPU caches, which have a very limited capacity.

Most of the technical decisions are tradeoffs, and the time-sharing is not an exception. It may cause a significant decrease of the CPU cache efficiency due to cache pollution. Consider a case when multiple threads receive their time slices on the same CPU core one-by-one. Each thread handles its own set of the data that populates CPU cache and gradually displaces others’ thread data. The next time a thread receives a new time slice on the same CPU core, its data is not in the cache, so expensive calls to RAM are made. In other words, a very limited CPU cache was polluted with the data irrelevant to computations currently handled by the CPU. How the CPU context change causes CPU cache pollution

Rescheduling thread to a different CPU core is another case when cache miss rate increases. Usually each CPU core has its own L1 and L2 cache. L3 (sometimes L4) cache is shared by all the cores. If a thread receives time slice on a CPU core different from the one where previous time slice was allocated on, fast L1 and L2 caches don’t contain required data. Hence, more expensive reads from slower caches or even RAM are made. Changing of a CPU core allocated to a thread causes higher cache miss rate

Costs of context switch and cache pollution

Let’s do some napkin math. The cost of the context itself is the cost of the data swap between CPU registers and RAM. The exact number of registers and their size varies between different CPU models. According to the x64 spec, the CPU core should have around 100 registers that can store about 1Kb of data. So, in the worst case scenario, the context switch requires ~100 RAM reads and ~100 RAM writes, 120 nanoseconds each. This gives 24 000 nanoseconds (24 microseconds) in total. In practice, various researches (1, 2, 3) reports about 1-5 microseconds on average for a single context switch, that’s 10 times less than our worst case estimation!

To estimate cache misses costs, we need to know how many variables and how much data an application’s thread processes on average. In some rare cases, server application requests and responses are less than 1Kb. For example Redis increment command transfers only several bytes over the network. Most often, the data handled in scope of a single request (thread) is way larger than 1Kb, especially in enterprise web applications, databases or data pipelines. Hence, the costs of cache misses is way higher, at least by an order of magnitude, compared to the costs of context switches. This happens because way more data should travel between RAM and CPU, comparing to the data moving during a context switch. So, while the cost of context switches is not so high, frequent context switches may ruin the system performance because of the associated increase of CPU cache miss rate.

A problem well stated is a problem half solved, so how to increase CPU cache hit rate and decrease cache miss rate? First obvious thing is to exclusively allocate CPU cores for an application. It’s possible both in raw Linux and Kubernetes environments. The explanation is that other applications running on the same host or node wouldn’t pollute CPU caches associated with the workload.

Second obvious thing is to limit the number of spawned threads to the number of CPU cores an application can use or at least reduce their number. The fewer threads an application has, the less is the probability that one’s thread data will overwrite the other one’s in CPU cache. However, the limitation of the number of threads may require other programming techniques and frameworks like employment of asynchronous IO or usage of reactive programming framework, like Spring WebFlux (in the java world, of course). Some other languages like Go or NodeJS provide non-blocking IO and asynchronous approach out-of-the box as the default ones.