And here's how the Meltdown patch impacts storage performance in KVM


By now, you’ve heard of the Spectre & Meltdown exploits that allow nefarious code to steal secrets from protected areas of a computer.  The now-available patch is critical to deploy in shared environments like KVM.  But it creates non-trivial impact to storage performance.  This brief provides details and possible mitigations.  Meltdown affects Intel x64-64 CPUs.

The Exploit:  Meltdown code includes instructions that are speculatively executed in the pipeline of modern processors but then discarded because of an incorrect branch prediction.  These instructions include accesses to the contents of protected, kernel memory.  But because of the incorrect branch prediction, the page fault (and kernel exception handling) never happen.  However, by the time the incorrect branch is determined, the CPU has already loaded the requested protected memory into the shared cache.   The exploit code then uses a sophisticated "targeted cache side channel attack" to discern the contents of the protected memory now residing in the shared cache.  By scanning all protected memory, Meltdown code has access to all the systems data… and hence the dramatic choice of the exploit's name.  

The Patch:  The exploit works because when executing user-space code, Linux keeps the kernel's protected memory available (via page tables) to improve performance when the app makes system calls or an interrupt is received.  This avoids context switching overheads (TLB flush, page-table swapping, etc) when transitioning from user to kernel code.   The now-available Kernel Page Table Isolation (KPTI) patch separates user and kernel space page tables entirely - preventing Meltdown code from loading protected memory into a shared cache.

NVMe Performance: The KPTI patch adds overhead to user-kernel transitions (TLB flush, page-table swapping, etc).  For apps or containers running on bare-metal Linux, every NVMe access takes several context switches: (i) when app does syscall to the kernel to initiate a read/write, (ii) when an SSD interrupts the system and (iii) when the kernel wakes the app.  For KVM, there are even more: (i) When guest app initiates read with QEMU and (ii) QEMU injects completion interrupt.  The following measurements used the FIO performance tool (version 2.18) on a Dual Processor Intel Xeon Platinum 8180M CPU@2.50GHz with a single U.2 NVMe SSD.

* 4K random reads or writes with Jobs=4 and QueueDepth=32 ** During FIO, a warm buffer cache is used to provide the end of the spectrum were app has high data locality/reuse 

* 4K random reads or writes with Jobs=4 and QueueDepth=32
** During FIO, a warm buffer cache is used to provide the end of the spectrum were app has high data locality/reuse 

As a reference, for bare-metal Linux, the overhead of the patch doesn't impact performance, but does increase CPU utilization slightly.  The impact is larger for KVM given the additional context switches.  But note that this degradation is in addition to the 77% NVMe performance drop (from 847 to 190 k IOPs) when accessing NVMe storage from inside a VM (vs bare metal).

The KPTI patch causes the largest drop (33%) for tests using a warm buffer cache since context switches are much larger percentage of these operations.  The Linux buffer cache caches disk blocks to optimize block I/O (but an app can bypass by specifying "direct" in a file-open call).  In this test, the buffer cache is part of the Guest Linux kernel to represent the scenario where an app is consciously optimizing performance.  We focused on read operations since use of write-back caches is highly app dependent given loss of data upon power failure. As examples, Spark sits on HDFS which uses the default file options - i.e. cached - but due to its large transfers, it's impact is likely small.  Users will likely see the largest impact for structured databases with small I/O transfer sizes and high locality (Postgres, MySQL, etc).

For users who are sensitive to NVMe performance in a KVM environment, and for whom buffer caches aren't appropriate, the tests show that NVMe solutions with SRIOV (single-root I/O virtualization) support provide close to bare-metal NVMe performance un-impacted by the KPTI patch.   SRIOV is a technology to bypass the hypervisor and avoid context switches.  

Potential Mitigations: Obviously, the impact is app-dependent – and you should do your own characterizations - but in general, the more I/O-bound, the higher the impact.   Possible mitigations: (a) Determine if you really need the patch.  Linux provides a CLI to disable it but this isn’t advised in a shared environment with security risks, (b) reduce the number of IO operations by combining transfers – if your app or framework allows, (c) Consider use of an NVMe solution with support for SRIOV - for example, the Attala Composable Storage Infrastructure.



Taufik Ma