Linux Performance in Cloud: 3 Million Storage IOPS on AWS Cloud Instance

Tuesday, April 4, 2017

3 Million Storage IOPS on AWS Cloud Instance

I3 Instance Family
NVMe Technology
Linux Block layer
I3 Storage Benchmark Results

AWS I3 Instance Family

AWS has always been at forefront of adopting new and advanced Intel technologies into their fleet of cloud instances. Introduction of next generation of I3 instance family has continued this tradition by offering low latency and high performance IO features for both storage and network. I3 instances comes with direct attached NVMe (Non-Volatile Memory PCIe) SSD storage. Due to no virtualization overhead and direct access to NVMe storage, I3 instances can able to achieve unprecedented 15GB/s read storage throughput and over 3 million read IOPS. Direct Memory Access (DMA) to storage keeps latencies low (< 100 us) even during moderate IO load.

NVMe SR-IOV extension allows splitting the storage drives across VM (instances). Instead of using Xen virtualized split driver model that is prone to higher latencies, cloud instance runs native nvme storage driver to access subset of PCI resources on a physical PCI IO board. Data transfer between driver and hardware is handled via low latency DMA path that does not require hypervisor intervention. Also, Intel VT-d support for re-mapping of Device DMA access and device generated interrupts helps cloud provider to isolate and partition IO resources and assign them to a specific cloud instance without compromising the integrity of the underlying hardware.

Other noteworthy I3 instance family features are:

Advanced Intel Broadwell processors
Support for 64 vcpus, 488 GB DDR4-based memory, 15 TB of NVMe local storage
SR-IOV based networking using Elastic Network Adaptor (ENA) offers 20 Gbps net throughput and over 2 Million Packets Per Second processing for low latency networking
EBS-optimized instance

NVMe Technology and Features

NVMe protocol supports multiple hardware queues, which is an advancement over traditional SAS and SATA protocols. Typical SAS devices support up to 256 commands and SATA devices support up to 32 commands in a single hardware queue. NVMe supports 64K commands per queue and up to 64K queues per device. NVMe queues are designed such that I/O commands and responses operate on the same processor core and thus can take advantage of warm cpu caches, locality as well as parallel processing capabilities of multi-core processors. Each application or thread running on the cpu gets a separate queue bound to that cpu, and with Linux block layer multi-queue support, no I/O locking is required to process IO. NVMe also supports MSI-X and interrupt steering, to distribute interrupt processing across multiple CPU, that improves scalability.

In addition, NVMe uses simple command set that take half the number of CPU instructions to process an I/O request that SAS or SATA does, providing higher IOPS per CPU instruction cycle and lower I/O latency.

Linux Block Layer

When application issues IO requests, Linux block layer moves requests from per-cpu submission queues into hardware queues, up to the maximum number specified by the driver. NVMe devices supports multiple queues (AWS sets different number of hardware queues for each I3 instance type).

Instance Type	Hardware Queues per Device	Linux Software Queues	Number of nvme Devices	Capacity	Total HW Queues
i3.xl	4	4	1	0.8T	4
i3.2xl	8	8	1	1.7T	8
i3.4xl	16	16	2	3.5T	32
i3.8xl	16	32	4	6.9T	64
i3.16xl	31	64	8	13.8T	248

With mq feature of block layer, it is now possible to run entire submission and completion path for IO processing on the same CPU where the process scheduled I/O, maximizing cache locality and performance. nvme driver was updated in Linux kernel version 3.19 to use Linux multiqueue feature

multi-queue feature implementation was completed in recent kernel version (3.19 and above) that improves Linux block layer scalability to achieve maximum IOPS (15 Million) to accommodate today's and future high performance NVMe devices. NVMe uses optimized block layer path and that reduces software overhead by over 50%: SCSI/SAS (6.0 us, 19500 cycles), NVMe (2.8 us, 9100 cycles). Before mq feature block layer has a single request queue per device that is protected by spinlock resulting in higher contention and lower scalability. Linux block layer mq features splits the request queue into two queues:

Number of separate per-cpu software queues. Each cpu submits IO operations into its own queue, with no interaction or locking requirements with other cpus.
One or more hardware queues managed by driver

I3 NVMe Storage Benchmark Results

I3 instance setup:

Instance Type	Hardware Queues per Device	Number of nvme Devices	Capacity	Total HW Queues
i3.xl	4	1	0.8T	4
i3.2xl	8	1	1.7T	8
i3.4xl	16	2	3.5T	32
i3.8xl	16	4	6.9T	64
i3.16xl	31	8	13.8T	248

I3 NVMe storage IOPS

Instance Type	Access pattern	Block size	Read iops
i3.xl	random	4k	205k
i3.2xl	random	4k	413k
i3.4xl	random	4k	830k
i3.8xl	random	4k	1.65m
i3.16xl	random	4k	3.3m

I3 NVMe storage Throughput

Instance Type	Access pattern	Block size	rtput(MB)
i3.xl	sequential	128k	980
i3.2xl	sequential	128k	1910
i3.4xl	sequential	128k	3814
i3.8xl	sequential	128k	7641
i3.16xl	sequential	128k	15302

Thanks to Intel hardware virtualization extensions, VT-x (cpu virtualization), EPT (translation tables or memory virtualization), VT-d (I/O virtualization), and SR-IOV server virtualization has evolved from a slow software only solution to an efficient hardware assisted one. Large chunks of compute work are now offloaded to hardware to achieve optimum performance, bypassing hypervisor layer. With reduced virtualization overhead, performance gap between hypervisor controlled and bare-metal systems continue to shrink.