Tuesday, April 4, 2017

3 Million Storage IOPS on AWS Cloud Instance

I3 Instance Family
NVMe Technology
Linux Block layer
I3 Storage Benchmark Results

AWS I3 Instance Family

AWS has always been at forefront of adopting new and advanced Intel technologies into their fleet of cloud instances. Introduction of next generation of I3 instance family has continued this tradition by offering low latency and high performance IO features for both storage and network. I3 instances comes with direct attached NVMe (Non-Volatile Memory PCIe) SSD storage. Due to no virtualization overhead and direct access to NVMe storage, I3 instances can able to achieve unprecedented 15GB/s read storage throughput and over 3 million read IOPS. Direct Memory Access (DMA) to storage keeps latencies low (< 100 us) even during moderate IO load.
NVMe SR-IOV extension allows splitting the storage drives across VM (instances). Instead of using Xen virtualized split driver model that is prone to higher latencies, cloud instance runs native nvme storage driver to access subset of PCI resources on a physical PCI IO board. Data transfer between driver and hardware is handled via low latency DMA path that does not require hypervisor intervention. Also, Intel VT-d support for re-mapping of Device DMA access and device generated interrupts helps cloud provider to isolate and partition IO resources and assign them to a specific cloud instance without compromising the integrity of the underlying hardware.
Other noteworthy I3 instance family features are:

  • Advanced Intel Broadwell processors
  • Support for 64 vcpus, 488 GB DDR4-based memory, 15 TB of NVMe local storage
  • SR-IOV based networking using Elastic Network Adaptor (ENA) offers 20 Gbps net throughput and over 2 Million Packets Per Second processing for low latency networking
  • EBS-optimized instance

NVMe Technology and Features

NVMe protocol supports multiple hardware queues, which is an advancement over traditional SAS and SATA protocols. Typical SAS devices support up to 256 commands and SATA devices support up to 32 commands in a single hardware queue. NVMe supports 64K commands per queue and up to 64K queues per device. NVMe queues are designed such that I/O commands and responses operate on the same processor core and thus can take advantage of warm cpu caches, locality as well as parallel processing capabilities of multi-core processors. Each application or thread running on the cpu gets a separate queue bound to that cpu, and with Linux block layer multi-queue support, no I/O locking is required to process IO. NVMe also supports MSI-X and interrupt steering, to distribute interrupt processing across multiple CPU, that improves scalability.

In addition, NVMe uses simple command set that take half the number of CPU instructions to process an I/O request that SAS or SATA does, providing higher IOPS per CPU instruction cycle and lower I/O latency.


Linux Block Layer

When application issues IO requests, Linux block layer moves requests from per-cpu submission queues into hardware queues, up to the maximum number specified by the driver. NVMe devices supports multiple queues (AWS sets different number of hardware queues for each I3 instance type).
Instance Type
Hardware Queues per Device
Linux Software Queues
Number of nvme Devices
Capacity
Total HW Queues
i3.xl
4
4
1
0.8T
4
i3.2xl
8
8
1
1.7T
8
i3.4xl
16
16
2
3.5T
32
i3.8xl
16
32
4
6.9T
64
i3.16xl
31
64
8
13.8T
248
With mq feature of block layer, it is now possible to run entire submission and completion path for IO processing on the same CPU where the process scheduled I/O, maximizing cache locality and performance. nvme driver was updated in Linux kernel version 3.19 to use Linux multiqueue feature
multi-queue feature implementation was completed in recent kernel version (3.19 and above) that improves Linux block layer scalability to achieve maximum IOPS (15 Million) to accommodate today's and future high performance NVMe devices. NVMe uses optimized block layer path and that reduces software overhead by over 50%: SCSI/SAS (6.0 us, 19500 cycles), NVMe (2.8 us, 9100 cycles). Before mq feature block layer has a single request queue per device that is protected by spinlock resulting in higher contention and lower scalability. Linux block layer mq features splits the request queue into two queues:
  • Number of separate per-cpu software queues. Each cpu submits IO operations into its own queue, with no interaction or locking requirements with other cpus.
  • One or more hardware queues managed by driver

I3 NVMe Storage Benchmark Results

I3 instance setup:
Instance Type
Hardware Queues per Device
Number of nvme Devices
Capacity
Total HW Queues
i3.xl
4
1
0.8T
4
i3.2xl
8
1
1.7T
8
i3.4xl
16
2
3.5T
32
i3.8xl
16
4
6.9T
64
i3.16xl
31
8
13.8T
248

I3 NVMe storage IOPS

Instance Type
Access pattern
Block size
Read iops
i3.xl
random
4k
205k
i3.2xl
random
4k
413k
i3.4xl
random
4k
830k
i3.8xl
random
4k
1.65m
i3.16xl
random
4k
3.3m

I3 NVMe storage Throughput

Instance Type
Access pattern
Block size
rtput(MB)
i3.xl
sequential
128k
980
i3.2xl
sequential
128k
1910
i3.4xl
sequential
128k
3814
i3.8xl
sequential
128k
7641
i3.16xl
sequential
128k
15302

Thanks to Intel hardware virtualization extensions, VT-x (cpu virtualization), EPT (translation tables or memory virtualization), VT-d (I/O virtualization), and SR-IOV server virtualization has evolved from a slow software only solution to an efficient hardware assisted one. Large chunks of compute work are now offloaded to hardware to achieve optimum performance, bypassing hypervisor layer. With reduced virtualization overhead, performance gap between hypervisor controlled and bare-metal systems continue to shrink. 

2 comments:

  1. Any chance the code used for the benchmarks is available somewhere? Would be curious to try to replicate your results.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete