Tuesday, April 4, 2017

3 Million Storage IOPS on AWS Cloud Instance

I3 Instance Family
NVMe Technology
Linux Block layer
I3 Storage Benchmark Results

AWS I3 Instance Family

AWS has always been at forefront of adopting new and advanced Intel technologies into their fleet of cloud instances. Introduction of next generation of I3 instance family has continued this tradition by offering low latency and high performance IO features for both storage and network. I3 instances comes with direct attached NVMe (Non-Volatile Memory PCIe) SSD storage. Due to no virtualization overhead and direct access to NVMe storage, I3 instances can able to achieve unprecedented 15GB/s read storage throughput and over 3 million read IOPS. Direct Memory Access (DMA) to storage keeps latencies low (< 100 us) even during moderate IO load.
NVMe SR-IOV extension allows splitting the storage drives across VM (instances). Instead of using Xen virtualized split driver model that is prone to higher latencies, cloud instance runs native nvme storage driver to access subset of PCI resources on a physical PCI IO board. Data transfer between driver and hardware is handled via low latency DMA path that does not require hypervisor intervention. Also, Intel VT-d support for re-mapping of Device DMA access and device generated interrupts helps cloud provider to isolate and partition IO resources and assign them to a specific cloud instance without compromising the integrity of the underlying hardware.
Other noteworthy I3 instance family features are:

  • Advanced Intel Broadwell processors
  • Support for 64 vcpus, 488 GB DDR4-based memory, 15 TB of NVMe local storage
  • SR-IOV based networking using Elastic Network Adaptor (ENA) offers 20 Gbps net throughput and over 2 Million Packets Per Second processing for low latency networking
  • EBS-optimized instance

NVMe Technology and Features

NVMe protocol supports multiple hardware queues, which is an advancement over traditional SAS and SATA protocols. Typical SAS devices support up to 256 commands and SATA devices support up to 32 commands in a single hardware queue. NVMe supports 64K commands per queue and up to 64K queues per device. NVMe queues are designed such that I/O commands and responses operate on the same processor core and thus can take advantage of warm cpu caches, locality as well as parallel processing capabilities of multi-core processors. Each application or thread running on the cpu gets a separate queue bound to that cpu, and with Linux block layer multi-queue support, no I/O locking is required to process IO. NVMe also supports MSI-X and interrupt steering, to distribute interrupt processing across multiple CPU, that improves scalability.

In addition, NVMe uses simple command set that take half the number of CPU instructions to process an I/O request that SAS or SATA does, providing higher IOPS per CPU instruction cycle and lower I/O latency.


Linux Block Layer

When application issues IO requests, Linux block layer moves requests from per-cpu submission queues into hardware queues, up to the maximum number specified by the driver. NVMe devices supports multiple queues (AWS sets different number of hardware queues for each I3 instance type).
Instance Type
Hardware Queues per Device
Linux Software Queues
Number of nvme Devices
Capacity
Total HW Queues
i3.xl
4
4
1
0.8T
4
i3.2xl
8
8
1
1.7T
8
i3.4xl
16
16
2
3.5T
32
i3.8xl
16
32
4
6.9T
64
i3.16xl
31
64
8
13.8T
248
With mq feature of block layer, it is now possible to run entire submission and completion path for IO processing on the same CPU where the process scheduled I/O, maximizing cache locality and performance. nvme driver was updated in Linux kernel version 3.19 to use Linux multiqueue feature
multi-queue feature implementation was completed in recent kernel version (3.19 and above) that improves Linux block layer scalability to achieve maximum IOPS (15 Million) to accommodate today's and future high performance NVMe devices. NVMe uses optimized block layer path and that reduces software overhead by over 50%: SCSI/SAS (6.0 us, 19500 cycles), NVMe (2.8 us, 9100 cycles). Before mq feature block layer has a single request queue per device that is protected by spinlock resulting in higher contention and lower scalability. Linux block layer mq features splits the request queue into two queues:
  • Number of separate per-cpu software queues. Each cpu submits IO operations into its own queue, with no interaction or locking requirements with other cpus.
  • One or more hardware queues managed by driver

I3 NVMe Storage Benchmark Results

I3 instance setup:
Instance Type
Hardware Queues per Device
Number of nvme Devices
Capacity
Total HW Queues
i3.xl
4
1
0.8T
4
i3.2xl
8
1
1.7T
8
i3.4xl
16
2
3.5T
32
i3.8xl
16
4
6.9T
64
i3.16xl
31
8
13.8T
248

I3 NVMe storage IOPS

Instance Type
Access pattern
Block size
Read iops
i3.xl
random
4k
205k
i3.2xl
random
4k
413k
i3.4xl
random
4k
830k
i3.8xl
random
4k
1.65m
i3.16xl
random
4k
3.3m

I3 NVMe storage Throughput

Instance Type
Access pattern
Block size
rtput(MB)
i3.xl
sequential
128k
980
i3.2xl
sequential
128k
1910
i3.4xl
sequential
128k
3814
i3.8xl
sequential
128k
7641
i3.16xl
sequential
128k
15302

Thanks to Intel hardware virtualization extensions, VT-x (cpu virtualization), EPT (translation tables or memory virtualization), VT-d (I/O virtualization), and SR-IOV server virtualization has evolved from a slow software only solution to an efficient hardware assisted one. Large chunks of compute work are now offloaded to hardware to achieve optimum performance, bypassing hypervisor layer. With reduced virtualization overhead, performance gap between hypervisor controlled and bare-metal systems continue to shrink. 

Monday, April 3, 2017

Elastic Storage in AWS Cloud

What is Elastic EBS
Elastic EBS Use Cases
Elastic EBS Testing
Elastic EBS Limitations

What is Elastic EBS

Public clouds are known for elasticity of compute resources. For example, AWS AutoScaling Group (ASG) can dynamically scale up a compute farm (VM or cloud instances) to service higher load or scale down to safe cost during off hours. What missing was a dynamic scaling of block storage such as: online volume expansion, storage tiering or temporary boosting volume's iops and throughput. Recent announcement of Elastic EBS feature addressed some of these shortcoming.  SAN features like online modification of volume attributes (size, iops and type) without detaching or attaching volume or restarting an instance are now possible in public cloud. No down time is required!

EBS (Elastic Block Storage) is AWS network storage offering that allows attaching block storage to a running cloud instance in a given Availability Zone (AZ).  EBS volumes support both: solid state disks (SSD), called io1, gp2, for improved random iops or lower latencies, and magnetic disks (HD), called st1 and sc1, for higher sequential throughput.  Elastic EBS feature makes it possible to mix and match EBS volume capabilities to achieve best storage performance for a given workload.

Elastic EBS Use Cases

Elastic EBS fits well for various use cases listed below:
Performance Boost: 
  • To improve IOPS performance temporarily during nightly batch or quarterly end processing
  • To improve sequential IO throughput for Analytics and DSS queries, io1 or gp2 volumes can be modified to st1 volumes that offer higher throughput than io1 and gp2 volumes.
Storage Tiers to Save Cost:
  • Modify volume types to save cost. One can build online storage tier without impacting performance using various EBS types available: io1, gp2, st1 and sc1
On Demand Storage Capacity
  • Instead of over-provisioning storage, start with a fixed size EBS volumes for all instances in your cluster. Monitor EBS volume capacity and raise it when reaches 80-90%. This may save cost as the capacity increase is performed on as needed basis.
Reduced Administration Overhead
  • EBS storage saves great deal of administration work as compared to ephemeral storage (direct attach storage). EBS is persistent storage and thus does not require additional steps in protecting data such as copying or refreshing data, as in the case of ephemeral storage, on instance launch. 
  • Elastic EBS feature makes EBS more attractive as it takes away the size consideration out of capacity planning decision. Both volume and file system can be expanded online without requiring instance reboot.

Elastic EBS Testing

Two new EBS API calls: ModifyVolume and DescribeVolumesModifications were introduced to support Elastic Volume feature.  awsconsole, awscli or API calls can be used to modify volume. Completion status can then be polled via Cloudwatch metrics or via DescribeVolumesModifications API to trigger some action.
 
" For example: File system size can be expanded online once volume change completion
      notification is received"

Modify Volume -- iops Test


Provision iops on io1 volume can be modified (increase or decrease) dynamically. Modification request takes few minutes to complete. Volume continue to perform at original iops level until volume modification is completed

To test the feature, I created and attached 500 GB ebs io1 volume type. DescribeVolumesModification API is used to poll volume changes. Iops are increased in every iteration by 2000. fio benchmark tool was running in the background to perform read via direct IO path. Measured storage iops, tput and latency. See bash script used for testing below




Modify Volume --size Test
Created and attached 500 GB ebs gp2 volume. DescribeVolumesModification API is used to poll volume modification status. Size of gp2 volume is modified (expanded) to an additional 200GB in every test iteration. xfs_grow is invoked to grow xfs file system to utilize additional volume space. fio read IO load on a file system was running in endless loop. Measured storage iops, tput and latency. See batch script used for testing below:




Modify Volume --type Test
Created and attached 3TB sc1 ebs type volume. DescribeVolumesModification API is used to poll modification status. Dynamically change 3TB volume from one EBS volume type to other in a loop:
sc1->st1 | st1->gp2 | gp2->io1 | io1-> gp2 | gp2->st1 | st1->sc1
fio read IO load via direct IO was running in endless loop. Measured storage iops, tput and latency. See the batch script used for testing below:




Elastic EBS Limitations

  • Elastic EBS feature is supported only on current generation of aws instances
  • There is a limit on one modification request per volume every 6 hours.
  • Volume modification change request takes from few seconds to tens of minutes to complete.
  • Minimum charge is for 6 hours once volume modification is completed.
  • Modify volume option "iops" applies to "io1" (provision iops) EBS volume only.
  • Modify Volume option "size" can only be used to increase the volume capacity. Shrinking volume is not supported!
  • Both ext4 and xfs support growing file system online.
    • File systems cannot be grown online if EBS device is under Linux MD and RAID volume type is RAID-0. Linux MD RAID supports growing for RAID-1 (mirror) and RAID-5(parity) volumes only.
    • File systems can be grown online if EBS device is under LVM (Linux Volume Manager) control
    • File systems can be grown online on drbd device as long as drbd device is backed up by LVM.
  • Modify options: "iops" and "type" (except "size") are transparent to instance and thus can be done to all types of configurations (raw, block, MD, LVM, drbd etc..) . 
  • AWS IAM InstanceProfile need additional permission to run ModifyVolume and DescribeVolumesModifications API from the instance to modify volume online. 
    • One can use scripts provided in the script section of this document to modify volume's iops, size and type.
    • One can use AWS lambda function provided by AWS to tag all volumes requiring size changes to set "maintenance" tag and then a script that runs on the instance to resize all targeted volumes.


EBS continue to evolve to match SAN like features in public cloud.  We expect more SAN features to be added into the growing feature list in the future such as: Thin provisioning, deduplication, LUN level snapshots or mirroring and multi-attach volumes for clustering application like Oracle RAC or clustered filesystems.

NOTE: EBS does support snapshot feature, but snapshot is saved to S3. When restored, data is copied to EBS volume as the data is accessed (lazy copy). Due to lazy copy from S3 to EBS , IO latencies jump to 100ms from 2-5 ms until all blocks are copied to EBS from S3. LUN level snapshot copies the data directly to destination LUN.