Monday, May 9, 2016

2 Million Packets Per Second on a Public Cloud Instance

Amazon recommends customers to choose VPC as it provides data center like features and performance with the elasticity of a public cloud. One of the key performance differentiator between EC2 Classic and VPC is the Enhanced Networking feature offered on VPC instances, that helps applications achieve high RPS rates due to low latency networking. As Netflix makes the migration from EC2 classic to VPC, we have executed benchmarks to identify VPC instance limits. Micro benchmarks ran on Amazon (r3,i2,m4).8xlarge VPC instances reported 10 fold (2 Mpps) improvement in packet processing rates as compared to similar EC2 classic instance, which is limited to 200 kpps. Some application benchmarks have also reported 10x higher RPS (Request/sec) rates in VPC.
Due to early adoption of public cloud, the majority of Netflix services are still hosted on Amazon EC2 Classic cloud. AWS services have evolved and EC2 classic was not built to support features required by new breed of cloud services.  Netflix services, that use RPS rates as a metric to scale the ASG (Auto Scaling Group), routinely over provision compute farms to overcome packet processing overhead inherent in Xen split driver model (Figure 1).
Figure1.  Definitive Guide to the Xen Hypervisor
NOTE: EC2 classic instances and low end instances in VPC use Xen split driver model (software) that uses shared memory ring between instance and Dom0 (Xen trusted domain) to exchange packets. This model has a higher packet processing overhead than SR-IOV (hardware) enabled NIC.
Amazon VPC promises much higher pps rates at a sub millisecond latencies and that means fewer instances (cost saving) are needed to meet upstream service demands.

Technology Overview

Amazon Enhanced Networking feature, built on top of SR-IOV (PCI-SIG standard) technology, allows instance to have a direct access to subset of PCI resources on a physical NIC. Unlike Xen virtualized driver, SR-IOV compliant driver running on a cloud instance can DMA (Direct Memory Access) to NIC hardware to achieve higher throughput and lower latency. DMA operation from the device to Virtual Machine memory does not compromise the safety of underlying hardware. Intel IO Virtualization Technology (vt-d) supports DMA and interrupt remapping and that restricts the NIC hardware to subset of physical memory allocated for a particular Virtual Machine. No hypervisor interaction is needed except for interrupt processing.
SR-IOV Driver and NIC

EC2 classic instance can have a single Xen Virtualized NIC, whereas, VPC instance can support multiple NIC (ENI) per instance to help distribute interrupts and network traffic. ENI on AWS instance is assigned a pci virtual function (VF). Each virtual function (lightweight version of pci physical function PF) gets subset of physical NIC pci resources such as: pci configuration registers, iomem regions, queue pairs (Rx/Tx queues), set of Transmit and Receive Descriptors with DMA capabilities. NIC driver (ixgbevf) running inside the instance is para-virtualized (para-virtualized in this context means that driver is modified to only have limited pci capabilities). It can transfer packets directly to hardware, but to change MAC address, device reset, or perform instructions that have global impact, it relies on Physical Function (PF) driver running in privileged domain (Dom0, managed by cloud provider). Communication between VF and PF drivers happens via special hardware registers/buffers, called Mailbox.

Each VF has it own pci resources on NIC
Intel PCIe NIC is a multi-queue device. AWS assigns each ENI (or VF) two queue pair ( up to a maximum of 16 queue pairs per instance) to distribute network traffic. Each queue pair is pinned to a separate CPU for interrupt and packet processing. NIC hashes on tuple (srcIP, dstIP, srcPort, dstPort) to decide which Rx queue (two Rx queues per ENI) to use for incoming flow. Packets from a single flow uses the same Rx queue to avoid packet reordering. Each queue pair has its own sets of Tx and Rx descriptors (max: 4096). Each Rx/Tx descriptors in queue is used to DMA individual packet from/to the NIC. When all Rx/Tx descriptors are exhausted or in use, NIC driver flows control the network stack. Thus larger number of Rx descriptors can improve pps rates and throughput. Intel PCIe NIC has an embedded Layer-2 switch that sorts packets based upon the destination MAC address or vlan tag. When match is found, it is forwarded to the appropriate queue pair. Layer-2 switch also performs bridging function between VF (ENI) in hardware without hypervisor intervention. Thus multiple instances hosted on the same physical machine can communicate at a much lower network latencies due to bridging feature of NIC, as compared to across physical machines. AWS Placement Groups use this feature to offer lowest possible network latency between instances.

Benchmark Results

Ubuntu AMI used for testing has routing tables set to route traffic coming in an interface to go back out the same interface and vice versa. That allows network traffic to be distributed across multiple ENI. Kernel tuning is baked into the AMI to attain optimum performance for varying type of Netflix workloads. Although each ENI attached to an instance has a dedicated DMA path to a physical NIC, the master driver (running in trusted domain Dom0 and controlled by cloud provider) has ability to set throttling limits on throughput and pps rates per instance. When multiple ENI are configured and stressed, instance maximum pps limit is split across multiple ENI. Tests ran on Amazon 8xlarge instances show:

Number of ENI Configured
Max pps rate per ENI
Bi-directional pps rates per ENI
2.4 Mpps
1.2 Mpps
1.2 Mpps
600 Mpps
600 Kpps
300 Kpps
300 Kpps
150 Kpps
Note: Amazon does not comment on maximum PPS rates per instance. We found in our testing that ~2.4 Mpps rates can be achieved on 8xlarge instances. Smaller instances (4xlarge) are throttled at ~ 1Mpps

Micro Benchmark Results
Microbenchmarking tools, pktgen and iperf, are used to test NIC hardware and driver capability to process small packets. Server NIC is flooded to measure maximum PPS rates. Iperf test was run with 68 Bytes MTU to generate small packets. Test results show Amazon instance types: x8large are can process packets at 2 Mpps PPS rates in receive (Rx) or transmit (Tx) and over 1 Mpps for bidirectional traffic.
Micro Benchmark
Network PPS Rates
2 Mpps
1.6 Mpps
iperf TCP with 68 bytes MTU
pktgen UDP test 

Application Benchmark Results
Webserver Test:
Nginx web server supports socket option SO_REUSEPORT for better concurrency as it reduces contention among multiple server process/threads accepting connections. Benchmark ran on VPC 8xlarge instance reported maximum RPS rates of over 1 Million on a single instance with 90th percentile latency of 2-5 ms. That is 10x more RPS than EC2 classic instance. EC2 class instance is limited to only 85 kpps. At 1 Million http request/response rate on VPC instance, underlying network reached its maximum limits and thus unable to push more web traffic. Eight clients were used to generate http traffic concurrently using wrk utility against a single nginx web server.
Instance Type
RPS Rates (Clients)
Web Server Latency
Web Server PPS Rates
VPC Instance
1- 4 ms
1.2 Mpps (receive)
1.2 Mpss (transmit)
Total: 2.4 Mpps
Classic Instance
1 ms
Higher load causes server to become unresponsive over the network. Thus kept the Network load low
85 kpps (receive)
85 kpps (transmit)
Total: 170 Kpps
web server Test in VPC
web server Test in Classic

Note: Each dot in the graph represents a single iteration of the test.
Memcached Test:
mcblaster, open source memcached client, is used to generate load on memcached server. memcache benchmark reported 300K RPS (gets/sec) rates at low 1-10 ms latencies on VPC instance as compared to 85-90K RPS rates on EC2 Classic. EC2 classic network maxes out at 90-100k pps rates and become unresponsive over the network. In comparison, VPC instance with SR-IOV can be pushed to much higher pps rates, without inducing higher latencies.
Instance Type
RPS Rates (Clients)
memcache max Latency
% requests completed in < 10 ms
VPC Instance
30 ms
Classic Instance
95 ms
memcache Test in VPC

memcache Latency distribution in VPC
Note: Each dot represents a single iteration of the test. Each test ran for 10 seconds at RPS rates of 100-300k
Memcached scalability is limited due to higher contention in memcache code as reported by Linux perf.

 We were still able to push more load on VPC instance even when memcached was exhibiting higher latencies. NIC driver on memcached server instance continue to process incoming packets at 1.8 Mpps but transmitted at a lower rates of 600 kpps due to overloaded memcached.

Linux Network Stack

Linux network stack can scale to high pps rates with proper kernel tuning and having the following features enabled:
  • RPS/RFS network stack feature helps distribute network stack processing across multiple cpus and that reduces latencies, especially on numa servers. During our test we enabled RFS only.
  • When NIC driver supports multi-send or bulk packet transmission feature, network stack can queue multiple packets (skb) to NIC driver when passing it for deliver.
  • Modern NIC supports multiple Rx/Tx hardware queues where each queue is assigned a dedicated cpu for interrupt processing. Receive traffic is distributed across multiple Rx queues and that utilizes full NIC potential.
  • NIC drivers can process multiple packets per interrupt using combination of software and hardware features: NAPI and hardware interrupt mitigation feature, to reduce interrupt processing overhead.
  • Benefit of configuring multiple ENI per instance is that it distributes network interrupt and packet processing across larger set of cpus. Multiple ENI can also be used to segregate network traffic for a service to improve visibility.
Having all these feature available help improve instance scalability by engaging more cpus for network processing.

PPS tests offer better insight into Network stack efficiency than throughput tests. One can optimize throughput by tuning tcp window size, using larger payload size and utilizing ethernet jumbo frames. Small packet handling or PPS tests help estimate the cost or latency associated with processing each packet. Smallest ethernet frame that can be sent over the wire is 64 bytes + overhead = 84 bytes. To keep NIC busy at 2 Mpps, payload (64 bytes) processing latency should stay below 500 ns (0.5 us), that includes full stack processing (java, jvm, libc, network stack and NIC driver). There are system calls (recvmmsg/sendmmsg (extra m), readv/writev) that can send multiple packets per system call to reduce system call overhead.
net tput (Gbits)
Frame size (bytes)
overhead (bytes)
pps rates
Latency per packet (ns)
16 Mpps
63 ns
880 Kpps
150 Kpps
NOTE: pps rate = tput / frame size  |  Latency per packet = 1 sec / pps rate. Ethernet Frame: MAC Header + smallest payload + CRC = 14 + 46 + 4 = 64 bytes.  Additional Overhead: Inter-frame-gap (IFG or IPG) + MAC preamble = 12 + 8 = 20

Server virtualization has evolved from software only to hardware assisted solution. Large chunk of computation work is now offloaded to hardware, bypassing the hypervisor. IO virtualization solution like SR-IOV available on public cloud instances can help accelerate both storage and network performance of latency sensitive workloads. Application with high concurrency capabilities running on a well tuned kernel can now able to service millions of requests on a single public cloud instance.


Intel SR-IOV Driver Companion Guide
Intel Virtualized Technology For Directed IO
Abyss open source software is used for automating: benchmarks execution, metrics collection and graphs generation.
Linux Kernel Tunables applied to AMI 

microbenchmark benchmark setup

ngnix webserver benchmark setup

memcached benchmark setup

Amer Ather | Netflix Performance Engineering