Linux Performance in Cloud: May 2016

Amazon recommends customers to choose VPC as it provides data center like features and performance with the elasticity of a public cloud. One of the key performance differentiator between EC2 Classic and VPC is the Enhanced Networking feature offered on VPC instances, that helps applications achieve high RPS rates due to low latency networking. As Netflix makes the migration from EC2 classic to VPC, we have executed benchmarks to identify VPC instance limits. Micro benchmarks ran on Amazon (r3,i2,m4).8xlarge VPC instances reported 10 fold (2 Mpps) improvement in packet processing rates as compared to similar EC2 classic instance, which is limited to 200 kpps. Some application benchmarks have also reported 10x higher RPS (Request/sec) rates in VPC.
Due to early adoption of public cloud, the majority of Netflix services are still hosted on Amazon EC2 Classic cloud. AWS services have evolved and EC2 classic was not built to support features required by new breed of cloud services. Netflix services, that use RPS rates as a metric to scale the ASG (Auto Scaling Group), routinely over provision compute farms to overcome packet processing overhead inherent in Xen split driver model (Figure 1).

Figure1. Definitive Guide to the Xen Hypervisor

NOTE: EC2 classic instances and low end instances in VPC use Xen split driver model (software) that uses shared memory ring between instance and Dom0 (Xen trusted domain) to exchange packets. This model has a higher packet processing overhead than SR-IOV (hardware) enabled NIC.

Amazon VPC promises much higher pps rates at a sub millisecond latencies and that means fewer instances (cost saving) are needed to meet upstream service demands.

Technology Overview

Amazon Enhanced Networking feature, built on top of SR-IOV (PCI-SIG standard) technology, allows instance to have a direct access to subset of PCI resources on a physical NIC. Unlike Xen virtualized driver, SR-IOV compliant driver running on a cloud instance can DMA (Direct Memory Access) to NIC hardware to achieve higher throughput and lower latency. DMA operation from the device to Virtual Machine memory does not compromise the safety of underlying hardware. Intel IO Virtualization Technology (vt-d) supports DMA and interrupt remapping and that restricts the NIC hardware to subset of physical memory allocated for a particular Virtual Machine. No hypervisor interaction is needed except for interrupt processing.

SR-IOV Driver and NIC

EC2 classic instance can have a single Xen Virtualized NIC, whereas, VPC instance can support multiple NIC (ENI) per instance to help distribute interrupts and network traffic. ENI on AWS instance is assigned a pci virtual function (VF). Each virtual function (lightweight version of pci physical function PF) gets subset of physical NIC pci resources such as: pci configuration registers, iomem regions, queue pairs (Rx/Tx queues), set of Transmit and Receive Descriptors with DMA capabilities. NIC driver (ixgbevf) running inside the instance is para-virtualized (para-virtualized in this context means that driver is modified to only have limited pci capabilities). It can transfer packets directly to hardware, but to change MAC address, device reset, or perform instructions that have global impact, it relies on Physical Function (PF) driver running in privileged domain (Dom0, managed by cloud provider). Communication between VF and PF drivers happens via special hardware registers/buffers, called Mailbox.

Each VF has it own pci resources on NIC

Intel PCIe NIC is a multi-queue device. AWS assigns each ENI (or VF) two queue pair ( up to a maximum of 16 queue pairs per instance) to distribute network traffic. Each queue pair is pinned to a separate CPU for interrupt and packet processing. NIC hashes on tuple (srcIP, dstIP, srcPort, dstPort) to decide which Rx queue (two Rx queues per ENI) to use for incoming flow. Packets from a single flow uses the same Rx queue to avoid packet reordering. Each queue pair has its own sets of Tx and Rx descriptors (max: 4096). Each Rx/Tx descriptors in queue is used to DMA individual packet from/to the NIC. When all Rx/Tx descriptors are exhausted or in use, NIC driver flows control the network stack. Thus larger number of Rx descriptors can improve pps rates and throughput. Intel PCIe NIC has an embedded Layer-2 switch that sorts packets based upon the destination MAC address or vlan tag. When match is found, it is forwarded to the appropriate queue pair. Layer-2 switch also performs bridging function between VF (ENI) in hardware without hypervisor intervention. Thus multiple instances hosted on the same physical machine can communicate at a much lower network latencies due to bridging feature of NIC, as compared to across physical machines. AWS Placement Groups use this feature to offer lowest possible network latency between instances.

Benchmark Results

Ubuntu AMI used for testing has routing tables set to route traffic coming in an interface to go back out the same interface and vice versa. That allows network traffic to be distributed across multiple ENI. Kernel tuning is baked into the AMI to attain optimum performance for varying type of Netflix workloads. Although each ENI attached to an instance has a dedicated DMA path to a physical NIC, the master driver (running in trusted domain Dom0 and controlled by cloud provider) has ability to set throttling limits on throughput and pps rates per instance. When multiple ENI are configured and stressed, instance maximum pps limit is split across multiple ENI. Tests ran on Amazon 8xlarge instances show:

Number of ENI Configured	Max pps rate per ENI	Bi-directional pps rates per ENI
1	2.4 Mpps	1.2 Mpps
2	1.2 Mpps	600 Mpps
4	600 Kpps	300 Kpps
8	300 Kpps	150 Kpps

Note: Amazon does not comment on maximum PPS rates per instance. We found in our testing that ~2.4 Mpps rates can be achieved on 8xlarge instances. Smaller instances (4xlarge) are throttled at ~ 1Mpps

Micro Benchmark Results
Microbenchmarking tools, pktgen and iperf, are used to test NIC hardware and driver capability to process small packets. Server NIC is flooded to measure maximum PPS rates. Iperf test was run with 68 Bytes MTU to generate small packets. Test results show Amazon instance types: x8large are can process packets at 2 Mpps PPS rates in receive (Rx) or transmit (Tx) and over 1 Mpps for bidirectional traffic.

Micro Benchmark	Network PPS Rates	Protocol
iperf	2 Mpps	TCP
pktgen	1.6 Mpps	UDP

iperf TCP with 68 bytes MTU

pktgen UDP test

Application Benchmark Results
Webserver Test:
Nginx web server supports socket option SO_REUSEPORT for better concurrency as it reduces contention among multiple server process/threads accepting connections. Benchmark ran on VPC 8xlarge instance reported maximum RPS rates of over 1 Million on a single instance with 90th percentile latency of 2-5 ms. That is 10x more RPS than EC2 classic instance. EC2 class instance is limited to only 85 kpps. At 1 Million http request/response rate on VPC instance, underlying network reached its maximum limits and thus unable to push more web traffic. Eight clients were used to generate http traffic concurrently using wrk utility against a single nginx web server.

Instance Type	RPS Rates (Clients)	Web Server Latency	Web Server PPS Rates
r3.8xl VPC Instance	1.1M	1- 4 ms	1.2 Mpps (receive) 1.2 Mpss (transmit) Total: 2.4 Mpps
R3.8xl Classic Instance	85K	1 ms Higher load causes server to become unresponsive over the network. Thus kept the Network load low	85 kpps (receive) 85 kpps (transmit) Total: 170 Kpps

web server Test in VPC

web server Test in Classic

Note: Each dot in the graph represents a single iteration of the test.
Memcached Test:
mcblaster, open source memcached client, is used to generate load on memcached server. memcache benchmark reported 300K RPS (gets/sec) rates at low 1-10 ms latencies on VPC instance as compared to 85-90K RPS rates on EC2 Classic. EC2 classic network maxes out at 90-100k pps rates and become unresponsive over the network. In comparison, VPC instance with SR-IOV can be pushed to much higher pps rates, without inducing higher latencies.

Instance Type	RPS Rates (Clients)	memcache max Latency	% requests completed in < 10 ms
VPC Instance	300K	30 ms	99%
Classic Instance	100K	95 ms	78%

memcache Test in VPC

memcache Latency distribution in VPC

Note: Each dot represents a single iteration of the test. Each test ran for 10 seconds at RPS rates of 100-300k
Memcached scalability is limited due to higher contention in memcache code as reported by Linux perf.

We were still able to push more load on VPC instance even when memcached was exhibiting higher latencies. NIC driver on memcached server instance continue to process incoming packets at 1.8 Mpps but transmitted at a lower rates of 600 kpps due to overloaded memcached.

Linux Network Stack

Linux network stack can scale to high pps rates with proper kernel tuning and having the following features enabled:

RPS/RFS network stack feature helps distribute network stack processing across multiple cpus and that reduces latencies, especially on numa servers. During our test we enabled RFS only.
When NIC driver supports multi-send or bulk packet transmission feature, network stack can queue multiple packets (skb) to NIC driver when passing it for deliver.
Modern NIC supports multiple Rx/Tx hardware queues where each queue is assigned a dedicated cpu for interrupt processing. Receive traffic is distributed across multiple Rx queues and that utilizes full NIC potential.
NIC drivers can process multiple packets per interrupt using combination of software and hardware features: NAPI and hardware interrupt mitigation feature, to reduce interrupt processing overhead.
Benefit of configuring multiple ENI per instance is that it distributes network interrupt and packet processing across larger set of cpus. Multiple ENI can also be used to segregate network traffic for a service to improve visibility.

Having all these feature available help improve instance scalability by engaging more cpus for network processing.

PPS tests offer better insight into Network stack efficiency than throughput tests. One can optimize throughput by tuning tcp window size, using larger payload size and utilizing ethernet jumbo frames. Small packet handling or PPS tests help estimate the cost or latency associated with processing each packet. Smallest ethernet frame that can be sent over the wire is 64 bytes + overhead = 84 bytes. To keep NIC busy at 2 Mpps, payload (64 bytes) processing latency should stay below 500 ns (0.5 us), that includes full stack processing (java, jvm, libc, network stack and NIC driver). There are system calls (recvmmsg/sendmmsg (extra m), readv/writev) that can send multiple packets per system call to reduce system call overhead.

net tput (Gbits)	Frame size (bytes)	overhead (bytes)	pps rates	Latency per packet (ns)
10	64	20	16 Mpps	63 ns
10	1500	20	880 Kpps	1136
10	9000	20	150 Kpps	6750

NOTE: pps rate = tput / frame size | Latency per packet = 1 sec / pps rate. Ethernet Frame: MAC Header + smallest payload + CRC = 14 + 46 + 4 = 64 bytes. Additional Overhead: Inter-frame-gap (IFG or IPG) + MAC preamble = 12 + 8 = 20

Server virtualization has evolved from software only to hardware assisted solution. Large chunk of computation work is now offloaded to hardware, bypassing the hypervisor. IO virtualization solution like SR-IOV available on public cloud instances can help accelerate both storage and network performance of latency sensitive workloads. Application with high concurrency capabilities running on a well tuned kernel can now able to service millions of requests on a single public cloud instance.

References

Intel SR-IOV Driver Companion Guide
Intel Virtualized Technology For Directed IO
Abyss open source software is used for automating: benchmarks execution, metrics collection and graphs generation.
Linux Kernel Tunables applied to AMI

Kernel version: 3.13.0-76-generic #120-Ubuntu

---------------------------------------------------------------

#!/bin/bash

# Set sysctl tuning

# vm tunables

echo 80 > /proc/sys/vm/dirty_ratio

echo 5 > /proc/sys/vm/dirty_background_ratio

echo 12000 > /proc/sys/vm/dirty_expire_centisecs

echo 0 > /proc/sys/vm/swappiness

# network and tcp/ip tunables

echo 1000 > /proc/sys/net/core/somaxconn

echo 5000 > /proc/sys/net/core/netdev_max_backlog

echo 16777216 > /proc/sys/net/core/rmem_max

echo 16777216 > /proc/sys/net/core/wmem_max

echo "4096 12582912 16777216" > /proc/sys/net/ipv4/tcp_wmem

echo "4096 12582912 16777216" > /proc/sys/net/ipv4/tcp_rmem

echo 8192 > /proc/sys/net/ipv4/tcp_max_syn_backlog

echo 0 > /proc/sys/net/ipv4/tcp_slow_start_after_idle

echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse

echo 1 > /proc/sys/net/ipv4/tcp_abort_on_overflow

echo "10240 65535" > /proc/sys/net/ipv4/ip_local_port_range

# numa tunables: Disable numa balancing feature

echo 0 > /proc/sys/kernel/numa_balancing

# Set network rfs tuning

echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

cat /proc/sys/net/core/rps_sock_flow_entries

ethlist=`ls /sys/class/net/`

for i in $ethlist

if [ "$i" != "lo" ]; then

# SR-IOV enabled NIC has two queues. Setting RFS feature only

echo $i

echo 0 >/sys/class/net/$i/queues/rx-0/rps_cpus

echo 16384 >/sys/class/net/$i/queues/rx-0/rps_flow_cnt

echo 0 > /sys/class/net/$i/queues/tx-0/xps_cpus

cat /sys/class/net/$i/queues/rx-0/rps_cpus

cat /sys/class/net/$i/queues/rx-0/rps_flow_cnt

cat /sys/class/net/$i/queues/tx-0/xps_cpus

echo 0 >/sys/class/net/$i/queues/rx-1/rps_cpus

echo 16384 >/sys/class/net/$i/queues/rx-1/rps_flow_cnt

echo 0 > /sys/class/net/$i/queues/tx-1/xps_cpus

cat /sys/class/net/$i/queues/rx-1/rps_cpus

cat /sys/class/net/$i/queues/rx-1/rps_flow_cnt

cat /sys/class/net/$i/queues/tx-1/xps_cpus

done

# Set Block Layer tuning

disklist=`cd /dev/; ls xv*`

for i in $disklist

if [ "$i" != "xvda" ]; then

echo $i

echo 256 > /sys/block/$i/queue/read_ahead_kb

echo 2 > /sys/block/$i/queue/rq_affinity

echo cfq > /sys/block/$i/queue/scheduler

echo 256 > /sys/block/$i/queue/nr_requests

# Display it

cat /sys/block/$i/queue/read_ahead_kb

cat /sys/block/$i/queue/rq_affinity

cat /sys/block/$i/queue/scheduler

cat /sys/block/$i/queue/nr_requests

done

microbenchmark benchmark setup

pktgen:

There was no UDP server running to process request. Objective was to measure NIC driver Packet processing capabilities.
Single ENI:
Rx/Tx queue pair is assigned to CPU 1,17. UDP traffic was using a single eth1-TxRx-0 queue resulting cpu 1 to be 100% busy in softirq (%soft) processing
cat /proc/interrupts:
Queue pair is assigned to each ENI:
263:          .....      xen-pirq-msi-x     eth1-TxRx-0
26            .....      xen-pirq-msi-x     eth1-TxRx-1

07:34:39 PM CPU    %usr   %nice    %sys %iowait    %irq   %soft %steal %guest %gnice   %idle
07:34:40 PM all    0.16    0.00    0.31    0.00    0.00    3.13    0.00    0.00    0.00   96.40
07:34:40 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00 100.00
07:34:40 PM    1    0.00    0.00    0.00    0.00    0.00 100.00    0.00    0.00    0.00    0.00
07:34:40 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00 100.00
07:34:40 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00 100.00
07:34:40 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00 100.00
...
$ sar -n DEV 1
Shows server eth1 is processing at 700Kpps rate
07:38:39 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s rxmcst/s   %ifutil
07:38:40 PM      eth1 698262.07      1.15 40913.79      0.05      0.00      0.00      0.00     23.77
...

For multiple ENI test, UDP load, across multiple clients, was directed to different ENI (8 ENI) of the server. This distributed the traffic across multiple Rx queues on each ENI, pinned to separate CPUs. Since multiple cpus were engaged in processing UDP traffic, server was able to achieve higher PPS rates of 1.6 Mpps.

iperf:

The primary purpose of iperf is to test network throughput, where large socket buffer is used to cache application data in kernel that allows network stack to advertise large TCP window to peer and send large set of TCP segments of MTU size without acknowledgement. To achieve higher PPS rate with iperf, we reduced the MTU size from 9001 (jumbo frame) to minimum allowed (68 bytes). Having a smaller MTU size resulted in network stack to break TCP segments into smaller chunks to fit into smaller MTU . Both iperf clients and server use the same MTU size.

~$ sudo ip link set dev eth1 mtu 65
RTNETLINK answers: Invalid argument
~$ sudo ip link set dev eth1 mtu 66
RTNETLINK answers: Invalid argument
~$ sudo ip link set dev eth1 mtu 68 << worked

Client:
iperf -c 100.80.27.42 -t 10000
------------------------------------------------------------
Client connecting to 100.80.27.42, TCP port 5001
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
[ 3] local 100.80.60.16 port 34459 connected with 100.80.27.42 port 5001
10:10:54 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s rxmcst/s   %ifutil
10:10:55 PM      eth0 63538.00 2251816.00   4095.26 681714.54      0.00      0.00      0.00    396.05
Server:
$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 12.0 MByte (default)
------------------------------------------------------------
0:07:40 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s rxmcst/s   %ifutil
10:07:41 PM      eth1 2249321.00 64086.00 680946.79   4130.54      0.00      0.00      0.00    395.61
cpu usage on server was low. Only single Rx queue was being used due to single client with one connection generating the load. NOTE: Using multiple iperf client connections or running client across multiple instances did not help with reaching higher PPS rates. Instance had already reached its max PPS limit.

….
$ ss -itp
Server shows small mss=36 bytes (TCP Maximum Segment Size)
ESTAB 0 0 100.80.27.42:5001 100.80.60.16:34459 users:(("iperf",123330,4))
cubic wscale:9,9 rto:204 rtt:4/2 ato:40 mss:36 cwnd:10 send 720.0Kbps rcv_rtt:4 rcv_space:6320576

12:23:52 AM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s rxmcst/s   %ifutil
12:23:53 AM      eth1 152105.00   4703.00 46047.41    303.12      0.00      0.00      0.00     26.75
12:23:53 AM      eth2 150648.00   5135.00 45606.33    338.49      0.00      0.00      0.00     26.50
12:23:53 AM      eth3 151990.00   4802.00 46012.35    316.50      0.00      0.00      0.00     26.73
12:23:53 AM      eth4 151777.00   4929.00 45948.12    325.66      0.00      0.00      0.00     26.69
12:23:53 AM      eth5 153761.00   4942.00 46544.21    364.64      0.00      0.00      0.00     27.04
12:23:53 AM      eth6 738426.00 22566.00 223546.93   1489.83      0.00      0.00      0.00    129.87
12:23:53 AM      eth7 493805.00 15640.00 149491.75   1031.70      0.00      0.00      0.00     86.85

iperf UDP Test:

Client:

$ iperf -u -c 100.80.27.42 -t 100000 -b 1000M
Client connecting to 100.80.27.42, UDP port 5001
Sending 1470 byte datagrams
UDP buffer size: 208 KByte (default)
------------------------------------------------------------
[ 3] local 100.80.60.16 port 38690 connected with 100.80.27.42 port 5001

Server:

iperf -u -s
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size: 208 KByte (default)
------------------------------------------------------------

Multiple ENI (UDP): With multiple ENI, iperf udp test able to achieve 2 Mpps rates10:52:11 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s rxmcst/s   %ifutil
10:52:12 PM      eth0      2.00      2.00      0.12      1.16      0.00      0.00      0.00      0.00
10:52:12 PM      eth1 1182784.00      0.00 94346.46      0.00      0.00      0.00      0.00     54.81
10:52:12 PM      eth2 872064.00      1.00 69583.37      0.04      0.00      0.00      0.00     40.43

ngnix webserver benchmark setup

memcached benchmark setup

Server:

version: memcached 1.4.14

$sudo memcached -p 7425 -u nobody -c 32768 -t 8 -o slab_reassign slab_automove -d -l 0.0.0.0

Client:

Warm up the memcached cache with 2 Million objects of size 100 bytes.

$mcblaster -p $mem_port -z 100 -k 2000000 -d 30 -w 10000 -c 1 -r 1 $peer

Once the cache is warmed up perform get operations at various RPS rates:

$mcblaster -p $mem_port -t 2 -z 100 -d 10 -r $RPS -c 1 $peer

For memcached latency tests, RPS rates were limited to 300k.

@RPS = (100000,125000,150000,175000,200000,250000,300000)

For full load without latency measurements, RPS rates used:

@RPS = (25000,50000,75000,100000,125000,150000,175000,200000,250000,300000)

pidstat shows, there are 8 memcached threads running with 500% CPU (5 CPU) usage:
01:32:39 AM   UID      TGID       TID    %usr %system %guest    %CPU   CPU Command
01:32:40 AM 65534     89033         - 215.00 267.00    0.00 482.00     5 memcached
01:32:40 AM 65534         -     89034   18.00   19.00    0.00   37.00     2 |__memcached
01:32:40 AM 65534         -     89035   21.00   23.00    0.00   44.00     3 |__memcached
01:32:40 AM 65534         -     89036   13.00   65.00    0.00   78.00    11 |__memcached
01:32:40 AM 65534         -     89037   33.00   28.00    0.00   61.00     7 |__memcached
01:32:40 AM 65534         -     89038   36.00   37.00    0.00   73.00     6 |__memcached
01:32:40 AM 65534         -     89039   32.00   28.00    0.00   60.00    10 |__memcached
01:32:40 AM 65534         -     89040   33.00   37.00    0.00   70.00     8 |__memcached
01:32:40 AM 65534         -     89041   32.00   28.00    0.00   60.00     4 |__memcached

   ..
memcached Socket Recv-Q is full - sign of application bottleneck or inability to process packets
[The column Recv-Q shows the bytes of data in the queue for user program to consume]
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port
ESTAB      239194 4453         100.80.60.15:7425         100.66.69.120:49930    users:(("memcached",62358,162))
        cubic wscale:9,9 rto:228 rtt:31.5/3 ato:40 mss:1344 cwnd:135 ssthresh:7 send 46.1Mbps unacked:61 rcv_rtt:16 rcv_space:32472
ESTAB      555373 7300         100.80.60.15:7425         100.80.23.167:38379    users:(("memcached",62358,86))
        cubic wscale:9,9 rto:228 rtt:31/3 ato:40 mss:1344 cwnd:167 ssthresh:11 send 57.9Mbps unacked:100 rcv_rtt:17 rcv_space:34682
..
-----
netstat reported large packets collapsed stats..sign of overwhelmed memcached server process
40912797 packets collapsed in receive queue due to low socket buffer

Amer Ather | Netflix Performance Engineering

Linux Performance in Cloud

Monday, May 9, 2016

2 Million Packets Per Second on a Public Cloud Instance

Technology Overview

Benchmark Results

Linux Network Stack

References

Blog Archive