Monday, May 9, 2016

2 Million Packets Per Second on a Public Cloud Instance

Amazon recommends customers to choose VPC as it provides data center like features and performance with the elasticity of a public cloud. One of the key performance differentiator between EC2 Classic and VPC is the Enhanced Networking feature offered on VPC instances, that helps applications achieve high RPS rates due to low latency networking. As Netflix makes the migration from EC2 classic to VPC, we have executed benchmarks to identify VPC instance limits. Micro benchmarks ran on Amazon (r3,i2,m4).8xlarge VPC instances reported 10 fold (2 Mpps) improvement in packet processing rates as compared to similar EC2 classic instance, which is limited to 200 kpps. Some application benchmarks have also reported 10x higher RPS (Request/sec) rates in VPC.
Due to early adoption of public cloud, the majority of Netflix services are still hosted on Amazon EC2 Classic cloud. AWS services have evolved and EC2 classic was not built to support features required by new breed of cloud services.  Netflix services, that use RPS rates as a metric to scale the ASG (Auto Scaling Group), routinely over provision compute farms to overcome packet processing overhead inherent in Xen split driver model (Figure 1).
Figure1.  Definitive Guide to the Xen Hypervisor
NOTE: EC2 classic instances and low end instances in VPC use Xen split driver model (software) that uses shared memory ring between instance and Dom0 (Xen trusted domain) to exchange packets. This model has a higher packet processing overhead than SR-IOV (hardware) enabled NIC.
Amazon VPC promises much higher pps rates at a sub millisecond latencies and that means fewer instances (cost saving) are needed to meet upstream service demands.

Technology Overview

Amazon Enhanced Networking feature, built on top of SR-IOV (PCI-SIG standard) technology, allows instance to have a direct access to subset of PCI resources on a physical NIC. Unlike Xen virtualized driver, SR-IOV compliant driver running on a cloud instance can DMA (Direct Memory Access) to NIC hardware to achieve higher throughput and lower latency. DMA operation from the device to Virtual Machine memory does not compromise the safety of underlying hardware. Intel IO Virtualization Technology (vt-d) supports DMA and interrupt remapping and that restricts the NIC hardware to subset of physical memory allocated for a particular Virtual Machine. No hypervisor interaction is needed except for interrupt processing.
SR-IOV Driver and NIC

EC2 classic instance can have a single Xen Virtualized NIC, whereas, VPC instance can support multiple NIC (ENI) per instance to help distribute interrupts and network traffic. ENI on AWS instance is assigned a pci virtual function (VF). Each virtual function (lightweight version of pci physical function PF) gets subset of physical NIC pci resources such as: pci configuration registers, iomem regions, queue pairs (Rx/Tx queues), set of Transmit and Receive Descriptors with DMA capabilities. NIC driver (ixgbevf) running inside the instance is para-virtualized (para-virtualized in this context means that driver is modified to only have limited pci capabilities). It can transfer packets directly to hardware, but to change MAC address, device reset, or perform instructions that have global impact, it relies on Physical Function (PF) driver running in privileged domain (Dom0, managed by cloud provider). Communication between VF and PF drivers happens via special hardware registers/buffers, called Mailbox.



Each VF has it own pci resources on NIC
Intel PCIe NIC is a multi-queue device. AWS assigns each ENI (or VF) two queue pair ( up to a maximum of 16 queue pairs per instance) to distribute network traffic. Each queue pair is pinned to a separate CPU for interrupt and packet processing. NIC hashes on tuple (srcIP, dstIP, srcPort, dstPort) to decide which Rx queue (two Rx queues per ENI) to use for incoming flow. Packets from a single flow uses the same Rx queue to avoid packet reordering. Each queue pair has its own sets of Tx and Rx descriptors (max: 4096). Each Rx/Tx descriptors in queue is used to DMA individual packet from/to the NIC. When all Rx/Tx descriptors are exhausted or in use, NIC driver flows control the network stack. Thus larger number of Rx descriptors can improve pps rates and throughput. Intel PCIe NIC has an embedded Layer-2 switch that sorts packets based upon the destination MAC address or vlan tag. When match is found, it is forwarded to the appropriate queue pair. Layer-2 switch also performs bridging function between VF (ENI) in hardware without hypervisor intervention. Thus multiple instances hosted on the same physical machine can communicate at a much lower network latencies due to bridging feature of NIC, as compared to across physical machines. AWS Placement Groups use this feature to offer lowest possible network latency between instances.

Benchmark Results

Ubuntu AMI used for testing has routing tables set to route traffic coming in an interface to go back out the same interface and vice versa. That allows network traffic to be distributed across multiple ENI. Kernel tuning is baked into the AMI to attain optimum performance for varying type of Netflix workloads. Although each ENI attached to an instance has a dedicated DMA path to a physical NIC, the master driver (running in trusted domain Dom0 and controlled by cloud provider) has ability to set throttling limits on throughput and pps rates per instance. When multiple ENI are configured and stressed, instance maximum pps limit is split across multiple ENI. Tests ran on Amazon 8xlarge instances show:

Number of ENI Configured
Max pps rate per ENI
Bi-directional pps rates per ENI
1
2.4 Mpps
1.2 Mpps
2
1.2 Mpps
600 Mpps
4
600 Kpps
300 Kpps
8
300 Kpps
150 Kpps
Note: Amazon does not comment on maximum PPS rates per instance. We found in our testing that ~2.4 Mpps rates can be achieved on 8xlarge instances. Smaller instances (4xlarge) are throttled at ~ 1Mpps

Micro Benchmark Results
Microbenchmarking tools, pktgen and iperf, are used to test NIC hardware and driver capability to process small packets. Server NIC is flooded to measure maximum PPS rates. Iperf test was run with 68 Bytes MTU to generate small packets. Test results show Amazon instance types: x8large are can process packets at 2 Mpps PPS rates in receive (Rx) or transmit (Tx) and over 1 Mpps for bidirectional traffic.
Micro Benchmark
Network PPS Rates
Protocol
iperf
2 Mpps
TCP
pktgen
1.6 Mpps
UDP
iperf TCP with 68 bytes MTU
pktgen UDP test 

Application Benchmark Results
Webserver Test:
Nginx web server supports socket option SO_REUSEPORT for better concurrency as it reduces contention among multiple server process/threads accepting connections. Benchmark ran on VPC 8xlarge instance reported maximum RPS rates of over 1 Million on a single instance with 90th percentile latency of 2-5 ms. That is 10x more RPS than EC2 classic instance. EC2 class instance is limited to only 85 kpps. At 1 Million http request/response rate on VPC instance, underlying network reached its maximum limits and thus unable to push more web traffic. Eight clients were used to generate http traffic concurrently using wrk utility against a single nginx web server.
Instance Type
RPS Rates (Clients)
Web Server Latency
Web Server PPS Rates
r3.8xl
VPC Instance
1.1M
1- 4 ms
1.2 Mpps (receive)
1.2 Mpss (transmit)
Total: 2.4 Mpps
R3.8xl
Classic Instance
85K
1 ms
Higher load causes server to become unresponsive over the network. Thus kept the Network load low
85 kpps (receive)
85 kpps (transmit)
Total: 170 Kpps
web server Test in VPC
web server Test in Classic

Note: Each dot in the graph represents a single iteration of the test.
Memcached Test:
mcblaster, open source memcached client, is used to generate load on memcached server. memcache benchmark reported 300K RPS (gets/sec) rates at low 1-10 ms latencies on VPC instance as compared to 85-90K RPS rates on EC2 Classic. EC2 classic network maxes out at 90-100k pps rates and become unresponsive over the network. In comparison, VPC instance with SR-IOV can be pushed to much higher pps rates, without inducing higher latencies.
Instance Type
RPS Rates (Clients)
memcache max Latency
% requests completed in < 10 ms
VPC Instance
300K
30 ms
99%
Classic Instance
100K
95 ms
78%
memcache Test in VPC

memcache Latency distribution in VPC
Note: Each dot represents a single iteration of the test. Each test ran for 10 seconds at RPS rates of 100-300k
Memcached scalability is limited due to higher contention in memcache code as reported by Linux perf.

 We were still able to push more load on VPC instance even when memcached was exhibiting higher latencies. NIC driver on memcached server instance continue to process incoming packets at 1.8 Mpps but transmitted at a lower rates of 600 kpps due to overloaded memcached.

Linux Network Stack

Linux network stack can scale to high pps rates with proper kernel tuning and having the following features enabled:
  • RPS/RFS network stack feature helps distribute network stack processing across multiple cpus and that reduces latencies, especially on numa servers. During our test we enabled RFS only.
  • When NIC driver supports multi-send or bulk packet transmission feature, network stack can queue multiple packets (skb) to NIC driver when passing it for deliver.
  • Modern NIC supports multiple Rx/Tx hardware queues where each queue is assigned a dedicated cpu for interrupt processing. Receive traffic is distributed across multiple Rx queues and that utilizes full NIC potential.
  • NIC drivers can process multiple packets per interrupt using combination of software and hardware features: NAPI and hardware interrupt mitigation feature, to reduce interrupt processing overhead.
  • Benefit of configuring multiple ENI per instance is that it distributes network interrupt and packet processing across larger set of cpus. Multiple ENI can also be used to segregate network traffic for a service to improve visibility.
Having all these feature available help improve instance scalability by engaging more cpus for network processing.


PPS tests offer better insight into Network stack efficiency than throughput tests. One can optimize throughput by tuning tcp window size, using larger payload size and utilizing ethernet jumbo frames. Small packet handling or PPS tests help estimate the cost or latency associated with processing each packet. Smallest ethernet frame that can be sent over the wire is 64 bytes + overhead = 84 bytes. To keep NIC busy at 2 Mpps, payload (64 bytes) processing latency should stay below 500 ns (0.5 us), that includes full stack processing (java, jvm, libc, network stack and NIC driver). There are system calls (recvmmsg/sendmmsg (extra m), readv/writev) that can send multiple packets per system call to reduce system call overhead.
net tput (Gbits)
Frame size (bytes)
overhead (bytes)
pps rates
Latency per packet (ns)
10
64
20
16 Mpps
63 ns
10
1500
20
880 Kpps
1136
10
9000
20
150 Kpps
6750
NOTE: pps rate = tput / frame size  |  Latency per packet = 1 sec / pps rate. Ethernet Frame: MAC Header + smallest payload + CRC = 14 + 46 + 4 = 64 bytes.  Additional Overhead: Inter-frame-gap (IFG or IPG) + MAC preamble = 12 + 8 = 20

Server virtualization has evolved from software only to hardware assisted solution. Large chunk of computation work is now offloaded to hardware, bypassing the hypervisor. IO virtualization solution like SR-IOV available on public cloud instances can help accelerate both storage and network performance of latency sensitive workloads. Application with high concurrency capabilities running on a well tuned kernel can now able to service millions of requests on a single public cloud instance.

References

Intel SR-IOV Driver Companion Guide
Intel Virtualized Technology For Directed IO
Abyss open source software is used for automating: benchmarks execution, metrics collection and graphs generation.
Linux Kernel Tunables applied to AMI 

microbenchmark benchmark setup

ngnix webserver benchmark setup

memcached benchmark setup


Amer Ather | Netflix Performance Engineering

72 comments:

  1. Amazon Web Services (AWS) BGP
    This video demonstrates how to configure the Amazon Web Services BGP to set up a VPN between a Check Point Security Gateway and Amazon VPC
    http://www.s4techno.com/blog/2015/12/24/amazon-web-services-aws-bgp/

    ReplyDelete
  2. Thanks for providing this informative information you may also refer.
    http://www.s4techno.com/blog/2016/08/10/interview-questions-of-aws/

    ReplyDelete
  3. Thanks for providing this informative information…..
    You may also refer- http://www.s4techno.com/blog/category/aws/

    ReplyDelete
  4. Amazing Article, thank you!. I just wish to give you a big thumbs up for the excellent post. Kindly keep updating your blog. Java Developer is a dream career for IT students.To start wonderful Career to become a Java developer learn from Java Training in Chennai. or learn thru Java Online Training from India .

    or Javascript Training in Chennai. Nowadays JavaScript has tons of job opportunities on various vertical industry.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Great article! Towards the end, you mentioned that "Amazon does not comment on maximum PPS rates per instance." True enough. I did a test to find the actual limit, and there's not one, but two! It appears they work on a best effort / guaranteed throughput mechanism.

    If anyone is interested, I put a bunch of graphs and data into a blog post about Max PPS in Amazon EC2.

    ReplyDelete
  7. Thanks for sharing this post. Your post is really very helpful its students. google cloud online training

    ReplyDelete
  8. Thanks for providing this informative information.
    melbourne remodeling contractors

    ReplyDelete
  9. thanks for providing informative or helpful information.
    hotels in Islamabad

    ReplyDelete
  10. thanks for providing informative information.
    123 movies

    ReplyDelete
  11. אהבתי מאוד את סגנון הכתיבה.
    gabaygroup.com

    ReplyDelete
  12. הייתי חייבת לפרגן, תודה על השיתוף.
    רהיטים מעוצבים

    ReplyDelete
  13. thanks for providing informative or helpful information.
    123 movie

    ReplyDelete
  14. פוסט מעניין, משתף עם העוקבים שלי. תודה.
    טבעת אירוסין

    ReplyDelete
  15. תמשיכו בפרסום פוסטים מעניינים כמו זה. תודה
    מגשי אירוח לאירועים קטנים

    ReplyDelete
  16. פוסט מעניין, משתף עם העוקבים שלי. תודה.
    מילוי שפתיים

    ReplyDelete
  17. מעולה. תודה על הכתיבה היצירתית.
    הפקת חתונה

    ReplyDelete
  18. go123movies

    Hiya, I am really glad I have found this info. Today bloggers publish only about gossip and net stuff and this is actually irritating. A good blog with interesting content, this is what I need. Thanks for making this website, and I will be visiting again. Do you do newsletters by email?

    ReplyDelete
  19. מזל שנתקלתי בכתבה הזאת. בדיוק בזמן
    משטח פעילות לתינוק

    ReplyDelete
  20. Excellent information Providing by your Article. Thanks
    rabbit jewelry

    ReplyDelete
  21. רציתי רק לשאול, אפשר לשתף את הפוסט בבלוג שלי?
    תמונה על בלוק עץ

    ReplyDelete
  22. אין ספק שהפוסט הזה דורש שיתוף. תודה.
    מצלמות אבטחה לבית

    ReplyDelete
  23. I really appreciate this wonderful post that you have provided for us. I assure this would be beneficial for most of the people. Thanks for sharing the information keep updating, looking forward to more posts. High Quality Product Images

    ReplyDelete
  24. Replies
    1. תודה על השיתוף. מחכה לכתבות חדשות.
      עסקים למכירה

      Delete
  25. אין ספק שזה אחד הנושאים המעניינים. תודה על השיתוף.
    חברת שיווק באינטרנט

    ReplyDelete
  26. כתיבה מעולה, אהבתי. אשתף עם העוקבים שלי.
    מיטת מעבר

    ReplyDelete
  27. כל מילה. תודה על השיתוף, מחכה לעוד פוסטים בנושא.
    עיצוב חווית משתמש

    ReplyDelete
  28. Spot on with this write-up, I truly believe that this amazing technology site needs much more attention. I’ll probably be returning to read through more, thanks for the information!

    ReplyDelete
  29. i am browsing this website dailly , and get nice facts from here all the time .

    ReplyDelete
  30. This comment has been removed by the author.

    ReplyDelete
  31. הדעות שלי קצת חלוקות בעניין הזה אבל ללא ספק כתבת מעניין מאוד.
    תמונות על קנבס

    ReplyDelete
  32. Thanks for providing this informative information
    אינטרקום

    ReplyDelete
  33. Thank you for this wonderful sharing with us. Keep Sharing.
    Garage door repair Mississauga

    ReplyDelete
  34. I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.

    Data Science Course

    ReplyDelete
  35. Truly, this article is really one of the very best in the history of articles. I am a antique ’Article’ collector and I sometimes read some new articles if I find them interesting. And I found this one pretty fascinating and it should go into my collection. Very good work!

    Data Science Training

    ReplyDelete
  36. This is a wonderful article, Given so much info in it, These type of articles keeps the users interest in the website, and keep on sharing more ... good luck.
    Data Science Institute in Bangalore

    ReplyDelete
  37. Great post i must say and thanks for the information. Education is definitely a sticky subject. However, is still among the leading topics of our time. I appreciate your post and look forward to more.
    Data Science Certification in Bangalore

    ReplyDelete
  38. I recently came across your blog and have been reading along. I thought I would leave my first comment. I don't know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.


    Bastion Balance Seoul

    ReplyDelete
  39. Some time we never feel what we have done but for other that is big achievement


    investorsdiurnal.com

    ReplyDelete
  40. I'd love to thank you for the efforts you've made in composing this post. I hope the same best work out of you later on too. I wished to thank you with this particular sites! Thank you for sharing. Fantastic sites!
    360DigiTMG Data Science Course in Bangalore

    ReplyDelete
  41. This is a great post. This post gives a truly quality information. I am certainly going to look into it. Really very helpful tips are supplied here. Thank you so much. Keep up the great works
    360DigiTMG Data Science Training in Bangalore

    ReplyDelete
  42. חשבתי שאתקל בסתם עוד מאמר שטחי. כמובן שטעיתי.

    בריכת שחיה ביתית

    ReplyDelete