Linux Performance in Cloud: My BIO

Hello, My name is Amer Ather and I live in beautiful city of Saratoga with lovely wife and three adorable kids. I have been thinking about blogging, tweeting (@amernetflix) for sometime. I thought if not now then never..so here I am ready to blog.

I blog about Linux Kernel performance, profiling and Amazon
Public Cloud Services and Open Source Projects

I graduated from University Of Toledo in Electrical Engineering in 1991. With no entry level job to be found in power industry due to recession, I decided to try my luck with computers and accepted a job as a System Engineer at Pyramid Technology. Company was a leader in building enterprise class servers that deliver mainframe-class performance and availability on a RISC based Unix SMP servers. It turned out to be a right decision considering my interest in Unix grew there. I literally read every Unix Kernel, C and TCP/IP books that I can get my hands on. I also attended professional development classes at the University Of California, Santa Cruz (Silicon Valley Extension). At a later part of my career (Year 2007), I joined UC part time faculty and started teaching courses: Linux Device Driver, Advanced, Linux Performance in the Cloud and DataCenter and Server Virtualization with Xen.

By 1994-1995, Pyramid Technology started facing stiff competition from a newcomer, Sun Microsystem, that entered into the enterprise server market with acquisition of Cray Research . With Intel/Window threatening Unix dominance in desktop and low-end server market, Sun Microsystem shifted its gear and entered into a higher margin enterprise server industry. Bunch of Pyramid Engineers (including me) were hired by Sun to help with company transition from desktop to enterprise server. At Sun, I worked as Area System Engineer, part of fly-and-fix team, responsible for managing field escalations, reviewing pervasive issues, developing tools and writing knowledge articles ..that are used by Sun customers and Sun service and support organizations. I also assumed sustaining role for few years and fixed bugs in ZFS, TCP, network and IO stacks. Used DTrace (Dynamic Tracing Framework) extensively for performance analysis, application profiling and fault injection purposes. Contributed to internal projects:

Crash Analysis Tool (cat): Tool is written by number of senior kernel engineers to improve Solution Center engineers ability to analyze kernel core efficiently
Kernkit: A loadable kernel module that sets/unsets breakpoints at any arbitrary location in kernel. Developed frontend modules that call into a backend breakpoint engine to insert/remove breakpoints and register callback. Teams used this tool to diagnose complex issues until DTrace (Solaris Dynamic Tracing framework) became a tool of choice. Number of front-end modules were developed: Kernel Thread Switcher (captures time spent by kernel thread in various states), Sigtrace (to help root cause process termination), TCP analyzer, mutex contention etc..

Sun Microsystem was a true engineering company that gave employees freedom to innovate and value their opinion. Company was generous with sharing profit and granting stock options. Company had a great run during Internet boom (We are the dot in .com!) because of their vision (Network is a Computer!). Those were amazing and memorable years: company stock split multiple times (Yes, I became rich!); Sun introduced java and rolled out first Multi-core Multithreaded (chip multithreading) Niagara chipset . Solaris 10 brought to market some ground breaking technologies: DTrace, Zones (Container), ZFS, FMA.. and list goes on. I was fortunate to part of it. Yes, I miss those days.

Sun downfall started with the Dot-com bubble burst. With money drying out and departure of key executives (including CEO and CTO) and Engineers created a huge void. Lack of strategy, vision and bad acquisitions brought the company to a breaking point. Sun had a strong portfolio of technology patents and brand recognition. IBM and Oracle saw a great value. After a bidding war and some politics, Oracle came out as a winner and acquired Sun at a bargain price.

Things changed for the worse for Sun employees after oracle merger. Oracle, unlike Sun, is a business company, where management is valued more than engineer. It’s all about check and balance! Innovation requires freedom to think and do what is right.. not what is told. Top engineering talent left Oracle for better opportunities and built successful startups. My tenure with Oracle also reached to its end. In 2013, I left Oracle and joined Netflix as a Senior Cloud Architect.

Netflix opportunity was a much needed shot in the arm that rejuvenated my career goals and direction. Netflix culture of “Freedom and Responsibility” was not just a document, it is a company’s lifeline. Netflix gives employees competitive benefits, freedom to innovate and, most importantly, let them share innovative work with the open source community. Netflix offers premium streaming service in HD and 4K to over sixty five million paid subscribers worldwide. As part of Netflix performance Engineering team, my colleague (Brendan Gregg) and I are working diligently to provide visibility across the full software stack (java, jvm, kernel, Xen). Netflix promotes self service model and thus our goal is to bring together full stack profiling capabilities and performance analysis knowledge as a self service to Netflix teams. Contributed to opensource projects: Vector, Abyss, pcp, grafana, cloudstat

Companies built in cloud era ( Netflix, Google, Facebook, Linkedin, Twitter, Ebay) follow Agile methodology around development. DevOps model allows developers to push code multiple times a week into production (yes production!). Writing a monolithic applications that contain full functionality and require months of testing and release cycle are thing of past. Public cloud has made micro service architecture (service oriented) indispensable. Focus has shifted away from developing one giant monolithic application that can only be scaled vertically. Nowadays, Industry trend is to break functionality into multiple micro services that can be developed independently and offer well known interfaces to interact with upstream and downstream services. Where monolithic application design suffer from lengthy and cumbersome build-test-release cycles, microservices allow teams to deliver features quickly and efficiently, realizing the advantages and benefits of continuous delivery. Netflix championed this model by successfully built and deployed hundreds of micro services on to Amazon public cloud. These micro services are mainly stateless and have different SLO/SLA requirements. Higher level of availability (99.99%) is achieved by: setting custom retry and timeouts on downstream services; offering self healing capabilities; running with limited functionality due to dependency failure, predictive auto-scaling (horizontal scaling) in anticipation of higher/lower production load, and massive duplication of resources and services. Netflix software stack is built from ground up to be cloud native that can adapt to variability in performance and availability inherent in the public cloud.

Number of Netflix services are based on open source projects (memcache, cassandra, kafka, elasticsearch, hadoop, etc..) and others are home grown software projects that were later open sourced under Netflix OSS. Netflix has teams that are responsible for complete life cycle of their service, that include: development, release, and deployment into the cloud. Service teams use Continue Delivery Platforms (Jenkins) to automate build process and private GitHub repositories for collaboration. Developers push their changes into central GIT repository, that triggers new build. Artifacts are packaged (deb packaging) and baked into AMI (Cloud image) that contains Ubuntu distribution and other platform libraries and packages like: OpenJDK, Apache, Tomcat, Monitoring agents etc..). Resulting AMI is then launched into the Amazon Public Cloud. Newly launched instance goes through canary testing by taking a small subset of production load. Correlation is performed across hundreds or even thousands of metrics/dimensions to find outliers or regression. Once the confidence is built on new code, Red/Black push is performed, that launches new software version into new ASG (Amazon Auto-Scaling-Group) or cluster, and scales it to a level of old ASG . Old ASG is then retired by slowly shrinking its size to zero instances. All this happen (majority of the time) without human intervention or downtime. Netflix is opened 24×7. Netflix services are built to be resilient and fault tolerant. There is an army of Monkeys and Gorillas hopping around in the cloud actively testing conformance by simulating real world failures. Lessons learnt from these exercises help netflix teams to make services reliable and scalable.