One of the biggest misconceptions about DevOps is probably that it is a new ways of doing Operations. In reality it is as much, if not more, a different way of doing Development. There are several articles from well respected experts on the topic that affirm “There’s No Such Thing as a Devops Team”, or that “DevOps is Dead”. In these articles the stated problem is that the Software Industry is trying to build and/or hire DevOps teams to bring in Continuous Integration, Continuous Delivery and Continuous Deployment so that the Development team can be more Agile and respond to Marketing requirements in record times. This new team is doomed to have limited impact without a cultural shift from the Development team as well. There is no one size fit them all DevOps solution, some companies, have been hugely successful thanks to a culture that enabled Operational Excellence which in turn enabled their massive growth, two of these companies are Netflix and Google.
Freedom and Responsibility
Netflix Engineering Culture is highly respected in the software industry. Netflix was launched in the mid 90s as an online DVD store, then they became very famous in the US with the red envelopes as a DVD rental by mail business. At the time they had a traditional, relatively small development team, developing a low/medium traffic site designed as a monolithic centralized service that scaled vertically and operated by an Operations group in a traditional datacenter. In 2008 netflix.com had an outage that lasted several days. As the customers still received DVDs in the mail if they had a queue this was not a massive problem, but Netflix plan was to transform itself into a streaming business, where such an outage could destroy it. At the time Netflix stock was trading around $3. Since the streaming service would be a, mostly, completely new software stack, they decided to completely revisit how they operate. In fact they no longer have an Operations group for streaming. If you ask Netflix how they think about DevOps their answer is that they don’t. By deciding to delegate the physical operations to AWS and make the development teams 100% responsible for uptime, they did not need that Operations group to run Streaming. The cultural transformation was completely on the shoulder of the developers. This is an extreme model, but is a good example of why a DevOps culture is an Engineering problem first. The Netflix approach is also known as NoOps, but is not for everyone and few organizations are able to scale NoOps. The key ingredients for this model to work was Netflix’s culture of “Freedom and Responsibility” combined with a true micro service architecture where the teams are loosely coupled and require little centralized coordination overhead. If there is any measure of success for Netflix transformation, in 2017 the stock is trading above $140, almost 50x in 9 years.
Site Reliability Engineering
Google has built a legendary production infrastructure around the Site Reliability Engineering group. This group’s processes and culture have recently been documented at length in the book “Site Reliability Engineering: How Google Runs Production Systems”. With the SRE group Google did not remove the group that run the production systems like Netflix has done, instead they created a new role, an Engineering role whose end goal is to achieve excellent Site Reliability. A little over half of the SRE come form a pure Software Engineering background and a little less than half have system administration experience, all of them are strong software developers. The team members are usually assigned to run specific Google products such as AdWords or Google News, but there is a mandate that Operational work should not be more than 50% of an SRE time, this is a maximum and there is no minimum. If the SREs go over the 50% maximum, which is actively monitored, then steps are taken to correct this, often times it means that the development team with the unstable service will pick up some of the operational burden. This allows Google to ensure that the SREs will be able to allocate sufficient engineering effort on improving the system, but also ensure that there is feedback and incentive on the development team to improve the fault tolerance of the service. In this model the operational cost is non linear while scaling, whereas in a traditional model the number of operators scales linearly as the number of machines and services scales. In practice traditional scaling can be exponential due to increased overhead in communication and high attrition caused by burnout. For example Facebook, who borrows a lot of its Engineering Culture from Google, has “one SRE for every 18 Million users”.
One impressive accomplishment is that, unlike a traditional Operations group in a traditional company, the SRE group and its engineers are very well respected internally both by the development teams and by the high level executives at Google. The SRE team members get the same awards and company wide recognition as the development teams during the weekly TGIF sessions. The SRE title has been picked up at other organizations, even Netflix has an SRE team but few are close to the original.