Written by Stephane Odul (linkedin.com/in/sodul)
The Netflix model and the Google SRE model are self admittedly not exactly DevOps. At Netflix the Operations team simply does not exist on the streaming side and at Google it is implemented as an Engineering Organization that also has Operational Duties. Early stage startups will actually be closer to the Netflix, NoOps, model, usually because the whole company is made of a handful of Software Developers and the problems of Reliability, Reproducibility, Security, Auditing, Disaster Recovery and Scaling are just not present yet or are not painful enough to justify the additional headcount. Startups with actual customers and online traffic will start to prioritize reliability and hire an Operations, or DevOps team to handle the problems which need people that have strong Operational experience to start improving Uptime.
Uptime is not a one time investment, it is an ongoing process that needs constant nurturing, and is as much the responsibility of the Development team as it is the Operations team. When planning the development work, Uptime should be your first feature because when your service is down none of your features matter. How do you guarantee that you will improve Uptime and how do you enable rapid delivery? One key item is Collaboration. With Collaboration you want to break down the organizational silos, increase empathy between the roles and increase Respect. To balance on-call duty, Google requires a minimum of 8 people for a single site on-call rotation. It does not mean that you need 8 Operations personnel, spread it with the Development team. Also do not make the rotation too large to avoid Operational Underload, in which case the person on call might lack the experience to fix the outages.
Make Root Cause Analysis mandatory after each failure, and have the Development Team responsible for writing them: they write the code, they are the one that should best understand how it works and how it can be updated to be more fault tolerant. You will want blameless RCAs, the goal of the exercise is to make sure the same outage will not happen twice, not to point fingers. The RCA should have a meeting with the key stake holders, that is the Developers and Operations. Thrive to develop, in your code base, solutions to tolerate the root cause of the outages. For example if a service became unstable and required to be restarted following a network glitch, fix the code to recover automatically (Development), instead of focusing on bouncing the service as fast as possible (Operations) or to impose an unattainable mandate of absolute reliability at the hardware level.
Balance the workload of your Operations team. If the team workload is 100% reactive then you need to ensure that they will have time for creative work and for Continuous Improvement. In the very short term you offload some of the operational duties to the Development team and yes you cut down on new feature development. This should not be a problem since your top feature should always be Uptime. Re-evaluate your hiring priorities, as the Operations team is likely to simply be understaffed so reallocate a Development headcount to the Operations team. Once more, Uptime is your most critical feature.
Internal tools development can be a great factor to success. Google has a dedicated Engineering Productivity group that develops internal tools such as Blaze, aka Bazel. That group represents about 10% of all of Google Engineering. Netflix has an Engineering Tools group that developed their internal Continuous Delivery tools such as Spinnaker. While you will want to re-use off the shelf solutions as much as possible, having an internal development team or group that will develop custom, self service, solutions will give your company a competitive advantage.
Overall I cannot stress enough that DevOps is first and foremost a cultural change for Engineering organizations that will help them achieve Operational Excellence. The change is mandatory if you are running a SaaS business with and Agile process and a modern Microservices architecture that must scale. In order to be successful today but also tomorrow you need to ensure that both your Development and Operations teams work closely together and that they have enough resources for Continuous Improvement. Just like there is no unique definition of DevOps today, the definitions also evolve just as fast as software development practices and even programming languages evolve.
By Stephane Odul
For Harrison Clarke