Over the last couple years the term “cloud native” has entered the collective consciousness of those designing and building applications and the infrastructure that supports them. At its heart, cloud native refers to a software architecture paradigm tailored for the cloud. It calls that applications 1) employ containers as the atomic unit for packaging and deployment, 2) be autonomic, that is centrally orchestrated and dynamically scheduled, and 3) be microservices-oriented, that is be built as loosely-coupled, modular services each running an independent process, most often communicating with one another through HTTP via an API.
Dissecting those characteristics further implies that modern applications need be platform-independent (e.g. decoupled from physical and/or virtual resources to work equally well across cloud and compute substrates), highly elastic, highly available and easily maintainable.
By the sound of that, it holds that building cloud native applications is a no-brainer for every organization, whether they consider writing software business-critical or not. In practice, however, going cloud native — much like adopting DevOps — requires putting into place a broad set of new technologies and practices which meaningfully shift around overhead costs associated with writing, deploying and managing software. So before considering going cloud native, it’s imperative to understand the motivations for this architectural transformation, both technically and organizationally.
A good place to start is with Google, the poster child for this highly distributed, autonomic computing paradigm. Google has been running on containerized infrastructure for nearly a decade and manages resource allocation, scheduling, orchestration and deployment through a proprietary system called Borg. In a research paper released in 2015, Large-scale cluster management at Google with Borg, Google elucidates its motivations:
Borg provides three main benefits: it (1) hides the details of resource management and failure handling so its users can focus on application development instead; (2) operates with very high reliability and availability, and supports applications that do the same; and (3) lets us run workloads across tens of thousands of machines effectively.
So Google’s rationale for going cloud-native is to achieve 1) agility, as defined by developer productivity and self-service, 2) fault-tolerance and 3) horizontal scalability. And while almost no organization has to operate at the massive scale of Google, every company in the world asks itself “how do I go faster” and “how do I minimize risk?”
Problems arise, however, when going cloud native becomes an end, not a means. While containers, autonomic scheduling and microservices-oriented design are all tools which can facilitate operational agility and reduce risk associated with shipping software, they are far from a panacea and involve shifting meaningful costs from dev to prod. Martin Fowler and others have termed this phenomenon the “microservices premium”
The [cloud native] approach is all about handling a complex system, but in order to do so the approach introduces its own set of complexities. When you [adopt cloud native architectures] you have to work on automated deployment, monitoring, dealing with failure, eventual consistency, and other [complexities] that a distributed system introduces.
The prevailing fallacy is to conflate using Docker as package format with the need to build an application as a complex distributed system from the get-go.
The first rule of the thumb is “if ain’t broke, don’t fix it,” so there’s no need for added complexity if your team is functioning at a high level, releases are on schedule and your app is resilient and scaling to meet the demand of users. Sustained high levels of developer productivity, continuous deployment and fault tolerant systems can be and are often achieved without so much as ever interacting with a Dockerfile (though it can radically simplify the development workflow). In fact, many of the most elegant delivery pipelines in high performance software organizations are AMI-based and deployed by Slackbots!
However, as your engineering organization balloons to 100+ devs, going cloud native — including stand up the entire distributed runtime — very well could begin to make sense. Just remember, all these decisions are tradeoffs, where complexity is merely shifted not reduced.