By any measure, we’ve entered into the Age of Machine Learning and Artificial Intelligence (‘ML/AI’). The confluence of massive data, cheap storage, elastic compute, and algorithmic advances, particularly in deep learning, has given rise to applications that previously were confined to the pages science fiction novels.
Machines now surpass humans in complex strategy games, to say nothing of image recognition, speech transcription, and other advances that begin to complicate our assumptions about what is and isn’t uniquely human. Voice-based personal assistants are commonplace; fully-autonomous vehicles seemjust around the bend.
Given these recent advances, much of the dialogue around ML/AI has centered disproportionately, albeit understandably, on breakthroughs in algorithms and their applications. Notably absent in the discussion has been any mention of the infrastructure underlying these intelligent systems.
Just as in the earliest days of computing where one needed to be expert in assembly language, compilers and operating systems to develop a simple application, so today you need an army of stats and distributed systems PhDs to build and deploy AI at scale. The abstractions and tooling necessary to make ML/AI usable are the missing link. The upshot is that ML/AI remains a limited and expensive discipline reserved for only a few elite engineering organizations.
Ultimately, this is relates to a lag in the evolution of infrastructure, which, to date, has been far outpaced by innovation in machine learning techniques. Put simply: the systems and tooling that helped usher in the current era of practical machine learning are ill-suited to power future generations of the intelligent applications they spawned.
Going forward, an entirely new toolchain is necessary to unlock the potential of ML/AI; to make it operational and usable, let alone approachable, for developers and enterprises. It stands to reason, then, that the next great opportunity in infrastructure will be to provide the building blocks for systems of intelligence.
From Infrastructure 1.0 to 2.0 and Beyond
Applications and infrastructure evolve in lock-step:
Advances in hardware or systems software cascade up the stack, enabling new breeds of applications. Those apps mature, come to strain their underlying resources, catalyzing a subsequent cycle of innovation at the infrastructure layer. The rise of better, faster, cheaper building blocks invariably leads to applications that deliver previously unforetold experiences to end-users. This ebb and flow forms the contours of a technical legacy that stretches from punch-cards to Pong to Powerpoint to Pinterest.
The commercial Internet that came of age in the late nineties and early aughts owes its existence to the x86 instruction set (Intel), the standardized operating system (Microsoft), the relational database (Oracle), Ethernet networking (Cisco), and networked data storage (EMC). Amazon, eBay, Yahoo and even the earliest iterations of Google and Facebook were built on this backbone we call Infrastructure 1.0.
Yet as the Web matured, growing from 16 million users in 1995 to over 3 billion by the end 2015, the scale and performance requirements of applications morphed. It was no longer feasible, much less economical, for Webscale giants to run their businesses on the backs of technologies developed during and for the client-server era.
Instead, these companies looked inward. Coupling superior technical expertise with parallel computing research from academia, the Googles, Facebooks and Amazons of the world defined a new class of infrastructure that was scale-out, programmable, (often) open source, and commodity. This category of technologies — Linux, KVM, Xen, Docker, Kubernetes, Mesos, MySQL, MongoDB, Kafka, Hadoop, Spark, and many others — defined the cloud era. My colleague Sunil Dhaliwal described this shift as Infrastructure 2.0.
Ultimately, the technologies of this generation were purpose-built for scaling the Internet to billions of end-users and storing information captured from those users efficiently. In doing so, the innovations of Infrastructure 2.0 catalyzed a dramatic acceleration in data growth. Combined with virtually endless parallel compute and algorithmic advances, the stage was set for today’s era of practical machine learning.
Infrastructure 3.0: The Building Blocks of Intelligent Systems
Infrastructure 2.0 was ultimately concerned with the question of “how do we connect the world?” Today’s generation of technology rephrases the question to ask “how do we make sense of it?”
This distinction — connectivity vs. cognition — is what makes ML/AI radically different from software of previous generations. The computational challenge of coding cognition is that it inverts the classical programming paradigm. Whereas in traditional applications, logic is hand-coded to perform a specific task, in ML/AI, training algorithms infer logic from troves of data. That logic is then implemented to make decisions and predictions about the world.
The result is an application that is ‘smart,’ but exceptionally data-intensive and computationally-expensive. These properties make ML/AI a poor fit for the multi-purpose, generic Von Neumann computing paradigm of the last seventy-plus years. Instead, ML/AI represents a foundational new architecture that necessitates a rethink of infrastructure, tooling and development practices.
To date, however, the preponderance of research and innovation in ML/AI has been dedicated to new algorithms, model training techniques and optimizations. The irony is that only a tiny fraction of the code in ML/AI systems is devoted to learning or prediction. Rather, the majority of complexity manifests in data preparation, feature engineering, and operationalizing the distributed systems infrastructure necessary to perform these tasks at scale.
Building and deploying ML/AI successfully requires a complex, carefully-coordinated workflow involving multiple discrete systems. First data needs to be ingested, cleaned and labeled. Then, the appropriate properties, known as features, upon which prediction is based must be determined. Finally, models need to trained, validated, served and continuously optimized. From start to finish, this process may take many months, even for the most technically-proficient organizations.
For ML/AI to reach its full potential, it must graduate from the academic discipline it is today into an engineering one. What that practically means is there need to be new abstractions, interfaces, systems and tooling to make developing and deploying intelligent applications facile for developers.
These requisite evolutions are not tiny shifts in abstraction or incremental process improvements. Rather, they are disruptive, foundational changes in both systems design and development workflow.
Correspondingly, at every layer of the stack we’re beginning see new platforms and tools emerge that are optimized for the ML/AI paradigm. The opportunities are abundant:
- Specialized hardware with many computing cores and high bandwidth memory (HBM) very close to the processor die. These chips are optimized for highly parallel, numerical computation that is necessary to perform the rapid, low-precision, floating-point math intrinsic to neural networks.
- Systems software with hardware-efficient implementation that compiles computation down to the transistor level.
- Distributed computing frameworks, for both training and inference, that can efficiently scale out model operations across multiple nodes.
- Data and metadata management systems to enable reliable, uniform, and reproducible pipelines for creating and managing both training and prediction data.
- Extremely low-latency serving infrastructure that enables machines to rapidly execute intelligent actions based on real-time data and context.
- Model interpretation, QA, debugging and observability tooling to monitor, introspect, tune and optimize models and applications at scale.
- End-to-end platforms that encapsulate the entire ML/AI workflow and abstract away complexity from end-users. Examples include in-house systems like Uber’s Michelangelo and Facebook’s FBLearner and commercial offerings like Determined AI*.
Just as the last decade witnessed the emergence of the cloud-native stack, so too over the next several years we expect a vast infrastructure and tooling ecosystem to coalesce around ML/AI.
Collectively, the innovations of this epoch — Infrastructure 3.0 — will be about unlocking the potential of ML/AI and providing the building blocks for intelligent systems. As with previous generations, there will be new projects, platforms and companies that emerge that challenge the current incumbency. The arms dealers for this imminent ML/AI revolution promise to be infrastructure giants of tomorrow.
This post first appeared in VentureBeat on November 28, 2017.
A giant thank you to Evan Sparks of Determined AI, Peter Bailis of Stanford’s DAWN, Joey Gonzalez and Robert Nishihara of UC Berkeley’s RISELab, and my colleagues at Amplify Partners for their perspective and input.
* Amplify Partners is an investor in Determined AI.