We live in an age of massive computation, with data centers housing thousands of connected nodes crunching enormous flows of data underlying many of our daily activities. But with this explosion of computing scale comes new problems in the software that makes those machines work. Programs designed and tested on smaller systems can produce unexpected issues when deployed across thousands of nodes, with critical consequences for reliability and performance.
In a new project funded with a $5 million LARGE grant from the National Science Foundation Principles and Practice of Scalable Systems (PPoSS) program, a group of researchers led by UChicago CS will develop a novel ecosystem for the development of extreme-scale systems…from the studs up. Over five years, the ScaleStuds project will develop a new pipeline of tools, software, and systems that allow software developers to write robust new software for massive clusters, even without direct access to these expensive systems, building foundations for correctness checkability and performance predictability at scale.
“The cloud basically democratized computing, now everybody has access to large computing. You no longer have to buy a big rack of machines and put in a lot of investment up front, you can just run fewer machines in the cloud,” said Haryadi Gunawi, Associate Professor of Computer Science and one of the primary investigators on the grant. “In the same way, we want to democratize large-scale software — the process of large-scale testing and performance prediction for the future.”
From UChicago CS, Gunawi will lead the project with co-PIs Professor Shan Lu and Associate Professor Henry (Hank) Hoffmann. The collaboration also includes researchers from Argonne National Laboratory, The University of Chicago Consortium for Advanced Science and Engineering (UChicago CASE), the University of Michigan, University of California, Davis, and Ohio State University.
Through interviews with cloud architects, the researchers repeatedly heard about a difficult and expensive problem: the difficulty of finding scalability faults in software before it is deployed in production. Companies need assurance that new software tested on smaller systems will work when it goes live on their full-sized clusters, and also need reliable estimates of how fast it will perform. As critical consumer and business tools move to the cloud as “software-as-a-service” products, unexpected bugs or slowdowns in new updates can interrupt or disrupt service for millions of customers, with catastrophic consequences for both the provider and its clientele.
“For every minute of downtime, there’s a cost for both end users and cloud service providers,” said Lu, who is currently a visiting researcher at Microsoft. “So for every piece of large-scale software deployed in the field, there are engineers who cover it 24 hours a day, seven days a week. If there’s a problem, it can take a lot of resources to fix it. So all cloud service providers want new tools to find problems before they deploy the software, in order to cut the downtime and reduce the number of people who need to make sure these deployments function as they should.”
The ScaleStuds vision is based around the analogy of a neighborhood of houses, each representing a new extreme-scale software/hardware component. Over the course of the project, the ScaleStuds team will build the inner structure and materials for these houses, floor by floor and stud by stud, with each stud representing a research building block. Developers can then take that framework and use it to build more robust and predictable software.
“We need to make progressive development, because we cannot solve everything at once,” Gunawi said. “What’s important is building the solutions layer by layer, where the studs represent the foundations towards correctness and performance predictability at scale.”
The many studs include steps such as building a taxonomy of scalability faults and designing new bug checkers to catch errors before deployment, emulators that simulate massive clusters on smaller machines, and machine learning-driven models that predict the performance of software on complex, extreme-scale hardware. Indeed, the team foresees a role for artificial intelligence in accelerating the pipeline to successful software deployment as the hardware used by future systems changes.
“One of the things that we need to build is a large data collection about the software properties as well as the hardware properties,” Gunawi said. “Using the house analogy, we want to know, how would the house perform if it’s on green soil or on dry soil, or even on the moon? It’s difficult to predict because there are so many moving pieces and so many data points. But this is where we feel machine learning could help. We’re not saying that machine learning beats humans, but we think machine learning simplifies human tasks.”
The broad range of expertise needed for the ScaleStuds project also reflects the growing strengths of UChicago CS and its partner institutions. Co-PIs Gunawi, Hoffmann, and Lu bring knowledge in systems, software reliability, and machine learning to the team, while partners at Argonne and other institutions contribute expertise in high-performance and cloud computing, verification, algorithms, and additional key areas.
“Haryadi and I were hired as part of the effort to grow systems in the department and I would guess that this is the largest proposal that was led by someone who started their faculty career here and then went through tenure here,” Hoffmann said. “It is certainly the first example in systems, so this feels like an important milestone for our department.”