Image: L Maule, CC BY-SA 4.0, via Wikimedia Commons.
Image: L Maule, CC BY-SA 4.0, via Wikimedia Commons.

We live in an age of massive computation, with data centers housing thousands of connected nodes crunching enormous flows of data underlying many of our daily activities. But with this explosion of computing scale comes new problems in the software that makes those machines work. Programs designed and tested on smaller systems can produce unexpected issues when deployed across thousands of nodes, with critical consequences for reliability and performance.

In a new project funded with a $5 million LARGE grant from the National Science Foundation Principles and Practice of Scalable Systems (PPoSS) program, a group of researchers led by UChicago CS will develop a novel ecosystem for the development of extreme-scale systems…from the studs up. Over five years, the ScaleStuds project will develop a new pipeline of tools, software, and systems that allow software developers to write robust new software for massive clusters, even without direct access to these expensive systems, building foundations for correctness checkability and performance predictability at scale.

“The cloud basically democratized computing, now everybody has access to large computing. You no longer have to buy a big rack of machines and put in a lot of investment up front, you can just run fewer machines in the cloud,” said Haryadi Gunawi, Associate Professor of Computer Science and one of the primary investigators on the grant. “In the same way, we want to democratize large-scale software — the process of large-scale testing and performance prediction for the future.”

From UChicago CS, Gunawi will lead the project with co-PIs Professor Shan Lu and Associate Professor Henry (Hank) Hoffmann. The collaboration also includes researchers from Argonne National Laboratory, The University of Chicago Consortium for Advanced Science and Engineering (UChicago CASE), the University of Michigan, University of California, Davis, and Ohio State University.

Through interviews with cloud architects, the researchers repeatedly heard about a difficult and expensive problem: the difficulty of finding scalability faults in software before it is deployed in production. Companies need assurance that new software tested on smaller systems will work when it goes live on their full-sized clusters, and also need reliable estimates of how fast it will perform. As critical consumer and business tools move to the cloud as “software-as-a-service” products, unexpected bugs or slowdowns in new updates can interrupt or disrupt service for millions of customers, with catastrophic consequences for both the provider and its clientele.

“For every minute of downtime, there’s a cost for both end users and cloud service providers,” said Lu, who is currently a visiting researcher at Microsoft. “So for every piece of large-scale software deployed in the field, there are engineers who cover it 24 hours a day, seven days a week. If there’s a problem, it can take a lot of resources to fix it. So all cloud service providers want new tools to find problems before they deploy the software, in order to cut the downtime and reduce the number of people who need to make sure these deployments function as they should.”

The ScaleStuds vision is based around the analogy of a neighborhood of houses, each representing a new extreme-scale software/hardware component. Over the course of the project, the ScaleStuds team will build the inner structure and materials for these houses, floor by floor and stud by stud, with each stud representing a research building block. Developers can then take that framework and use it to build more robust and predictable software.

“We need to make progressive development, because we cannot solve everything at once,” Gunawi said. “What’s important is building the solutions layer by layer, where the studs represent the foundations towards correctness and performance predictability at scale.”

The many studs include steps such as building a taxonomy of scalability faults and designing new bug checkers to catch errors before deployment, emulators that simulate massive clusters on smaller machines, and machine learning-driven models that predict the performance of software on complex, extreme-scale hardware. Indeed, the team foresees a role for artificial intelligence in accelerating the pipeline to successful software deployment as the hardware used by future systems changes.

“One of the things that we need to build is a large data collection about the software properties as well as the hardware properties,” Gunawi said. “Using the house analogy, we want to know, how would the house perform if it’s on green soil or on dry soil, or even on the moon? It’s difficult to predict because there are so many moving pieces and so many data points. But this is where we feel machine learning could help. We’re not saying that machine learning beats humans, but we think machine learning simplifies human tasks.”

The broad range of expertise needed for the ScaleStuds project also reflects the growing strengths of UChicago CS and its partner institutions. Co-PIs Gunawi, Hoffmann, and Lu bring knowledge in systems, software reliability, and machine learning to the team, while partners at Argonne and other institutions contribute expertise in high-performance and cloud computing, verification, algorithms, and additional key areas.

“Haryadi and I were hired as part of the effort to grow systems in the department and I would guess that this is the largest proposal that was led by someone who started their faculty career here and then went through tenure here,” Hoffmann said. “It is certainly the first example in systems, so this feels like an important milestone for our department.”

Related News

More UChicago CS stories from this research area.
Video

Is it Ethical to Use Facial Imaging in Decision-Making?

Jun 28, 2022
UChicago CS News

Single Sign-On Migration for Chameleon Project Receives PEARC Best Paper Award

Jun 27, 2022
UChicago CS News

EPiQC Post-Doc Pens Op-Ed on Potential of Quantum Computing for Chemistry

Jun 24, 2022
UChicago CS News

Faculty Bill Fefferman and Chenhao Tan Receive Google Research Scholar Awards

Jun 21, 2022
UChicago CS News

Two Incoming UChicago CS PhD Students Receive Department of Energy Fellowship

Jun 16, 2022
UChicago CS News

Prof. Yanjing Li Receives Under-40 Innovators Award from DAC

Jun 15, 2022
Video

Data Science Institute Summit

Jun 15, 2022
UChicago CS News

DSI Summer Lab Returns In-Person With 49 Students From Across the U.S.

Jun 14, 2022
In the News

Nick Feamster Talks Internet Equity on Light Reading Podcast

Jun 09, 2022
UChicago CS News

Prof. Andrew A. Chien Named to DARPA ISAT Study Group

Jun 07, 2022
UChicago CS News

UChicago CS Spinout Super.tech Acquired by Quantum Ecosystem Leader ColdQuanta

May 10, 2022
UChicago CS News

University of Chicago Named in National Science Foundation’s $20 Million CONECT Award under the Forthcoming ACCESS Program

Apr 27, 2022
arrow-down-largearrow-left-largearrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-smallbutton-arrowclosedocumentfacebookfacet-arrow-down-whitefacet-arrow-downPage 1CheckedCheckedicon-apple-t5backgroundLayer 1icon-google-t5icon-office365-t5icon-outlook-t5backgroundLayer 1icon-outlookcom-t5backgroundLayer 1icon-yahoo-t5backgroundLayer 1internal-yellowinternalintranetlinkedinlinkoutpauseplaypresentationsearch-bluesearchshareslider-arrow-nextslider-arrow-prevtwittervideoyoutube