Image: L Maule, CC BY-SA 4.0, via Wikimedia Commons.
Image: L Maule, CC BY-SA 4.0, via Wikimedia Commons.

We live in an age of massive computation, with data centers housing thousands of connected nodes crunching enormous flows of data underlying many of our daily activities. But with this explosion of computing scale comes new problems in the software that makes those machines work. Programs designed and tested on smaller systems can produce unexpected issues when deployed across thousands of nodes, with critical consequences for reliability and performance.

In a new project funded with a $5 million LARGE grant from the National Science Foundation Principles and Practice of Scalable Systems (PPoSS) program, a group of researchers led by UChicago CS will develop a novel ecosystem for the development of extreme-scale systems…from the studs up. Over five years, the ScaleStuds project will develop a new pipeline of tools, software, and systems that allow software developers to write robust new software for massive clusters, even without direct access to these expensive systems, building foundations for correctness checkability and performance predictability at scale.

“The cloud basically democratized computing, now everybody has access to large computing. You no longer have to buy a big rack of machines and put in a lot of investment up front, you can just run fewer machines in the cloud,” said Haryadi Gunawi, Associate Professor of Computer Science and one of the primary investigators on the grant. “In the same way, we want to democratize large-scale software — the process of large-scale testing and performance prediction for the future.”

From UChicago CS, Gunawi will lead the project with co-PIs Professor Shan Lu and Associate Professor Henry (Hank) Hoffmann. The collaboration also includes researchers from Argonne National Laboratory, The University of Chicago Consortium for Advanced Science and Engineering (UChicago CASE), the University of Michigan, University of California, Davis, and Ohio State University.

Through interviews with cloud architects, the researchers repeatedly heard about a difficult and expensive problem: the difficulty of finding scalability faults in software before it is deployed in production. Companies need assurance that new software tested on smaller systems will work when it goes live on their full-sized clusters, and also need reliable estimates of how fast it will perform. As critical consumer and business tools move to the cloud as “software-as-a-service” products, unexpected bugs or slowdowns in new updates can interrupt or disrupt service for millions of customers, with catastrophic consequences for both the provider and its clientele.

“For every minute of downtime, there’s a cost for both end users and cloud service providers,” said Lu, who is currently a visiting researcher at Microsoft. “So for every piece of large-scale software deployed in the field, there are engineers who cover it 24 hours a day, seven days a week. If there’s a problem, it can take a lot of resources to fix it. So all cloud service providers want new tools to find problems before they deploy the software, in order to cut the downtime and reduce the number of people who need to make sure these deployments function as they should.”

The ScaleStuds vision is based around the analogy of a neighborhood of houses, each representing a new extreme-scale software/hardware component. Over the course of the project, the ScaleStuds team will build the inner structure and materials for these houses, floor by floor and stud by stud, with each stud representing a research building block. Developers can then take that framework and use it to build more robust and predictable software.

“We need to make progressive development, because we cannot solve everything at once,” Gunawi said. “What’s important is building the solutions layer by layer, where the studs represent the foundations towards correctness and performance predictability at scale.”

The many studs include steps such as building a taxonomy of scalability faults and designing new bug checkers to catch errors before deployment, emulators that simulate massive clusters on smaller machines, and machine learning-driven models that predict the performance of software on complex, extreme-scale hardware. Indeed, the team foresees a role for artificial intelligence in accelerating the pipeline to successful software deployment as the hardware used by future systems changes.

“One of the things that we need to build is a large data collection about the software properties as well as the hardware properties,” Gunawi said. “Using the house analogy, we want to know, how would the house perform if it’s on green soil or on dry soil, or even on the moon? It’s difficult to predict because there are so many moving pieces and so many data points. But this is where we feel machine learning could help. We’re not saying that machine learning beats humans, but we think machine learning simplifies human tasks.”

The broad range of expertise needed for the ScaleStuds project also reflects the growing strengths of UChicago CS and its partner institutions. Co-PIs Gunawi, Hoffmann, and Lu bring knowledge in systems, software reliability, and machine learning to the team, while partners at Argonne and other institutions contribute expertise in high-performance and cloud computing, verification, algorithms, and additional key areas.

“Haryadi and I were hired as part of the effort to grow systems in the department and I would guess that this is the largest proposal that was led by someone who started their faculty career here and then went through tenure here,” Hoffmann said. “It is certainly the first example in systems, so this feels like an important milestone for our department.”

Related News

More UChicago CS stories from this research area.
UChicago CS News

UChicago’s Parsl Project Pivots to Sustainability and Community with New Grants

Nov 17, 2022
man browsing Netflix
UChicago CS News

Trending Now: How Netflix Chills Our Free Will

Nov 14, 2022
In the News

Alumnus Pranav Gokhale Named to Crain’s 40 Under 40

Nov 07, 2022
UChicago CS News

Prof. Diana Franklin Discusses Quantum Computing Education on Entangled Things Podcast

Nov 03, 2022
UChicago CS News

UChicago AI Summit Examines Promise and Concerns for Science and Society

Nov 01, 2022
UChicago CS News

New Schmidt Futures Fellowship at UChicago to Foster Next Generation of AI-Driven Scientists

Oct 26, 2022
UChicago CS News

New UpDown Project Uses “Intelligent Data Movement” to Accelerate Graph Analytics

Oct 21, 2022
UChicago CS News

Five UChicago CS Students Named to Siebel Scholars 2023 Class

Sep 22, 2022
UChicago CS News

UChicago CS Students Emily Wenger and Xu Zhang Receive Harper Fellowships

Sep 14, 2022
In the News

Internet Disconnect

Sep 13, 2022
UChicago CS News

UChicago/Argonne Computer Scientist Ian Foster Receives ACM/IEEE Ken Kennedy Award

Sep 07, 2022
UChicago CS News

First In-Person Robotics Class Lets Students See Code Come To (Artificial) Life

Sep 06, 2022
arrow-down-largearrow-left-largearrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-smallbutton-arrowclosedocumentfacebookfacet-arrow-down-whitefacet-arrow-downPage 1CheckedCheckedicon-apple-t5backgroundLayer 1icon-google-t5icon-office365-t5icon-outlook-t5backgroundLayer 1icon-outlookcom-t5backgroundLayer 1icon-yahoo-t5backgroundLayer 1internal-yellowinternalintranetlinkedinlinkoutpauseplaypresentationsearch-bluesearchshareslider-arrow-nextslider-arrow-prevtwittervideoyoutube