It may not sound like the most exciting role, but the humble “scheduler” holds the key to the future of large-scale computing. Supercomputers, data centers, and even modern personal computers all benefit today from the concept of parallel computing, running multiple tasks at the same time instead of sequentially. This produces tremendously faster computing, but also dramatically scales up the complexity of deciding the optimal order to run as many as millions of interdependent jobs.

For decades, schedulers got by with a “greedy” algorithm — a short-term strategy which simply grabs the next task in line whenever resources become available. But as high-performance computing and data center operators worry more about energy costs and the end of Moore’s Law, this simple solution may no longer suffice.

At the 2018 Supercomputing conference in Dallas, a team of University of Chicago researchers presented an alternate approach to scheduling known as “Divide and Conquer.”  The algorithm, developed by graduate students Gokalp Demirci, Ivana Marincic, and David Kim with associate professor of computer science Henry Hoffmann, applies a longer-term perspective and exploits configurable resources to achieve better results while adhering to a strict cap on energy consumption.

The algorithm is the first improvement on an approximation for scheduling with resource and precedence constraints since 1975, Hoffmann said.

“I think that most systems people figured these greedy algorithms work, and it's close to the best thing we can find in most situations, so I’ll just keep using greedy schedulers. That's why nobody's looked at it for 43 years, but now is really the time to solve it,” Hoffmann said. “The combination of power management and exascale means this is the right time.”

A More Efficient Exascale

As high-performance computing reaches for the exascale — systems that can run one billion billion calculations per second — experts emphasize that merely building bigger computers is no longer the solution. If today’s petascale machines were simply scaled up a thousandfold, they would consume 200 megawatts of power, as much as roughly 130,000 residential homes. The Department of Energy has proposed capping the energy consumption of exascale systems at 20-40 megawatts, presenting a difficult engineering challenge for both hardware and software.

Improving schedulers could be low-hanging fruit to help meet these goals. While full optimization is prohibitively complex — calculating the best schedule for a supercomputer would require a second supercomputer, Hoffmann said — there’s still plenty of room to improve beyond the simple, greedy algorithm.

Further help comes from the improved ability to fine-tune how a large computing system delegates power. If given a very large task and a very small task that can run concurrently, more resources can be directed to the large task so that the two finish at roughly the same time, freeing up more space for subsequent jobs.

That last idea is exploited by the “Divide and Conquer” algorithm, which looks to the future to put the emphasis on when tasks will end, instead of just starting what’s available whenever resources are free. For a typical computing job, the algorithm looks at the full workload, commonly depicted with nodes and edges as a directed acyclic graph (DAG), and continually, recursively divides the problem into subproblems. Those subproblems can then be organized to run concurrently where possible, and assigned different amounts of resources so that they finish together, leaving more resources available to start the next group of subproblems.

When tested against greedy approaches on DAGs of up to 10,000 nodes in a simulated supercomputer, the Divide and Conquer approach improved performance by as much as 75 percent. That’s more than just a footrace, as the faster performance was achieved using the same amount of power, suggesting significant gains in energy efficiency.

“It's neat because every scheduler of practical use was greedy, and this one is not. And we get a big win by not being greedy,” Hoffmann said. “Divide and Conquer will be harder to implement, so that's going to be an issue, but our first results are really promising and indicate this is actually going to be very valuable in practice as well.”

A Bridge Between Theory and Architecture

The advance was made possible by a productive collaboration across research areas within UChicago CS. Demirci and Kim — graduate students studying theoretical computer science with Professors Janos Simon and Laszlo Babai, respectively — started working on the problem in Hoffmann’s Computer Architecture class, partnering with Hoffmann’s systems computer science student Marincic.

“I always work with theory and machine learning students to figure out how we can relate what they’re doing to computer architecture,” Hoffmann said. “We said, this is what the algorithms are doing, this is what we're doing in systems, and there's a gap, so why don't we see if we can bridge that gap?”

“At Chicago, we have this historically extremely strong theory department and our systems are strong, but newer. I thought this was a great project that brings the old strength and the new strength together.”

The group’s first paper, “Approximation Algorithms for Scheduling with Resource and Precedence Constraints” was presented at the Symposium on Theoretical Aspects of Computer Science (STACS) in March 2018. The Supercomputing paper, “A Divide and Conquer Algorithm for DAG Scheduling under Power Constraints,” was presented Wednesday, November 14th.

Related News

More UChicago CS stories from this research area.
UChicago CS News

New 2022-23 CS Faculty Add Expertise in Linguistics, Visualization, Economics, and Data Science Education

Aug 11, 2022
In the News

UChicago Co-Leads $10 Million NSF Institute on Foundations of Data Science

Aug 09, 2022
UChicago CS News

UChicago CS Faculty Receive Industry Grants From J.P. Morgan, Google

Jul 19, 2022
In the News

Bill Fefferman Comments on New Standards for Quantum-Proof Cryptography

Jul 07, 2022
UChicago CS News

UChicago London Colloquium Features Data Science, Quantum Research

Jul 01, 2022
Video

Is it Ethical to Use Facial Imaging in Decision-Making?

Jun 28, 2022
UChicago CS News

Single Sign-On Migration for Chameleon Project Receives PEARC Best Paper Award

Jun 27, 2022
UChicago CS News

EPiQC Post-Doc Pens Op-Ed on Potential of Quantum Computing for Chemistry

Jun 24, 2022
UChicago CS News

Faculty Bill Fefferman and Chenhao Tan Receive Google Research Scholar Awards

Jun 21, 2022
UChicago CS News

Two Incoming UChicago CS PhD Students Receive Department of Energy Fellowship

Jun 16, 2022
UChicago CS News

Prof. Yanjing Li Receives Under-40 Innovators Award from DAC

Jun 15, 2022
Video

Data Science Institute Summit

Jun 15, 2022
arrow-down-largearrow-left-largearrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-smallbutton-arrowclosedocumentfacebookfacet-arrow-down-whitefacet-arrow-downPage 1CheckedCheckedicon-apple-t5backgroundLayer 1icon-google-t5icon-office365-t5icon-outlook-t5backgroundLayer 1icon-outlookcom-t5backgroundLayer 1icon-yahoo-t5backgroundLayer 1internal-yellowinternalintranetlinkedinlinkoutpauseplaypresentationsearch-bluesearchshareslider-arrow-nextslider-arrow-prevtwittervideoyoutube