“Crash Skipping” Strategy for Approximate Computing Receives GLSVLSI Award
For software developers, avoiding errors and crashes is top priority. When an application crashes, it wastes time and resources, either rolling back to the last checkpoint or starting over entirely. But in the growing realm of approximate computing, where the output of applications doesn’t need to be exactly right but just “good enough,” avoiding crashes by giving these errors the silent treatment might help save energy and time.
That’s the idea behind “Crash Skipping,” a new technique created by UChicago CS graduate Yan Verdeja Herms and assistant professor Yanjing Li for efficient error recovery in approximate computing environments. The research received the best paper award at the ACM Great Lakes Symposium on VLSI (GLSVLSI) conference held in May 2019 in Washington DC.
The approach holds promise for approximate applications used in machine learning, computer vision, data mining, and other increasingly popular functions that don’t always require precise accuracy. Following the philosophy of “perfect is the enemy of good,” Crash Skipping largely ignores errors that would normally cause the application to crash, trusting that the end result will still be acceptable.
“The notion of allowing detected errors to propagate unchecked seemed strange, because one generally tries to avoid them. But sometimes starting with an aggressive sounding solution can yield pleasantly surprising results,” said Verdeja Herms. “If the goal of approximate computing is to relax reliability to achieve performance and energy efficiency benefits, crashes actually occur much more frequently than unacceptable outputs. We show that novel ways to handle crashes can provide large benefits.”
With Crash Skipping, when an exception is raised by the program, instead of triggering a crash it is treated as a “nop,” the computer instruction for doing nothing. The program then picks back up at a later point in the process, running until completion if possible.
Verdeja Herms and Li considered Crash Skipping a success when the program didn’t require a restart and still produced a result within the application’s acceptable range. When tested on a suite of approximate applications, the approach was successful 56% of the time, improving performance and energy usage by an average of 33%.
“The most important benefit of Crash Skipping is the large energy savings it can bring, especially given the approach is so elegantly simple and imposes almost zero cost,” Li said. “Also, it can be completely automated and does not require manual programmer input, unlike other related work.”
That said, the approach can be further improved by giving developers control over how “far” to skip past the crash and the maximum number of crashes to skip. Future versions may also first automatically analyze program code to sort out critical and non-critical crashes, then skip only the latter group.
“These knobs control how aggressively we perform crash skipping, so as to maximize the chance of successfully skipping all crashes and reaching the end of the program with acceptable outputs, while avoiding ‘over-skipping’ which can lead to unacceptable outputs,” the authors said. “We came up with a systematic and automatic algorithm to determine the parameters for these two knobs and our algorithm works very well, as shown in our results.”
Verdeja Herms contributed to the paper as an undergraduate before receiving a B.S. in computer science in 2018. He now works as a software engineer at the principal trading firm DRW, working on low-latency trading strategies.
“UChicago CS certainly gave me the in-depth understanding of computer systems that has helped me contribute meaningfully to my work here,” Verdeja Herms said. “My research project with Dr. Li additionally gave me the tools and experience to approach complex problems methodically, as well as the ability to learn from and collaborate with others beyond the classroom.”