The concurrency bug is one of the computer programmer’s most notorious enemies. Occurring when two pieces of code execute simultaneously, producing a corrupted result, these bugs are easy to make but hard to find and fix. Concurrency bugs have been blamed for medical accidents, massive regional blackouts, and online theft. When researchers find a new method for finding concurrency bugs, they consider it a success if it locates and corrects even a few errors in software.
But last month at the 2019 ACM Symposium on Operating Systems Principles (SOSP) meeting in Canada, a paper from UChicago CS, Microsoft Research, and the University of California, Berkeley revealed a new approach that captured thousands of concurrency bugs. The accomplishment was so far beyond past attempts, it earned the conference’s Best Paper award, a prestigious honor at this very selective, high-impact biennial conference.
Led by Guangpu Li, a graduate student advised by UChicago CS Professor Shan Lu, the paper presented an out-of-the-box approach to detecting concurrency bugs that outperformed decades of research on the topic.
“It's a very simple idea for a very challenging problem,” Lu said. “People have been looking at this problem for more than 40 years, and from that very early time, it was the same idea: that you just analyze the program using very expensive algorithms and compute the logical time. What we're saying is a different idea, and it's very simple. We were worried that it was too simple…but then the result is very good.”
The paper focused on a particular type of concurrency bug called thread-safety violations, or TSVs, which often evade established testing methods and show up in production code. Previous approaches for finding concurrency bugs either require computationally intensive analysis of code to find potential conflicts, or require running tests many times over to sniff out bugs.
Instead of further tweaks to these techniques, Li and his co-authors tried a totally new approach, seeking a middle ground with low overhead and high accuracy, detecting many more bugs without raising the false positive rate. Their concurrency testing tool first detects “near-misses,” where two parts of the code execute on the same object close to each other in time. It then injects delays of different durations to only these potentially problematic pairs, ruling out the rest, in an attempt to detect whether a true concurrency bug might occur. Depending on how effective each delay injection is — for example, does it make two code elements execute closer to each other or not — the delay-injection plan is further refined. The two-step process can analyze a new program in one run, saving developers both time and the cost of running multiple tests or expensive program analysis.
“In the first step, we observe two pieces are code that are close by, and in the next step, you put in some delays to try to push them to execute together,” Lu said. “So you don’t need to do any complicated analysis. It's basically just a very lightweight approach that turns out to work very well.”
The tool was also designed to work “in the large,” for deployment in the kind of integrated build and test environment used by software companies such as Microsoft. At this scale, developer teams around the world constantly update thousands of software projects, and so any bug-testing tool must be fast, computationally efficient, and easy to use.
During his summer internship at Microsoft in 2018, Li tested out the new tool in the Microsoft integrated build and test environment. In 43,000 software modules, the tool found over 1,000 previous unknown bugs — nearly half of which were considered high-priority bugs that could have caused service outages. Since this successful trial, Microsoft has continued using the tool as part of their regular testing procedures.
“Currently, any Microsoft developer can use this tool,” Li said. “Usually when a developer sends a testing job to the data center, he or she can choose some configuration, saying I want to use this machine or whether I want to use this tool. And now it's an option, the developer can check and use our tool, then everything will be handled by the testing data center. It will automatically apply our tool to their program, then give them a report of whether it found bugs or not.”
While the Microsoft implementation was built to run on code written in the C# language, the team believes that the approach can be ported over to other programming languages in the future. The code for the TSVD (Thread Safety Violation Detector) tool is available on Github.
The paper, “Efficient Scalable Thread-Safety-Violation Detection,” is available through ACM and Microsoft. In addition to Li and Lu, co-authors include Madanlal Musuvathi and Suman Nath of Microsoft Research and Rohan Padhye of University of California, Berkeley.