A new method to measure the tidiness of data repositories and help researchers clean up their “data swamps” took first place in the ACM Student Research Competition for Undergraduates at the 2018 Supercomputing conference in Dallas, TX. The project, conducted by students Luann Jung and Brendan Whitaker with advisors Kyle Chard and Aaron Elmore of UChicago CS, was part of last summer’s BigDataX Research Experiences for Undergraduates (REU) program, held at UChicago and the Illinois Institute of Technology.
Anyone who has used a computer knows how easy it is for data to become disorganized. Without the strictest drag-and-drop discipline, folders quickly become cluttered with files of different formats or subject matter, making it difficult or impossible to find the right file when needed in the future. If this disorder is a problem for personal computers, it is magnified exponentially in the archives of scientific projects that hold terabytes and terabytes of data.
Over the 10 weeks of the BigDataX program, Jung — a first-year student at MIT — and Whitaker — a fourth-year at Ohio State University — worked with a group of UChicago CS researchers seeking solutions for this issue, colloquially known as the “data swamp.” In an CERES-funded project, Chard, Elmore, Ian Foster, Michael Franklin and Blase Ur are developing automated processes that reorganize data repositories or databases and make the information within more reusable and discoverable. But in order to create these improvements, researchers need a way to measure what they’re improving — attaching a number to the dirtiness or cleanliness of a given repository.
“It's an important challenge because a huge portion of scientific research nowadays is heavily reliant on statistical inference from large quantities of data,” Whitaker said. “The work-hour split of data scientist is often said to be nearly 90% preprocessing and 10% inference, training, and testing. It's this ‘heterogeneous’ quality that necessitates such a time sink, and it's my view that quantifying this quality is the first step in creating systems that do it automatically and do it well.”
Measuring Clutter With Clusters
In their paper, “Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Data Repositories,” Jung and Whitaker constructed a parallel pipeline that uses clustering methods to quickly assess a given repository for its tidiness. The pipeline processes text files and tabular data (such as csv or tsv files) and clusters them according to their shared features.
Broadly, the cleanliness score is then calculated from how well those clusters map against file directories in the repository. If a given cluster is heavily represented in a small number of directories and only appears rarely in other locations, that’s considered well-organized. Conversely, a cluster that appears somewhat frequently across many different directories without a significant drop-off between where it is and isn’t present would be considered a sign of disorder.
To test their new score, Jung and Whitaker created synthetic datasets where they could manually shuffle how the files were organized, and also used a real data repository from the Carbon Dioxide Information and Analysis Center. In both evaluations, run using the Chameleon cloud computing testbed, the new cleanliness score outperformed previously-published measures.
Overall, the project combined many of the skills emphasized by the BigDataX program, which is designed to “promote a data-centric view of scientific and technical computing, at the intersection of distributed systems theory and practice.” At Supercomputing, the project beat out dozens of competitors for first place in the undergraduate category, which also qualifies it for the ACM Student Research Competition Grand Finals.
“I definitely gained experience and practice using machine learning in different contexts. Additionally, the process of writing, organizing, and parallelizing our code together was a good exercise in research collaboration,” Jang said. “Winning the competition was very unexpected but a pleasant surprise.”
Their cleanliness metric will also be used in the ongoing “Data Swamp” project, including as a comparison point for ongoing surveys of how human observers assess the disorganization of repositories.
“I think overall it’s a really hard problem, and one that a lot of organizations deal with. Part of why they won is also their novelty in thinking about how they define their ‘swamp’ metric, which is hard to do given the different types of files found, such as text files, images, and CSVs,” Elmore said. “In just ten weeks, Luann and Brendan provided us with a good foundational component that we can use for future work.”
Applications for the 2019 BigDataX REU program are now open, with a deadline of March 1st, 2019. Join us in Chicago!