Over the last decade, the bottleneck for data analytics has shifted from the collection of information to the analysis of increasingly massive and unwieldy datasets. The gap is only growing as the Internet of Things brings online more devices capable of relentless data collection, from smart electric meters to cheap home sensors. Methods designed for analyzing or comparing a format as straightforward as time series data become untenable when applied to thousands or millions of time series, forcing researchers to work with compressed or reduced data.
With a new fellowship from data services company NetApp, UChicago CS postdoctoral researcher John Paparrizos hopes to reduce this compromise, with new approaches that enable multi-faceted analysis on compressed, large-scale data. By making it possible to run clustering, classification, prediction, and other analytic tasks on data while it is still compressed, these approaches can help researchers avoid the headaches of working with raw, overabundant data without sacrificing the fine detail of observations.
Specifically, Paparrizos — a researcher in the laboratory of Liew Family Chair of Computer Science Michael J. Franklin — will build a unified approach to support several different analytic tasks on compressed data: indexing, classification, clustering, sampling, and visualization. Previously, papers have largely focused on specialized approaches that handle one task at a time and for a particular dataset in mind, making it hard for users to generalize these approaches in different settings and applications.
“For example, when an algorithm requires the use of a particular distance measure to compare time series, you have limitations on what kind of compression method you can use and, therefore, what kind of indexing mechanism you can use to accelerate the computation,” Paparrizos said. “In this project, our goal is to automatically learn to effectively compress time series such that the low-dimensional data are compatible with classic, well-studied, indexing mechanisms and, importantly, preserve the invariance to time-series distortions offered by user-defined comparison methods in the high-dimensional space.”
The project will evaluate the effectiveness of that approach on datasets from two real-world applications — high-resolution energy usage information collected by utility companies from smart meters and image data from satellites capturing Earth’s surface over time. Currently, researchers often need to reduce the dimensionality of these datasets in order to conduct comparisons and other analyses, losing accuracy in the process.
“Most of the highly accurate algorithms are very difficult to scale when you have databases with more than 100,000 time series, so for millions of time series, you need to find better ways to compress the data in order to offer a scalable solution,” Paparrizos said. “The challenge is to demonstrate minimal loss in accuracy while performing analytics on large-scale time-series collections.”
After development and testing, Paparrizos will then work to integrate the new methods into popular large-scale analytics software, such as Apache Spark. The NetApp fellowship provides funding for one year of work on the project. To read more about Paparrizos’ fellowship, visit the NetApp website.