When people ask a question of a database, they don’t always want or need an exact answer. In fast-moving industries such as finance or in conditions where the data is always growing and changing, an approximate answer delivered quickly is often good enough. Approximate query processing, or AQP, seeks approaches that balance speed, efficiency, and accuracy in situations where a “perfect” calculation is unnecessary.
A new paper from the research group of UChicago CS assistant professor Sanjay Krishnan addresses this goal with a new solution: JanusAQP. In work accepted for next year’s IEEE International Conference on Data Engineering (ICDE), former postdoctoral researcher Stavros Sintos and PhD graduate Xi Liang joined Krishnan to create a new system that quickly and efficiently delivers query results for dynamic databases.
The work capped three years of research that combined Sintos’ expertise in computational geometry with Liang’s pursuit of new algorithmic approaches for database structures and query processing. Their collaboration produced a fresh perspective on a problem that has faced database researchers for nearly 40 years, Krishnan said.
“People have long thought that AQP is a great idea; we want to cut down resources, we want to make things faster,” Krishnan said. “But there are challenges of sparsity and dynamism. Approximations are really good at capturing general properties of your data, but often, you’re interested in outliers, or very rare things that are happening inside your data set. And with real world data, any use case that is large means that it’s usually accumulating data over time. This paper is the closest to a practical AQP system that really addresses these two challenges that you will see out there.”
The project began with a 2020 paper by Liang, now a software engineer at Databricks, with Krishnan, former postdoctoral researcher Zechao Shang, and UChicago CS faculty Aaron Elmore and Michael Franklin. In that paper, accepted at the ACM Special Interest Group on Management of Data (SIGMOD) conference, the team described a framework for dealing with missing data when calculating basic data queries such as SUM, COUNT, and AVG.
Their solution relied upon a partitioned data structure that broke the full dataset down into smaller parts for faster analysis. But the strategy required a high precomputation cost and only worked with static data, Liang said.
Enter Sintos, now an assistant professor at the University of Illinois Chicago, who brought his knowledge in computational geometry to bear on the partitioning approach. A 2021 paper described a new data structure akin to a “tree” of pre-computed data aggregates representing the overall dataset, which can then be quickly sampled for approximate results.
“Data has geometry,” Sintos said. “You can imagine that there are some parts of the space that the data lie on that are more important than others. So instead of just taking uniform samples over everything, we focus on the areas that have more data, or where the value of data is high. We precompute some statistics by building a partition over the data, splitting the data into smaller buckets. Then when we have a query, we can use the samples from these buckets along with the precomputed statistics, and in the end, we combine all of these together to return the final result.”
For the JanusAQP system, the authors further improved these tree partition structures to work with dynamic data, where new information is continuously added. Liang, Sintos, and Krishnan built a “catch-up” process to ingest new data and rebuild the structure while still remaining responsive to queries in real time.
“The system has to somehow place the new data into the data structure that has already been learned,” Liang said. “That catch-up process can be done in the background. Meanwhile, the system can deal with the new data online, while the background can still keep learning from the historic data.”
While Liang and Sintos have moved on to new positions in industry and academia, Krishnan’s group will continue building upon their work. One promising area is blending the AQP approach with machine learning in a mutually beneficial relationship.
“There’s this rich work in computational geometry and data structures that is often ignored, and they’re just a great tool to have,” Krishnan said. “They are space-efficient, they can be updated easily, they’re extremely flexible to capture different types of data and different data types. So how can we use them to make machine learning more resource efficient? I do think that Xi and Stavros’ work has pushed the envelope on approximation very significantly, much more significantly than I would have thought when this project started.”