MS Presentation: Xi Liang
Fast and Reliable Missing Data Contingency Analysis
with Predicate-Constraints
Today, data analysts largely rely on intuition to determine whether
missing or withheld rows of a dataset significantly affect their
analyses. We propose a framework that can produce automatic
contingency analysis, i.e., the range of values an aggregate SQL query
could take, under formal constraints describing the variation and
frequency of missing data tuples. We describe how to process SUM,
COUNT, AVG, MIN, and MAX queries in these conditions resulting in hard
error bounds with testable constraints. We propose an optimization
algorithm based on an integer program that reconciles a set of such
constraints, even if they are overlapping, conflicting, or
unsatisfiable, into such bounds. We also present a novel formulation
of the Fractional Edge Cover problem to account for cases where
constraints span multiple tables. Our experiments on 4 datasets
against several statistical imputation and inference baselines show
that statistical techniques can have a deceptively high error rate
that is often unpredictable. In contrast, our framework offers hard
bounds that are guaranteed to hold if the constraints are not
violated. In spite of these hard bounds, we show competitive accuracy
to statistical baselines.
Xi Liang
Xi's advisor is Prof. Sanjay Krishnan