Foundations of Data Science: Algorithms, Models, Explainability
Building a theory for data science involves formulating new theoretical frameworks for important applications, as well as developing efficient and reliable solutions for associated computational challenges. Central themes of my research include new models and algorithms for bioinformatics and trustworthy machine learning. In this talk, I first describe my work on trustworthy machine learning, where I will present a new model for explainable k-means clustering based on small decision trees and a new algorithm for finding a tree-based clustering with provably low cost. This work is the first to identify an unsupervised learning problem where explainable-by-design algorithms do not suffer from a large loss in their effectiveness. Turning to DNA data storage, I will provide an overview of this exciting, emerging technology. It promises orders of magnitude improved density and longevity compared to existing storage media. However, efficiently retrieving data that has been stored in DNA requires solving many interesting theoretical and practical problems. I will survey my contributions in this area, including efficient DNA synthesis methods, a distributed clustering algorithm for edit distance, and new statistical reconstruction algorithms. Next, I will discuss how to reconstruct node-labeled trees when given samples from an appropriately defined deletion channel. This involves new combinatorial and statistical algorithms, and it showcases a difficult model where worst-case reconstruction is possible with a polynomial number of samples. Finally, I will share my plans for future work in the areas of statistical reconstruction, trustworthy machine learning, and applied algorithms more generally.
Host: Sanjay Krishnan
Cyrus Rashtchian is currently a postdoc in the Computer Science & Engineering department at the University of California, San Diego. He received his Ph.D. in Computer Science & Engineering in 2018, advised by Paul Beame, from the University of Washington, Seattle, and his BS in Computer Science from the University of Illinois, Urbana-Champaign. He has broad research interests in the foundations of data science, including DNA data storage, robust and explainable machine learning, statistical reconstruction, clustering, and distributed algorithms. In general, he applies diverse geometric and algorithmic tools to problems in data science, with a keen eye for new applications. Prior to UCSD, he completed research internships at Facebook Reality Labs, Microsoft Research, and Cray. He has published in top machine learning and theoretical computer science conferences, including SODA, COLT, ITCS, ICML, NeurIPS, and AISTATS, and journals such as Nature Biotechnology and the Annals of Applied Probability. Personal website: http://www.cyrusrashtchian.com