Advisor: Victor Veitch

I’m investigating post-hoc, internal interpretability of large language models from a causal inference perspective. Specifically, what does it mean to locate a human-understandable concept, and how can we validate such claims?

I’m particularly motivated by applications to AI safety such as monitoring long-term planning and deception, but am also excited about applications to fairness and adversarial robustness.

Research

AI & Machine Learning

Foundations and applications of computer algorithms making data-centric models, predictions, and decisions
arrow-down-largearrow-left-largearrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-smallbutton-arrowclosedocumentfacebookfacet-arrow-down-whitefacet-arrow-downPage 1CheckedCheckedicon-apple-t5backgroundLayer 1icon-google-t5icon-office365-t5icon-outlook-t5backgroundLayer 1icon-outlookcom-t5backgroundLayer 1icon-yahoo-t5backgroundLayer 1internal-yellowinternalintranetlinkedinlinkoutpauseplaypresentationsearch-bluesearchshareslider-arrow-nextslider-arrow-prevtwittervideoyoutube