Contact Info
Email
Website
Advisor: Victor Veitch
I’m investigating post-hoc, internal interpretability of large language models from a causal inference perspective. Specifically, what does it mean to locate a human-understandable concept, and how can we validate such claims?
I’m particularly motivated by applications to AI safety such as monitoring long-term planning and deception, but am also excited about applications to fairness and adversarial robustness.