Stochastics and Statistics Seminar Series
Towards a ‘Chemistry of AI’: Unveiling the Structure of Training Data for more Scalable and Robust Machine Learning
February 21, 2025 @ 11:00 am - 12:00 pm
David Alverez-Melis (Harvard University)
E18-304
Event Navigation
Abstract: Recent advances in AI have underscored that data, rather than model size, is now the primary bottleneck in large-scale machine learning performance. Yet, despite this shift, systematic methods for dataset curation, augmentation, and optimization remain underdeveloped. In this talk, I will argue for the need for a “Chemistry of AI”—a paradigm that, like the emerging “Physics of AI,” embraces a principles-first, rigorous, empiricist approach but shifts the focus from models to data. This perspective treats datasets as structured, dynamic entities that can be transformed through optimization and seeks to characterize their fundamental properties, composition, and interactions. I will then highlight some of our recent work that takes initial steps toward establishing this framework, including principled methods for dataset synthesis and surprising recent findings in dataset distillation.
Bio:
David Alvarez-Melis is an Assistant Professor of Computer Science at the Harvard John A. Paulson School of Engineering and Applied Sciences, where he leads the Data-Centric Machine Learning (DCML) group. He is also a Researcher at Microsoft Research New England and an Associate Faculty at the Kempner Institute for Natural and Artificial Intelligence. He holds a Ph.D. in Computer Science from MIT and degrees in Mathematics from NYU and ITAM. David’s research seeks to make machine learning more broadly applicable (especially to data-poor applications) and trustworthy (e.g., robust and interpretable) through a data-centric approach that draws on methods from statistics, optimization and applied mathematics, and which takes inspiration from problems arising in the application of machine learning to the natural sciences.