Towards a ‘Chemistry of AI’: Unveiling the Structure of Training Data for more Scalable and Robust Machine Learning
Abstract: Recent advances in AI have underscored that data, rather than model size, is now the primary bottleneck in large-scale machine learning performance. Yet, despite this shift, systematic methods for dataset curation, augmentation, and optimization remain underdeveloped. In this talk, I will argue for the need for a "Chemistry of AI"—a paradigm that, like the emerging "Physics of AI," embraces a principles-first, rigorous, empiricist approach but shifts the focus from models to data. This perspective treats datasets as structured, dynamic…