Foreword by Julian McAuley

Many recent breakthroughs in Machine Learning, including Natural Language Processing, Computer Vision, etc. owe as much to having better data as they owe to having better models.

Naturally, modern ML datasets should be large, in order for models to capture their complex underlying semantics. However having enough data is only a small part of the problem: data must also be processed, appropriately represented, properly sampled, freed from issues of balance and bias etc., not to mention the challenge of extracting meaningful predictive information.

A common experience among ML practitioners is that this type of “data munging” occupies more time and effort than modeling; it is also incredibly rewarding, as the collection and curation of new datasets often facilitates the most novel and exciting research, and can represent a significant contribution to the research community.

It is wonderful to see a book that covers the underexplored but important skill of collecting and curating data. I expect this will be useful to practitioners who are beginning to collect their own datasets, or wondering how popular datasets are typically collected. Such topics are typically missing from academic treatment of machine learning, where the massive task of data collection and preparation is so often glossed over.

I was thrilled to hear Jigyasa and Rishabh were working on this book: both have experience collecting, curating, and modeling large datasets, both in academic and industrial settings. I expect readers will find the sections on data extraction and data preparation especially useful, as these are the skills I have found most useful in my own career.

Julian McAuley
Professor, University of California San Diego