Data Engineering for Machine Learning


Data Engineering for Machine Learning is a novel course at the intersection of Systems, Big Data and Machine Learning. The course focuses on machine learning systems in the real-world, as well as on data-related problems that typically occur in end-to-end machine learning deployments. The lectures cover systems deployed in practice at companies like Google, Twitter & Amazon, and research from a variety of conferences such as NeurIPS (ML), VLDB / SIGMOD (data management) and OSDI (systems). Students will learn about abstractions and concepts of ML-related systems on the one hand, and gain practical experiences in use cases such as data and model validation, experiment databases, and data cleaning on the other hand.

This course will be taught by Sebastian Schelter, an MSDSE Fellow at CDS. Sebastian has already gained a lot of practical experience in real-world ML systems, as an intern with the recommender systems team at Twitter, and as a Senior Applied Scientist at Amazon Core AI Berlin, where he spent three years before joining NYU. He is a committer and PMC member in several systems-related projects of the Apache Software Foundation. Additionally, he mentors the deep learning-related project Apache TVM for the Apache Incubator, and is the creator and chair of the workshop series on ‘Data Management for End-to-End Machine Learning’ at ACM SIGMOD.

Preliminary Syllabus


09/05 - Real-World Machine Learning Systems

end-to-end machine learning systems in the real-world; ML engineering; anecdotes; open problems and challenges

09/12 - Systems Foundations

history of large-scale data processing; systems basics: data parallelism, task parallelism, pipeline parallelism; memory hierarchy; end of Moore’s law; distributed filesystems; machine learning in practice; machine learning engineering

Systems for Machine Learning

09/19 - Machine Learning on Distributed Dataflow Systems I

recap: abstractions for parallel data processing (MapReduce / Resilient Distributed Datasets); reformulation of ML algorithms using MapReduce; SparkML

09/26 - Machine Learning on Distributed Dataflow Systems II

domain-specific languages for ML; compilation of ML programs to dataflow systems; limits of scalability; performance issues of distributed dataflow systems

10/03 - Distributed Learning with Parameter Servers

scalable asychronous learning; Hogwild!-style SGD; limitations of distributed learning; distributed matrix factorization

10/10 - Deep Learning Engines

representation of deep neural networks as computational graphs; auto-differentiation; scheduling of mini-batch SGD; eager execution; optimization for CPU/GPU; deep learning compilers

10/17 - Model Serving Systems

model serving; deploying machine learning models; A/B tests

Data Management for Machine Learning

10/24 Model and Metadata Management

experiment databases; managing models and metadata in ML experiments; reproducibility in ML; collaboration between data scientists

10/31 Automated Machine Learning

automating supervised learning; accelerating model selection; relationship between query optimization and model search

11/07 Data Validation and Data Cleaning

data quality dimensions; data profiling; data cleaning; unit tests for data; constraint suggestion; anomaly detection; missing value imputation

11/14 Model Validation

operating ML systems; schema-validation of features; dataset shift detection;

11/21 Fairness in Automated Decision Making

fairness and bias in ML; measuring fairness; counter measures

11/28 Thanksgiving

Case Studies

12/05 Case Study: Large-Scale Demand Forecasting at Amazon

analysis of a real-world end-to-end ML system from Amazon

12/12 Invited Guest Lecture with Industry Practitioner

12/19 Final Exam