Data Engineering for Machine Learning is a novel course at the intersection of Systems, Big Data and Machine Learning. The course focuses on machine learning systems in the real-world, as well as on data-related problems that typically occur in end-to-end machine learning deployments. The lectures cover systems deployed in practice at companies like Google, Twitter & Amazon, and research from a variety of conferences such as NeurIPS (ML), VLDB / SIGMOD (data management) and OSDI (systems). Students will learn about abstractions and concepts of ML-related systems on the one hand, and gain practical experiences in use cases such as data and model validation, experiment databases, and data cleaning on the other hand.
This course will be taught by Sebastian Schelter, an MSDSE Fellow at CDS. Sebastian has already gained a lot of practical experience in real-world ML systems, as an intern with the recommender systems team at Twitter, and as a Senior Applied Scientist at Amazon Core AI Berlin, where he spent three years before joining NYU. He is a committer and PMC member in several systems-related projects of the Apache Software Foundation. Additionally, he mentors the deep learning-related project Apache TVM for the Apache Incubator, and is the creator and chair of the workshop series on ‘Data Management for End-to-End Machine Learning’ at ACM SIGMOD.
end-to-end machine learning systems in the real-world; ML engineering; anecdotes; open problems and challenges
history of large-scale data processing; systems basics: data parallelism, task parallelism, pipeline parallelism; memory hierarchy; end of Moore’s law; distributed filesystems; machine learning in practice; machine learning engineering
recap: abstractions for parallel data processing (MapReduce / Resilient Distributed Datasets); reformulation of ML algorithms using MapReduce; SparkML
domain-specific languages for ML; compilation of ML programs to dataflow systems; limits of scalability; performance issues of distributed dataflow systems
scalable asychronous learning; Hogwild!-style SGD; limitations of distributed learning; distributed matrix factorization
representation of deep neural networks as computational graphs; auto-differentiation; scheduling of mini-batch SGD; eager execution; optimization for CPU/GPU; deep learning compilers
model serving; deploying machine learning models; A/B tests
experiment databases; managing models and metadata in ML experiments; reproducibility in ML; collaboration between data scientists
automating supervised learning; accelerating model selection; relationship between query optimization and model search
data quality dimensions; data profiling; data cleaning; unit tests for data; constraint suggestion; anomaly detection; missing value imputation
operating ML systems; schema-validation of features; dataset shift detection;
fairness and bias in ML; measuring fairness; counter measures
analysis of a real-world end-to-end ML system from Amazon