CSE234: Data Systems for ML
Instructor: Prof. Hao Zhang

This course was undoubtedly the most valuable learning experience at UCSD. By far the most practical course, Prof. Hao Zhang knows exactly what to teach, with materials that are thoroughly up-to-date with the latest advancements in data systems for ML. The field of LLMs is evolving rapidly, but this course builds a solid foundation in LLMSys by covering fundamental concepts and optimizations that are here to stay:
- Dataflow graphs and Autodiff: Prof. Zhang began with an introduction to dataflow graphs, covering both symbolic and imperative programming frameworks. I learned about static and dynamic computational dataflow graphs, along with their respective pros and cons. We explored forward and backward mode autodiff and I had the opportunity to implement an autodiff library as part of the assignments.
- CPU, GPU, CUDA: Learned about CPU and GPU architectures and their respective memory hierarchies and how to perform efficient MatMul tiling on these hardwares. This also included discussion about handling control flows and scheduling on CUDA.
- Graph Optimization: Covered automated and template-based graph optimization techniques, including TASO (Tensor Algebra SuperOptimizer), where we explored fully equivalent and partially equivalent graph transformations. As part of the readings, I also learned about TVM (Tensor Virtual Machine).
- Operator Compilation: The goal here is to make the operators (like MatMul, Conv2D, etc) run fast on different hardware devices. These ML compilers generate efficient kernel code for these operators.
- Memory Optimization: Learned techniques like activation checkpointing, gradient accumulation and CPU-swapping to optimize on memory.
- Parallelization: Prof. covered this topic in great depth. The idea is that we want to deploy our computational graph across many devices to be able to efficiently train large models. I learned several kinds of paralellization strategies, communication primitives, sync and async parameter server designs etc. This was complemented by the readings on ZeRO - Zero Redundancy Optimizer (data parallelism), GPipe (pipeline parallelism), Megatron-LM (tensor parallelism), ALPA (automated inter- and intra-operator parallelism), etc.
The Prof also talked about techniques like quantization (both for inference and training), pruning and decoding. I also learned deep compression techniques like k-means quantization and linear quantization in great depths.
LLMSys: Covered everything about transformers, from multi-headed self-attention to MLP layers. Prof. Zhang walked us through calculating parameters, FLOPs, and peak memory for models like LLaMA-2 7B, with an assignment to do the same for DeepSeek. He also went deep into MoE and other attention methods like multi-head latent attention. Another interesting topic was the Chinchilla scaling laws.
- LLM Serving and Inference: Prof. Zhang covered the key challenges and optimizations in LLM serving and inference. Since LLMs are slow and expensive to serve, requiring multiple GPUs, inference is more memory-IO bound than compute-bound.
- Key optimizations include KV-Caching to avoid recomputation, Continuous Batching for better GPU utilization, Paged Attention (vLLM) (this was mention by Jensen Huang in his keynote) to improve KV cache efficiency, and Speculative Decoding using a smaller draft model to speed up generation.
- Important performance metrics discussed were TTFT (Time to First Token), TPOT (Time per Output Token), and the difference between Throughput and Goodput. One of the latest advancements is Disaggregation (Hao's own work), where prefill and decode operations are split across different GPUs, improving efficiency—an approach now widely used in large-scale deployments like DeepSeek-V3.
There’s so much more to reflect on, and I’m sure I’ve missed many interesting aspects in this short review. I plan to dive deeper and share insights through blogs as time permits.
This course was extremely heavy in the sense that I took four courses in winter'25, this course took the time and effort equivalent to other three combined. The assignments were closely tied to the class material and incredibly challenging, but I enjoyed every bit of it. The sense of satisfaction upon completing them was truly rewarding.
winter 2025