Sharing my Reflections on the courses I took as a Data Science major (MS) student at UCSD

DSC206: Algorithms for Data Science

Instructor: Prof. Arya Mazumdar

This course covered key data science algorithms across four core areas: Dimensionality Reduction, Clustering, Streaming Algorithms, and Optimization. It emphasized mathematical foundations through derivations and proofs, providing a deeper understanding of these techniques.

  • Dimensionality Reduction: Learned about SVD, low-rank approxiimations and PCA algorithms. There are two major ways to find the singular vectors: one is the power method and the other is the greedy approach. HITS algorithm (used for ranking webpages according to a term) is essentially SVD on the web. Also learned about the Pagerank algorithm, based on the markov chain model.
  • Hashing: The goal is to find similar queries in a high-dimensional space. We use hashing to replace large sets/docs by much smaller representations called signatures. One of the prominent techniques used to compute signatures is Minhashing. Another way to find similar queries is using the the Locality Sensitive Hashing (LSH), where the key idea is to hash the items several times such that similar items are more likely to be hashed in the same bucket than dissimilar items.
  • Data Streaming: Learned about practical algorithms like the Bloom Filter (used for spam filtering), Flajolet Martin algorithm (used for finding the #distinct_items in a stream) and Count-min sketch (used for estimating the frequencies of items in a stream). These methods address constraints related to memory limitations, storage availability, and real-time processing demands. CMS is used in Pandas and some of the famous attention algorithms like Hyper-attention.
  • Clustering: Learned about center-based clustering methods like: k-center, k-median, and k-means clustering. Learned about K-means++, which is a better initialization strategy for the Lloyds algorithm. It uses FT (Farthest Travel) algorithm for initialization (can be fooled by outliers). Prof also talked about the spectral clustering algorithm in great depth and its applications in finding clusters in social networks and communities of interest.
  • Optimization and ML: Had a revisit of some of the supervised learning algorithms like the Perceptron algorithm, SVM, optimization algorithms like GD, SGD and learned about the analysis of their convergence.

Few key-takeways to remember:
  • PCA is just a special case of SVD where we stop at the first signular vector.
  • In HITS algorithm, each webpage is composed of two different kinds of scores: Hubs and Authorities.
  • When using minhash signatures, the similarity between two sets (in high dimension) = similarity between their signature vectors.
  • Why we need LSH when we already have minhash? Even though the signature length per doc is small, the possible combination of pairs is huge and it will take weeks to compute the similarity between signatures.
  • Data Streaming algorithms are based on the idea that approximate solutions might be faster than exact solutions.
  • K-means is an NP-hard optimization problem in itself. In practice Lloyd's algorithm is used, which results in an equivalent surrogate loss function. This algorithm always coverges to a local minima of the cost.

winter 2025