Sharing my Reflections on the courses I took as a Data Science major (MS) student at UCSD

DSC240: Machine Learning

Instructor: Prof. Yu-Xiang Wang

This course was very elementary and I took it only to satisfy my degree requirements. However, it really helped preparing for interviews and revise many of the core concepts and algorithms in ML. I will go to the classes just to listen to the Prof and made absolutely no notes. Prof. ensured to keep clasess interactive throughout the course.

The Prof started with intro to supervised learning and standard performance metrics like accuracy, precsion, recall, ROC-AUC curve, etc. and gave toy examples for us to work out. The Prof also gave a detour to linear algebra and probability theory concepts.

  • Learned about the Perceptron Algorithm (PA) and analysed the convergence by calculating the upper bound on the number of mistakes the algorithm makes. Both online and offline (batch) perceptron algorithms make a finite number of mistakes.
  • Loss function & Surrogate loss: Best linear separator (minimising the 0-1 loss) in general is NP-hard (in case of non-linear separability). However, we can optimise an approximation of the objective function that is computationally feasible and focus more on modeling i.e better features and hypothesis. This leads to surrogate loss functions like the square loss, hinge-loss, logistic loss, etc. (for binary classification settings).
  • There can be two different goals for a given regression problem: Prediction task (predicting y using x) or an Estimation task (inferencing about the unobserved function 'f'). Different losses for regression problems include: L2-loss (aka squared loss), L1-loss, huber-loss, etc. In Linear regression, the Least Squares objective function (OLS) can be solved using GD, SGD, or by directly solving the system of linear equations.
  • Regularization prevents overfitting and incuces structures in the solution. Ridge regression (L2) = induces solutions that are smaller but dense, Lasso regression (L1) = induces solutions that are sparse. Regularization can be applied to classification problems as well, L2 = Maximise margin, L1 = feature selection.
  • Learned about formulating the quadratic program for hard-margin SVM and Soft-margin SVM. Soft-SVM is equivalent to minimizing the hinge-loss function. It has an hyperparameter 'C', that determines how many points become support vectors. As C increases fewer points become support vectors and it becomes hard-margin SVM.
  • The Prof also talked in depth about discriminative and probabilistic modeling (helps quantify uncertainity) paradigms. Under probabilistic modeling he gave lots of examples to figure out the MLEs.
  • Prof talked in great depth about the voting classifiers including Bagging, Random forest and Boosting algorithms. Main difference between RF and bagging is that RF also introduces randomness in the feature selection in addition to randomly sampling a subset of the data.
  • Feature expansion: is about lifting a feature vector to a higher dimensional space (since even the best linear classifiers may not be good).This involves methods like kth order polynomial expansion and the kernel methods. The data becomes linearly separable in these high dimensional spaces.

Few key-takeways to remember:
  • If the data is not linearly separable, PA will not converge. PA depends on the order of the datapoints (path dependence), if the dataset is shuffled, PA will make a different number of mistakes.
  • Surrogate loss being convex does not imply all ML problems using surrogate loss are convex.
  • SGD is about learning from incremental feedback, whereas GD is about seeking average of the feedbacks from all examples to update the parameters. SGD is an unbiased estimator for the gradient of the loss function over the entire dataset.
  • PA can be seen as a variant of SGD. PA uses a shifted version of the hige loss, aka the perceptron loss.
  • In regression problems, we can interpret the fitted coefficients using their sign (indicates + or - correlation with the label) and the magnitude (indicates how strongly correlated).
  • Laplace smoothing is just like L2-regularization. Like L2-reg prevents models from assgning large values to coefficients, laplace smoothing prevents NB from assigning 0 probabilites.
  • Boosting Interpretations from a feature learning perspective: Every new tree is a new (greedily selected) feature. Final classifier is a linear combination of these learned features.
  • Kernel trick is mapping the data to high dimension (computing 'k(x, x') = <f(x), f(x')>) without explicitly calculating the transformations 'f' to that high dimension.

winter 2025