CSE291h: Advanced Data-driven Text Mining
Instructor: Prof. Jingbo Shang

This course covered a wide range of NLP topics, starting with foundational concepts such as frequency-based word embeddings, and progressing to prediction-based embeddings (Word2Vec, GloVe), language models (including neural language models), and their respective advantages and limitations. The course also delved into Sentiment Analysis, Information Retrieval, and LLMs in fair detail, providing a solid foundation for understanding these advanced topics.
One of the highlights was learning about the professor’s own work in Phrase Mining, which enhances feature representation and improves textual understanding. This included an exploration of three methods for phrase mining: supervised, unsupervised, and weakly/distantly supervised learning, a new concept for me. As part of the assignments, I applied techniques such as SegPhrase (weakly supervised, using manually annotated labels) and AutoPhrase (distantly supervised, leveraging existing knowledge bases like Wikipedia) for mining phrases.
Additionally, the professor organized a graded Kaggle competition focused on multi-class text classification. This was a challenging and intensive experience, where I experimented with various embedding and modeling techniques to achieve a strong score. Overall, I found this course highly beneficial as a Data Science major. While it doesn’t focus directly on the latest hot topics like LLMs, it provides a robust understanding that is essential for working with these models in the future.
Fall 2024