CSIC5011-Math5473: Topological and Geometric Data Reduction and Visualization

HKUST

CSIC5011-Math5473: Topological and Geometric Data Reduction and Visualization
Spring 2025

Course Information

Synopsis (摘要)

This course is open to graduates and senior undergraduates in applied mathematics, statistics, and engineering, who are interested in learning from data. Students with other backgrounds such as life sciences are also welcome, provided you have certain maturity of mathematics. It will cover wide topics in geometric (principal component analysis and manifold learning, etc.) and topological data reduction (clustering and computational homology group, etc.).
Prerequisite: linear and abstract algebra, basic probability and multivariate statistics, basic stochastic process (Markov chains), convex optimization; familiarity with Matlab, R, and/or Python, etc.

Reference (参考教材)

[pdf download]

Topological Data Analysis for Genomics and Evolution: Topology in Biology. By Raul Rabadan and Andrew J. Blumberg [ Amazon ]

Instructors:

Yuan YAO

Time and Venue:

Wed 03:00PM - 05:50PM, Rm 4503, Lift 25-26 (64)

Homework and Projects:

Weekly homeworks (no grader, but I'll read your submissions and give bonus credits), mini-projects, and a final major project. No final exam.
Email: datascience.hw (add "AT gmail DOT com" afterwards)

Schedule (时间表)

Date	Topic	Instructor	Scriber
02/05/2025, Wed	Lecture 01: Syllabus, Principal Component Analysis, and Multidimensional Scaling [ Class outline ] [ PCA-MDS slides ] [Reference]: To view .jpynb files below, you may try [ Jupyter NBViewer] PCA in iPython Notebook [ pca.ipynb ] [ pca.py ] MDS in Python [ scikit-learn MDS] PCA with Logistic regression for digit classification: [ pca_logistic.ipynb ] [ pca_logistic.py ] [Homework 1]: Homework 1 [pdf]. Just for fun, no grading; but I'll read your submissions and give your bonus credits.	Y.Y.
02/07/2025, Fri	Seminar. [ Mathematics Colloquium ] Title: Theoretical Evaluation of Data Reconstruction Error and Induced Optimal Defenses [ announcement ] Speaker: Prof. Qi LEI, New York University Time: Friday Feb 7, 2025, 10:30am-noon Abstract: Data reconstruction attacks and defenses are crucial for understanding data leakage in machine learning and federated learning. However, previous research has largely focused on empirical observations of gradient inversion attacks, lacking a theoretical framework for quantitatively analyzing reconstruction errors based on model architecture and defense methods. In this talk, we propose framing the problem as an inverse problem, enabling a theoretical and systematic evaluation of data reconstruction attacks. For various defense methods, we derive the algorithmic upper bounds and matching information-theoretical lower bounds on reconstruction error for two-layer neural networks, accounting for feature and architecture dimensions as well as defense strength. We further propose two defense strategies — Optimal Gradient Noise and Optimal Gradient Pruning — that maximize reconstruction error while maintaining model performance. Bio: Qi Lei is an assistant professor of Mathematics and Data Science at the Courant Institute of Mathematical Sciences and the Center for Data Science at NYU. Previously she was an associate research scholar at the ECE department of Princeton University. She received her Ph.D. from Oden Institute for Computational Engineering & Sciences at UT Austin. She visited the Institute for Advanced Study (IAS)/Princeton for the Theoretical Machine Learning Program. Before that, she was a research fellow at Simons Institute for the Foundations of Deep Learning Program. Her research aims to develop mathematical groundings for trustworthy and (sample- and computationally) efficient machine learning algorithms. Qi has received several awards/recognitions, including Rising Stars in Machine Learning, in EECS, and in Statistics and Data Science, the Outstanding Dissertation Award, Computing Innovative Fellowship, and Simons-Berkeley Research Fellowship.. [ Relevant Reference ]: Zihan Wang, Jason D. Lee, Qi Lei. Reconstructing Training Data from Model Gradient, Provably [ link ] Sheng Liu, Zihan Wang, Yuxiao Chen, Qi Lei. Data Reconstruction Attacks and Defenses: A Systematic Evaluation. [ link ] Yuxiao Chen, Gamze Gürsoy, Qi Lei. Optimal Defenses Against Gradient Reconstruction Attacks. [ link ]	Y.Y.
02/08/2025, Sat	Lecture 02: Horn's Parallel Analysis and Random Matrix Theory for PCA, Sufficient Dimensionality Reduction and Supervised PCA (Chap 2: 3-5) [ slides ] [ Sufficient Dimensionality Reduction and Supervised PCA ] [Time and Venue]: 3:00PM - 5:50PM, G009A, CYT Bldg (80) [Reference]: Horn's Parallel Analysis in R: [ paran.R ] Parallel Analysis in Matlab: [ papca.m ] Parallel Analysis in Python by LI, Zhen: [ paPCA_curve.py ] [ paPCA_image.py ] Marcenko-Pastur Law of Wishart matrices in Matlab: [ mp.m ] S&P500 dataset in class: [ snp500.Rda ] [ snp452-data.mat ] [ snp500.txt ] [Johnstone06] High dimensional statistical inference and random matrices, ICM2006. [KN08 for multi-rank signal] S. Kritchman and B. Nadler, Determining the number of components in a factor model from limited noisy data, Chemometrics and Intelligent Laboratory Systems 94(1):19-32, 2008. [ NB10: multi-rank subspace ] R. R. Nadakuditi and F. Benaych-Georges, The breakdown point of signal subspace estimation, IEEE Sensor Array and Multichannel Signal Processing Workshop (2010), Jerusalem, Israel, 2010, pp. 177-180, doi: 10.1109/SAM.2010.5606726. Florent Benaych-Georges and Raj Rao Nadakuditi (2009) The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices. [Parallel Analysis: Horn (1965) original paper] [Remarks on Parallel Analysis: Buja-Eyuboglu (1992) with random permutation] [Raul Rabadan (2018)]: applications of RMT in single cell data analysis Dennis Cook, Fisher Lecture: Dimensionality Reduction in Regression. Statistical Science, 22(1):1-26, 2007. Ker-Chau Li, Sliced Inverse Regression for Dimension Reduction . Journal of the American Statistical Association, 86(414):316-327, 1991 Wu, Liang, and Mukherjee. Localized Sliced Inverse Regression. NIPS 2009. [Matlab codes] Jiang B and Liu JS. (2014) Variable selection for general index models via sliced inverse regression. Annals of Statistics, 42:1751-1786. [ R codes ] Wolfgang Hardle and Leopold Simar. Applied Multivariate Statistical Analysis. Chapter 18.3: Sliced Inverse Regression. [Homework 2]: Homework 2 [pdf]. Just for fun, no grading; but I'll read your submissions and give your bonus credits	Y.Y.
02/19/2025, Wed	Lecture 03: High Dimensional Sample Mean: Inadmissibility of MLE and James-Stein Estimators (Chap 2: 1-2) [ slides ] [Reference]: Comparing Maximum Likelihood Estimator and James-Stein Estimator in R: [ JSE.R ] Computer Age Statistical Inference, by Efron and Hastie, contains an empirical Bayes derivation of JSE in Section 7.1 and the baseball player example in 7.2: [ link ] James-Stein estimator via multi-task ridge regression: Yaqi Duan, Kaizheng Wang (2022) Adaptive and Robust Multi-task Learning, [ arXiv:2202.05250 ] [Homework 3]: Homework 3 [pdf]. Just for fun, no grading; but I'll read your submissions and give your bonus credits	Y.Y.
02/26/2025, Wed	Lecture 04: Random Projections, Johnson-Lindenstrauss Lemma, and Applications in Compressed Sensing etc. (Chap 3) [ slides ] [Reference]: Joseph Salmon's lecture on Johnson-Lindenstrauss Theory [ JLlemma.pdf ] Random Projections in Scikit-learn: [ link ] Dennis Amelunxen, Martin Lotz, Michael B. McCoy, Joel A. Tropp. Living on the edge: Phase transitions in convex programs with random data. [ arXiv:1303.6672 ] [Homework 4]: Homework 4 [pdf]. Just for fun, no grading; but I'll read your submissions and give your bonus credits	Y.Y.
03/05/2025, Wed	Lecture 05: Robust PCA, Sparse PCA, and Graph Realization (MDS) with Uncertainty -- Semidefinite Programming approach [ slides ] [Reference]: You need Matlab CVX optimization toolbox to run the following demo codes. Robust PCA demo: [ testRPCA.m ] [ Colab testRPCA.ipynb ] Robust PCA via ADMM by Stephen Boyd et al.: [ link ] Robust PCA via ADMM in Python [ weblink ] Sparse PCA demo: [ testSPCA.m ] [ Colab testSPCA.ipynb ] Sparse PCA in Python [ sklearn ] Sensor Network Localization in Matlab: an old package collected by Kirill Konovalov from previous CSIC5011 (2017) [ SNLSDP-0 ] Sensor Network Localization with Facial Reduction in Matlab: [ SNLSDPclique ] Teng ZHANG's Tyler's M-estimator [ Matlab: tyler_m_estimator.m ] Duembgen, Nordhausen and Schuhmacher (2016): R package for M-scatter estimates [ R package fastM ] [Homework 5]: Homework 5 [pdf]. Just for fun, no grading; but I'll read and give bonus credits if you submitted.	Y.Y.
03/12/2025, Wed	Lecture 06: Introduction to Manifold Learning: ISOMAP and LLE (Modified LLE, Hessian LLE, and LTSA) [ slides ] [Reference]: Tenenbaum, Joshua B., Vin De Silva, John C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, Vol. 290, No. 5500, pp. 2319-2323. 22 Dec 2000. [ DOI: 10.1126/science.290.5500.2319 ] Roweis, Sam T. and Lawrence K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, Vol 290, Issue 5500 pp. 2323-2326. 22 Dec 2000. [ DOI: 10.1126/science.290.5500.2323 ]. Balasubramanian, Mukund and Eric L. Schwartz. The Isomap Algorithm and Topological Stability. Science, Vol 295, Issue 5552, p.7. 4 Jan 2002. [ DOI: 10.1126/science.295.5552.7a ] V. de Silva and J.B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. Neural Information Processing Systems 15 (NIPS 2002). [NIPS 2002] Donoho, D. & Grimes, C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc Natl Acad Sci U S A. 100:5591, 2003. [doi: 10.1073/pnas.1031596100] Zhang, Z. & Wang, J. MLLE: Modified Locally Linear Embedding Using Multiple Weights. [ NIPS 2006 ] Zhang, Z. & Zha, H. (2005) Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM Journal on Scientific Computing. 26 (1): 313-338. [doi:10.1137/s1064827502419154] [Python]: plot_mani_digits.ipynb : demo of digits in class plot_compare_methods.ipynb : demo of Swiss roll in class scikit-learn manifold module [Matlab]: IsomapR1 : isomap codes by Tennenbaum, de Silva (isomapII.m with sparsity, fast mex with dijkstra.cpp and fibheap.h) lle.m : lle with k-nearest neighbors kcenter.m : k-center algorithm to find 'landmarks' in a metric space	Y.Y.
03/19/2025, Wed	Lecture 07: Manifold Learning II: Laplacian Eigenmap, Diffusion Map, and Stochastic Neighbor Embedding [ slides ] [Reference]: Mikhail Belkin, Partha Niyogi. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, Advances in Neural Information Processing Systems (NIPS) 14, 2001, p. 586-691, MIT Press [nips link] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. PNAS 102 (21):7426-7431, 2005. [doi: 10.1073/pnas.0500334102] Nadler, Boaz; Stephane Lafon; Ronald R. Coifman; Ioannis G. Kevrekidis (2005). Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators. Advances in Neural Information Processing Systems (NIPS) 18, 2005. [ .pdf ] Coifman, R.R.; S. Lafon. (2006). Diffusion maps. Applied and Computational Harmonic Analysis. 21: 5-30. [ DOI: 10.1016/j.acha.2006.04.006 ] Stochastic Neighbor Embedding [ .pdf ] Visualizing Data using t-SNE [ .pdf ] A paper that relates SNE to Laplacian Eigenmaps [ .pdf ] A helpful website: How to use t-SNE effectively? [ link ] [Matlab] Matlab code to compare manifold learning algorithms [ mani.m ] : PCA, MDS, ISOMAP, LLE, Hessian LLE, LTSA, Laplacian, Diffusion (no SNE!) [Python]: plot_compare_methods.ipynb : demo in class plot_mani_digits.ipynb : demo of digits in class scikit-learn manifold LLE : PCA/MDS, ISOMAP, LLE/MLLE, Hessian, LTSA, Laplacian (Spectral), t-SNE (no Diffusion map) Laurens van der Maaten's website for t-SNE codes [Homework 6]: Homework 6 [pdf]. Just for fun, no grading.	Y.Y.
03/26/2025, Wed	Lecture 08: Random Walk on Graphs and Spectral Graph Theory: Perron-Frobenius (PageRank), Fiedler (Algebraic Connectivity), Cheeger Inequality (Normalized Cut), Lumpability (Spectral Clustering), and Transition Path Theory (Semi-supervised Learning) [ slides ] [Reference]: Amy N. Langville and Carl D. Meyer's book: Google's PageRank and Beyond Jim Demmel's courseweb at UC Berkeley for Fiedler Theory and Graph Bipartition: [ link ] T. Buehler, M. Hein. Spectral Clustering based on the graph p-Laplacian. Proceedings of the 26th International Conference on Machine Learning (ICML 2009), 81-88. James R. Lee, Shayan Oveis Gharan, Luca Trevisan. Multi-way spectral partitioning and higher-order Cheeger inequalities. Proceedings of the forty-fourth annual ACM symposium on Theory of computing (STOC'12), Pages 1117-1130. arXiv:1111.1055. Weinan E, Jianfeng Lu, and Yuan Yao. The Landscape of Complex Networks: Critical Nodes and A Hierarchical Decomposition. Methods and Applications of Analysis, special issue in honor of Professor Stanley Osher on his 70th birthday, 20(4):383-404, 2013. [Homework 7]: Homework 7 [pdf]. Just for fun, no grading.	Y.Y.
04/09/2025, Wed	Lecture 09: Introduction to Topological Data Analysis. [ slides ] [Reference]: Topological Methods for Exploring Low-density States in Biomolecular Folding Pathways. Yuan Yao, Jian Sun, Xuhui Huang, Gregory Bowman, Gurjeet Singh, Michael Lesnick, Vijay Pande, Leonidas Guibas and Gunnar Carlsson. J. Chem. Phys. 130, 144115 (2009). [pdf][Online Publication][SimTK Link: Data and Mapper Matlab Codes] [Selected by Virtual Journal of Biological Physics Research, 04/15/2009]. Structural insight into RNA hairpin folding intermediates. Bowman, Gregory R., Xuhui Huang, Yuan Yao, Jian Sun, Gunnar Carlsson, Leonidas Guibas and Vijay Pande. Journal of American Chemistry Society, 2008, 130 (30): 9676-9678. [link] Single-cell topological RNA-seq analysis reveals insights into cellular differentiation and development. Abbas H Rizvi, Pablo G Camara, Elena K Kandror, Thomas J Roberts, Ira Schieren, Tom Maniatis & Raul Rabadan. Nature Biotechnology. 2017 May. doi:10.1038/nbt.3854 Spatiotemporal genomic architecture informs precision oncology in glioblastoma. Lee JK, Wang J, Sa JK, Ladewig E, Lee HO, Lee IH, Kang HJ, Rosenbloom DS, Camara PG, Liu Z, van Nieuwenhuizen P, Jung SW, Choi SW, Kim J, Chen A, Kim KT, Shin S, Seo YJ, Oh JM, Shin YJ, Park CK, Kong DS, Seol HJ, Blumberg A, Lee JI, Iavarone A, Park WY, Rabadan R, Nam DH. Nat Genet. 2017 Apr. doi: 10.1038/ng.3806. A Python Implementation of Mapper [ sakmapper ] in single cell data analysis. Single Cell TDA [ scTDA ] with [ tutorial in html ] A Java package for persistent homology and barcodes: Javaplex Tutorial. Persistent Homology Analysis of Biomolecular Data Guo-Wei Wei. SIAM News 2017 Topological Data Analysis Generates High-Resolution, Genome-wide Maps of Human Recombination. Pablo G. Camara, Daniel I.S. Rosenbloom, Kevin J. Emmett, Arnold J. Levine, Raul Rabadan. Cell Systems. 2016 June. doi: 10.1016/j.cels.2016.05.008. Topology of viral evolution. Chan JM, Carlsson G, Rabadan R. Proc Natl Acad Sci USA 2013 Oct 29. doi: 10.1073/pnas.1313480110. Robert Ghrist's monograph on applied Topology Elementary Applied Topology [Homework 8]: Homework 8 [pdf]. Just for fun, no grading.	Y.Y.
04/11/2025, Fri	Seminar. [ Mathematics Colloquium ] Title: Interaction of Statistics and Geometry: A New Landscape for Data Science [ announcement ] Speaker: Prof. Zhigang YAO, National University of Singapore Time: Friday April 11, 2025, 3pm Venue: Lecture Theatre F (Lift 25/26) Abstract: While classical statistics primarily focuses on observations as real numbers or elements of real vector spaces, contemporary statistical challenges often involve more complex data types. These data are represented in spaces that, although not strictly Euclidean vector spaces, possess inherent geometric structures. The community exploring the interaction between statistics and geometry is expanding in both numbers and scope. The concept of manifold fitting traces back to H. Whitney's work in the early 1930s. The resolution of the Whitney extension problem has yielded new insights into data interpolation and inspired the formulation of the Geometric Whitney Problems. Specifically, given a set, we inquire: when can we construct a smooth d-dimensional submanifold to approximate the set, and how effectively can we estimate it in terms of distance and smoothness? In this talk, I will explore the manifold fitting problem, highlighting its modern insights and implications. Although various mathematical approaches have been proposed, many rely on restrictive assumptions, complicating the development of efficient and practical algorithms. As the manifold hypothesis-exploring non-Euclidean structures-remains a cornerstone of data science, further exploration of the manifold fitting problem is essential within the contemporary data science community. This discussion will be informed by recent work by Yao, Yau, and other co-authors, alongside ongoing research. Bio: Zhigang Yao is a tenured Associate Professor in the Department of Statistics and Data Science at the National University of Singapore. Since 2022, he has also been a visiting faculty member at the Center for Mathematical Sciences and Applications at Harvard University. In addition, he holds visiting professorships at the YMSC at Tsinghua University and the Shanghai Institute of Mathematical Sciences and Interdisciplinary Science (SIMIS).Yao's primary research interests are in statistical inference for complex data. In recent years, his focus has shifted towards Non-Euclidean Statistics and low-dimensional manifold fitting. He is dedicated to advancing the emerging field at the intersection of geometry and statistics. Along with his collaborators, Yao has proposed novel methods and theories that extend traditional principal component analysis (PCA) to Riemannian manifolds, including principal flows, sub-manifolds, and principal boundaries. These innovations offer new manifold fitting theories designed to address the limitations of conventional statistical methods by incorporating the geometric structure of data. [ Relevant Reference ] Zhigang Yao, Jiaji Su, and Shing-Tung Yau. Manifold fitting with CycleGAN [ PNAS: doi/10.1073/pnas.2311436121 ] Zhigang Yao, Bingjie Li, Yukun Lu, and Shing-Tung Yau. Single-cell analysis via manifold fitting: A framework for RNA clustering and beyond. [ PNAS: doi/10.1073/pnas.2400002121 ] Zhigang Yao, Jiaji Su, Bingjie Li, Shing-Tung Yau. Manifold Fitting. [ arXiv:2304.07680 ] Zhigang Yao, Yuqing Xia. Manifold Fitting under Unbounded Noise. [ arXiv:1909.10228 ] Zhigang Yao's research project source codes: [ github ]	Y.Y.
04/12/2025, Sat	Lecture 10: Hodge Theory and Applications: Social Choice, Crowdsourced Ranking, and Game Theory [ slides ] [ Reference ]: Statistical Ranking and Combinatorial Hodge Theory. Xiaoye Jiang, Lek-Heng Lim, Yuan Yao and Yinyu Ye. Mathematical Programming, Volume 127, Number 1, Pages 203-244, 2011. [pdf][ arxiv.org/abs/0811.1067][ Matlab Codes] Flows and Decompositions of Games: Harmonic and Potential Games Ozan Candogan, Ishai Menache, Asuman Ozdaglar, and Pablo A. Parrilo Mathematics of Operations Research, 36(3): 474 - 503, 2011 [arXiv.org/abs/1005.2405][ doi:10.1287/moor.1110.0500 ] Hodge Theory on Metric Spaces. Laurent Bartholdi, Thomas Schick, Nat Smale, and Steve Smale. Found Comput Math, 12, 1–48 (2012). . [https://doi.org/10.1007/s10208-011-9107-3][ arXiv:0912.0284] HodgeRank on Random Graphs for Subjective Video Quality Assessment. Qianqian Xu, Qingming Huang, Tingting Jiang, Bowei Yan, Weisi Lin, and Yuan Yao. IEEE Transactions on Multimedia, 14(3):844-857, 2012 [pdf][ Matlab codes in zip ] Robust Evaluation for Quality of Experience in Crowdsourcing. Qianqian Xu, Jiechao Xiong, Qingming Huang, and Yuan Yao ACM Multimedia 2013. [pdf] Online HodgeRank on Random Graphs for Crowdsourceable QoE Evaluation. Qianqian Xu, Jiechao Xiong, Qingming Huang, and Yuan Yao IEEE Transactions on Multimedia, 16(2):373-386, Feb. 2014. [pdf] Analysis of Crowdsourced Sampling Strategies for HodgeRank with Sparse Random Graphs Braxton Osting, Jiechao Xiong, Qianqian Xu, and Yuan Yao Applied and Computational Harmonic Analysis, 41 (2): 540-560, 2016 [ arXiv:1503.00164 ] [ ACHA online ] [Matlab codes to reproduce our results] False Discovery Rate Control and Statistical Quality Assessment of Annotators in Crowdsourced Ranking Qianqian Xu, Jiechao Xiong, Xiaochun Cao, Yuan Yao Proceedings of The 33rd International Conference on Machine Learning (ICML), New York, June 19-24, 2016. [ arXiv:1605.05860 ] [ pdf ] [ supplementary ] Parsimonious Mixed-Effects HodgeRank for Crowdsourced Preference Aggregation Qianqian Xu, Jiechao Xiong, Xiaochun Cao, Yuan Yao ACM Multimedia Conference (ACMMM), Amsterdam, Netherlands, October 15-19, 2016. [ arXiv:1607.03401 ] [ pdf ] HodgeRank with Information Maximization for Crowdsourced Pairwise Ranking Aggregation Qianqian Xu, Jiechao Xiong, Xi Chen, Qingming Huang, Yuan Yao The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, Louisiana, USA, February 27, 2018. [ arXiv:1711.05957 ] [ Matlab Source Codes ] From Social to Individuals: a Parsimonious Path of Multi-level Models for Crowdsourced Preference Aggregation Qianqian Xu, Jiechao Xiong, Xiaochun Cao, Qingming Huang, Yuan Yao IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(4):844-856, 2019. Extended from MM'16 in [ arXiv:1607.03401 ]. [ arXiv:1804.11177 ] [ doi: 10.1109/TPAMI.2018.2817205 ][ GitHub source] Evaluating Visual Properties via Robust HodgeRank Qianqian Xu, Jiechao Xiong, Xiaochun Cao, Qingming Huang and Yuan Yao International Journal of Computer Vision, 129: 1732-1753, 2021. [ arXiv:1408.3467 ] [ DOI: 10.1007/s11263-021-01438-y ] Professor Don Saari: [ UCI homepage ] [ Book Info: Disposing Dictators, Demstifying Voting Paradoxes ] [ Amazon link ] Quantum algorithms for topological and geometric analysis of data Lloyd, Seth and Garnerone, Silvano and Zanardi, Paolo Nature Communications, 7(1): 10138, 2016. [ DOI: 10.1038/ncomms10138 ] [Homework 9]: Homework 9 [pdf]. Just for fun, no grading.	Y.Y.
04/16/2025, Wed	Lecture 11: Final Project [ project.pdf ] and Deep Generative Models [ slides ] [ Seminar and final project topics ] Title: Digital Historical Forensics: A Computational Approach to Wartime Media Cultures [ slides ] Speaker: Prof. Lin DU, National University of Singapore Abstract: This study examines the longstanding need and challenge of providing contextual analysis of historical images stored in digital visual archives and the accessibility of retrieving contextual information from these historical archives. Contextual analysis is essential for disciplines such as history and art history, as it allows for the contextualization of artwork and historical sources with historical narratives, which in turn enhances understanding of the artistic or political expression in the contents of cultural products. To address this challenge, a novel approach is proposed utilizing computer vision to trace the circulation and dissemination of historical photographs in their original contexts. This method involves first using YOLO v7 to crop historical images from pictorial magazines, then training machine learning models on the cropped printed images plus another large dataset of original historical photographs, and comparing the similarity of images between the datasets of printed images and original photographs. To ensure accuracy of image similarities between the two subsets with distinct image qualities, an ensemble of three machine learning models—Vision Transformer, EfficientNetv2, and Swin Transformer —--- was developed. Through this system, contexts in the circulation of historical photographs were discovered and new insights regarding the editing strategies of propaganda magazines in East Asia during WWII were uncovered. These outcomes offer supporting evidence for previous research in the history and art historical disciplines, and demonstrate the potential of computer vision for uncovering new information from digital visual archives. Our model achieves a 77.8% top-15 retrieval accuracy on our evaluation dataset. Further projects addressing these challenges are outlined, accompanied by relevant datasets. Bio: Lin Du is currently a Postdoctoral Fellow and will join as an assistant professor in July, jointly appointed in the Departments of Japanese Studies and Chinese Studies at the National University of Singapore. She completed her PhD at the Department of Asian Languages and Cultures at UCLA, where her dissertation, "Chinese Photojournalism 1937–1952: Materiality and the Institutionalization of Culture via a Computer Vision Approach," utilized advanced computer vision techniques to explore wartime visual media culture. Lin holds an MA from the Regional Studies East Asia Program at Harvard University and a BA in Chinese Language and Literature from Peking University. Her pioneering work in machine learning has been published in the ACM Journal on Computing and Cultural Heritage (JOCCH), and her contributions to humanities research are forthcoming in the Journal of Chinese Cinemas and Asia Pacific Perspectives.	Y.Y.
04/30/2025, Wed	Lecture 12: Final Project Presentation [ project.pdf ] and Deep Generative Models [ slides ] [ Seminar and final project topics ] Title: Can we avoid robust overfitting in adversarial training? - An approximation viewpoint [ slides ] Speaker: Dr. Zhongjie Shi, Hong Kong University Abstract: Adversarial training is a widely used method to improve the robustness of deep neural networks (DNNs) over adversarial perturbations. However, it is empirically observed that adversarial training on over-parameterized networks often suffers from the robust overfitting: it can achieve almost zero adversarial training error while the robust generalization performance is not promising. In this talk, we provide a theoretical understanding of the question of whether overfitted DNNs in adversarial training can generalize from an approximation viewpoint. Our analysis for both regression and classification tasks points out that robust overfitting can be avoided but the required model capacity will depend on the smoothness of the target function, while a robust generalization gap is inevitable. We hope our analysis will give a better understanding of the mathematical foundations of robustness in DNNs from an approximation view. Bio: Zhongjie Shi received the B.Sc. degree in computing mathematics and the Ph.D. degree in data science from the City University of Hong Kong, Hong Kong, in 2018 and 2022, respectively. Following this, he held a post-doctoral position in electrical engineering with KU Leuven, for one year. He is currently a post-doctoral researcher in the School of Computing and Data Science with The University of Hong Kong. His research focuses on deep learning theory, functional data analysis, and adversarial training.	Y.Y.
05/10/2025, Sat	Lecture 13: Final Project Presentation [ project.pdf ]	Y.Y.

Datasets (to-be-updated)

[Animal Sleep Data] Animal species sleeping hours vs. other features

[Anzhen Heart Data] Heart Operation Effect Prediction, provided by Dr. Jinwen Wang, Anzhen Hospital

[Beer Data] 877 beers dataset, provided by Mr. Richard Sun, Shanghai

[Crime Data] Crime rates in 59 US cities during 1970-1992

[Real-Time-Bidding Algorithm Competition Data] Contest Website

[红楼梦人物事件矩阵] a 376-by-475 matrix (374-by-475 updated by WAN, Mengting) for character-event appearance in A Dream of Red Mansion (Xueqin Cao) [ Dataset in Github ] [374 Characters dream.RData (for R load)] [dream.Rd (for R manual)] [HongLouMeng374.txt] [HongLouMeng376.csv] [.mat] [readme.m]

[西游记] characters-scene occurance matrices for 100 chapters [ Dataset in GitHub ] [data in RData] [data in matlab (302-by-408 matrix)]

chap001-005	chap006-009	chap010-013	chap014-017	chap018-021	chap022-025
chap026-029	chap030-033	chap034-037	chap038-041	chap042-045	chap046-049
chap050-053	chap054-057	chap058-061	chap062-065	chap066-069	chap070-073
chap074-077	chap078-081	chap082-085	chap086-088	chap089-091	chap092-094
chap095-097	chap098-100	All in TXT	readData.m

[Keywords Pricing] Keywords and profit index in paid search advertising, by Hansheng Wang (Guanghua, PKU). [sample file] [readme.txt] [data in csv]

[Radon Data] Radon measurements of 12,687 houses in US

[Wells Data] Switch unsafe wells for arsenic pollution in Bangladesh

to-be-done...

by YAO, Yuan.