Reading group on Modern Deep Learning Theory and Practice

Schedule

	Date	Speaker	Papers to be presented
1	June 14, 2024	Yanlei Liu	[D-9]
2	June 21, 2024	Chengmei Niu	[D-8]
3	June 28, 2024	Jaiqing Liu	[D-2]
4	July 05, 2024	Kexin Chen	[D-10]
5	July 12, 2024	Muen Wu	[D-5]
6	July 19, 2024	Yue Xu	[D-4]

List of papers

[A] Tensor Program

[][A-1] Yang, Greg. “Wide feedforward or recurrent neural networks of any architecture are gaussian processes.” Advances in Neural Information Processing Systems 32 (2019).
[][A-2] Yang, Greg. “Tensor programs ii: Neural tangent kernel for any architecture.” arXiv preprint arXiv:2006.14548 (2020).
[][A-3] Yang, Greg, and Etai Littwin. “Tensor programs iib: Architectural universality of neural tangent kernel training dynamics.” International Conference on Machine Learning. PMLR, 2021.
[][A-4] Yang, Greg, et al. “Tensor programs vi: Feature learning in infinite-depth neural networks.” arXiv preprint arXiv:2310.02244 (2023).
[][A-5] Noci, Lorenzo, et al. “Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning.” arXiv preprint arXiv:2402.17457 (2024).
[][A-6] Li, Ping, and Phan-Minh Nguyen. “On random deep weight-tied autoencoders: Exact asymptotic analysis, phase transitions, and implications to training.” International Conference on Learning Representations. 2018.
[][A-7] Noci, Lorenzo, et al. “Why do Learning Rates Transfer? Reconciling Optimization and Scaling Limits for Deep Learning.” arXiv preprint arXiv:2402.17457 (2024).

[B] Theory and Practice for Transformers

[][B-1] Cowsik, Aditya, et al. “Geometric Dynamics of Signal Propagation Predict Trainability of Transformers.” arXiv preprint arXiv:2403.02579 (2024).
[][B-2] Noci, Lorenzo, et al. “Signal propagation in transformers: Theoretical perspectives and the role of rank collapse.” Advances in Neural Information Processing Systems 35 (2022): 27198-27211.
[][B-3] Malladi, Sadhika, et al. “A kernel-based view of language model fine-tuning.” International Conference on Machine Learning. PMLR, 2023.
[][B-4] Hayou, Soufiane, Nikhil Ghosh, and Bin Yu. “LoRA+: Efficient Low Rank Adaptation of Large Models.” arXiv preprint arXiv:2402.12354 (2024). Together with the original LoRA paper

[C] Random kernel matrices

[C-1] Xiuyuan Cheng, Amit Singer. “The Spectrum of Random Inner-product Kernel Matrices”. 2012.
[C-2] Zhou Fan, Andrea Montanari. “The Spectral Norm of Random Inner-Product Kernel Matrices”. 2017.
[C-3] Z. Liao, R. Couillet, “Inner-product Kernels are Asymptotically Equivalent to Binary Discrete Kernels”, 2019.
[C-4] Z. Liao, R. Couillet, and M. Mahoney. “Sparse Quantized Spectral Clustering.” 2021.
[C-5] Yue M. Lu, Horng-Tzer Yau. “An Equivalence Principle for the Spectrum of Random Inner-Product Kernel Matrices with Polynomial Scalings”. 2023.
[C-6] Sofiia Dubova et al. “Universality for the global spectrum of random inner-product kernel matrices in the polynomial regime.” 2023.

[D] Transformer-based model and in-context learning

[][D-1] Ruiqi Zhang, Spencer Frei, and Peter L. Bartlett. “Trained transformers learn linear models in-context”. Journal of Machine Learning Research, 25(49):1–55, 2024.
[D-2] Shivam Garg, Dimitris Tsipras, Percy Liang, Gregory Valiant. “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”. NeurISP 2022.
[][D-3] Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, Denny Zhou. “What learning algorithm is in-context learning? Investigations with linear models”. ICLR 2023.
[D-4] Johannes von Oswald et al. “Transformers Learn In-Context by Gradient Descent”. ICML 2022.
[D-5] Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak. “Transformers as Algorithms: Generalization and Stability in In-context Learning”. ICML 2023.
[][D-6] Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, Song Mei. “Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection”. 2023.
[][D-7] Jingfeng Wu, Difan Zou, Zixiang Chen, Vladimir Braverman, Quanquan Gu, and Peter Bartlett. “How many pretraining tasks are needed for in-context learning of linear regression?”. ICLR 2024.
[D-8] Yue M. Lu, Mary I. Leteya, Jacob A. Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan. “Asymptotic theory of in-context learning by linear attention”. 2024.
[D-9] Aaditya K Singh, Stephanie C.Y. Chan, Ted Moskovitz, Erin Grant, Andrew M Saxe, Felix Hill. “The Transient Nature of Emergent In-Context Learning in Transformers”. NeurIPS 2023.
[D-10] Allan Raventos, Mansheej Paul, Feng Chen, Surya Ganguli. “Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression”. NeurIPS 2023.
[][D-11] Siyu Chen, Heejune Sheen, Tianhao Wang, Zhuoran Yang. “Training Dynamics of Multi-Head Softmax Attention for In-Context Learning: Emergence, Convergence, and Optimality”. 2024.

[F] Others

[][F-1] Bordelon, Blake, Alexander Atanasov, and Cengiz Pehlevan. “A Dynamical Model of Neural Scaling Laws.” arXiv preprint arXiv:2402.01092 (2024).
[][F-2] Bahri, Yasaman et al. “Explaining Neural Scaling Laws.” 2021.
[][F-3] Kumar, Tanishq, et al. “Grokking as the transition from lazy to rich training dynamics.” arXiv preprint arXiv:2310.06110 (2023).
[][F-4] Papyan, Vardan, X. Y. Han, and David L. Donoho. “Prevalence of neural collapse during the terminal phase of deep learning training.” Proceedings of the National Academy of Sciences 117.40 (2020): 24652-24663.
[][F-5] Adityanarayanan Radhakrishnan et al. “Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features.” 2023.
[][F-6] Noam Levi, Alon Beck, and Yohai Bar Sinai. “Grokking in Linear Estimators – A Solvable Model that Groks without Understanding.” 2023.

More information

Contact and Thanks

Zhenyu Liao, EIC, Huazhong University of Science and Technology

The organizers are grateful for support from NSFC via fund NSFC-62206101.