Loading Events

Taiji Suzuki (Univ. Tokyo) – Feature learning perspective of deep foundation models: statistical and optimization theories

March 16 - May 30

FEATURE LEARNING PERSPECTIVE OF DEEP FOUNDATION MODELS: STATISTICAL AND OPTIMIZATION THEORIES
TAIJI SUZUKI
University of Tokyo
Department of Mathematical Informatics

Schedule: 
March 16, 2026 | 2PM – 5PM – Room 2005
March 19, 2026 | 2PM – 5PM – Room 2005
March 26, 2026 | 2PM – 5PM – Room 2005
March 30, 2026 | 2PM – 5PM – Room 2005

Content:
The main focus of this lecture will be on the feature learning aspects of deep foundation models, especially about the benefit of feature learning to achieve better predictive accuracy and more efficient optimization. As the deep foundation models are developed following the scaling law, a theoretical understanding of the learning principles behind practice is getting more important. For superior generalization of deep models, it is essentially important to acquire compressed representations avoiding mere memorization, making representation/feature learning fundamental. It has been theoretically shown that deep learning gains various advantages in generalization via its feature learning ability which naturally arises from its deep structure. It will be discussed how the feature learning ability affects the rate of convergence by comparing it with the sub-optimal rate of non-feature learning methods. This issue will be discussed not only from statistical  perspective but also from optimization perspective. Interesting examples include estimating a Gaussian single index model in which the computational complexity can be characterized by quantities socalled information exponent and generative exponent. If time permits, optimization guarantees by mean field Langevin dynamics and its statistical property will also be discussed. Furthermore, feature learning is significant not only during pre-training but also during test-time inference. This will be demonstrated concisely using in-context learning as an example. It will be discussed how the test time feature learning as well as pre-training feature learning affects the performance of test time inference. In summary, the following topics will be covered in the lecture (but some of them could be omitted depending on time constraint):
• Nonparametric function estimation by deep learning on high dimension data and its minimax optimality.
• Stochastic gradient descent for neural network training; Gaussian single index model, k-parity problem, information exponent, CSQ/SQ lower bound.
• Mean field Langevin dynamics and its statistical property.
• Test time inference and test time scaling: Transformer, in-context learning, chain-ofthought.

Evaluation: 
Take home project.

References:
E. Giné and R. Nickl. Mathematical foundations of infinite-dimensional statistical models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2015.
S. Hayakawa and T. Suzuki. On the minimax optimality and superiority of deep neural network learning over sparse parameter spaces. Neural Networks, 123:343–361, 2020.
J. Kim, T. Nakamaki, and T. Suzuki. Transformers are minimax optimal nonparametric incontext learners. In Advances in Neural Information Processing Systems, volume 37, pages 106667–106713, 2024.
J. D. Lee, K. Oko, T. Suzuki, and D. Wu. Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit. In Advances in Neural Information Processing Systems, volume 37, pages 58716–58756, 2024.
N. Nishikawa, Y. Song, K. Oko, D. Wu, and T. Suzuki. Nonlinear transformers can perform inference-time feature learning. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 46554– 46585. PMLR, 2025.
A. Nitanda, D. Wu, and T. Suzuki. Convex analysis of the mean field Langevin dynamics. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pages 9741–9757. PMLR, 28–30 Mar 2022.
Nitanda, A. Lee, D. T. X. Kai, M. Sakaguchi, and T. Suzuki. Propagation of chaos for mean-field Langevin dynamics and its application to model ensemble. In Forty-second International Conference on Machine Learning, 2025.
K. Oko, Y. Song, T. Suzuki, and D. Wu. Pretrained transformer efficiently learns lowdimensional target functions in-context. In Advances in Neural Information Processing Systems, volume 37, pages 77316–77365, 2024.
T. Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations, 2019.
T. Suzuki and A. Nitanda. Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space. In Advances in Neural Information Processing Systems, volume 34, pages 3609–3621, 2021. A. B. Tsybakov.