Why do Transformers work so well?

Bayesian Language Understanding with Nonparametric Variational Transformers - Talk Jan 29 2024

In our invited talk at the CALCULUS Symposium at KU Leuven, we present our work on reinterpreting the latent representations of Transformers as nonparametric mixture distributions, and training a variational Bayesian version of Transformers on natural language. We argue that the empirical success of Transformers is evidence that natural language understanding is nonparametric variational Bayesian inference of mixture distributions.