A review of LLMs from the perspective of memory and compute

Yoni Dukler (AWS AI)

Live Stream

Abstract

Scaling of machine learning models has significantly improved model capabilities, especially for self-supervised tasks. To scale efficiently, one must harmonize modeling efforts with good utilization of the computational resources at hand. In this talk I will review a selection of important LLM architectural choices with the perspective of efficiency. I will first walk through the high level mechanism of GPU computation along with their capabilities and constraints. I will dive deeper into a few aspects of the hardware and work to identify a few basic principles that guide hardware efficiency in the field. From there, I will motivate recent innovations in LLM execution and architecture from the lens of the hardware principles we identified, considering both LLM training and inference efficiency.

seminar

07.08.25 09.10.25

Math Machine Learning seminar MPI MIS + UCLA Math Machine Learning seminar MPI MIS + UCLA

MPI for Mathematics in the Sciences Live Stream

Details anzeigen

Upcoming Events of this Seminar

Donnerstag, 07.08.25 Efficient compression of neural networks and datasets with Lukas Barth
Donnerstag, 14.08.25 to be announced with Jonathan Siegel
Donnerstag, 21.08.25 to be announced with Zhou Fan
Donnerstag, 28.08.25 to be announced with Randall Balestriero
Donnerstag, 02.10.25 to be announced with Marcello Carioni
Donnerstag, 09.10.25 to be announced with Baharan Mirzasoleiman