A review of LLMs from the perspective of memory and compute
- Yoni Dukler (AWS AI)
Abstract
Scaling of machine learning models has significantly improved model capabilities, especially for self-supervised tasks. To scale efficiently, one must harmonize modeling efforts with good utilization of the computational resources at hand. In this talk I will review a selection of important LLM architectural choices with the perspective of efficiency. I will first walk through the high level mechanism of GPU computation along with their capabilities and constraints. I will dive deeper into a few aspects of the hardware and work to identify a few basic principles that guide hardware efficiency in the field. From there, I will motivate recent innovations in LLM execution and architecture from the lens of the hardware principles we identified, considering both LLM training and inference efficiency.