Talk
Towards Data-efficient Training of Large Language Models (LLMs)
- Baharan Mirzasoleiman (UCLA)
Abstract
High quality data is crucial for training LLMs with superior performance. In this talk, I will present two theoretically-rigorous approaches to find smaller subsets of examples that can improve the performance and efficiency of training LLMs. First, I will present a one-shot data selection method for supervised fine-tuning of LLMs. Then, I'll talk about an iterative data selection strategy to pretrain or fine-tune LLMs on imbalanced mixtures of language data. I'll conclude by showing empirical results confirming that the above data selection strategies can effectively improve the performance of various LLMs during fine-tuning and pretraining.