#80- Layer pruning and Mixture of Depths.
Listen now
Description
Hey guys, continuing the series of episodes about PEFT, in this episode I talk about inference optimization techniques for LLMs. I talk about layer pruning, where we prune consecutive layers of the LLM without almost not losing model performance. I also talk about Mixture of Depths, a similar technique to Mixture of Experts, where we have a router that choses which tokens will be processed in which layer of the LLM. Paper MoD: ⁠https://arxiv.org/pdf/2404.02258.pdf⁠ Paper layer pruning: ⁠https://arxiv.org/pdf/2403.17887v1.pdf⁠ Instagram of the podcast: https://www.instagram.com/podcast.lifewithai Linkedin of the podcast: https://www.linkedin.com/company/life-with-ai
More Episodes
Extra episode about Llama 3.
Published 04/19/24
Published 04/18/24
Hey guys, this is the first episode in a series of episodes about PEFT, Parameter Efficient Fine Tuning. In this episode I talk about LoRA and QLoRA, two widely used methods that allowed us to fine tune LLMs way faster and in a single GPU without losing performance. Video sobre...
Published 04/11/24