https://arxiv.org/abs/2112.06905 GLaM: Efficient Scaling of Language Models with Mixture-of-ExpertsScaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these laarxiv.org MoE는 파라미터를 늘리면서도 추론 속도나 전력 사용을 줄였..