https://arxiv.org/abs/2412.04139 Monet: Mixture of Monosemantic Experts for TransformersUnderstanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticityarxiv.org 처음에는 단순 MOE에 SAE를 붙인 줄 알았는데 MOE를 최대한 발전시켜 추가 ..