반응형

전체 글 944

Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models - 논문 리뷰

https://arxiv.org/abs/2305.14705 Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language ModelsSparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions.arxiv.org  이 논문은 "M..

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts - 논문 리뷰

https://arxiv.org/abs/2112.06905 GLaM: Efficient Scaling of Language Models with Mixture-of-ExpertsScaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these laarxiv.org MoE는 파라미터를 늘리면서도 추론 속도나 전력 사용을 줄였..

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability - 논문리뷰

https://arxiv.org/abs/2405.10927 Using Degeneracy in the Loss Landscape for Mechanistic InterpretabilityMechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involvarxiv.org 신경망의 해석 가능성을 방해하는 퇴행적 구조를 해결..

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding - 논문 리뷰

https://arxiv.org/abs/2006.16668 GShard: Scaling Giant Models with Conditional Computation and Automatic ShardingNeural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better modelarxiv.org      이 논문은 대규모 신경망을..

Benchmarking Large Language Models in Retrieval-Augmented Generation - 논문리뷰

https://arxiv.org/abs/2309.01431 Benchmarking Large Language Models in Retrieval-Augmented GenerationRetrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large languagearxiv.org 이 논문에선 LLM의 할루시네이션, 지식 갱신을 해결하기 ..

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer - 논문 리뷰

https://arxiv.org/abs/1701.06538 Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts LayerThe capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capaarxiv.org 여기서는 모든 전문가를 다 연산하는 ..

Learning Factored Representations in a Deep Mixture of Experts - 논문 리뷰

https://arxiv.org/abs/1312.4314 Learning Factored Representations in a Deep Mixture of ExpertsMixtures of Experts combine the outputs of several "expert" networks, each of which specializes in a different part of the input space. This is achieved by training a "gating" network that maps each input to a distribution over the experts. Such models shoarxiv.org 기존 MOE는 단일 layer에서 MOE를 진행했다면 여기서는 Dee..

Adaptive Mixtures of Local Experts - 논문 리뷰

https://www.cs.toronto.edu/~hinton/absps/jjnh91.pdf MOE의 근본 논문이네요MOE 학습하는 방식은 일반 네트워크와 다를 것이 없고, Gate를 통해서 가중치를 추가하여 각 네트워크에 퍼지게 됩니다.여기선 특정 전문가만 계산하는 것이 아니라 전체가 다 계산하게 되어 연산 량이 어마무시하게 늘긴 했습니다.  1. 해결하려는 문제이 논문은 복잡한 데이터 분포를 효과적으로 학습하기 위한 문제를 다루고 있습니다. 특히, 하나의 모델이 데이터의 모든 분포를 캡처하기 어렵기 때문에 이를 해결할 수 있는 방법을 제안하고자 합니다. 연구자들은 다양한 입력 특성에 따라 다른 전문가 모델을 활성화하여 각 지역의 데이터 분포에 적합한 예측을 수행할 수 있는 방법을 찾고 있습니다.2. ..

SelfIE: Self-Interpretation of Large Language Model Embeddings - 논문 리뷰

https://arxiv.org/abs/2403.10949 SelfIE: Self-Interpretation of Large Language Model EmbeddingsHow do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a frameworkarxiv.org 이 논문은 Sparse Autoencoder(SAE)와는 다르게 추가..

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity - 논문 리뷰

https://arxiv.org/abs/2101.03961 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient SparsityIn deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of pararxiv.org  1. 문제 ..

728x90
728x90