반응형

2024/10/18 6

Improving Dictionary Learning with Gated Sparse Autoencoders - 논문 리뷰

https://arxiv.org/abs/2404.16014 Improving Dictionary Learning with Gated Sparse AutoencodersRecent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the Gatedarxiv.org  기존 SAE에 LSTM과 같은 GATE 구조를 추가하여 필요한 항만 ..

Interpreting and Steering LLM Representations with Mutual Information-based Explanations on Sparse Autoencoders - 논문 리

https://openreview.net/forum?id=vc1i3a4O99 Interpreting and Steering LLM Representations with Mutual...Large language models (LLMs) excel at addressing general human queries, yet they can falter or produce unexpected responses in specific scenarios. Gaining insight into the internal states of LLMs...openreview.net  여태가지 다 해봤던 내용인 것 같아서 크게 특별한 것이 안보이네요 1. 학습한다!2. 이 뉴런이 어떤 단어,토큰에 강하게 반응하는지 찾는다3. 뉴..

Transcoders Find Interpretable LLM Feature Circuits - 논문 리뷰

https://arxiv.org/abs/2406.11944 Transcoders Find Interpretable LLM Feature CircuitsA key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficultarxiv.org기존 SAE는 입력을 그대로 출력하는 Autoencoder의 특성을 그대로 가져와 입력을..

Sparse Autoencoders Find Highly Interpretable Features in Language Models - 논문 리뷰

https://arxiv.org/abs/2309.08600 Sparse Autoencoders Find Highly Interpretable Features in Language ModelsOne of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandaarxiv.org 기존 Transformer 구조에서 뉴런 수 보..

Transformer Interpretability Beyond Attention Visualization - 논문 리뷰

https://arxiv.org/abs/2012.09838 Transformer Interpretability Beyond Attention VisualizationSelf-attention techniques, and specifically Transformers, are dominating the field of text processing and are becoming increasingly popular in computer vision classification tasks. In order to visualize the parts of the image that led to a certain classifiarxiv.org이건 Visual LM인가 보네요특정 토큰마다 중요도를 구해서 별 의미없는..

Interpretability Illusions in the Generalization of Simplified Models - 논문 리뷰

https://arxiv.org/abs/2312.03656 Interpretability Illusions in the Generalization of Simplified ModelsA common method to study deep learning systems is to use simplified model representations--for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of thesearxiv.org 모델을 단순화하여 시각화하면 왜 이러한 출력을 하는지 보..

728x90
728x90