Do Sparse Autoencoders Identify Reasoning Features in Language Models?

인공지능/논문 리뷰 or 진행

Do Sparse Autoencoders Identify Reasoning Features in Language Models?

이게될까 2026. 5. 24. 18:28

728x90

https://arxiv.org/abs/2601.05679

Do Sparse Autoencoders Identify Reasoning Features in Language Models?

We study how reliably sparse autoencoders (SAEs) support claims about reasoning-related internal features in large language models. We first give a stylized analysis showing that sparsity-regularized decoding can preferentially retain stable low-dimensiona

arxiv.org

SAE가 찾아낸 reasoning feature가 실제 추론 계산을 나타내는지 검증했고, 많은 경우 추론 자체가 아니라 CoT 문체, 특정 토큰, 절차적 표현 드오가 같이 낮은 차원의 언어적 단서를 포착한다고 주장

기존 논문에서 reasoning feature를 찾기 위해 CoT reasoning text와 non-reasoning text를 준비하고, SAE feature activation을 비교한 뒤 reasoning text에서 더 강하게 켜지는 feature를 reasoning feature로 해석하였음

그러나 논문에선 이 접근 법은 실제 추론 뿐만이 아니라 "First, Let's, Therefore, Wait"와 같은 표현을 자주 포함하여 추론 과정을 포착한 것인지, 언어 패턴을 포착한 것인지 구분하기 어려움

SAE는 sparsity objective 이기 때문에 고차원적이고 다양한 reasoning variation보다 반복적으로 나타나는 저차원적 Wait, Let 같은 lexical cue를 더 쉽게 feature를 분리할 수 있음

그래서 아래 3조건을 만족해야 진짜 reasoning feature를 정의할 수 있음

Reasoning specificity	reasoning text에서 안정적으로 활성화되고 non-reasoning text에서는 낮게 활성화되어야 함
Non-spurious correlation	단순히 “therefore”, “let us consider”, “I need to” 같은 표면적 cue만으로 활성화되면 안 됨
Semantic invariance	같은 reasoning을 paraphrase하거나 문체를 바꿔도 activation이 유지되어야 함

이를 검증하기 위해 아래와 같은 파이프라인을 활용함

	목적	방법
1. Contrastive feature selection	후보 reasoning feature 찾기	reasoning corpus와 non-reasoning corpus의 SAE activation 차이를 Cohen’s d로 측정
2. Causal token injection	token cue만으로 feature가 켜지는지 확인	non-reasoning text에 top-activating token, bigram, trigram 삽입
3. LLM-guided falsification	더 복잡한 confound 제거	LLM이 false positive와 false negative counterexample 생성
4. Steering sanity check	feature 증폭이 실제 reasoning 성능을 바꾸는지 확인	feature decoder direction으로 residual stream steering

초반부 레이어는 lexical processing에 치우치고, 후반 레이어는 output token prediction에 치우치기 때문에 중간 레이어를 선택함 .

기존 방법으로 나누면 SAE feature들은 reasoning text와 non-reasoning text를 구분하는 것 처럼 보이지만 이것만으로는 feature가 reasoning을 포착한다고 말할 수 없음 = cot문체 때문인지, reasoning 때문인지 알 수 없음

토큰 삽입만으로 많은 후보 feature들이 활성화 되는 것을 통해 실제 reaosning을 따라가는 것은 아니다!

토큰 삽입으로도 설명되지 않은 feature들은 llm에게 feature가 어떤 패턴을 감지하도록 가설을 생성하게 함 => 그 가설을 바탕으로 false positibe를 만듦 => reasoning 여부와 분리되는지 확인

여기서 살아남음 feature는 0개로 대부분은 cot-style feature로 보여줬고, steering도 reasoning 성능 향상을 보여주지 못 했음

결국 이 논문은 SAE 기반 reasoning interpretabilibty가 잘못되었음을 보여줬고, 좀 더 엄격하게 진행해야 함을 보여줌

흠 근데 reasoning model은 안 쓰고, instruction, distill model만 쓰면서 reasoning feature를 찾는 것이 맞을지, 모델 자체 출력도 아니고 data set을 그대로 forward하는데 이게 맞을지가 의문이긴 하네요

핵심 질문	SAE가 contrastive activation 방식으로 찾아낸 “reasoning feature”가 실제로 LLM 내부의 추론 계산 과정을 나타내는가?
문제의식	기존 연구들은 CoT reasoning text에서 더 강하게 활성화되는 SAE feature를 reasoning feature로 해석했지만, CoT text는 실제 추론뿐 아니라 “First”, “I need to”, “Let’s”, “Therefore”, “Wait” 같은 표면적 언어 단서도 많이 포함함. 따라서 activation 차이만으로는 reasoning feature인지 lexical/style feature인지 구분하기 어려움.
핵심 주장	SAE의 sparsity objective는 복잡하고 다양한 고차원 reasoning variation보다, 반복적으로 등장하는 저차원 lexical cue를 더 쉽게 포착한다. 따라서 contrastive selection으로 뽑힌 feature는 reasoning 자체가 아니라 reasoning과 함께 나타나는 언어적 상관물일 가능성이 높음.
이론적 근거	논문은 reasoning activation을 ① 안정적인 저차원 cue 방향과 ② 고차원 reasoning variation으로 나누어 분석함. ℓ1 sparse decoding 또는 Top-K sparsity는 많은 좌표에 퍼진 고차원 성분을 억제하고, 하나의 안정적 cue coordinate는 상대적으로 보존하기 쉬움을 보임.
“진짜 reasoning feature”의 조건	① reasoning text에서 안정적으로 활성화되어야 함, ② non-reasoning text에 reasoning cue만 넣었을 때 활성화되면 안 됨, ③ 같은 reasoning을 paraphrase하거나 문체를 바꿔도 activation이 유지되어야 함.
방법론 전체 흐름	Contrastive feature selection → Causal token injection → LLM-guided falsification → Steering sanity check 순서로 검증함. 즉, 먼저 기존 방식대로 후보 feature를 찾고, 이후 token cue와 counterexample로 해당 feature 해석을 반증하는 방식.
Contrastive feature selection	reasoning corpus와 non-reasoning corpus에서 각 SAE feature의 activation 차이를 Cohen’s d, ROC-AUC, activation frequency ratio 등으로 측정하고, 각 configuration에서 top 100 feature를 후보 reasoning feature로 선택함.
Causal token injection	각 후보 feature를 가장 강하게 활성화하는 token, bigram, trigram을 찾은 뒤, non-reasoning text에 삽입함. 만약 이 조작만으로 feature activation이 증가하면 해당 feature는 reasoning이 아니라 token cue에 민감한 것으로 판단함.
LLM-guided falsification	token injection으로 설명되지 않는 context-dependent feature에 대해 LLM이 feature hypothesis를 만들고, ① reasoning은 없지만 activation을 유발하는 false positive, ② reasoning 의미는 유지하지만 activation이 사라지는 false negative paraphrase를 생성함.
실험 모델	Gemma-3-12B-Instruct, Gemma-3-4B-Instruct, DeepSeek-R1-Distill-Llama-8B를 중심으로 분석하고, appendix에서 Llama-3.1-8B, Gemma-2-9B, Gemma-2-2B도 추가 검증함.
데이터셋	reasoning corpus로 s1K-1.1과 General Inquiry Thinking Chain-of-Thought를 사용하고, non-reasoning corpus로 Pile uncopyrighted subset을 사용함. 각 corpus에서 1,000개 샘플을 사용하며 입력은 64 tokens로 chunking함.
주요 결과 1	Contrastive selection만 보면 top feature들은 reasoning text와 non-reasoning text를 잘 구분함. 평균 Cohen’s d는 대체로 0.675~1.043 수준으로, 통계적으로는 reasoning-associated feature처럼 보임.
주요 결과 2	그러나 token injection 결과, 후보 feature의 45%~90%가 non-reasoning text에 몇 개의 관련 token만 삽입해도 유의미하게 활성화됨. 이는 많은 후보 feature가 실제 reasoning보다 lexical cue에 민감함을 보여줌.
주요 결과 3	Token injection으로 설명되지 않은 context-dependent feature 248개를 LLM-guided falsification으로 분석했지만, genuine reasoning feature로 분류된 feature는 0개였음. 대부분은 first-person planning, procedural discourse, formal explanatory style, decomposition vocabulary 등의 confound로 해석됨.
주요 결과 4	Steering 실험에서도 top feature를 증폭했을 때 AIME와 GPQA 성능이 개선되지 않았고, 일부는 오히려 하락함. 논문은 feature steering이 특정 표현을 더 자주 생성하게 할 수는 있어도, 그것이 reasoning mechanism을 조작했다는 증거는 아니라고 해석함.
논문의 결론	현재 분석한 설정에서는 SAE가 contrastive activation으로 찾은 reasoning-associated feature 대부분이 실제 reasoning computation보다는 CoT 스타일의 언어적 단서를 포착한 것으로 보임. 따라서 SAE feature를 고수준 reasoning mechanism으로 해석하려면 contrastive correlation만으로는 부족하고, causal intervention과 falsification 검증이 필수적임.
의의	SAE 기반 mechanistic interpretability 연구에서 “activation이 높다 = 의미 있는 내부 개념이다”라는 해석을 경계하게 만듦. 특히 reasoning, refusal, hallucination, instruction following 같은 고수준 행동을 해석할 때 feature가 실제 mechanism인지 표면적 correlate인지 구분해야 함을 강조함.
한계	이 결과는 개별 SAE feature의 monosemantic reasoning interpretation에 대한 검증이며, reasoning이 여러 feature에 분산되어 있거나 nonlinear subspace에 표현될 가능성은 배제하지 않음. 또한 LLM-guided falsification의 품질은 생성 LLM의 능력에 영향을 받을 수 있음.
연구적 takeaway	향후 reasoning interpretability는 단일 feature activation 분석을 넘어서, paraphrase-invariant feature, cue-matched counterexample, causal intervention, distributed subspace/circuit 분석을 함께 수행해야 함. 이 논문은 SAE를 부정하기보다 SAE 해석의 검증 기준을 강화한 연구로 보는 것이 타당함.

저작자표시 비영리 (새창열림)

'인공지능 > 논문 리뷰 or 진행' 카테고리의 다른 글

Retrieval from Within: An Intrinsic Capability of Attention-Based Models (0)	2026.05.21
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem? (0)	2026.05.20
Recursive Multi-Agent Systems (0)	2026.05.15
LIMO: Less is More for Reasoning (0)	2026.05.14
s1: Simple test-time scaling (0)	2026.05.14

현재글Do Sparse Autoencoders Identify Reasoning Features in Language Models?

NLP, AI, XAI에 관심있는 공대생의 일기장...?

Today :
Yesterday :

티스토리툴바