https://arxiv.org/abs/2601.05679 Do Sparse Autoencoders Identify Reasoning Features in Language Models?We study how reliably sparse autoencoders (SAEs) support claims about reasoning-related internal features in large language models. We first give a stylized analysis showing that sparsity-regularized decoding can preferentially retain stable low-dimensionaarxiv.orgSAE가 찾아낸 reasoning feature가 실제..