인공지능/XAI

Sparse Autoencoder 시작

이게될까 2024. 9. 19. 21:34
728x90
728x90

SAE를 통해 LLM의 데이터를 변경, 조작해보자가 시작되었습니다!!!!

https://transformer-circuits.pub/2024/scaling-monosemanticity/

 

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Authors Adly Templeton*, Tom Conerly*, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tr

transformer-circuits.pub

여기서 시작된 아이디어 입니다.!

금문교에서 시작된 SAE!

https://jbloomaus.github.io/SAELens/

 

SAE Lens

SAELens The SAELens training codebase exists to help researchers: Train sparse autoencoders. Analyse sparse autoencoders and neural network internals. Generate insights which make it easier to create safe and aligned AI systems. Please note these docs are

jbloomaus.github.io

이 아저씨 블로그를 통해서 배우려고 합니다.

https://github.com/jbloomAus/SAELens

 

GitHub - jbloomAus/SAELens: Training Sparse Autoencoders on Language Models

Training Sparse Autoencoders on Language Models. Contribute to jbloomAus/SAELens development by creating an account on GitHub.

github.com

깃 허브 보면 튜토리얼까지 잘 되어 있더라고여

https://huggingface.co/jbloom/Gemma-2b-Residual-Stream-SAEs

 

jbloom/Gemma-2b-Residual-Stream-SAEs · Hugging Face

Gemma 2b Residual Stream SAEs. This is a "quick and dirty" SAE release to unblock researchers. These SAEs have not been extensively studied or characterized. However, I will try to update the readme here when I add SAEs here to reflect what I know about th

huggingface.co

허깅 페이스에 모델도 공개되어 있습니다. 

이건 Residual에서 뽑은 SAE로 데이터 크기가 많이 줄어든다고 하네요

 

 

밑에는 나머지 자료들 좀 저장용으로 

https://huggingface.co/google/gemma-scope

 

google/gemma-scope · Hugging Face

Gemma Scope: This is a landing page for Gemma Scope, a comprehensive, open suite of sparse autoencoders for Gemma 2 9B and 2B. Sparse Autoencoders are a "microscope" of sorts that can help us break down a model’s internal activations into the underlying

huggingface.co

 

https://www.neuronpedia.org/

 

Neuronpedia

Open Interpretability Platform

www.neuronpedia.org

여기에 모델들 공개해 놓는다고 합니다.

 

https://github.com/EleutherAI/sae

 

GitHub - EleutherAI/sae: Sparse autoencoders

Sparse autoencoders. Contribute to EleutherAI/sae development by creating an account on GitHub.

github.com

SAE 하나 더 있는데 이건 그렇게 친절해 보이지 않아서...

 

https://transformer-circuits.pub/2024/april-update/index.html

 

Circuits Updates - April 2024

We report a number of developing ideas on the Anthropic interpretability team, which might be of interest to researchers working actively in this space. Some of these are emerging strands of research where we expect to publish more on in the coming months

transformer-circuits.pub

https://www.anthropic.com/research#interpretability

 

Research

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

www.anthropic.com

여기 업데이트도 잘 확인해보기 

 

https://www.salesforceairesearch.com/crm-benchmark

 

Generative AI Benchmark for CRM | Salesforce AI Research

Powering the world's smartest CRM by embedding state-of-the-art deep learning technology into the Salesforce Platform.

www.salesforceairesearch.com

벤치마크 점수 확인해서 작은데 좋은 모델로 뽑아 쓰기 

https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966

 

🪐 SmolLM - a HuggingFaceTB Collection

A series of smol LLMs: 135M, 360M and 1.7B. We release base and Instruct models as well as the training corpus and some WebGPU demos

huggingface.co

아마 이 모델이 될 것 같기도 합니다...

728x90