2024.11.07 - [인공지능/XAI] - Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 3
SAE는 7번 레이어에 붙어있습니다.
11레이어까지 있으니까 한 번 쭉 확인해봅시다....
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
skincare specialist | Female | 98.2 | 0.885237 | 0.851516 |
kindergarten teacher | Female | 96.8 | 0.879833 | 0.844357 |
childcare worker | Female | 94.6 | 0.908659 | 0.854653 |
secretary | Female | 92.5 | 0.866899 | 0.826608 |
hairstylist | Female | 92.4 | 0.881781 | 0.852713 |
dental assistant | Female | 92 | 0.885272 | 0.84631 |
nurse | Female | 91.3 | 0.892691 | 0.846891 |
school psychologist | Female | 90.4 | 0.880224 | 0.829196 |
receptionist | Female | 90 | 0.884083 | 0.84414 |
vet | Female | 89.8 | 0.869386 | 0.852478 |
nutritionist | Female | 89.6 | 0.886887 | 0.840343 |
maid | Female | 88.7 | 0.919116 | 0.86704 |
therapist | Female | 87.1 | 0.891367 | 0.849494 |
social worker | Female | 86.8 | 0.873986 | 0.820296 |
sewer | Female | 86.5 | 0.880158 | 0.854183 |
paralegal | Female | 84.8 | 0.851667 | 0.838413 |
library assistant | Female | 84.2 | 0.880276 | 0.830921 |
interior designer | Female | 83.8 | 0.868622 | 0.829819 |
manicurist | Female | 83 | 0.920434 | 0.916763 |
special education teacher | Female | 82.8 | 0.870975 | 0.816348 |
police officer | Male | 15.8 | 0.850793 | 0.803566 |
taxi driver | Male | 12 | 0.879322 | 0.84247 |
computer architect | Male | 11.8 | 0.88091 | 0.84602 |
mechanical engineer | Male | 9.4 | 0.909123 | 0.881162 |
truck driver | Male | 7.9 | 0.891951 | 0.859146 |
electrical engineer | Male | 7 | 0.864773 | 0.823462 |
landscaping worker | Male | 6.2 | 0.888274 | 0.854544 |
pilot | Male | 5.3 | 0.876505 | 0.845214 |
repair worker | Male | 5.1 | 0.896395 | 0.852561 |
firefighter | Male | 5.1 | 0.898523 | 0.863758 |
construction worker | Male | 4.2 | 0.888954 | 0.848474 |
machinist | Male | 3.4 | 0.888064 | 0.871639 |
aircraft mechanic | Male | 3.2 | 0.891025 | 0.853536 |
carpenter | Male | 3.1 | 0.86811 | 0.843074 |
roofer | Male | 2.9 | 0.847142 | 0.831337 |
brickmason | Male | 2.2 | 0.867273 | 0.848192 |
plumber | Male | 2.1 | 0.87239 | 0.842234 |
electrician | Male | 1.7 | 0.894518 | 0.857185 |
vehicle technician | Male | 1.2 | 0.880496 | 0.841039 |
crane operator | Male | 1.1 | 0.89262 | 0.8612 |
이게 뭔....
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
skincare specialist | Female | 98.2 | -0.28427 | -0.18885 |
kindergarten teacher | Female | 96.8 | -0.29349 | -0.15017 |
childcare worker | Female | 94.6 | 0.183231 | -0.09855 |
secretary | Female | 92.5 | -0.05228 | -0.09362 |
hairstylist | Female | 92.4 | -0.15438 | -0.00019 |
dental assistant | Female | 92 | -0.21715 | -0.25152 |
nurse | Female | 91.3 | 0.197028 | 0.026283 |
school psychologist | Female | 90.4 | -0.02939 | -0.2249 |
receptionist | Female | 90 | -0.11885 | -0.12152 |
vet | Female | 89.8 | -0.05999 | 0.101977 |
nutritionist | Female | 89.6 | -0.10836 | -0.26641 |
maid | Female | 88.7 | 0.452602 | 0.148875 |
therapist | Female | 87.1 | -0.03172 | -0.11619 |
social worker | Female | 86.8 | -0.01379 | -0.19423 |
sewer | Female | 86.5 | -0.03255 | 0.029441 |
paralegal | Female | 84.8 | -0.11245 | 0.103436 |
library assistant | Female | 84.2 | -0.06259 | -0.19382 |
interior designer | Female | 83.8 | -0.14656 | -0.16534 |
manicurist | Female | 83 | 0.325327 | 0.609222 |
special education teacher | Female | 82.8 | -0.18254 | -0.27691 |
police officer | Male | 15.8 | 0.061902 | -0.04618 |
taxi driver | Male | 12 | -0.14195 | -0.17661 |
computer architect | Male | 11.8 | -0.07117 | -0.07187 |
mechanical engineer | Male | 9.4 | -0.03773 | 0.142761 |
truck driver | Male | 7.9 | -0.04467 | -0.04188 |
electrical engineer | Male | 7 | -0.31577 | -0.31442 |
landscaping worker | Male | 6.2 | -0.31069 | -0.2016 |
pilot | Male | 5.3 | 0.000462 | -0.00561 |
repair worker | Male | 5.1 | 0.034276 | -0.14766 |
firefighter | Male | 5.1 | 0.030996 | -0.07756 |
construction worker | Male | 4.2 | -0.15419 | -0.11414 |
machinist | Male | 3.4 | -0.06887 | 0.178433 |
aircraft mechanic | Male | 3.2 | -0.17256 | -0.20201 |
carpenter | Male | 3.1 | -0.37341 | -0.0821 |
roofer | Male | 2.9 | -0.14815 | 0.04336 |
brickmason | Male | 2.2 | -0.14312 | 0.037151 |
plumber | Male | 2.1 | -0.11869 | -0.05927 |
electrician | Male | 1.7 | -0.09495 | -0.06853 |
vehicle technician | Male | 1.2 | -0.09524 | -0.08894 |
crane operator | Male | 1.1 | -0.26544 | -0.11965 |
내가 뭔갈 잘못하고 있나...?
이번엔 11레이어
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
skincare specialist | Female | 98.2 | 0.952556 | 0.936987 |
kindergarten teacher | Female | 96.8 | 0.951825 | 0.935702 |
childcare worker | Female | 94.6 | 0.957717 | 0.932282 |
secretary | Female | 92.5 | 0.93632 | 0.920797 |
hairstylist | Female | 92.4 | 0.949957 | 0.937659 |
dental assistant | Female | 92 | 0.950961 | 0.934148 |
nurse | Female | 91.3 | 0.952866 | 0.933521 |
school psychologist | Female | 90.4 | 0.945576 | 0.921782 |
receptionist | Female | 90 | 0.95102 | 0.93408 |
vet | Female | 89.8 | 0.942855 | 0.93628 |
nutritionist | Female | 89.6 | 0.949447 | 0.927987 |
maid | Female | 88.7 | 0.962526 | 0.939558 |
therapist | Female | 87.1 | 0.952193 | 0.934614 |
social worker | Female | 86.8 | 0.940401 | 0.916169 |
sewer | Female | 86.5 | 0.949164 | 0.938944 |
paralegal | Female | 84.8 | 0.939479 | 0.933251 |
library assistant | Female | 84.2 | 0.943475 | 0.922672 |
interior designer | Female | 83.8 | 0.94191 | 0.926157 |
manicurist | Female | 83 | 0.968939 | 0.966691 |
special education teacher | Female | 82.8 | 0.942118 | 0.917624 |
police officer | Male | 15.8 | 0.925999 | 0.901801 |
taxi driver | Male | 12 | 0.94509 | 0.928054 |
computer architect | Male | 11.8 | 0.945797 | 0.928877 |
mechanical engineer | Male | 9.4 | 0.961702 | 0.947846 |
truck driver | Male | 7.9 | 0.951451 | 0.935774 |
electrical engineer | Male | 7 | 0.93818 | 0.918885 |
landscaping worker | Male | 6.2 | 0.954152 | 0.937349 |
pilot | Male | 5.3 | 0.942039 | 0.930336 |
repair worker | Male | 5.1 | 0.950627 | 0.929089 |
firefighter | Male | 5.1 | 0.955618 | 0.939639 |
construction worker | Male | 4.2 | 0.948317 | 0.930044 |
machinist | Male | 3.4 | 0.953453 | 0.946909 |
aircraft mechanic | Male | 3.2 | 0.951964 | 0.934727 |
carpenter | Male | 3.1 | 0.945402 | 0.935081 |
roofer | Male | 2.9 | 0.935538 | 0.928428 |
brickmason | Male | 2.2 | 0.940855 | 0.932296 |
plumber | Male | 2.1 | 0.947161 | 0.933655 |
electrician | Male | 1.7 | 0.95383 | 0.937022 |
vehicle technician | Male | 1.2 | 0.939626 | 0.920676 |
crane operator | Male | 1.1 | 0.953499 | 0.938783 |
전혀 군집화가 안 보이네요..
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
woman | Female | 100 | 1 | 0.977756 |
skincare specialist | Female | 98.2 | 0.952556 | 0.936987 |
kindergarten teacher | Female | 96.8 | 0.951825 | 0.935702 |
childcare worker | Female | 94.6 | 0.957717 | 0.932282 |
secretary | Female | 92.5 | 0.93632 | 0.920797 |
hairstylist | Female | 92.4 | 0.949957 | 0.937659 |
dental assistant | Female | 92 | 0.950961 | 0.934148 |
nurse | Female | 91.3 | 0.952866 | 0.933521 |
school psychologist | Female | 90.4 | 0.945576 | 0.921782 |
receptionist | Female | 90 | 0.95102 | 0.93408 |
vet | Female | 89.8 | 0.942855 | 0.93628 |
nutritionist | Female | 89.6 | 0.949447 | 0.927987 |
maid | Female | 88.7 | 0.962526 | 0.939558 |
therapist | Female | 87.1 | 0.952193 | 0.934614 |
social worker | Female | 86.8 | 0.940401 | 0.916169 |
sewer | Female | 86.5 | 0.949164 | 0.938944 |
paralegal | Female | 84.8 | 0.939479 | 0.933251 |
library assistant | Female | 84.2 | 0.943475 | 0.922672 |
interior designer | Female | 83.8 | 0.94191 | 0.926157 |
manicurist | Female | 83 | 0.968939 | 0.966691 |
special education teacher | Female | 82.8 | 0.942118 | 0.917624 |
police officer | Male | 15.8 | 0.925999 | 0.901801 |
taxi driver | Male | 12 | 0.94509 | 0.928054 |
computer architect | Male | 11.8 | 0.945797 | 0.928877 |
mechanical engineer | Male | 9.4 | 0.961702 | 0.947846 |
truck driver | Male | 7.9 | 0.951451 | 0.935774 |
electrical engineer | Male | 7 | 0.93818 | 0.918885 |
landscaping worker | Male | 6.2 | 0.954152 | 0.937349 |
pilot | Male | 5.3 | 0.942039 | 0.930336 |
repair worker | Male | 5.1 | 0.950627 | 0.929089 |
firefighter | Male | 5.1 | 0.955618 | 0.939639 |
construction worker | Male | 4.2 | 0.948317 | 0.930044 |
machinist | Male | 3.4 | 0.953453 | 0.946909 |
aircraft mechanic | Male | 3.2 | 0.951964 | 0.934727 |
carpenter | Male | 3.1 | 0.945402 | 0.935081 |
roofer | Male | 2.9 | 0.935538 | 0.928428 |
brickmason | Male | 2.2 | 0.940855 | 0.932296 |
plumber | Male | 2.1 | 0.947161 | 0.933655 |
electrician | Male | 1.7 | 0.95383 | 0.937022 |
vehicle technician | Male | 1.2 | 0.939626 | 0.920676 |
crane operator | Male | 1.1 | 0.953499 | 0.938783 |
man | Male | 0.1 | 0.977756 | 1 |
애초에 남자와 여자가 너무 가깝네요...?
음 한번 GPT -2 small SAE layer에서 cos 유사도를 구해보겠습니다.
prompt 1 | prompt 2 | cos similarity |
man | woman | 0.68 |
he | she | 0.56 |
boy | girl | 0.48 |
his | her | 0.62 |
male | female | 0.39 |
유사도가 가장 낮은 female과 male로 진행해 볼게요..
Job | Gender Dominance | Female Percentage | Cosine Similarity with female | Cosine Similarity with male |
skincare specialist | Female | 98.2 | 0.124757 | 0.134896 |
kindergarten teacher | Female | 96.8 | 0.147743 | 0.160076 |
childcare worker | Female | 94.6 | 0.141078 | 0.146944 |
secretary | Female | 92.5 | 0.198375 | 0.201354 |
hairstylist | Female | 92.4 | 0.128075 | 0.137902 |
dental assistant | Female | 92 | 0.195335 | 0.204906 |
nurse | Female | 91.3 | 0.244849 | 0.261608 |
school psychologist | Female | 90.4 | 0.169121 | 0.154034 |
receptionist | Female | 90 | 0.178872 | 0.179647 |
vet | Female | 89.8 | 0.273882 | 0.289234 |
nutritionist | Female | 89.6 | 0.177132 | 0.168246 |
maid | Female | 88.7 | 0.238828 | 0.239321 |
therapist | Female | 87.1 | 0.180114 | 0.187415 |
social worker | Female | 86.8 | 0.167729 | 0.166011 |
sewer | Female | 86.5 | 0.213534 | 0.227236 |
paralegal | Female | 84.8 | 0.133382 | 0.142434 |
library assistant | Female | 84.2 | 0.108575 | 0.109091 |
interior designer | Female | 83.8 | 0.192761 | 0.199092 |
manicurist | Female | 83 | 0.123101 | 0.1409 |
special education teacher | Female | 82.8 | 0.136673 | 0.139805 |
police officer | Male | 15.8 | 0.164144 | 0.15197 |
taxi driver | Male | 12 | 0.167654 | 0.159419 |
computer architect | Male | 11.8 | 0.144956 | 0.147659 |
mechanical engineer | Male | 9.4 | 0.197546 | 0.207421 |
truck driver | Male | 7.9 | 0.208168 | 0.221485 |
electrical engineer | Male | 7 | 0.156188 | 0.155851 |
landscaping worker | Male | 6.2 | 0.124597 | 0.128934 |
pilot | Male | 5.3 | 0.210591 | 0.219529 |
repair worker | Male | 5.1 | 0.128081 | 0.133228 |
firefighter | Male | 5.1 | 0.213929 | 0.219438 |
construction worker | Male | 4.2 | 0.186627 | 0.186755 |
machinist | Male | 3.4 | 0.165033 | 0.17933 |
aircraft mechanic | Male | 3.2 | 0.185949 | 0.195976 |
carpenter | Male | 3.1 | 0.166936 | 0.179504 |
roofer | Male | 2.9 | 0.21479 | 0.22808 |
brickmason | Male | 2.2 | 0.130557 | 0.144164 |
plumber | Male | 2.1 | 0.199274 | 0.211099 |
electrician | Male | 1.7 | 0.128658 | 0.128711 |
vehicle technician | Male | 1.2 | 0.168419 | 0.176294 |
crane operator | Male | 1.1 | 0.164983 | 0.177588 |
흠 ...
이번엔 남자가 대부분 높네요...
그럼 이번엔 11번 레이어 에서 cos 유사도를 구해볼게요
prompt 1 | prompt 2 | cos similarity |
man | woman | 0.98 |
he | she | 0.97 |
boy | girl | 0.97 |
his | her | 0.97 |
male | female | 0.99 |
이건 그냥 다 높네요....
2번 레이어로 한번 볼게요
prompt 1 | prompt 2 | cos similarity |
man | woman | 0.88 |
he | she | 0.88 |
boy | girl | 0.87 |
his | her | 0.80 |
male | female | 0.92 |
man | he | 0.75 |
woman | she | 0.79 |
....?
애초에 직업에 대한 코사인 유사도보다 man, woman 이 더 높네요....
다시 미스트랄로....
mistral은 레이어가 깊어질수록 cos 유사도가 떨어지는 경향이 있어서 31번 레이어로 진행했습니다.
기본적으로 job에 대해 cos 유사도가 man보다 woman이 더 높네요 -> 그래서 모든 직업에서 여자의 cos 유사도가 더 높게 나오는 것 같습니다.
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
woman | Female | 100 | 1 | 0.478665 |
skincare specialist | Female | 98.2 | 0.526142 | 0.337138 |
kindergarten teacher | Female | 96.8 | 0.571344 | 0.319337 |
childcare worker | Female | 94.6 | 0.653964 | 0.321974 |
secretary | Female | 92.5 | 0.490099 | 0.289095 |
hairstylist | Female | 92.4 | 0.475824 | 0.333437 |
dental assistant | Female | 92 | 0.46657 | 0.21639 |
nurse | Female | 91.3 | 0.576348 | 0.327447 |
school psychologist | Female | 90.4 | 0.555125 | 0.294005 |
receptionist | Female | 90 | 0.579533 | 0.30584 |
vet | Female | 89.8 | 0.380612 | 0.295048 |
nutritionist | Female | 89.6 | 0.530972 | 0.265065 |
maid | Female | 88.7 | 0.426042 | 0.352122 |
therapist | Female | 87.1 | 0.616198 | 0.312762 |
social worker | Female | 86.8 | 0.598205 | 0.277394 |
sewer | Female | 86.5 | 0.338412 | 0.328602 |
paralegal | Female | 84.8 | 0.473365 | 0.355198 |
library assistant | Female | 84.2 | 0.306555 | 0.190848 |
interior designer | Female | 83.8 | 0.553535 | 0.26863 |
manicurist | Female | 83 | 0.646296 | 0.64153 |
special education teacher | Female | 82.8 | 0.528594 | 0.267602 |
police officer | Male | 15.8 | 0.600642 | 0.303183 |
taxi driver | Male | 12 | 0.583161 | 0.318098 |
computer architect | Male | 11.8 | 0.515812 | 0.300871 |
mechanical engineer | Male | 9.4 | 0.560198 | 0.285663 |
truck driver | Male | 7.9 | 0.604801 | 0.328755 |
electrical engineer | Male | 7 | 0.556409 | 0.275509 |
landscaping worker | Male | 6.2 | 0.600496 | 0.296639 |
pilot | Male | 5.3 | 0.509496 | 0.349558 |
repair worker | Male | 5.1 | 0.641669 | 0.318395 |
firefighter | Male | 5.1 | 0.512231 | 0.322519 |
construction worker | Male | 4.2 | 0.611927 | 0.302671 |
machinist | Male | 3.4 | 0.507236 | 0.3774 |
aircraft mechanic | Male | 3.2 | 0.546825 | 0.296465 |
carpenter | Male | 3.1 | 0.490861 | 0.403917 |
roofer | Male | 2.9 | 0.465377 | 0.376143 |
brickmason | Male | 2.2 | 0.471654 | 0.331363 |
plumber | Male | 2.1 | 0.432962 | 0.369362 |
electrician | Male | 1.7 | 0.538109 | 0.291336 |
vehicle technician | Male | 1.2 | 0.480898 | 0.273057 |
crane operator | Male | 1.1 | 0.558155 | 0.364165 |
man | Male | 0 | 0.478665 | 1 |
그냥 널리 퍼져있네요....
Job: doctor
Cosine similarity with man: 0.29202771186828613
Cosine similarity with woman: 0.3873016834259033
Cosine similarity with he: 0.18976394832134247
Cosine similarity with she: 0.24517714977264404
Cosine similarity with male: 0.26682591438293457
Cosine similarity with female: 0.30686309933662415
Job: nurse
Cosine similarity with man: 0.32744666934013367
Cosine similarity with woman: 0.5763475894927979
Cosine similarity with he: 0.32279425859451294
Cosine similarity with she: 0.2484569251537323
Cosine similarity with male: 0.2853856086730957
Cosine similarity with female: 0.3248463273048401
Job: engineer
Cosine similarity with man: 0.3628814220428467
Cosine similarity with woman: 0.535031795501709
Cosine similarity with he: 0.30105212330818176
Cosine similarity with she: 0.24273112416267395
Cosine similarity with male: 0.2686672806739807
Cosine similarity with female: 0.31331366300582886
Job: teacher
Cosine similarity with man: 0.3335043787956238
Cosine similarity with woman: 0.5944691300392151
Cosine similarity with he: 0.3386254608631134
Cosine similarity with she: 0.24903792142868042
Cosine similarity with male: 0.2746357321739197
Cosine similarity with female: 0.33448266983032227
Job: scientist
Cosine similarity with man: 0.3251022696495056
Cosine similarity with woman: 0.6097099781036377
Cosine similarity with he: 0.37819525599479675
Cosine similarity with she: 0.2599332332611084
Cosine similarity with male: 0.2557474374771118
Cosine similarity with female: 0.3047844469547272
모델이 오히려 논문에서 나온 것 보다 출력이 잘 안나오네요...
이렇게 해야 나오네여
일단 이제 편향이 있는 것은 확인 했으니 편향을 만드는 feature를 찾아봅시다.
'인공지능 > XAI' 카테고리의 다른 글
Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 6 (0) | 2024.12.01 |
---|---|
Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 5 (1) | 2024.11.30 |
Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 3 (0) | 2024.11.28 |
Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 2 (0) | 2024.11.27 |
Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 1 (0) | 2024.11.26 |