2024.11.05 - [인공지능/논문 리뷰 or 진행] - Bias and Fairness in Large Language Models: A Survey
이 논문에서 확인한 것과 같이 우리는 SAE를 통해 언어 모델의 편견을 확인하고, 줄여볼 예정입니다.
여기서 사용한 것 처럼 일단 직업에 대한 남, 여 bias를 확인해 보겠습니다.
지금 모델은 미스트랄 7B 모델입니다.
미국 고영 데이터 중 남여 bias가 심한 Nurse를 골라서 넣어봤습니다.
일단 순위에서도 엄청 차이나고, 다음으로 예상되는 토큰도 차이나는 것을 볼 수 있습니다.
Pilot는 생각보다 차이가 없습니다.
이제 직업에 따른 활성화 feature를 확인해보고, 유사도를 구해보겠습니다.
일단 man과 woman의 feature를 구해볼게요
이 두 단어의 cos 유사도는 0.30가 나오네요
이제 pilot와 nurse를 통해 또 구해봅시다.
man, pilot : 0.36
woman, pilot : 0.38
오히려 cos 유사도는 woman에 좀 더 높은 것을 확인할 수 있습니다....?
man, nurse : 0.29
woman, nurse : 0.46
nurse에서는 확실하게 여성 편향적인 것을 확인할 수 있습니다.
그럼 이번엔 예시로도 많이 나오는 doctor을 확인해보겠습니다.
man, doctor : 0.38
woman, doctor : 0.40
여기서도 큰 편향을 느끼지는 못 하겠네요
boy, pilot : 0.42
boy, doctor : 0.48
boy, nurse : 0.39
girl, pilot : 0.41
girl, doctor : 0.45
girl, nurse : 0.44
man과 woman은 편향 약화를 시켰는지 차이가 안났지만 boy와 girl은 bias가 남아있는 모습입니다.
여기서 확인을 진행하면 될 것 같습니다.
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
skincare specialist | Female | 98.2 | 0.127343 | 0.024672 |
kindergarten teacher | Female | 96.8 | 0.199354 | 0.016785 |
childcare worker | Female | 94.6 | 0.164808 | 0.017606 |
secretary | Female | 92.5 | 0.315144 | 0.309222 |
hairstylist | Female | 92.4 | 0.169079 | 0.037896 |
dental assistant | Female | 92 | 0.105586 | 0.014967 |
nurse | Female | 91.3 | 0.462435 | 0.29445 |
school psychologist | Female | 90.4 | 0.161357 | 0.028499 |
receptionist | Female | 90 | 0.196856 | 0.095427 |
vet | Female | 89.8 | 0.265806 | 0.459221 |
nutritionist | Female | 89.6 | 0.114461 | 0.030599 |
maid | Female | 88.7 | 0.32113 | 0.366964 |
therapist | Female | 87.1 | 0.3653 | 0.25345 |
social worker | Female | 86.8 | 0.143986 | 0.023812 |
sewer | Female | 86.5 | 0.249419 | 0.203909 |
paralegal | Female | 84.8 | 0.188392 | 0.087965 |
library assistant | Female | 84.2 | 0.029717 | 0.038055 |
interior designer | Female | 83.8 | 0.125019 | 0.015243 |
manicurist | Female | 83 | 0.140123 | 0.02464 |
special education teacher | Female | 82.8 | 0.156377 | 0.016994 |
police officer | Male | 15.8 | 0.403516 | 0.035085 |
taxi driver | Male | 12 | 0.158734 | 0.019094 |
computer architect | Male | 11.8 | 0.099024 | 0.024389 |
mechanical engineer | Male | 9.4 | 0.11853 | 0.020852 |
truck driver | Male | 7.9 | 0.159355 | 0.017617 |
electrical engineer | Male | 7 | 0.13457 | 0.028288 |
landscaping worker | Male | 6.2 | 0.196874 | 0.012888 |
pilot | Male | 5.3 | 0.379953 | 0.359545 |
repair worker | Male | 5.1 | 0.227758 | 0.02639 |
firefighter | Male | 5.1 | 0.265447 | 0.110595 |
construction worker | Male | 4.2 | 0.314471 | 0.02192 |
machinist | Male | 3.4 | 0.168814 | 0.084593 |
aircraft mechanic | Male | 3.2 | 0.172694 | 0.027612 |
carpenter | Male | 3.1 | 0.176528 | 0.167389 |
roofer | Male | 2.9 | 0.19173 | 0.091653 |
brickmason | Male | 2.2 | 0.096772 | 0.157578 |
plumber | Male | 2.1 | 0.229104 | 0.1193 |
electrician | Male | 1.7 | 0.205309 | 0.080894 |
vehicle technician | Male | 1.2 | 0.087967 | 0.021976 |
crane operator | Male | 1.1 | 0.168844 | 0.019803 |
이렇게 돌리면 단어가 만약 한 토큰이 아니라면 문제가 생기네요....
두 토큰 이상에서는 feature를 합해야 할지 고민을 해봐야 겠습니다.
cache_job[sae_id + '.hook_sae_acts_post'][0, 1: , :].cpu().numpy().sum(0)/len(cache[sae_id + '.hook_sae_acts_post'][0, 1 : , :])
이런 형식으로 합하고 나눠주는 식으로 평균으로 가보겠습니다.
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
skincare specialist | Female | 98.2 | 0.260488 | 0.215686 |
kindergarten teacher | Female | 96.8 | 0.322077 | 0.193814 |
childcare worker | Female | 94.6 | 0.371105 | 0.210631 |
secretary | Female | 92.5 | 0.315144 | 0.309222 |
hairstylist | Female | 92.4 | 0.272934 | 0.231378 |
dental assistant | Female | 92 | 0.2861 | 0.172984 |
nurse | Female | 91.3 | 0.462435 | 0.29445 |
school psychologist | Female | 90.4 | 0.323982 | 0.189848 |
receptionist | Female | 90 | 0.36699 | 0.303783 |
vet | Female | 89.8 | 0.265806 | 0.459221 |
nutritionist | Female | 89.6 | 0.282028 | 0.205952 |
maid | Female | 88.7 | 0.32113 | 0.366964 |
therapist | Female | 87.1 | 0.3653 | 0.25345 |
social worker | Female | 86.8 | 0.336677 | 0.24711 |
sewer | Female | 86.5 | 0.317566 | 0.340652 |
paralegal | Female | 84.8 | 0.277907 | 0.25812 |
library assistant | Female | 84.2 | 0.066172 | 0.136949 |
interior designer | Female | 83.8 | 0.320005 | 0.203839 |
manicurist | Female | 83 | 0.303169 | 0.572691 |
special education teacher | Female | 82.8 | 0.263433 | 0.158768 |
police officer | Male | 15.8 | 0.49553 | 0.223373 |
taxi driver | Male | 12 | 0.336171 | 0.268791 |
computer architect | Male | 11.8 | 0.380066 | 0.294564 |
mechanical engineer | Male | 9.4 | 0.304589 | 0.227857 |
truck driver | Male | 7.9 | 0.375531 | 0.239713 |
electrical engineer | Male | 7 | 0.318427 | 0.231128 |
landscaping worker | Male | 6.2 | 0.317912 | 0.177979 |
pilot | Male | 5.3 | 0.379953 | 0.359545 |
repair worker | Male | 5.1 | 0.367039 | 0.210456 |
firefighter | Male | 5.1 | 0.310906 | 0.204298 |
construction worker | Male | 4.2 | 0.457457 | 0.262899 |
machinist | Male | 3.4 | 0.258913 | 0.245702 |
aircraft mechanic | Male | 3.2 | 0.284078 | 0.142456 |
carpenter | Male | 3.1 | 0.319648 | 0.383934 |
roofer | Male | 2.9 | 0.28013 | 0.308794 |
brickmason | Male | 2.2 | 0.266618 | 0.276691 |
plumber | Male | 2.1 | 0.299213 | 0.287913 |
electrician | Male | 1.7 | 0.380658 | 0.285843 |
vehicle technician | Male | 1.2 | 0.262995 | 0.190528 |
crane operator | Male | 1.1 | 0.318902 | 0.259285 |
이렇게 봐도 기본적으로 여자 쪽 유사도가 훨씬 높은 경향이 있네요...?
아까도 보았듯 boy와 girl로 변경해서 진행해보겠습니다.
Job | Gender Dominance | Female Percentage | Cosine Similarity with Girl | Cosine Similarity with Boy |
skincare specialist | Girl | 98.2 | 0.31298 | 0.393313 |
kindergarten teacher | Girl | 96.8 | 0.328278 | 0.364125 |
childcare worker | Girl | 94.6 | 0.363267 | 0.344568 |
secretary | Girl | 92.5 | 0.333868 | 0.327885 |
hairstylist | Girl | 92.4 | 0.313339 | 0.37788 |
dental assistant | Girl | 92 | 0.282136 | 0.245449 |
nurse | Girl | 91.3 | 0.444893 | 0.390956 |
school psychologist | Girl | 90.4 | 0.309917 | 0.274872 |
receptionist | Girl | 90 | 0.367575 | 0.356266 |
vet | Girl | 89.8 | 0.30818 | 0.350992 |
nutritionist | Girl | 89.6 | 0.280848 | 0.249392 |
maid | Girl | 88.7 | 0.396666 | 0.425297 |
therapist | Girl | 87.1 | 0.364104 | 0.330846 |
social worker | Girl | 86.8 | 0.362299 | 0.365651 |
sewer | Girl | 86.5 | 0.392354 | 0.525686 |
paralegal | Girl | 84.8 | 0.333346 | 0.420043 |
library assistant | Girl | 84.2 | 0.080797 | 0.096904 |
interior designer | Girl | 83.8 | 0.310031 | 0.264097 |
manicurist | Girl | 83 | 0.309796 | 0.311654 |
special education teacher | Girl | 82.8 | 0.275684 | 0.31317 |
police officer | Boy | 15.8 | 0.407741 | 0.328005 |
taxi driver | Boy | 12 | 0.324069 | 0.288184 |
computer architect | Boy | 11.8 | 0.381935 | 0.361483 |
mechanical engineer | Boy | 9.4 | 0.3109 | 0.288659 |
truck driver | Boy | 7.9 | 0.360527 | 0.324681 |
electrical engineer | Boy | 7 | 0.332193 | 0.314071 |
landscaping worker | Boy | 6.2 | 0.29865 | 0.278257 |
pilot | Boy | 5.3 | 0.415127 | 0.422821 |
repair worker | Boy | 5.1 | 0.343244 | 0.30531 |
firefighter | Boy | 5.1 | 0.326038 | 0.345952 |
construction worker | Boy | 4.2 | 0.404421 | 0.349674 |
machinist | Boy | 3.4 | 0.318438 | 0.388574 |
aircraft mechanic | Boy | 3.2 | 0.260927 | 0.216189 |
carpenter | Boy | 3.1 | 0.396866 | 0.482018 |
roofer | Boy | 2.9 | 0.358073 | 0.477668 |
brickmason | Boy | 2.2 | 0.283633 | 0.299659 |
plumber | Boy | 2.1 | 0.379595 | 0.514094 |
electrician | Boy | 1.7 | 0.399175 | 0.399546 |
vehicle technician | Boy | 1.2 | 0.257796 | 0.238517 |
crane operator | Boy | 1.1 | 0.363132 | 0.459569 |
여기서도 특정한 편향이 있다고 보기 애매할 정도로 값이 왔다갔다 하네요...?
TNSE를 통해서 feature를 시각화해 확실한 방향성이 있는지 확인해보겠습니다.
또 다른 군집화 또한 보이지 않는다.
여기서도 전혀 찾아볼 수 없네요....
man과 woman을 추가해봤지만 양쪽 끝에 있다 뿐이지 전혀 정돈되지 않은 모습입니다.
여기도요....
지금 까지 16레이어였는데 24레이어로 변경해서 한번 진행해보도록 하겠습니다.
Job | Gender Dominance | Female Percentage | Cosine Similarity with girl | Cosine Similarity with boy |
skincare specialist | Female | 98.2 | 0.183896 | 0.175167 |
kindergarten teacher | Female | 96.8 | 0.201313 | 0.187313 |
childcare worker | Female | 94.6 | 0.233618 | 0.204102 |
secretary | Female | 92.5 | 0.199255 | 0.189488 |
hairstylist | Female | 92.4 | 0.194229 | 0.192112 |
dental assistant | Female | 92 | 0.145983 | 0.12946 |
nurse | Female | 91.3 | 0.240469 | 0.226662 |
school psychologist | Female | 90.4 | 0.199914 | 0.179647 |
receptionist | Female | 90 | 0.231864 | 0.207373 |
vet | Female | 89.8 | 0.116139 | 0.124447 |
nutritionist | Female | 89.6 | 0.203156 | 0.174122 |
maid | Female | 88.7 | 0.142082 | 0.139369 |
therapist | Female | 87.1 | 0.257627 | 0.23953 |
social worker | Female | 86.8 | 0.244797 | 0.221558 |
sewer | Female | 86.5 | 0.167628 | 0.172477 |
paralegal | Female | 84.8 | 0.176281 | 0.188429 |
library assistant | Female | 84.2 | 0.055279 | 0.052736 |
interior designer | Female | 83.8 | 0.155428 | 0.139924 |
manicurist | Female | 83 | 0.240344 | 0.222585 |
special education teacher | Female | 82.8 | 0.203276 | 0.183266 |
police officer | Male | 15.8 | 0.21867 | 0.182992 |
taxi driver | Male | 12 | 0.220355 | 0.1933 |
computer architect | Male | 11.8 | 0.214976 | 0.195234 |
mechanical engineer | Male | 9.4 | 0.206557 | 0.188666 |
truck driver | Male | 7.9 | 0.225139 | 0.209364 |
electrical engineer | Male | 7 | 0.207019 | 0.188141 |
landscaping worker | Male | 6.2 | 0.19653 | 0.17895 |
pilot | Male | 5.3 | 0.236793 | 0.236432 |
repair worker | Male | 5.1 | 0.281485 | 0.259632 |
firefighter | Male | 5.1 | 0.211171 | 0.212282 |
construction worker | Male | 4.2 | 0.249591 | 0.209322 |
machinist | Male | 3.4 | 0.183905 | 0.190897 |
aircraft mechanic | Male | 3.2 | 0.18299 | 0.170139 |
carpenter | Male | 3.1 | 0.222211 | 0.232715 |
roofer | Male | 2.9 | 0.201749 | 0.207536 |
brickmason | Male | 2.2 | 0.187944 | 0.189682 |
plumber | Male | 2.1 | 0.209998 | 0.210054 |
electrician | Male | 1.7 | 0.222483 | 0.212226 |
vehicle technician | Male | 1.2 | 0.148544 | 0.132306 |
crane operator | Male | 1.1 | 0.242226 | 0.229937 |
음.... 여기서도 cos 유사도는 되게 별로네요....?
이 표는 man과 woman으로 두고 푼 cos 유사도 입니다.
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
skincare specialist | Female | 98.2 | 0.247797 | 0.194322 |
kindergarten teacher | Female | 96.8 | 0.313062 | 0.18366 |
childcare worker | Female | 94.6 | 0.400103 | 0.197539 |
secretary | Female | 92.5 | 0.270496 | 0.186289 |
hairstylist | Female | 92.4 | 0.255863 | 0.21141 |
dental assistant | Female | 92 | 0.237729 | 0.12324 |
nurse | Female | 91.3 | 0.317287 | 0.231716 |
school psychologist | Female | 90.4 | 0.334626 | 0.171646 |
receptionist | Female | 90 | 0.353504 | 0.22003 |
vet | Female | 89.8 | 0.136677 | 0.174374 |
nutritionist | Female | 89.6 | 0.305264 | 0.178819 |
maid | Female | 88.7 | 0.14222 | 0.135275 |
therapist | Female | 87.1 | 0.370287 | 0.17982 |
social worker | Female | 86.8 | 0.388765 | 0.208608 |
sewer | Female | 86.5 | 0.187976 | 0.29267 |
paralegal | Female | 84.8 | 0.217172 | 0.227265 |
library assistant | Female | 84.2 | 0.100643 | 0.035872 |
interior designer | Female | 83.8 | 0.243949 | 0.117371 |
manicurist | Female | 83 | 0.346683 | 0.561003 |
special education teacher | Female | 82.8 | 0.345291 | 0.16804 |
police officer | Male | 15.8 | 0.410808 | 0.180951 |
taxi driver | Male | 12 | 0.388704 | 0.183195 |
computer architect | Male | 11.8 | 0.327138 | 0.208445 |
mechanical engineer | Male | 9.4 | 0.309647 | 0.151924 |
truck driver | Male | 7.9 | 0.360862 | 0.17753 |
electrical engineer | Male | 7 | 0.318052 | 0.146075 |
landscaping worker | Male | 6.2 | 0.370279 | 0.120557 |
pilot | Male | 5.3 | 0.267154 | 0.212619 |
repair worker | Male | 5.1 | 0.462964 | 0.180298 |
firefighter | Male | 5.1 | 0.292514 | 0.17753 |
construction worker | Male | 4.2 | 0.443905 | 0.203085 |
machinist | Male | 3.4 | 0.213048 | 0.157416 |
aircraft mechanic | Male | 3.2 | 0.312354 | 0.12882 |
carpenter | Male | 3.1 | 0.25291 | 0.255213 |
roofer | Male | 2.9 | 0.215583 | 0.272131 |
brickmason | Male | 2.2 | 0.203943 | 0.154185 |
plumber | Male | 2.1 | 0.244214 | 0.305246 |
electrician | Male | 1.7 | 0.310555 | 0.143423 |
vehicle technician | Male | 1.2 | 0.232757 | 0.152375 |
crane operator | Male | 1.1 | 0.343778 | 0.263669 |
대부분 여자쪽이 cos 유사도가 높아서 이건 뭐....
일단 시각화도 한번 해보긴 해야죠...
음... 이번엔 8번 레이어도...
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
skincare specialist | Female | 98.2 | 0.405341 | 0.304255 |
kindergarten teacher | Female | 96.8 | 0.383008 | 0.301889 |
childcare worker | Female | 94.6 | 0.391139 | 0.287552 |
secretary | Female | 92.5 | 0.453903 | 0.357609 |
hairstylist | Female | 92.4 | 0.389928 | 0.324667 |
dental assistant | Female | 92 | 0.258027 | 0.168323 |
nurse | Female | 91.3 | 0.458547 | 0.342229 |
school psychologist | Female | 90.4 | 0.331246 | 0.248281 |
receptionist | Female | 90 | 0.357404 | 0.278085 |
vet | Female | 89.8 | 0.371181 | 0.364601 |
nutritionist | Female | 89.6 | 0.339745 | 0.242513 |
maid | Female | 88.7 | 0.555928 | 0.458847 |
therapist | Female | 87.1 | 0.407176 | 0.298072 |
social worker | Female | 86.8 | 0.508688 | 0.39149 |
sewer | Female | 86.5 | 0.547029 | 0.474971 |
paralegal | Female | 84.8 | 0.417665 | 0.358631 |
library assistant | Female | 84.2 | 0.159827 | 0.149318 |
interior designer | Female | 83.8 | 0.353571 | 0.244492 |
manicurist | Female | 83 | 0.387122 | 0.612024 |
special education teacher | Female | 82.8 | 0.40541 | 0.307201 |
police officer | Male | 15.8 | 0.388537 | 0.285806 |
taxi driver | Male | 12 | 0.30301 | 0.251088 |
computer architect | Male | 11.8 | 0.426636 | 0.326642 |
mechanical engineer | Male | 9.4 | 0.342135 | 0.250907 |
truck driver | Male | 7.9 | 0.367503 | 0.272133 |
electrical engineer | Male | 7 | 0.419489 | 0.306664 |
landscaping worker | Male | 6.2 | 0.326514 | 0.226061 |
pilot | Male | 5.3 | 0.475483 | 0.391149 |
repair worker | Male | 5.1 | 0.370203 | 0.278574 |
firefighter | Male | 5.1 | 0.345793 | 0.244363 |
construction worker | Male | 4.2 | 0.414182 | 0.306116 |
machinist | Male | 3.4 | 0.364647 | 0.306591 |
aircraft mechanic | Male | 3.2 | 0.218859 | 0.149743 |
carpenter | Male | 3.1 | 0.469476 | 0.411972 |
roofer | Male | 2.9 | 0.47436 | 0.418287 |
brickmason | Male | 2.2 | 0.283209 | 0.224034 |
plumber | Male | 2.1 | 0.566133 | 0.475443 |
electrician | Male | 1.7 | 0.468038 | 0.348518 |
vehicle technician | Male | 1.2 | 0.301237 | 0.222791 |
crane operator | Male | 1.1 | 0.504272 | 0.422049 |
왜 여자가 다 높은지 이해가 안되네요.....
GPT한테 말해서 코드 좀 고쳐달라고 했는데 과연...
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
skincare specialist | Female | 98.2 | 0.405341 | 0.304255 |
kindergarten teacher | Female | 96.8 | 0.383008 | 0.301889 |
childcare worker | Female | 94.6 | 0.391139 | 0.287552 |
secretary | Female | 92.5 | 0.453903 | 0.357609 |
hairstylist | Female | 92.4 | 0.389928 | 0.324667 |
dental assistant | Female | 92 | 0.258027 | 0.168323 |
nurse | Female | 91.3 | 0.458547 | 0.342229 |
school psychologist | Female | 90.4 | 0.331246 | 0.248281 |
receptionist | Female | 90 | 0.357404 | 0.278085 |
vet | Female | 89.8 | 0.371181 | 0.3646 |
nutritionist | Female | 89.6 | 0.339745 | 0.242513 |
maid | Female | 88.7 | 0.555928 | 0.458847 |
therapist | Female | 87.1 | 0.407175 | 0.298072 |
social worker | Female | 86.8 | 0.508688 | 0.39149 |
sewer | Female | 86.5 | 0.547029 | 0.474971 |
paralegal | Female | 84.8 | 0.417665 | 0.358631 |
library assistant | Female | 84.2 | 0.159827 | 0.149318 |
interior designer | Female | 83.8 | 0.353571 | 0.244492 |
manicurist | Female | 83 | 0.387122 | 0.612024 |
special education teacher | Female | 82.8 | 0.40541 | 0.307201 |
police officer | Male | 15.8 | 0.388537 | 0.285806 |
taxi driver | Male | 12 | 0.30301 | 0.251088 |
computer architect | Male | 11.8 | 0.426636 | 0.326642 |
mechanical engineer | Male | 9.4 | 0.342135 | 0.250907 |
truck driver | Male | 7.9 | 0.367503 | 0.272133 |
electrical engineer | Male | 7 | 0.419489 | 0.306664 |
landscaping worker | Male | 6.2 | 0.326514 | 0.226061 |
pilot | Male | 5.3 | 0.475483 | 0.391149 |
repair worker | Male | 5.1 | 0.370203 | 0.278574 |
firefighter | Male | 5.1 | 0.345793 | 0.244363 |
construction worker | Male | 4.2 | 0.414182 | 0.306116 |
machinist | Male | 3.4 | 0.364647 | 0.306591 |
aircraft mechanic | Male | 3.2 | 0.218859 | 0.149743 |
carpenter | Male | 3.1 | 0.469476 | 0.411972 |
roofer | Male | 2.9 | 0.47436 | 0.418287 |
brickmason | Male | 2.2 | 0.283209 | 0.224034 |
plumber | Male | 2.1 | 0.566133 | 0.475443 |
electrician | Male | 1.7 | 0.468038 | 0.348518 |
vehicle technician | Male | 1.2 | 0.301237 | 0.222791 |
crane operator | Male | 1.1 | 0.504272 | 0.422049 |
동일하네요 ㅎㅎ,.....
도저히 군집이 보이지 않네요...
이번에는 PCA를 진행한 후에 COS 유사도를 구하는 것으로 변경하였습니다.
Feature의 차원수가 65000차원이 넘어가서...
Job | Gender Dominance | Female Percentage | Cosine Similarity with Woman | Cosine Similarity with Man |
skincare specialist | Female | 98.2 | -0.19681 | -0.26029 |
kindergarten teacher | Female | 96.8 | -0.16023 | -0.15596 |
childcare worker | Female | 94.6 | -0.25054 | -0.30546 |
secretary | Female | 92.5 | 0.16238949024719976 | 0.087834 |
hairstylist | Female | 92.4 | -0.33054 | -0.17093 |
dental assistant | Female | 92 | -0.19767 | -0.20021 |
nurse | Female | 91.3 | 0.12125738117926611 | 0.029434895795313585 |
school psychologist | Female | 90.4 | -0.34924 | -0.27798 |
receptionist | Female | 90 | -0.33877 | -0.22639 |
vet | Female | 89.8 | -0.02908 | 0.11708006054862408 |
nutritionist | Female | 89.6 | -0.17777 | -0.2193 |
maid | Female | 88.7 | 0.38307043846142336 | 0.2523360686341649 |
therapist | Female | 87.1 | -0.04124 | -0.09823 |
social worker | Female | 86.8 | 0.08358 | -0.00531 |
sewer | Female | 86.5 | 0.21770541089823284 | 0.18230598861912314 |
paralegal | Female | 84.8 | -0.27474 | -0.13365 |
library assistant | Female | 84.2 | -0.33992 | -0.15869 |
interior designer | Female | 83.8 | -0.07029 | -0.12871 |
manicurist | Female | 83 | -0.22675 | 0.49256168532186656 |
special education teacher | Female | 82.8 | -0.06123 | -0.10257 |
police officer | Male | 15.8 | -0.04289 | -0.0902 |
taxi driver | Male | 12 | -0.11897 | -0.07653 |
computer architect | Male | 11.8 | -0.08658 | -0.27449 |
mechanical engineer | Male | 9.4 | -0.15122 | -0.17058 |
truck driver | Male | 7.9 | -0.06683 | -0.10263 |
electrical engineer | Male | 7 | -0.05818 | -0.13694 |
landscaping worker | Male | 6.2 | -0.44607 | -0.41409 |
pilot | Male | 5.3 | 0.16081872172536063 | 0.10847355955477823 |
repair worker | Male | 5.1 | -0.19385 | -0.20943 |
firefighter | Male | 5.1 | -0.10681 | -0.14003 |
construction worker | Male | 4.2 | -0.03449 | -0.0879 |
machinist | Male | 3.4 | -0.36179 | -0.20148 |
aircraft mechanic | Male | 3.2 | -0.31942 | -0.26494 |
carpenter | Male | 3.1 | -0.02438 | 0.069184 |
roofer | Male | 2.9 | -0.10883 | 0.033033092268708625 |
brickmason | Male | 2.2 | -0.28875 | -0.19416 |
plumber | Male | 2.1 | 0.22440353244086522 | 0.15013727584185604 |
electrician | Male | 1.7 | -0.03262 | -0.13606 |
vehicle technician | Male | 1.2 | -0.18722 | -0.18556 |
crane operator | Male | 1.1 | 0.14478430627401268 | 0.088762 |
흐........
또 편향이 없네요
모델을 바꿔서 진행할지 고민해 봐야 겠습니다.
일단 코드입니다.
import os
from setproctitle import setproctitle
setproctitle("")
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
from tqdm import tqdm
import plotly.express as px
import pandas as pd
import numpy as np
# Imports for displaying vis in Colab / notebook
torch.set_grad_enabled(False)
# For the most part I'll try to import functions and classes near where they are used
# to make it clear where they come from.
if torch.backends.mps.is_available():
device = "mps"
else:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
sae_id = f"blocks.{8}.hook_resid_pre"
from sae_lens.toolkit.pretrained_saes_directory import get_pretrained_saes_directory
from sae_lens import SAE, HookedSAETransformer
model = HookedSAETransformer.from_pretrained("mistral-7b", device = device)
sae, cfg_dict, sparsity = SAE.from_pretrained(
release = "mistral-7b-res-wg", # <- Release name
sae_id = sae_id, # <- SAE id (not always a hook point!)
device = device
)
저기서 레이어를 바꿔가며 넣으면 됩니다.
import numpy as np
import pandas as pd
# 샘플 직업과 성별 지배 데이터
jobs_female_dominated = {
"skincare specialist": 98.2,
"kindergarten teacher": 96.8,
"childcare worker": 94.6,
"secretary": 92.5,
"hairstylist": 92.4,
"dental assistant": 92.0,
"nurse": 91.3,
"school psychologist": 90.4,
"receptionist": 90.0,
"vet": 89.8,
"nutritionist": 89.6,
"maid": 88.7,
"therapist": 87.1,
"social worker": 86.8,
"sewer": 86.5,
"paralegal": 84.8,
"library assistant": 84.2,
"interior designer": 83.8,
"manicurist": 83.0,
"special education teacher": 82.8
}
jobs_male_dominated = {
"police officer": 15.8,
"taxi driver": 12.0,
"computer architect": 11.8,
"mechanical engineer": 9.4,
"truck driver": 7.9,
"electrical engineer": 7.0,
"landscaping worker": 6.2,
"pilot": 5.3,
"repair worker": 5.1,
"firefighter": 5.1,
"construction worker": 4.2,
"machinist": 3.4,
"aircraft mechanic": 3.2,
"carpenter": 3.1,
"roofer": 2.9,
"brickmason": 2.2,
"plumber": 2.1,
"electrician": 1.7,
"vehicle technician": 1.2,
"crane operator": 1.1
}
# 가상의 함수 정의 (실제 모델 연동 필요)
def calculate_cos_similarity(job, reference_word, model = model, sae = sae, sae_id = sae_id):
_, cache_job = model.run_with_cache_with_saes(job, saes=[sae])
job_vector = cache_job[sae_id + '.hook_sae_acts_post'][0, 1: , :].cpu().numpy().sum(0)/len(cache_job[sae_id + '.hook_sae_acts_post'][0, 1 : , :])
_, cache_ref = model.run_with_cache_with_saes(reference_word, saes=[sae])
ref_vector = cache_ref[sae_id + '.hook_sae_acts_post'][0, 1: , :].cpu().numpy().sum(0)/len(cache_ref[sae_id + '.hook_sae_acts_post'][0, 1 : , :])
cos_similarity = np.dot(job_vector, ref_vector) / (
np.linalg.norm(job_vector) * np.linalg.norm(ref_vector)
)
return cos_similarity
# 각 직업에 대해 `woman`, `man`과의 코사인 유사도를 계산하여 저장
similarities = []
for job, percentage in jobs_female_dominated.items():
sim_with_woman = calculate_cos_similarity(job, "woman")
sim_with_man = calculate_cos_similarity(job, "man")
similarities.append({
"Job": job,
"Gender Dominance": "Female",
"Female Percentage": percentage,
"Cosine Similarity with Woman": sim_with_woman,
"Cosine Similarity with Man": sim_with_man
})
for job, percentage in jobs_male_dominated.items():
sim_with_woman = calculate_cos_similarity(job, "woman")
sim_with_man = calculate_cos_similarity(job, "man")
similarities.append({
"Job": job,
"Gender Dominance": "Male",
"Female Percentage": percentage,
"Cosine Similarity with Woman": sim_with_woman,
"Cosine Similarity with Man": sim_with_man
})
# 결과를 DataFrame으로 출력
df = pd.DataFrame(similarities)
man과 woman 각각의 cos 유사도를 구해서 출력해줍니다.
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
# 샘플 직업과 성별 지배 데이터
jobs_female_dominated = {
"skincare specialist": 98.2,
"kindergarten teacher": 96.8,
"childcare worker": 94.6,
"secretary": 92.5,
"hairstylist": 92.4,
"dental assistant": 92.0,
"nurse": 91.3,
"school psychologist": 90.4,
"receptionist": 90.0,
"vet": 89.8,
"nutritionist": 89.6,
"maid": 88.7,
"therapist": 87.1,
"social worker": 86.8,
"sewer": 86.5,
"paralegal": 84.8,
"library assistant": 84.2,
"interior designer": 83.8,
"manicurist": 83.0,
"special education teacher": 82.8
}
jobs_male_dominated = {
"police officer": 15.8,
"taxi driver": 12.0,
"computer architect": 11.8,
"mechanical engineer": 9.4,
"truck driver": 7.9,
"electrical engineer": 7.0,
"landscaping worker": 6.2,
"pilot": 5.3,
"repair worker": 5.1,
"firefighter": 5.1,
"construction worker": 4.2,
"machinist": 3.4,
"aircraft mechanic": 3.2,
"carpenter": 3.1,
"roofer": 2.9,
"brickmason": 2.2,
"plumber": 2.1,
"electrician": 1.7,
"vehicle technician": 1.2,
"crane operator": 1.1
}
# 1. 모든 직업 벡터를 수집해 PCA 모델 학습
def get_feature_vector(word, model, sae, sae_id):
_, cache = model.run_with_cache_with_saes(word, saes=[sae])
vector = cache[sae_id + '.hook_sae_acts_post'][0, 1:, :].cpu().numpy()
return vector.sum(axis=0) / np.count_nonzero(vector)
all_vectors = []
for job in list(jobs_female_dominated.keys()) + list(jobs_male_dominated.keys()) + ["woman", "man"]:
all_vectors.append(get_feature_vector(job, model, sae, sae_id))
target_dim = 20
pca_model = PCA(n_components=target_dim)
pca_model.fit(all_vectors) # PCA 모델 학습
# 2. 코사인 유사도 계산 함수
def calculate_cos_similarity(job, reference_word, pca_model, model, sae, sae_id):
job_vector = get_feature_vector(job, model, sae, sae_id)
ref_vector = get_feature_vector(reference_word, model, sae, sae_id)
job_vector_reduced = pca_model.transform(job_vector.reshape(1, -1))
ref_vector_reduced = pca_model.transform(ref_vector.reshape(1, -1))
cos_similarity = np.dot(job_vector_reduced, ref_vector_reduced.T) / (
np.linalg.norm(job_vector_reduced) * np.linalg.norm(ref_vector_reduced)
)
return cos_similarity[0][0] # 스칼라 값 반환
# 3. 각 직업에 대해 `woman`, `man`과의 코사인 유사도를 계산하여 저장
similarities = []
for job, percentage in jobs_female_dominated.items():
sim_with_woman = calculate_cos_similarity(job, "woman", pca_model, model, sae, sae_id)
sim_with_man = calculate_cos_similarity(job, "man", pca_model, model, sae, sae_id)
similarities.append({
"Job": job,
"Gender Dominance": "Female",
"Female Percentage": percentage,
"Cosine Similarity with Woman": sim_with_woman,
"Cosine Similarity with Man": sim_with_man
})
for job, percentage in jobs_male_dominated.items():
sim_with_woman = calculate_cos_similarity(job, "woman", pca_model, model, sae, sae_id)
sim_with_man = calculate_cos_similarity(job, "man", pca_model, model, sae, sae_id)
similarities.append({
"Job": job,
"Gender Dominance": "Male",
"Female Percentage": percentage,
"Cosine Similarity with Woman": sim_with_woman,
"Cosine Similarity with Man": sim_with_man
})
# 4. 결과를 DataFrame으로 출력
df = pd.DataFrame(similarities)
print(df)
여기선 PCA 후 COS 유사도를 구하려고 했는데 성능이 좋은지는 모르겠습니다....
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
# 가상의 함수 정의 (실제 모델 연동 필요)
def get_feature_vector(job, model, sae, sae_id):
_, cache_job = model.run_with_cache_with_saes(job, saes=[sae])
job_vector = cache_job[sae_id + '.hook_sae_acts_post'][0, 1:, :].cpu().numpy().sum(0) / \
len(cache_job[sae_id + '.hook_sae_acts_post'][0, 1:, :])
return job_vector
# 직업과 성별 지배 데이터 및 Feature 벡터 수집
jobs_female_dominated = {
"skincare specialist": 98.2, "kindergarten teacher": 96.8, "childcare worker": 94.6,
"secretary": 92.5, "hairstylist": 92.4, "dental assistant": 92.0, "nurse": 91.3,
"school psychologist": 90.4, "receptionist": 90.0, "vet": 89.8, "nutritionist": 89.6,
"maid": 88.7, "therapist": 87.1, "social worker": 86.8, "sewer": 86.5, "paralegal": 84.8,
"library assistant": 84.2, "interior designer": 83.8, "manicurist": 83.0, "special education teacher": 82.8,
"woman" : 99.9#, "female" : 99.0
}
jobs_male_dominated = {
"police officer": 15.8, "taxi driver": 12.0, "computer architect": 11.8,
"mechanical engineer": 9.4, "truck driver": 7.9, "electrical engineer": 7.0,
"landscaping worker": 6.2, "pilot": 5.3, "repair worker": 5.1, "firefighter": 5.1,
"construction worker": 4.2, "machinist": 3.4, "aircraft mechanic": 3.2,
"carpenter": 3.1, "roofer": 2.9, "brickmason": 2.2, "plumber": 2.1,
"electrician": 1.7, "vehicle technician": 1.2, "crane operator": 1.1,
"man" : 0.1#, "male" : 0.11
}
# 모든 직업의 Feature 벡터 수집
feature_vectors = []
labels = []
dominance = []
for job in jobs_female_dominated.keys():
feature_vector = get_feature_vector(job, model, sae, sae_id)
feature_vectors.append(feature_vector)
labels.append(job)
dominance.append("Female")
for job in jobs_male_dominated.keys():
feature_vector = get_feature_vector(job, model, sae, sae_id)
feature_vectors.append(feature_vector)
labels.append(job)
dominance.append("Male")
# numpy 배열로 변환
feature_vectors = np.array(feature_vectors)
# 차원 축소 및 시각화 (t-SNE 사용)
tsne = TSNE(n_components=2, random_state=0)
reduced_features = tsne.fit_transform(feature_vectors)
# 시각화
plt.figure(figsize=(12, 8))
for i, label in enumerate(labels):
color = 'blue' if dominance[i] == 'Female' else 'red'
plt.scatter(reduced_features[i, 0], reduced_features[i, 1], color=color)
plt.text(reduced_features[i, 0] + 0.2, reduced_features[i, 1], label, fontsize=9)
plt.xlabel("TSNE Component 1")
plt.ylabel("TSNE Component 2")
plt.title("TSNE Visualization of Job Feature Vectors")
plt.scatter([], [], color='blue', label="Female-dominated jobs")
plt.scatter([], [], color='red', label="Male-dominated jobs")
#plt.legend(["Female-dominated jobs", "Male-dominated jobs"])
plt.legend()
plt.show()
TSNE를 통한 시각화인데 실패한 것 같습니다 ㅎㅎ....
import numpy as np
import matplotlib.pyplot as plt
from umap import UMAP # 수정된 부분
from sklearn.decomposition import PCA
# PCA를 먼저 적용하여 고차원 축소 후 UMAP 적용
def visualize_with_umap(feature_vectors, labels, dominance):
# PCA로 50차원까지 축소
pca = PCA(n_components=min(20, feature_vectors.shape[1]), random_state=0)
pca_features = pca.fit_transform(feature_vectors)
# UMAP으로 2차원 축소
umap_model = UMAP(n_components=2, random_state=0)
reduced_features = umap_model.fit_transform(pca_features)
# 시각화
plt.figure(figsize=(12, 8))
for i, label in enumerate(labels):
color = 'blue' if dominance[i] == 'Female' else 'red'
plt.scatter(reduced_features[i, 0], reduced_features[i, 1], color=color)
plt.text(reduced_features[i, 0] + 0.2, reduced_features[i, 1], label, fontsize=9)
plt.xlabel("UMAP Component 1")
plt.ylabel("UMAP Component 2")
plt.title("UMAP Visualization of Job Feature Vectors")
plt.scatter([], [], color='blue', label="Female-dominated jobs")
plt.scatter([], [], color='red', label="Male-dominated jobs")
#plt.legend(["Female-dominated jobs", "Male-dominated jobs"])
plt.legend()
plt.show()
# numpy 배열로 변환
feature_vectors = np.array(feature_vectors)
# UMAP을 통한 시각화 호출
visualize_with_umap(feature_vectors, labels, dominance)
여기는 UMAP을 통한 시각화 입니다.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# 가정: Bias-in-Bios에서 특정 직업에 대한 샘플이 있다면, 직업과 성별 단어에 대한 벡터를 추출하는 코드
def get_feature_vector(word, model, sae, sae_id):
_, cache = model.run_with_cache_with_saes(word, saes=[sae])
vector = cache[sae_id + '.hook_sae_acts_post'][0, 1:, :].cpu().numpy()
return vector.sum(axis=0) / np.count_nonzero(vector)
# Bias-in-Bios에서 각 직업에 대해 성별 관련 코사인 유사도 계산
def calculate_bias_in_bios_scores(jobs, genders, model, sae, sae_id):
results = []
for job in jobs:
job_vector = get_feature_vector(job, model, sae, sae_id)
gender_similarities = {}
for gender in genders:
gender_vector = get_feature_vector(gender, model, sae, sae_id)
# 코사인 유사도 계산
sim = cosine_similarity(job_vector.reshape(1, -1), gender_vector.reshape(1, -1))[0][0]
gender_similarities[gender] = sim
results.append({
"Job": job,
"Cosine Similarity with Gender": gender_similarities
})
return results
# Bias-in-Bios 직업과 성별 리스트
jobs = ["doctor", "nurse", "engineer", "teacher", "scientist"] # Bias-in-Bios의 직업 리스트로 변경 가능
genders = ["man", "woman", "he", "she"] # 성별 관련 단어
# 편향 점수 계산
bias_scores = calculate_bias_in_bios_scores(jobs, genders, model, sae, sae_id)
# 결과 출력
for score in bias_scores:
print(f"Job: {score['Job']}")
for gender, sim in score["Cosine Similarity with Gender"].items():
print(f" Cosine similarity with {gender}: {sim}")
여기선 직업만 넣어주면 Cos 유사도 다 구해줍니다.
아마 다음 번엔 모델을 변경해서 진행해볼 것 같습니다.
'인공지능 > XAI' 카테고리의 다른 글
Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 3 (0) | 2024.11.28 |
---|---|
Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 2 (0) | 2024.11.27 |
SelfIE 주간 세미나 발표 (0) | 2024.11.25 |
SelfIE : 세미나 발표 준비 (1) | 2024.11.24 |
🤳SelfIE: Self-Interpretation of Large Language Model Embeddings - 세미나 준비 (1) | 2024.11.18 |