인공지능/XAI

Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 4

이게될까 2024. 11. 29. 16:00
728x90
728x90

2024.11.07 - [인공지능/XAI] - Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 3

 

 

SAE는 7번 레이어에 붙어있습니다.

11레이어까지 있으니까 한 번 쭉 확인해봅시다....

Job Gender Dominance Female Percentage Cosine Similarity with Woman Cosine Similarity with Man
skincare specialist Female 98.2 0.885237 0.851516
kindergarten teacher Female 96.8 0.879833 0.844357
childcare worker Female 94.6 0.908659 0.854653
secretary Female 92.5 0.866899 0.826608
hairstylist Female 92.4 0.881781 0.852713
dental assistant Female 92 0.885272 0.84631
nurse Female 91.3 0.892691 0.846891
school psychologist Female 90.4 0.880224 0.829196
receptionist Female 90 0.884083 0.84414
vet Female 89.8 0.869386 0.852478
nutritionist Female 89.6 0.886887 0.840343
maid Female 88.7 0.919116 0.86704
therapist Female 87.1 0.891367 0.849494
social worker Female 86.8 0.873986 0.820296
sewer Female 86.5 0.880158 0.854183
paralegal Female 84.8 0.851667 0.838413
library assistant Female 84.2 0.880276 0.830921
interior designer Female 83.8 0.868622 0.829819
manicurist Female 83 0.920434 0.916763
special education teacher Female 82.8 0.870975 0.816348
police officer Male 15.8 0.850793 0.803566
taxi driver Male 12 0.879322 0.84247
computer architect Male 11.8 0.88091 0.84602
mechanical engineer Male 9.4 0.909123 0.881162
truck driver Male 7.9 0.891951 0.859146
electrical engineer Male 7 0.864773 0.823462
landscaping worker Male 6.2 0.888274 0.854544
pilot Male 5.3 0.876505 0.845214
repair worker Male 5.1 0.896395 0.852561
firefighter Male 5.1 0.898523 0.863758
construction worker Male 4.2 0.888954 0.848474
machinist Male 3.4 0.888064 0.871639
aircraft mechanic Male 3.2 0.891025 0.853536
carpenter Male 3.1 0.86811 0.843074
roofer Male 2.9 0.847142 0.831337
brickmason Male 2.2 0.867273 0.848192
plumber Male 2.1 0.87239 0.842234
electrician Male 1.7 0.894518 0.857185
vehicle technician Male 1.2 0.880496 0.841039
crane operator Male 1.1 0.89262 0.8612

이게 뭔....

 

Job Gender Dominance Female Percentage Cosine Similarity with Woman Cosine Similarity with Man
skincare specialist Female 98.2 -0.28427 -0.18885
kindergarten teacher Female 96.8 -0.29349 -0.15017
childcare worker Female 94.6 0.183231 -0.09855
secretary Female 92.5 -0.05228 -0.09362
hairstylist Female 92.4 -0.15438 -0.00019
dental assistant Female 92 -0.21715 -0.25152
nurse Female 91.3 0.197028 0.026283
school psychologist Female 90.4 -0.02939 -0.2249
receptionist Female 90 -0.11885 -0.12152
vet Female 89.8 -0.05999 0.101977
nutritionist Female 89.6 -0.10836 -0.26641
maid Female 88.7 0.452602 0.148875
therapist Female 87.1 -0.03172 -0.11619
social worker Female 86.8 -0.01379 -0.19423
sewer Female 86.5 -0.03255 0.029441
paralegal Female 84.8 -0.11245 0.103436
library assistant Female 84.2 -0.06259 -0.19382
interior designer Female 83.8 -0.14656 -0.16534
manicurist Female 83 0.325327 0.609222
special education teacher Female 82.8 -0.18254 -0.27691
police officer Male 15.8 0.061902 -0.04618
taxi driver Male 12 -0.14195 -0.17661
computer architect Male 11.8 -0.07117 -0.07187
mechanical engineer Male 9.4 -0.03773 0.142761
truck driver Male 7.9 -0.04467 -0.04188
electrical engineer Male 7 -0.31577 -0.31442
landscaping worker Male 6.2 -0.31069 -0.2016
pilot Male 5.3 0.000462 -0.00561
repair worker Male 5.1 0.034276 -0.14766
firefighter Male 5.1 0.030996 -0.07756
construction worker Male 4.2 -0.15419 -0.11414
machinist Male 3.4 -0.06887 0.178433
aircraft mechanic Male 3.2 -0.17256 -0.20201
carpenter Male 3.1 -0.37341 -0.0821
roofer Male 2.9 -0.14815 0.04336
brickmason Male 2.2 -0.14312 0.037151
plumber Male 2.1 -0.11869 -0.05927
electrician Male 1.7 -0.09495 -0.06853
vehicle technician Male 1.2 -0.09524 -0.08894
crane operator Male 1.1 -0.26544 -0.11965

내가 뭔갈 잘못하고 있나...?

 

이번엔 11레이어 

Job Gender Dominance Female Percentage Cosine Similarity with Woman Cosine Similarity with Man
skincare specialist Female 98.2 0.952556 0.936987
kindergarten teacher Female 96.8 0.951825 0.935702
childcare worker Female 94.6 0.957717 0.932282
secretary Female 92.5 0.93632 0.920797
hairstylist Female 92.4 0.949957 0.937659
dental assistant Female 92 0.950961 0.934148
nurse Female 91.3 0.952866 0.933521
school psychologist Female 90.4 0.945576 0.921782
receptionist Female 90 0.95102 0.93408
vet Female 89.8 0.942855 0.93628
nutritionist Female 89.6 0.949447 0.927987
maid Female 88.7 0.962526 0.939558
therapist Female 87.1 0.952193 0.934614
social worker Female 86.8 0.940401 0.916169
sewer Female 86.5 0.949164 0.938944
paralegal Female 84.8 0.939479 0.933251
library assistant Female 84.2 0.943475 0.922672
interior designer Female 83.8 0.94191 0.926157
manicurist Female 83 0.968939 0.966691
special education teacher Female 82.8 0.942118 0.917624
police officer Male 15.8 0.925999 0.901801
taxi driver Male 12 0.94509 0.928054
computer architect Male 11.8 0.945797 0.928877
mechanical engineer Male 9.4 0.961702 0.947846
truck driver Male 7.9 0.951451 0.935774
electrical engineer Male 7 0.93818 0.918885
landscaping worker Male 6.2 0.954152 0.937349
pilot Male 5.3 0.942039 0.930336
repair worker Male 5.1 0.950627 0.929089
firefighter Male 5.1 0.955618 0.939639
construction worker Male 4.2 0.948317 0.930044
machinist Male 3.4 0.953453 0.946909
aircraft mechanic Male 3.2 0.951964 0.934727
carpenter Male 3.1 0.945402 0.935081
roofer Male 2.9 0.935538 0.928428
brickmason Male 2.2 0.940855 0.932296
plumber Male 2.1 0.947161 0.933655
electrician Male 1.7 0.95383 0.937022
vehicle technician Male 1.2 0.939626 0.920676
crane operator Male 1.1 0.953499 0.938783

전혀 군집화가 안 보이네요..

Job Gender Dominance Female Percentage Cosine Similarity with Woman Cosine Similarity with Man
woman Female 100 1 0.977756
skincare specialist Female 98.2 0.952556 0.936987
kindergarten teacher Female 96.8 0.951825 0.935702
childcare worker Female 94.6 0.957717 0.932282
secretary Female 92.5 0.93632 0.920797
hairstylist Female 92.4 0.949957 0.937659
dental assistant Female 92 0.950961 0.934148
nurse Female 91.3 0.952866 0.933521
school psychologist Female 90.4 0.945576 0.921782
receptionist Female 90 0.95102 0.93408
vet Female 89.8 0.942855 0.93628
nutritionist Female 89.6 0.949447 0.927987
maid Female 88.7 0.962526 0.939558
therapist Female 87.1 0.952193 0.934614
social worker Female 86.8 0.940401 0.916169
sewer Female 86.5 0.949164 0.938944
paralegal Female 84.8 0.939479 0.933251
library assistant Female 84.2 0.943475 0.922672
interior designer Female 83.8 0.94191 0.926157
manicurist Female 83 0.968939 0.966691
special education teacher Female 82.8 0.942118 0.917624
police officer Male 15.8 0.925999 0.901801
taxi driver Male 12 0.94509 0.928054
computer architect Male 11.8 0.945797 0.928877
mechanical engineer Male 9.4 0.961702 0.947846
truck driver Male 7.9 0.951451 0.935774
electrical engineer Male 7 0.93818 0.918885
landscaping worker Male 6.2 0.954152 0.937349
pilot Male 5.3 0.942039 0.930336
repair worker Male 5.1 0.950627 0.929089
firefighter Male 5.1 0.955618 0.939639
construction worker Male 4.2 0.948317 0.930044
machinist Male 3.4 0.953453 0.946909
aircraft mechanic Male 3.2 0.951964 0.934727
carpenter Male 3.1 0.945402 0.935081
roofer Male 2.9 0.935538 0.928428
brickmason Male 2.2 0.940855 0.932296
plumber Male 2.1 0.947161 0.933655
electrician Male 1.7 0.95383 0.937022
vehicle technician Male 1.2 0.939626 0.920676
crane operator Male 1.1 0.953499 0.938783
man Male 0.1 0.977756 1

애초에 남자와 여자가 너무 가깝네요...?

 

음 한번 GPT -2 small SAE layer에서 cos 유사도를 구해보겠습니다.

prompt 1 prompt 2 cos similarity
man woman 0.68
he she 0.56
boy girl 0.48
his her 0.62
male female 0.39

유사도가 가장 낮은 female과 male로 진행해 볼게요..

Job Gender Dominance Female Percentage Cosine Similarity with female Cosine Similarity with male
skincare specialist Female 98.2 0.124757 0.134896
kindergarten teacher Female 96.8 0.147743 0.160076
childcare worker Female 94.6 0.141078 0.146944
secretary Female 92.5 0.198375 0.201354
hairstylist Female 92.4 0.128075 0.137902
dental assistant Female 92 0.195335 0.204906
nurse Female 91.3 0.244849 0.261608
school psychologist Female 90.4 0.169121 0.154034
receptionist Female 90 0.178872 0.179647
vet Female 89.8 0.273882 0.289234
nutritionist Female 89.6 0.177132 0.168246
maid Female 88.7 0.238828 0.239321
therapist Female 87.1 0.180114 0.187415
social worker Female 86.8 0.167729 0.166011
sewer Female 86.5 0.213534 0.227236
paralegal Female 84.8 0.133382 0.142434
library assistant Female 84.2 0.108575 0.109091
interior designer Female 83.8 0.192761 0.199092
manicurist Female 83 0.123101 0.1409
special education teacher Female 82.8 0.136673 0.139805
police officer Male 15.8 0.164144 0.15197
taxi driver Male 12 0.167654 0.159419
computer architect Male 11.8 0.144956 0.147659
mechanical engineer Male 9.4 0.197546 0.207421
truck driver Male 7.9 0.208168 0.221485
electrical engineer Male 7 0.156188 0.155851
landscaping worker Male 6.2 0.124597 0.128934
pilot Male 5.3 0.210591 0.219529
repair worker Male 5.1 0.128081 0.133228
firefighter Male 5.1 0.213929 0.219438
construction worker Male 4.2 0.186627 0.186755
machinist Male 3.4 0.165033 0.17933
aircraft mechanic Male 3.2 0.185949 0.195976
carpenter Male 3.1 0.166936 0.179504
roofer Male 2.9 0.21479 0.22808
brickmason Male 2.2 0.130557 0.144164
plumber Male 2.1 0.199274 0.211099
electrician Male 1.7 0.128658 0.128711
vehicle technician Male 1.2 0.168419 0.176294
crane operator Male 1.1 0.164983 0.177588

흠 ...

이번엔 남자가 대부분 높네요...

 

그럼 이번엔 11번 레이어 에서 cos 유사도를 구해볼게요

prompt 1 prompt 2 cos similarity
man woman 0.98
he she 0.97
boy girl 0.97
his her 0.97
male female 0.99

이건 그냥 다 높네요....

 

2번 레이어로 한번 볼게요

prompt 1 prompt 2 cos similarity
man woman 0.88
he she 0.88
boy girl 0.87
his her 0.80
male female 0.92
man he 0.75
woman she 0.79

....?

애초에 직업에 대한 코사인 유사도보다 man, woman 이 더 높네요....

 

다시 미스트랄로....

mistral은 레이어가 깊어질수록 cos 유사도가 떨어지는 경향이 있어서 31번 레이어로 진행했습니다.

기본적으로 job에 대해 cos 유사도가 man보다 woman이 더 높네요 -> 그래서 모든 직업에서 여자의 cos 유사도가 더 높게 나오는 것 같습니다.

Job Gender Dominance Female Percentage Cosine Similarity with Woman Cosine Similarity with Man
woman Female 100 1 0.478665
skincare specialist Female 98.2 0.526142 0.337138
kindergarten teacher Female 96.8 0.571344 0.319337
childcare worker Female 94.6 0.653964 0.321974
secretary Female 92.5 0.490099 0.289095
hairstylist Female 92.4 0.475824 0.333437
dental assistant Female 92 0.46657 0.21639
nurse Female 91.3 0.576348 0.327447
school psychologist Female 90.4 0.555125 0.294005
receptionist Female 90 0.579533 0.30584
vet Female 89.8 0.380612 0.295048
nutritionist Female 89.6 0.530972 0.265065
maid Female 88.7 0.426042 0.352122
therapist Female 87.1 0.616198 0.312762
social worker Female 86.8 0.598205 0.277394
sewer Female 86.5 0.338412 0.328602
paralegal Female 84.8 0.473365 0.355198
library assistant Female 84.2 0.306555 0.190848
interior designer Female 83.8 0.553535 0.26863
manicurist Female 83 0.646296 0.64153
special education teacher Female 82.8 0.528594 0.267602
police officer Male 15.8 0.600642 0.303183
taxi driver Male 12 0.583161 0.318098
computer architect Male 11.8 0.515812 0.300871
mechanical engineer Male 9.4 0.560198 0.285663
truck driver Male 7.9 0.604801 0.328755
electrical engineer Male 7 0.556409 0.275509
landscaping worker Male 6.2 0.600496 0.296639
pilot Male 5.3 0.509496 0.349558
repair worker Male 5.1 0.641669 0.318395
firefighter Male 5.1 0.512231 0.322519
construction worker Male 4.2 0.611927 0.302671
machinist Male 3.4 0.507236 0.3774
aircraft mechanic Male 3.2 0.546825 0.296465
carpenter Male 3.1 0.490861 0.403917
roofer Male 2.9 0.465377 0.376143
brickmason Male 2.2 0.471654 0.331363
plumber Male 2.1 0.432962 0.369362
electrician Male 1.7 0.538109 0.291336
vehicle technician Male 1.2 0.480898 0.273057
crane operator Male 1.1 0.558155 0.364165
man Male 0 0.478665 1

그냥 널리 퍼져있네요....

Job: doctor
  Cosine similarity with man: 0.29202771186828613
  Cosine similarity with woman: 0.3873016834259033
  Cosine similarity with he: 0.18976394832134247
  Cosine similarity with she: 0.24517714977264404
  Cosine similarity with male: 0.26682591438293457
  Cosine similarity with female: 0.30686309933662415
Job: nurse
  Cosine similarity with man: 0.32744666934013367
  Cosine similarity with woman: 0.5763475894927979
  Cosine similarity with he: 0.32279425859451294
  Cosine similarity with she: 0.2484569251537323
  Cosine similarity with male: 0.2853856086730957
  Cosine similarity with female: 0.3248463273048401
Job: engineer
  Cosine similarity with man: 0.3628814220428467
  Cosine similarity with woman: 0.535031795501709
  Cosine similarity with he: 0.30105212330818176
  Cosine similarity with she: 0.24273112416267395
  Cosine similarity with male: 0.2686672806739807
  Cosine similarity with female: 0.31331366300582886
Job: teacher
  Cosine similarity with man: 0.3335043787956238
  Cosine similarity with woman: 0.5944691300392151
  Cosine similarity with he: 0.3386254608631134
  Cosine similarity with she: 0.24903792142868042
  Cosine similarity with male: 0.2746357321739197
  Cosine similarity with female: 0.33448266983032227
Job: scientist
  Cosine similarity with man: 0.3251022696495056
  Cosine similarity with woman: 0.6097099781036377
  Cosine similarity with he: 0.37819525599479675
  Cosine similarity with she: 0.2599332332611084
  Cosine similarity with male: 0.2557474374771118
  Cosine similarity with female: 0.3047844469547272

모델이 오히려 논문에서 나온 것 보다 출력이 잘 안나오네요...

이렇게 해야 나오네여

일단 이제 편향이 있는 것은 확인 했으니 편향을 만드는 feature를 찾아봅시다.

 

728x90