인공지능/XAI

Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 3

이게될까 2024. 11. 28. 13:53
728x90
728x90

2024.11.05 - [인공지능/XAI] - Sparse Autoencoder를 통한 LLM의 Bias 줄이기 - 성에 따른 직업 2

 

이번에는 SAE의 Feature에서 COS 유사도를 구하는 것이 아니라 그 이후의 레이어에서 확인해 보도록 하겠습니다.

 

Job Gender Dominance Female Percentage Cosine Similarity with Woman Cosine Similarity with Man
skincare specialist Female 98.2 0.651186 0.556518
kindergarten teacher Female 96.8 0.649491 0.514379
childcare worker Female 94.6 0.71621 0.519848
secretary Female 92.5 0.583385 0.54581
hairstylist Female 92.4 0.634154 0.553486
dental assistant Female 92 0.594918 0.44044
nurse Female 91.3 0.712285 0.57134
school psychologist Female 90.4 0.640011 0.494764
receptionist Female 90 0.63295 0.544817
vet Female 89.8 0.521638 0.586588
nutritionist Female 89.6 0.620365 0.503985
maid Female 88.7 0.557642 0.581538
therapist Female 87.1 0.607497 0.489134
social worker Female 86.8 0.705663 0.54845
sewer Female 86.5 0.637374 0.63115
paralegal Female 84.8 0.625037 0.606166
library assistant Female 84.2 0.422588 0.460436
interior designer Female 83.8 0.637165 0.494837
manicurist Female 83 0.644541 0.701974
special education teacher Female 82.8 0.661597 0.507501
police officer Male 15.8 0.70545 0.507632
taxi driver Male 12 0.642695 0.500626
computer architect Male 11.8 0.651701 0.564702
mechanical engineer Male 9.4 0.637769 0.532792
truck driver Male 7.9 0.667702 0.494411
electrical engineer Male 7 0.628306 0.492114
landscaping worker Male 6.2 0.669272 0.483533
pilot Male 5.3 0.629289 0.603345
repair worker Male 5.1 0.678439 0.516247
firefighter Male 5.1 0.653101 0.52751
construction worker Male 4.2 0.718746 0.525603
machinist Male 3.4 0.644379 0.622361
aircraft mechanic Male 3.2 0.620817 0.460778
carpenter Male 3.1 0.654313 0.653469
roofer Male 2.9 0.624167 0.607007
brickmason Male 2.2 0.579816 0.553472
plumber Male 2.1 0.629519 0.639175
electrician Male 1.7 0.642468 0.532843
vehicle technician Male 1.2 0.614766 0.479526
crane operator Male 1.1 0.673586 0.602639

여기서도 여자가 전부다 COS 유사도가 높은 편향이 있네요.....

 

TSNE 또한 딱히 안 보이네요..

흠 UMAP 또한 아무것도 안보이네요...

 

Job: doctor
  Cosine similarity with man: 0.6425841450691223
  Cosine similarity with woman: 0.6595352292060852
  Cosine similarity with he: 0.5585692524909973
  Cosine similarity with she: 0.6580565571784973
Job: nurse
  Cosine similarity with man: 0.5713397264480591
  Cosine similarity with woman: 0.7122852802276611
  Cosine similarity with he: 0.5225833654403687
  Cosine similarity with she: 0.5876948833465576
Job: engineer
  Cosine similarity with man: 0.6049615144729614
  Cosine similarity with woman: 0.6559303402900696
  Cosine similarity with he: 0.5704656839370728
  Cosine similarity with she: 0.6277342438697815
Job: teacher
  Cosine similarity with man: 0.5859898924827576
  Cosine similarity with woman: 0.667400598526001
  Cosine similarity with he: 0.5594150424003601
  Cosine similarity with she: 0.6244168281555176
Job: scientist
  Cosine similarity with man: 0.5397614240646362
  Cosine similarity with woman: 0.6615307331085205
  Cosine similarity with he: 0.5063120126724243
  Cosine similarity with she: 0.5763662457466125

진짜 답이 안보이네요...

 

일단 여기까지는 미스트랄 모델이었으니 다른 모델로 변경해보겠습니다.

 

이 밑에 부턴 GPT-2 small 모델입니다.

Job Gender Dominance Female Percentage Cosine Similarity with Woman Cosine Similarity with Man
skincare specialist Female 98.2 0.195972 0.204778
kindergarten teacher Female 96.8 0.220412 0.233106
childcare worker Female 94.6 0.21226 0.196284
secretary Female 92.5 0.317799 0.327826
hairstylist Female 92.4 0.197492 0.212216
dental assistant Female 92 0.292716 0.315205
nurse Female 91.3 0.367193 0.412158
school psychologist Female 90.4 0.252426 0.209854
receptionist Female 90 0.291924 0.294703
vet Female 89.8 0.41986 0.476376
nutritionist Female 89.6 0.28915 0.251744
maid Female 88.7 0.414578 0.406934
therapist Female 87.1 0.288946 0.296861
social worker Female 86.8 0.264815 0.248089
sewer Female 86.5 0.320507 0.360573
paralegal Female 84.8 0.195828 0.215684
library assistant Female 84.2 0.18374 0.167661
interior designer Female 83.8 0.292864 0.308887
manicurist Female 83 0.313635 0.443759
special education teacher Female 82.8 0.211079 0.203801
police officer Male 15.8 0.247464 0.196906
taxi driver Male 12 0.253791 0.227682
computer architect Male 11.8 0.229394 0.219098
mechanical engineer Male 9.4 0.301834 0.319501
truck driver Male 7.9 0.310374 0.335189
electrical engineer Male 7 0.242503 0.236696
landscaping worker Male 6.2 0.191117 0.189157
pilot Male 5.3 0.317637 0.349418
repair worker Male 5.1 0.214539 0.207093
firefighter Male 5.1 0.329568 0.336709
construction worker Male 4.2 0.301041 0.298139
machinist Male 3.4 0.246829 0.273729
aircraft mechanic Male 3.2 0.278695 0.294949
carpenter Male 3.1 0.251271 0.283897
roofer Male 2.9 0.324939 0.361726
brickmason Male 2.2 0.201017 0.226547
plumber Male 2.1 0.308215 0.334688
electrician Male 1.7 0.205533 0.195738
vehicle technician Male 1.2 0.263219 0.261841
crane operator Male 1.1 0.264234 0.276296

뭔가 고르게 분포된 이 느낌....

확실하게 남자가 많은 직업은 남자한테 편향적이다! 이런 것은 아니네요

 

TSNE와 UMAP 또한 알 수가 없네요....

Job: doctor
  Cosine similarity with man: 0.3023069500923157
  Cosine similarity with woman: 0.3201085031032562
  Cosine similarity with he: 0.235477015376091
  Cosine similarity with she: 0.2189578413963318
Job: nurse
  Cosine similarity with man: 0.4121583104133606
  Cosine similarity with woman: 0.36719265580177307
  Cosine similarity with he: 0.34656408429145813
  Cosine similarity with she: 0.3177404999732971
Job: engineer
  Cosine similarity with man: 0.2845001220703125
  Cosine similarity with woman: 0.28417065739631653
  Cosine similarity with he: 0.21660952270030975
  Cosine similarity with she: 0.199890598654747
Job: teacher
  Cosine similarity with man: 0.405282199382782
  Cosine similarity with woman: 0.4033980965614319
  Cosine similarity with he: 0.3337845802307129
  Cosine similarity with she: 0.3021133244037628
Job: scientist
  Cosine similarity with man: 0.23807066679000854
  Cosine similarity with woman: 0.26343828439712524
  Cosine similarity with he: 0.1863318830728531
  Cosine similarity with she: 0.17157433927059174

흠.....

 

그럼 이번엔 SAE layer의 다음 레이어인 8번 레이어로 가보겠습니다.

Job Gender Dominance Female Percentage Cosine Similarity with Woman Cosine Similarity with Man
skincare specialist Female 98.2 0.830806 0.8029
kindergarten teacher Female 96.8 0.822819 0.789158
childcare worker Female 94.6 0.868605 0.81984
secretary Female 92.5 0.81239 0.773908
hairstylist Female 92.4 0.816338 0.786998
dental assistant Female 92 0.850209 0.816433
nurse Female 91.3 0.866288 0.812078
school psychologist Female 90.4 0.838778 0.792577
receptionist Female 90 0.844404 0.80642
vet Female 89.8 0.827386 0.810253
nutritionist Female 89.6 0.84587 0.802108
maid Female 88.7 0.89618 0.827164
therapist Female 87.1 0.852225 0.817951
social worker Female 86.8 0.826142 0.779234
sewer Female 86.5 0.833624 0.806207
paralegal Female 84.8 0.772227 0.765412
library assistant Female 84.2 0.833513 0.783677
interior designer Female 83.8 0.827806 0.79507
manicurist Female 83 0.865115 0.874826
special education teacher Female 82.8 0.822208 0.772009
police officer Male 15.8 0.822562 0.784152
taxi driver Male 12 0.838859 0.810866
computer architect Male 11.8 0.840573 0.813889
mechanical engineer Male 9.4 0.869139 0.851203
truck driver Male 7.9 0.859283 0.838243
electrical engineer Male 7 0.810291 0.77179
landscaping worker Male 6.2 0.833979 0.809405
pilot Male 5.3 0.842505 0.814605
repair worker Male 5.1 0.859992 0.819944
firefighter Male 5.1 0.860235 0.832443
construction worker Male 4.2 0.844144 0.809736
machinist Male 3.4 0.833735 0.833825
aircraft mechanic Male 3.2 0.847732 0.824421
carpenter Male 3.1 0.805981 0.79008
roofer Male 2.9 0.794447 0.777306
brickmason Male 2.2 0.814994 0.809293
plumber Male 2.1 0.8419 0.810002
electrician Male 1.7 0.847812 0.820417
vehicle technician Male 1.2 0.837883 0.809716
crane operator Male 1.1 0.85486 0.827914

ㅋㅋㅋㅋㅋ.ㅋ.....

이번에도 여자가 다 높게 나오네요 ㅎㅎ.......

TSNE 시각화에선 그래도 남자와 여자가 군집화 된 것을 볼 수 있습니다...?

UMAP도 구분 된 것 같은 모습으로 보이네요 

 

그럼 이번엔 PCA를 진행 후 COS 유사도를 진행해보겠습니다.

Job Gender Dominance Female Percentage Cosine Similarity with Woman Cosine Similarity with Man
skincare specialist Female 98.2 -0.33386 -0.24223
kindergarten teacher Female 96.8 -0.29054 -0.20816
childcare worker Female 94.6 0.090889 -0.07886
secretary Female 92.5 -0.05085 -0.08793
hairstylist Female 92.4 -0.22697 -0.09444
dental assistant Female 92 -0.13962 -0.20142
nurse Female 91.3 0.279483 0.0298
school psychologist Female 90.4 -0.03985 -0.18175
receptionist Female 90 -0.09819 -0.12192
vet Female 89.8 -0.02614 0.058945
nutritionist Female 89.6 -0.11909 -0.24243
maid Female 88.7 0.515144 0.152515
therapist Female 87.1 -0.00373 -0.05599
social worker Female 86.8 -0.04072 -0.14932
sewer Female 86.5 -0.01937 0.005589
paralegal Female 84.8 -0.11654 0.030654
library assistant Female 84.2 -0.10128 -0.23378
interior designer Female 83.8 -0.12399 -0.14279
manicurist Female 83 0.099113 0.419571
special education teacher Female 82.8 -0.22952 -0.35435
police officer Male 15.8 0.073392 -0.00112
taxi driver Male 12 -0.16622 -0.15999
computer architect Male 11.8 -0.04169 -0.01093
mechanical engineer Male 9.4 -0.01881 0.112229
truck driver Male 7.9 -0.01101 0.06296
electrical engineer Male 7 -0.32179 -0.31583
landscaping worker Male 6.2 -0.34693 -0.23745
pilot Male 5.3 0.043929 0.0348
repair worker Male 5.1 0.041328 -0.10179
firefighter Male 5.1 -0.0164 -0.03699
construction worker Male 4.2 -0.19002 -0.15065
machinist Male 3.4 -0.07069 0.186568
aircraft mechanic Male 3.2 -0.17663 -0.12196
carpenter Male 3.1 -0.31704 -0.10918
roofer Male 2.9 -0.15045 -0.06005
brickmason Male 2.2 -0.09472 0.100032
plumber Male 2.1 -0.04168 -0.08663
electrician Male 1.7 -0.17601 -0.10355
vehicle technician Male 1.2 -0.12613 -0.07126
crane operator Male 1.1 -0.18231 -0.10622

여기도 편향이 그렇게 강하다! 라고 볼 순 없네요,,,>?ㅜㅜㅜㅜ

 

또 다른 아이디어가 떠오르면 작성하겠습니다....

 

728x90