인공지능/공부

자연어 처리 - 과제 1.1

이게될까 2024. 3. 20. 20:18
728x90
728x90

요즘 릴스에 많이 나오는 max와 hamilton으로 했습니다...

import numpy as np
from numpy import dot
from numpy.linalg import norm
import pandas as pd

# Cosine Simiarity
def cos_sim(A, B):
  return dot(A, B)/(norm(A)*norm(B))
  
from sklearn.feature_extraction.text import CountVectorizer # frequency based DTM

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer # tf-idf based DTM

TEXT = ['Verstappen is the son of former Formula One driver Jos Verstappen, and former go-kart racer Sophie Kumpen. He had a successful run in karting and single-seater categories – including FIA European Formula 3 – breaking several records.',
        'At the 2015 Australian Grand Prix, when he was aged 17 years, 166 days, he became the youngest driver to compete in Formula One. After spending the 2015 season with Scuderia Toro Rosso, Verstappen started his 2016 campaign with the Italian team before being promoted to parent team Red Bull Racing after four races as a replacement for Daniil Kvyat. At the age of 18, he won the 2016 Spanish Grand Prix on his debut for Red Bull Racing, becoming the youngest-ever driver and the first Dutch driver to win a Formula One Grand Prix.',
        'After winning ten Grands Prix during the 2021 season, Verstappen became Formula One World Drivers Champion for the first time, being the first Dutch driver and the 34th driver to do so.[6] He won the next two consecutive Formula One championships in 2022 and 2023. As of the 2024 Saudi Arabian Grand Prix, Verstappen has had 56 victories, 34 pole positions and 31 fastest laps. In addition to being the youngest Grand Prix winner, he holds several Formula One records, including the most wins in a season, the highest percentage of wins in a season, and the most consecutive wins. Verstappen is set to remain at Red Bull until at least the end of the 2028 season after signing a contract extension.',
        'His family has a long association with motor sports: his father is a Dutch former Formula One driver, his Belgian mother competed in karting,[12][13] and his first cousin once removed, Anthony Kumpen, competed in endurance racing and is a two-time NASCAR Whelen Euro Series champion currently serving as the team manager for PK Carsport in Euro Series.[12] Although Verstappen has a Belgian mother, was born in Belgium and resided in Bree, Belgium, he decided to compete with a Dutch racing licence. He stated he "feels more Dutch", having spent more time with his father than with his mother during his upbringing, and was always surrounded by Dutch people while growing up in Maaseik, a Belgian town at the Dutch border.[13] Verstappen said in 2015: I actually only lived in Belgium to sleep, but during the day I went to the Netherlands and had my friends there too. I was raised as a Dutch person and thats how I feel.[14] In 2022, however, Verstappen stated that he appreciates both sides and is half-half at the end of the day',
        'He competed in Formula One for more than a season before obtaining a road driving licence on his 18th birthday.[16] Verstappen moved to Monaco the day after, in October 2015, and has lived there since and has said it was not for tax reasons.[17] In November 2020, Verstappen bought a Dassault Falcon 900EX aircraft from Virgin Galactic. The aircraft is registered PH-DTF and operated by Exxaero',
        'Sir Lewis Carl Davidson Hamilton MBE HonFREng (born 7 January 1985) is a British racing driver competing in Formula One, driving for Mercedes, and has also driven for McLaren. Hamilton has won a joint-record seven Formula One World Drivers Championship titles (tied with Michael Schumacher), and holds the records for most number of wins (103), pole positions (104), and podium finishes (197), among other records.',
        'Born and raised in Stevenage, Hertfordshire, he began karting in 1993 at the age of eight and achieved success in local, national and international championships',
        'Hamilton joined the inaugural McLaren-Mercedes Young Driver Programme in 1998, and progressed to win the 2003 British Formula Renault Championship, 2005 Formula 3 Euro Series and the 2006 GP2 Series. This led to a Formula One drive with McLaren-Mercedes from 2007 to 2012, making him the first black driver to race in the series. In his debut season, Hamilton set numerous records as he finished runner-up to Kimi Räikkönen by one point. In 2008, he won his maiden title in dramatic fashion—making a crucial overtake on the last lap of the 2008 Brazilian Grand Prix, the last race of the season—to become the then-youngest ever Formula One World Champion. Following six seasons with McLaren, Hamilton signed with Mercedes in 2013',
        'Changes to the regulations for 2014 mandating the use of turbo-hybrid engines saw the start of a highly successful period for Hamilton, during which he won six further drivers titles. Consecutive titles came in 2014 and 2015 during the intense Hamilton–Rosberg rivalry. Following teammate Rosbergs title win and retirement in 2016, Ferraris Sebastian Vettel became Hamiltons closest rival in two championship battles, in which he twice overturned mid-season point deficits to claim consecutive titles again in 2017 and 2018. His third and fourth consecutive titles followed in 2019 and 2020 to equal Schumachers record of seven drivers titles. After surpassing 100 race wins and pole positions and finishing runner-up to Max Verstappen in 2021, Hamilton has not managed to win races during Formula Ones current ground effect era with Mercedes. Hamilton is set to join Ferrari for the 2025 season',
        'Hamilton has been credited with furthering Formula Ones global following by appealing to a broader audience outside the sport, in part due to his high-profile lifestyle, environmental and social activism, and exploits in music and fashion. He has also become a prominent advocate in support of activism to combat racism and push for increased diversity in motorsport. Hamilton was listed in the 2020 issue of Time as one of the 100 most influential people globally (Time 100), and was knighted (Knight Bachelor) in the 2021 New Year Honours'
]

## Your code here

# CountVectorizer 이용하기
tf_vectorizer = CountVectorizer(min_df=1, max_df=0.9, ngram_range=(1,1))
tf_features = tf_vectorizer.fit_transform(TEXT)

# 단어들의 이름 확인하기
feature_names = tf_vectorizer.get_feature_names_out() # 바나나는 모든 문서에 등장해서 빠짐?
#print(feature_names)

# 벡터화 된 문서 확인
features = np.array(tf_features.todense())
#features

# 문서 벡터 집합 데이터프레임으로 표현하기
df = pd.DataFrame(data=features, columns=feature_names)
#df

print(' 0~ 4: max 5 ~ 9 : hamilton')
# Cosine 유사도 계산
for i in range (0,9):
    print (f'문서 {i}와 문서{i+1}의 유사도',cos_sim(features[i], features[i+1]))

for i in range (0,9):  #
    print (f'문서 {0}와 문서{i+1}의 유사도',cos_sim(features[0], features[i+1]))
    
    
    
tfidf_vectorizer = TfidfVectorizer(min_df=2, max_df=0.9, ngram_range=(1,1))
tfidf_features = tfidf_vectorizer.fit_transform(TEXT)

tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
#print(tfidf_feature_names)
tfidf_features = np.array(tfidf_features.todense()) 

df = pd.DataFrame(data=tfidf_features, columns=tfidf_feature_names)

print(' 0~ 4: max 5 ~ 9 : hamilton')
# Cosine 유사도 계산
for i in range (0,9):
    print (f'문서 {i}와 문서{i+1}의 유사도',cos_sim(tfidf_features[i], tfidf_features[i+1]))

for i in range (0,9):  #
    print (f'문서 {0}와 문서{i+1}의 유사도',cos_sim(tfidf_features[0], tfidf_features[i+1]))

 

 

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

def my_preprocessing(text, customized_stopwords=None):
# 1. 불필요한 symbols과 marks 제거하기
	filtered_content = re.sub(r'[^\s\d\w]','',text)

# 2. Case conversion; 대문자를 소문자로 바꾸기
	filtered_content = filtered_content.lower()

# 3. Word tokenization
	word_tokens = nltk.word_tokenize(filtered_content) # 작은 의미 단위로 쪼개준다.

# 4. POS tagging
	tokens_pos = nltk.pos_tag(word_tokens)  # 명사만 뽑아 쓰기 위해 품사를 붙여준다.

# 5. Select Noun words
	NN_words = []
	for word, pos in tokens_pos:
		if 'NN' in pos:
			NN_words.append(word)

# 6. Lemmatization
# nltk에서 제공되는 WordNetLemmatizer을 이용하는 경우 
	wlem = WordNetLemmatizer()
	lemmatized_words = []
	for word in NN_words:
		#print(word, pos)
		#new_word = wlem.lemmatize(word) # 동사를 원형으로 바꿔주는 기능을 한다.
		#print('lemma: ', new_word)
		lemmatized_words.append(word)

# 7. Stopwords removal
# 1차적으로 nltk에서 제공하는 불용어사전을 이용해서 불용어를 제거할 수 있습니다. the, a, an ....
# tf-idf값이 엄청 낮은 단어들
	stopwords_list = stopwords.words('english') #nltk에서 제공하는 불용어사전 이용
	#print('stopwords: ', stopwords_list) 
	unique_NN_words = set(lemmatized_words)
	final_NN_words = lemmatized_words

	for word in unique_NN_words:
		if word in stopwords_list:
			while word in final_NN_words: final_NN_words.remove(word)
				

	return final_NN_words
    
docs_nouns = [my_preprocessing(doc, stopwords) for doc in total_docs]

documents_filtered = [' '.join(doc) for doc in docs_nouns]

from sklearn.feature_extraction.text import TfidfVectorizer # tf-idf based DTM
tfidf_vectorizer = TfidfVectorizer(min_df=2, max_df=0.9, ngram_range=(1,1))
tfidf_features = tfidf_vectorizer.fit_transform(documents_filtered)

tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
#print(tfidf_feature_names)
tfidf_features = np.array(tfidf_features.todense()) 

df = pd.DataFrame(data=tfidf_features, columns=tfidf_feature_names)

print('0:apple1 \n1:kevin \n2:serena \n3:apple2 \n')
# Cosine 유사도 계산
for i in range (0,3):
    print (f'문서 {i}와 문서{i+1}의 유사도',cos_sim(tfidf_features[i], tfidf_features[i+1]))

for i in range (0,3):  #
    print (f'문서 {0}와 문서{i+1}의 유사도',cos_sim(tfidf_features[0], tfidf_features[i+1]))
    
print(df)

 

여기서 한글 안나오는게 큰 문제더라구요... 

kaggle....

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

plt.rc('font', family='NanumBarunGothic') 
def change_matplotlib_font(font_download_url):
    FONT_PATH = 'MY_FONT'
    
    font_download_cmd = f"wget {font_download_url} -O {FONT_PATH}.zip"
    unzip_cmd = f"unzip -o {FONT_PATH}.zip -d {FONT_PATH}"
    os.system(font_download_cmd)
    os.system(unzip_cmd)
    
    font_files = fm.findSystemFonts(fontpaths=FONT_PATH)
    for font_file in font_files:
        fm.fontManager.addfont(font_file)

    font_name = fm.FontProperties(fname=font_files[0]).get_name()
    matplotlib.rc('font', family=font_name)
    print("font family: ", plt.rcParams['font.family'])

font_download_url = "https://fonts.google.com/download?family=Noto%20Sans%20KR"
change_matplotlib_font(font_download_url)

이거 한줄 넣으니 해결 되었습니다 ㅎㅎ...

with open('/kaggle/input/2024-1-nlp-1/2016_filtered_review.txt', encoding='utf-8') as f:
    docs = [line.strip().split('\t\t') for line in f]
    
reviews = [doc[1].strip().split() for doc in docs]
from gensim.models import Word2Vec # 이미 word2vec를 가지고 있기도 함
model_sg_n10 = Word2Vec(reviews, window=3, min_count=3, vector_size=100, sg=1, negative=10) # min == 3번 이상 나온 단어를 학습
# vector size =
# sg = skip gram이다.
# negative = negative 샘플 개수

## your code here
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import numpy as np

# 유사 단어 목록을 얻고, 해당 단어들의 벡터를 배열로 준비
similar_words = ['치킨', '음악', '양주', '소주', '열정']
all_similar_words = []
word_categories = []

for idx, word in enumerate(similar_words):
    similar = [item[0] for item in model_sg_n10.wv.most_similar(word, topn=10)]
    all_similar_words.extend(similar)
    word_categories.extend([idx] * len(similar))  # 같은 종류의 단어들을 같은 카테고리 번호로 분류

# 유사 단어들의 벡터 추출
word_vectors = np.array([model_sg_n10.wv[word] for word in all_similar_words])

# t-SNE 적용
tsne = TSNE(n_components=2, random_state=42)
X_reduced = tsne.fit_transform(word_vectors)

# 시각화
plt.figure(figsize=(10, 10))
colors = ['red', 'green', 'blue', 'purple', 'orange']  # 각 종류별 색상 지정

for idx, color in enumerate(colors):
    indices = [i for i, x in enumerate(word_categories) if x == idx]
    plt.scatter(X_reduced[indices, 0], X_reduced[indices, 1], c=color, label=similar_words[idx])
    # 각 점 위에 단어 표시
    for i in indices:
        plt.text(X_reduced[i, 0], X_reduced[i, 1], all_similar_words[i], fontsize=9)

plt.legend()
plt.show()

 

 

728x90