자연어 처리 : LLaMa Pretrain하기

인공지능/자연어 처리

자연어 처리 : LLaMa Pretrain하기 - python 실습

이게될까 2024. 7. 21. 21:15

728x90

https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform?pli=1&pli=1

Request Form

Thank you for your interest in Meta AI’s LLaMA (Large Language Model Meta AI) models. To request access to the models, please fill out this form, and we'll review and let you know if your use case is approved. The information you provide below will be us

docs.google.com

여기에서 라마에 대한 접근을 요청할 수 있습니다

'상업적으로 사용하지 않을 것이다'가 제일 중요하네요

65B모델은 램만 130GB를 잡아먹는다고 하니 ㅎㅎ.......

그냥 비슷한 모델을 만든다는 느낌으로 진행하면 되겠습니다.

리눅스 환경이 필요해서 잠시 우분투로 갔다오겠습니다...

용량이 너무 커서 다운 받는 것은 포기...

나중에 랩실에서 하거나 사 내에서 해야겠네요

https://huggingface.co/docs/datasets/index

Datasets

huggingface.co

데이타 셋이 모여있는 사이트입니다.

https://huggingface.co/datasets/abisee/cnn_dailymail

abisee/cnn_dailymail · Datasets at Hugging Face

HONG KONG, China (Reuters) -- Paul Lee got his liver from an executed Chinese prisoner; Karam in Egypt bought a kidney for his sister for $5,300; in Istanbul Hakan is holding out for $30,700 for one of his kidneys. Doctors in Pakistan have been arrested fo

huggingface.co

데이터 구성은 이렇게 되어있습니다.

데이터 셋도 나뉘어져 있네요 ㅎㅎ

from datasets import load_dataset,DatasetDict

raw_dataset = load_dataset("abisee/cnn_dailymail", "3.0.0")

아래와 같이 데이터 셋을 나눌 수 있습니다.

raw_dataset['train'][0]['article'][:200]

"It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours afte"

너무 기니까 잘라서 보기!

raw_dataset['train'].to_pandas()

판다스 프레임워크로 변환하여 출력하기!

데이터는 28만개로 상당히 많이 있다.

sampled_dataset = DatasetDict(
    {
        "train": raw_dataset['train'].select(range(50000)).shuffle(),
        "valid": raw_dataset['test'].select(range(5000)).shuffle()
    }
)

데이터가 너무 많기 때문에 조금만 뽑아서 테스트를 진행해본다.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer

Gpt 2 토크나이저를 불러온다.

def get_training_corpus(ds):
    return(
        ds[i:i+1000]['article'] for i in range(0, len(ds), 1000)
    )


training_corpus = get_training_corpus(raw_dataset['train'])

한 번에 토크나이저에 올리면 데이터의 양이 너무 많기 때문에 조금씩 올려준다.

%%time

tokenizer = tokenizer.train_new_from_iterator(training_corpus, vocab_size=50257)

vocab은 원본과 똑같은 사이즈이다.

여기서 시간이 좀 걸린다. 데이터 양이 어마무시하기 때문에...

sample_text = "It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria"

tokenizer.tokenize(sample_text)

['It',
"'s",
'Ġofficial',
':',
'ĠU',
'.',
'S',
'.',
'ĠPresident',
'ĠBarack',
'ĠObama',
'Ġwants',
'Ġlawmakers',
'Ġto',
'Ġweigh',
'Ġin',
'Ġon',
'Ġwhether',
'Ġto',
'Ġuse',
'Ġmilitary',
'Ġforce',
'Ġin',
'ĠSyria']

테스트 문장을 통해 토크나이저 확인하기!

context_length = 128

def tokenize(batch):
    outputs = tokenizer(
        batch['article'],
        max_length=context_length,
        truncation=True,
        return_overflowing_tokens=True,
        return_length=True
    )
    
    input_batch = []
    for length, input_ids in zip(outputs['length'], outputs['input_ids']):
        if length==context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}

128이 되지 않는 데이터들은 잘라준다.

이 코드는 텍스트 데이터를 토크나이저(tokenizer)를 사용하여 토큰화하고, 그 결과로 얻은 입력 아이디(input_ids)를 반환하는 함수입니다. 다음은 각 부분에 대한 설명입니다:

context_length = 128:
- 토크나이저가 생성할 토큰 시퀀스의 최대 길이를 정의합니다. 여기서는 128로 설정되어 있습니다.
def tokenize(batch)::
- batch라는 입력을 받아서 토큰화된 결과를 반환하는 함수 tokenize를 정의합니다.
outputs = tokenizer(...:
- tokenizer 함수를 호출하여 batch['article']에 있는 텍스트 데이터를 토큰화합니다.
- max_length=context_length: 최대 길이를 context_length로 설정합니다. 이는 128 토큰으로 자릅니다.
- truncation=True: 텍스트가 max_length보다 길면 잘라냅니다.
- return_overflowing_tokens=True: 하나의 입력이 max_length를 초과하는 경우 초과된 토큰들을 반환합니다.
- return_length=True: 각 토큰 시퀀스의 길이를 반환합니다.
input_batch = []:
- 결과를 저장할 빈 리스트 input_batch를 생성합니다.
for length, input_ids in zip(outputs['length'], outputs['input_ids'])::
- outputs['length']와 outputs['input_ids']를 병렬로 순회합니다. 여기서 outputs['length']는 각 토큰 시퀀스의 길이를, outputs['input_ids']는 각 토큰 시퀀스를 나타냅니다.
if length==context_length::
- 토큰 시퀀스의 길이가 context_length와 같은 경우에만 다음 코드를 실행합니다.
input_batch.append(input_ids):
- 길이가 context_length인 토큰 시퀀스를 input_batch에 추가합니다.
return {"input_ids": input_batch}:
- input_batch를 input_ids라는 키를 가진 딕셔너리로 반환합니다.

요약하면, 이 코드는 주어진 텍스트 데이터 배치를 최대 길이 128로 토큰화한 후, 길이가 정확히 128인 토큰 시퀀스만 골라서 반환합니다. 이렇게 하면 일정한 길이의 입력 데이터만 사용하여 모델을 훈련시키거나 예측에 사용할 수 있습니다.

tokenized_datasets = sampled_dataset.map(tokenize, batched=True, remove_columns=raw_dataset['train'].column_names)

토크나이저를 원본 데이터셋에 적용해준다.

from transformers import LlamaConfig

configuration = LlamaConfig()

configuration

LlamaConfig {
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 2048,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "tie_word_embeddings": false,
  "transformers_version": "4.28.0",
  "use_cache": true,
  "vocab_size": 32000
}

tokenizer.bos_token_id, tokenizer.eos_token_id, tokenizer.vocab_size

(50256, 50256, 50257)

configuration = LlamaConfig(**{
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "hidden_act": "silu",
  "hidden_size": 512,
  "initializer_range": 0.02,
  "intermediate_size": 1376,
  "max_position_embeddings": 128,
  "model_type": "llama",
  "num_attention_heads": 4,
  "num_hidden_layers": 4,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "tie_word_embeddings": False,
  "transformers_version": "4.28.0",
  "use_cache": True,
  "vocab_size": 50257
})

Config를 토크나이저에 맞게 바꿔주고 크기도 많이 줄여준다.

from transformers import LlamaForCausalLM

model = LlamaForCausalLM(configuration)
model

우리가 만든 config를 통해 모델을 불러온다.

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Setting ds_accelerator to cuda (auto detect)
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(50257, 512, padding_idx=0)
    (layers): ModuleList(
      (0-3): 4 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=512, out_features=512, bias=False)
          (k_proj): Linear(in_features=512, out_features=512, bias=False)
          (v_proj): Linear(in_features=512, out_features=512, bias=False)
          (o_proj): Linear(in_features=512, out_features=512, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=512, out_features=1376, bias=False)
          (down_proj): Linear(in_features=1376, out_features=512, bias=False)
          (up_proj): Linear(in_features=512, out_features=1376, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=512, out_features=50257, bias=False)
)

import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

모델 GPU에 넣어주기

model.to(device)

모델 GPU에 넣어주기

prompt = "It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in "

inputs = tokenizer(prompt, return_tensors='pt')
inputs.to(device)

generate_ids = model.generate(inputs.input_ids, max_length=50)
generate_ids

tensor([[ 1026,   338,  1743,    25,   471,    13,    50,    13,  1992,  8732,
          2486,  3382, 10191,   284, 10164,   287,   319,  1771,   284,   779,
          2422,  2700,   287,   220,  8229,  8229,  8229,  8229,  8229,  8229,
         45936, 45936, 45936, 45936, 45936, 45936, 45936, 45936, 45936, 45936,
         45936,  7907,  7907,  7907,  7907, 45827, 29700,  7907, 45827, 19297]],
       device='cuda:0')

무엇인가 생성되었다.

tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

"It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Return Return Return Return Return Return Humane Humane Humane Humane Humane Humane Humane Humane Humane Humane Humane captured captured captured captured Inspiredinders captured Inspired Roosevelt"

학습되지 않은 모델이다보니 이상하게 출력하는 것을 볼 수 있다.

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)

out = data_collator([tokenized_datasets['train'][i] for i in range(3)])

for key in out:
    print(f"{key}: {out[key].shape}")

input_ids: torch.Size([3, 128])
attention_mask: torch.Size([3, 128])
labels: torch.Size([3, 128])

레이블이 생겼다!

out['input_ids'][0][:20], out['attention_mask'][0][:20], out['labels'][0][:20]

(tensor([    7, 18474,     8,  1377,  1649,   257,  3052,   326,  3667,   517,
           621,   257,  2063,    12, 24540,  9651,  9692,  3011, 19957,    11]),
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
tensor([    7, 18474,     8,  1377,  1649,   257,  3052,   326,  3667,   517,
           621,   257,  2063,    12, 24540,  9651,  9692,  3011, 19957,    11]))

인풋과 레이블이 동일한 것을 볼 수 있다.
모델 안에서 시프트를 하기 때문에 문제 없이 작동한다.

from transformers import TrainingArguments

batch_size = 32
logging_steps = 1000
learning_rate=5e-4
num_epochs=1

args = TrainingArguments(
    output_dir='newsllama',
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy='steps',
    eval_steps=logging_steps,
    logging_steps=logging_steps,
    save_steps=logging_steps,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_steps=logging_steps,
    lr_scheduler_type='cosine',
    learning_rate=5e-4,
    fp16=True,
    push_to_hub=False
)

from transformers import Trainer

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['valid']
)

트레이너 정리하기!

trainer.train()

모델을 학습시킨다

prompt = """It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in"""
inputs = tokenizer(prompt, return_tensors="pt")
inputs.to("cuda:0")

# Generate
generate_ids = model.generate(inputs.input_ids, max_length=50)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

'It\'s official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in the United States. "We are not going to be a very good thing," he said. "We are not going to be able'

이번에는 학습된 모델로 예측을 진행해보았다.

prompt = """Shall I compare thee to a summer’s day?
Thou art more lovely and more temperate:
Rough winds do shake the"""
inputs = tokenizer(prompt, return_tensors="pt")
inputs.to("cuda:0")

# Generate
generate_ids = model.generate(inputs.input_ids, max_length=50)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

"Shall I compare thee to a summer’s day?\nThou art more lovely and more temperate:\nRough winds do shake the world's most important-time world. The world's most important-time world is the first"

문학에 대해서도 어떻게든 말을 이어간다.

prompt = """As a scientific endeavor, machine learning grew out of the quest for artificial intelligence (AI). In the early days of AI as an academic discipline, some researchers were interested in"""
inputs = tokenizer(prompt, return_tensors="pt")
inputs.to("cuda:0")

# Generate
generate_ids = model.generate(inputs.input_ids, max_length=50)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

'As a scientific endeavor, machine learning grew out of the quest for artificial intelligence (AI). In the early days of AI as an academic discipline, some researchers were interested in the country\'s capital. The U.S. government has been a "a'

타 도메인에서도 어떻게든 말을 이어가려고 진행한다.

model.save_pretrained('daily_llama_0721')
tokenizer.save_pretrained('daily_tokenizer_0721')

모델 저장하기

저작자표시

'인공지능 > 자연어 처리' 카테고리의 다른 글

자연어 처리 Python 실습 - Parameter Efficient Fine tuning (3)	2024.07.22
자연어 처리 python 실습 - LLaMa instruction Tuning (1)	2024.07.21
자연어 처리 LLaMa 모델 분석하기 (0)	2024.07.21
자연어 처리 : 분산학습 - Distributed Training, Python 실습 (1)	2024.07.21
Gen AI LM - GPT (1)	2024.07.20

현재글자연어 처리 : LLaMa Pretrain하기 - python 실습

인공지능, 자율주행에 관심있는 공대생의 일기장...?

Today :
Yesterday :

공대생 도전 일지