Pre-training = 특정 task로 특화하기 전 일반적인 data를 통해 먼저 학습하는 단계로 일반화된 언어 특성을 학습
1. 과거의 사전학습
과거에는 딥러닝의 학습이 잘 진행되지 않아 레이어를 각각 학습시키고, 합치는 형식으로 진행
2. 사전 학습된 워드 임베딩 - Word2Vec, FastText,GloVe
단어의 의미를 Dense 벡터 표현에 성공적으로 압축
BUT 단어의 형태학적 특성을 반영하지 못하고, OOV 처리가 어려우며, 단어 사전이 클수록 학습하는데 오래걸린다는 단점이 있다.
3. 사전 학습된 언어 모델 ELMo
한 단어에 여러 의미가 있으면 벡터 값이 달라지기에 위의 Word2Vec와 같은 한계를 어느정도 해결
그러나 LSTM기반을 벗어나지 못해 구조적인 문제가 발생
고정된 크기 벡터에 모든 정보를 압축해 정보 손실, Vanishing Gradient 문제, 입력을 순차적으로 처리하여 병렬 처리 불가
GPT도 입력은 순서대로 들어가지 않나...?
Yes, both ELMo (Embeddings from Language Models) and GPT (Generative Pre-trained Transformer) face challenges with parallel computation, but the nature of the problem and how it impacts these models differs.
1. ELMo and Parallelization
ELMo is based on bidirectional LSTMs (Long Short-Term Memory networks), which process text sequentially. This means that each word’s representation depends on the previous word (or sequence of words), making it difficult to parallelize the computation.
- Sequential Dependency: LSTMs inherently process data one step at a time, which means that in ELMo, to compute the representation for a word at position tt, the model must first compute the representations for all previous words. This sequential nature limits parallelization because the computations can’t be done all at once; they depend on each other step by step.
- Bidirectionality: ELMo uses bidirectional LSTMs, meaning it processes the text both forwards and backwards. This further complicates parallelization because the forward and backward passes need to be computed in sequence before combining the results.
2. GPT and Parallelization
GPT (Generative Pre-trained Transformer), particularly in its earlier versions like GPT-2 and GPT-3, also faces challenges with parallelization, but the nature of these challenges is different.
- Transformer Architecture: GPT is based on the Transformer architecture, which uses self-attention mechanisms. Unlike LSTMs, Transformers can process entire sequences in parallel within each layer because the attention mechanism allows each token to attend to every other token in the sequence simultaneously. This means that, within a single layer, the computations can be parallelized.
- Autoregressive Nature: Despite the parallelism within layers, GPT is an autoregressive model, meaning it generates text one token at a time. When generating text, the model must compute the output for the first token before moving on to the next one. This sequential generation introduces a bottleneck similar to what we see in LSTMs, but only during the generation phase, not during training or forward passes.
- Parallel Computation During Training: During training, the entire sequence can be processed at once, allowing for significant parallelization. However, when generating text (especially in large models like GPT-3), the model must generate each token sequentially, which limits parallelization.
Key Differences in Parallelization Challenges:
- ELMo:
- Sequential Processing: ELMo's reliance on bidirectional LSTMs makes the entire computation sequential, limiting parallelization during both training and inference.
- Difficulty in Parallelizing: The inherent nature of LSTMs means that every time step depends on the previous one, so it’s hard to speed up the computation with parallel hardware.
- GPT:
- Parallelization in Training: GPT can parallelize computations during training within each layer because of its Transformer architecture.
- Sequential Generation: The limitation in GPT comes during text generation, where tokens are generated one after the other, which is less parallelizable.
Summary:
- ELMo faces significant parallelization issues because of its LSTM-based architecture, making it difficult to parallelize both during training and inference.
- GPT improves on parallelization through its Transformer architecture, especially during training, but still faces sequential limitations during text generation due to its autoregressive nature.
So, while both ELMo and GPT encounter challenges related to parallel computation, the extent and context of these challenges differ. GPT is generally more efficient in terms of parallelization, particularly during training, but like ELMo, it faces limitations when it comes to sequential tasks like text generation.
Parallelization in Training:
- GPT:
- Can Parallelize During Training: GPT, based on the Transformer architecture, can parallelize computations during the training phase. This is because the self-attention mechanism allows all tokens in a sequence to be processed simultaneously within each layer. This parallel processing makes GPT much more efficient to train compared to models based on LSTMs, like ELMo.
- ELMo:
- Cannot Fully Parallelize During Training: ELMo uses bidirectional LSTMs, which are inherently sequential. This sequential nature means that each word's representation depends on the previous word (or sequence of words). Therefore, ELMo cannot fully parallelize computations during training because each step must be calculated in order.
Parallelization in Inference:
- GPT:
- Limited Parallelization During Inference: During inference (particularly when generating text), GPT operates in an autoregressive manner. This means it generates one token at a time, where each subsequent token depends on the previously generated tokens. Because of this sequential dependency, GPT cannot fully parallelize the inference process. However, within each step of generating a token, the computations (like attention mechanisms) can still be parallelized, but the generation of tokens must happen sequentially.
- ELMo:
- Limited Parallelization During Inference: Similar to training, ELMo’s inference process is also limited by the sequential nature of LSTMs. The model processes each word in sequence, relying on the previous word’s computation, which restricts parallelization during inference.
Summary:
- GPT:
- Training: Can use parallel computation extensively due to the Transformer’s self-attention mechanism.
- Inference: Limited parallelization during text generation because it generates tokens one by one in sequence.
- ELMo:
- Training: Cannot use parallel computation effectively due to the sequential nature of LSTMs.
- Inference: Also limited in parallelization due to sequential processing requirements.
So, while GPT is much more efficient in training because of its ability to parallelize, both GPT and ELMo face challenges with parallel computation during inference, but for slightly different reasons. GPT's limitations are tied to the autoregressive generation of tokens, whereas ELMo's limitations are due to the sequential nature of LSTMs.
4. 사전 학습 언어 모델 - Transformer
광범위한 데이터에서 일반화된 지식을 학습할 수 있게 되었다.
또한 강건성과 불확실성을 개선
전이학습 - 데이터가 많은 도메인에서 학습된 모델을 데이터가 적은 타겟 도메인으로 전이시키는 학습 방식
backbone - 뼈대가 되는 모델로 크고 일반화가 잘 되어있어야 한다.
이점 - 효율적 학습 가능, 부족한 데이터 셋에 대한 오버피팅 방지, 부족한 도메인 지식 확보
파인튜닝 - 전이학습과 같다!
low-level layer - 일반적인 특징 추출
last layer - 구체적인 특징 추출
GPT의 다양한 task
BERT의 다양한 task
파인튜닝 방법은 이전의 pre-training에서 source domain과 target domain이 얼마나 관련 있는지, target data의 양에 따라 freezing시키는 양이 달라진다.
adapter - encoder나 decoder에 adapter layer를 추가하여 해당 레이어만 학습하는 것으로 바꿔 끼면서 다른 task 해결이 가능
Prefix - 고정된 개수의 embedding 토큰을 각 task마다 학습하여 input에 붙여 넣는다.
incontext learning - 모델 크기가 충분히 커서 일반화 수준이 높다면 다른 학습 없이 프롬포트 단에서만 작업 수행 가능
FL learning - encoder나 decoder의 일부 FCN만 학습하는 것으로 prompt tuning을 이겼다.
불안정한(unstable) 학습 - 같은 데이터셋에 대해 같은 모델로 학습
-> 하이퍼 파라미터 튜닝을 활용하여 안정적인 학습 곡선을 그리는 파라미터 조합 찾기
Catatrophic forgetting
두가지 테스크를 순차적으로 훈련하면 첫번째로 학습한 테스크를 까먹는다.
Small Size of the Fine-tuning Dataset
샘플의 개수가 적을 때 발생하는 문제로, 에폭을 많이 돌리면 불안정 했던 분산이 다시 복구된다.
-> 오버피팅 전 까지만 ㅎㅎ
일반화시 고려해야할 것
Drop out - 특정 노드에 의존하지 않게 되어 성능 향상
Weight decay - 가중치 커지지 않도록 패널티 부과
learning rate scheduling - learning rate를 학습 진행 상황에 따라 조절하여 수렴을 돕는다.
global gradient clipping - 기울기가 임계값을 넘지 않도록 값에 제한을 둔다.
자연어 분류 - 자연어 이해(NLU)작업의 하위 집합, 자연어 형태의 문장이나 문서를 이해해 적절한 클래스로 해당 문장을 분류하는 모든 작업
감정 분석 (Sentiment Analysis Task) - 내포된 정확한 감정 분류
자연어 추론 작업 (Natural Language Inference Task): 전제와 가설에 해당하는 문장이 주어졌을 때, 가설의 참/ 거짓 혹은 관련 없음을 분류
의도 분류 작업(Intent Classification Task) - 주어진 문장에 내포된 정확한 의도를 분류
문장 및 분서 분류 작업 (Sentence Classification Task) - 주어진 문장이나 문서를 정해진 카테고리로 분류
평가
Precision - 참이라고 예측한 것들 중 실제 참인 경우 = 실제 정답이 negative인 데이터를 positive라고 잘못 예측하면 안되는 상황에서 중요한 지표
Recall - 참인것들 중 참이라고 예측한 경우 = positive를 negative라고 잘못 예측하면 안되는 경우
F1 Score - Recall과 Precision의 조화평균
자연어 생성
기계 번역 작업
기계 독해 작업 - 주어진 문서에서 질문에 대한 답변을 생성하는 질의 응답 작업
요약 작업
평가 방식
PPL(Perplexity) - 문장이 완성될 때까지 선택된 토큰들의 누적된 확률을 기반으로 계산한 값의 역수에 제곱근씌운 것
모델이 reference 문장을 생성할 때 얼마나 헷갈렸는지를 나타내는 지표로 낮을수록 해당 문장이 생성할 때 확신을 가지고 생성했음을 말하고, 원하는 방향으로 학습이 잘 되었다고 가정할 수 있다.
BLEU(Bilingual Evaluation Understudy)
reference 문장과의 형태적 유사성에 기반한 지표이므로 개선이 필요하다. = 문장의 의미를 파악하지 못한다.
SSA (Sensible and Specificity Average)
sensibleness - 문장이 말이 안되면 0, 말이 되면 1로 1인 경우에 다음 평가 진행
specificity - 생성한 문장이 구체적인 컨텍스트를 포함하면 1을 준다.
그런데 이거 평가를 어떻게 함?
You're correct in thinking that SSA (Semantic Similarity Analysis) in NLP (Natural Language Processing) tasks involves understanding the meaning of text, which connects it closely with Natural Language Understanding (NLU). Since SSA involves comparing the meanings of different pieces of text, Language Models (LMs) are indeed important in facilitating this task.
Let’s break this down step-by-step.
What is SSA in NLP?
Semantic Similarity Analysis (SSA) is a technique used to measure how similar two pieces of text are in meaning, rather than just comparing them based on exact words or surface-level characteristics. For example, the sentences:
- "The cat is on the mat."
- "A feline is sitting on the carpet."
Although the wording is different, SSA would try to determine that these two sentences are similar in meaning. This is useful in various NLP tasks such as:
- Text classification
- Question answering
- Text summarization
- Paraphrase detection
How SSA Is Done Using Language Models:
Since SSA requires a deep understanding of the text, Language Models (LMs) such as BERT, GPT, and ELMo are often employed for this purpose. Here’s how the process typically works:
1. Text Representation with Language Models
To evaluate the semantic similarity between two texts, you first need to represent the texts in a way that captures their meanings. Modern pre-trained language models (such as BERT, GPT, and even FastText embeddings) can be used to convert these texts into vector representations (embeddings) that capture semantic information.
For example:
- BERT or ELMo generates contextualized embeddings for each word in a sentence, where the meaning of a word is influenced by its context.
- GPT generates embeddings based on the transformer architecture and can also be used for understanding semantic relations between sentences.
These embeddings are then compared to see how similar they are in terms of meaning.
2. Similarity Measurement Methods
Once you have vector representations (embeddings) for the two texts, you can measure their similarity. Common approaches include:
- Cosine Similarity: This measures the cosine of the angle between the two vectors representing the texts. A cosine similarity of 1 means the vectors are identical (very similar text), while a value of 0 means they are completely dissimilar.
- Euclidean Distance or Manhattan Distance: These metrics calculate the distance between the two vectors in the embedding space.
- Dot Product: Another common similarity measure that can be used to find the overlap between two vectors.
3. Training for Specific Tasks
In some cases, SSA is performed within a supervised learning framework, where you fine-tune pre-trained language models (like BERT) on a task-specific dataset. This is common in applications like semantic textual similarity (STS) tasks, where models are trained to predict a similarity score between sentence pairs.
For instance:
- Sentence BERT (SBERT) is a modified version of BERT that is specifically designed to create fixed-size sentence embeddings, making it very effective for SSA tasks.
- Fine-tuning GPT or other LMs on task-specific data (e.g., paraphrase detection or question-answering pairs) can also yield good SSA performance.
4. Using NLU (Natural Language Understanding) for SSA
NLU focuses on extracting the meaning and structure from text, and SSA depends heavily on the ability of LMs to understand the context and semantics of words and sentences. Models that are effective at NLU tasks, like BERT or ELMo, are typically also strong at SSA. These models understand the deeper connections between words and their meanings, enabling more accurate comparisons of text pairs.
Steps for SSA with LM-based Models:
- Preprocess Text: Clean and prepare the text for input into the language model (tokenization, lowercasing, etc.).
- Embed Text: Use a pre-trained language model (like BERT or GPT) to convert the text into vector embeddings.
- Compare Embeddings: Measure the similarity between the embeddings using a similarity metric like cosine similarity or dot product.
- Thresholding (if needed): Set a threshold for what is considered “similar.” If two sentences have a similarity score above a certain value, they are considered semantically similar.
- Evaluate Performance: Evaluate the performance of the model by comparing it to a labeled dataset (where human annotators have marked pairs of sentences as similar or not).
Why You Need a Language Model for SSA:
Language Models (LMs) are essential for SSA because they provide contextualized, deep representations of words and sentences, which are crucial for capturing the underlying meaning of text. Simple techniques like bag-of-words fail in SSA tasks because they don’t capture context or word order. LMs, especially those based on transformers (like BERT and GPT), excel at capturing complex linguistic relationships, making them highly effective for SSA.
Conclusion:
In summary, SSA in NLP task evaluation typically requires:
- Language Models (LMs) like BERT, GPT, or ELMo to produce rich, contextualized embeddings of text.
- Similarity metrics like cosine similarity to compare the embeddings of two pieces of text.
- Fine-tuning or task-specific training for tasks like paraphrase detection or question answering.
These components together ensure that SSA can measure how semantically similar two texts are, making LMs a fundamental part of the process.
'인공지능 > 자연어 처리' 카테고리의 다른 글
ESC Task, ESConV 평가 방식 (3) | 2024.09.05 |
---|---|
Model의 파라미터를 줄이는 방법 - Pruning with LLM 1 (1) | 2024.09.05 |
자연어 처리 복습 4 - seq2seq, ELMo, Transformer, GPT, BERT (0) | 2024.09.03 |
자연어 처리 복습 3 - 토큰화, 임베딩, 언어 모델 (2) | 2024.09.03 |
세미나 정리 8-29 (0) | 2024.08.29 |