딥러닝 응용 시험 정리

인공지능/공부

딥러닝 응용 시험 정리 - 1 CTC Loss, LoRA

이게될까 2025. 12. 7. 21:22

728x90

일단 기말고사가 닥쳐와서....

https://docs.pytorch.org/docs/stable/generated/torch.nn.CTCLoss.html

CTCLoss — PyTorch 2.9 documentation

CTCLoss class torch.nn.CTCLoss(blank=0, reduction='mean', zero_infinity=False)[source] The Connectionist Temporal Classification loss. Calculates loss between a continuous (unsegmented) time series and a target sequence. CTCLoss sums over the probability o

docs.pytorch.org

일단 CTC Loss부터 보겠습니다.

CTC Loss = Connectionist Temporal Classification - AST, OCR과 같은 시계열 신호를 문자 변환처럼 입 출력 길이가 다르고 정렬이 주어지지 않은 문제를 학습하기 위한 Loss이다.

입력 프레임은 100개인데 출력되는 문자는 10개일 수 있을 때 정렬이 안되니까 생기는 문제를 해결합니다.

Conformer, Wav2Vec2, HuBERT 등 다양한 구조가 CTC와 함께 학습됨

2025.12.03 - [인공지능/논문 리뷰 or 진행] - Prompting Large Language Models with Speech Recognition Abilities - Code 구현

Prompting Large Language Models with Speech Recognition Abilities - Code 구현

https://github.com/MyoungJinKim/AAA737_TermProject GitHub - MyoungJinKim/AAA737_TermProject: Prompting Large Language Models with Speech Recognition Abilities 논문 코드 재현Prompting Large Language Models with Speech Recognition Abilities 논문 코

yoonschallenge.tistory.com

여기에 구현도 되어 있으니...

# -------------------------------
# [1] Target이 padding된 경우
# -------------------------------
T = 50   # 입력 시퀀스 길이 (Time steps)
C = 20   # 클래스 개수 (blank 포함)
N = 16   # 배치 크기
S = 30   # 배치 내 가장 긴 target 시퀀스 길이 (padding 길이)
S_min = 10  # target 최소 길이 (예시용)

# 입력 벡터 랜덤 생성 (크기: [T, N, C])
input = torch.randn(T, N, C).log_softmax(2).detach().requires_grad_()

# 타겟 시퀀스 랜덤 생성 (0 = blank, 1~C-1 = 실제 클래스)
target = torch.randint(low=1, high=C, size=(N, S), dtype=torch.long)

# 각 샘플의 입력 시퀀스 길이 (모두 T로 동일)
input_lengths = torch.full(size=(N,), fill_value=T, dtype=torch.long)

# 각 샘플의 실제 타겟 길이 (padding 제외 길이)
target_lengths = torch.randint(
    low=S_min,
    high=S,
    size=(N,),
    dtype=torch.long,
)

# CTC Loss 계산
ctc_loss = nn.CTCLoss()
loss = ctc_loss(input, target, input_lengths, target_lengths)
loss.backward()


# --------------------------------------
# [2] Target이 padding되지 않은 경우
# --------------------------------------
T = 50   # 입력 시퀀스 길이
C = 20   # 클래스 개수 (blank 포함)
N = 16   # 배치 크기

# 입력 벡터 랜덤 생성 (크기: [T, N, C])
input = torch.randn(T, N, C).log_softmax(2).detach().requires_grad_()

# 각 샘플의 입력 시퀀스 길이 (모두 T)
input_lengths = torch.full(size=(N,), fill_value=T, dtype=torch.long)

# 각 샘플의 타겟 길이 (1 ~ T 사이)
target_lengths = torch.randint(low=1, high=T, size=(N,), dtype=torch.long)

# padding 없이 모든 target을 1차원으로 연결
target = torch.randint(
    low=1,
    high=C,
    size=(sum(target_lengths),),
    dtype=torch.long,
)

# CTC Loss 계산
ctc_loss = nn.CTCLoss()
loss = ctc_loss(input, target, input_lengths, target_lengths)
loss.backward()


# ---------------------------------------------------
# [3] Target이 padding되지 않고, 배치도 없는 경우 (N = 1)
# ---------------------------------------------------
T = 50   # 입력 시퀀스 길이
C = 20   # 클래스 개수 (blank 포함)

# 입력 벡터 랜덤 생성 (크기: [T, C])
input = torch.randn(T, C).log_softmax(1).detach().requires_grad_()

# 입력 시퀀스 길이 (스칼라)
input_lengths = torch.tensor(T, dtype=torch.long)

# 타겟 시퀀스 길이 (1 ~ T)
target_lengths = torch.randint(low=1, high=T, size=(), dtype=torch.long)

# 단일 샘플에 대한 target 생성
target = torch.randint(
    low=1,
    high=C,
    size=(target_lengths,),
    dtype=torch.long,
)

# CTC Loss 계산
ctc_loss = nn.CTCLoss()
loss = ctc_loss(input, target, input_lengths, target_lengths)
loss.backward()

CTC를 지도학습으로 봐야 할지가 정말 애매합니다.

Gold Label은 있으나 프레임 단위 정답이 없어서 언제 그 정답이 추출되는지 모르기에 Weakly Supervised Learning 이라고 봐야겠죠

CTC Loss는 입력 프레임이 매우 길고 출력 문자가 짧을 때,
Blank 토큰과 중복 제거 규칙을 이용해 정답 문자열을 만들 수 있는 모든 alignment 경우의 수를 고려하고,
각 alignment의 확률을 곱한 뒤 이를 전부 합산하여 전체 시퀀스의 확률을 구하고,
그 전체 확률에 대해 Negative Log Likelihood 형태로 단일 Loss를 계산하는 방식이다.

https://arxiv.org/abs/2106.09685

LoRA: Low-Rank Adaptation of Large Language Models

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes le

arxiv.org

기존 full finetuning은 테스크마다 100%를 모두 바꾸니 리소스 소모가 컸고, Adapter는 추론 지연이 발생했으며 Prefix나 Prmpt Tuning은 입력 시퀸스 길이가 감소하며 학습이 불안정했다.

이를 해결하기 위해 매우 작은 저 차원 공간의 업데이트 진행을 하여 기존 weight는 얼린다.

초기에는 B가 0이라 영향이 없지만 점점 커지면서 영향을 주게 됨

Attention에서 AKV 연산 직전에 추가해줌

https://arxiv.org/abs/2202.12837

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Large language models (LMs) are able to in-context learn -- perform a new task via inference alone by conditioning on a few input-label pairs (demonstrations) and making predictions for new inputs. However, there has been little understanding of how the mo

arxiv.org

이 논문은 Incontext Learning의 성능이 정말 정답 라벨이 붙은 few-shot 때문인지를 확인하려고 했습니다.

그래서 정답 라벨이 아닌 것도, 랜덤으로 바꿔서 실행해봤으나 ICL 성능은 떨어지지 않았습니다.

그래서 정답 대응 분포, input 분포, 라벨 종류 공간, 입력 라벨 다 나눠서 진행

정답 대응은 크게 중요하지 않았다.

결국 input 분포와 라벨 집합을 바꿔버린 것에서 성능을 많이 떨어짐

그리고 라벨만 주거나, 입력만 줘도 성능 하락!

결국 포맷을 따라가는 정답을 뱉기만 하는 것 아닌가

https://ieeexplore.ieee.org/document/9414467

Neural Utterance Confidence Measure for RNN-Transducers and Two Pass Models

In this paper, we propose methods to compute confidence score on the predictions made by an end-to-end speech recognition model in a 2-pass framework. We use RNN-Transducer for a streaming model, and an attention-based decoder for the second pass model. We

ieeexplore.ieee.org

이 논문은 confidence를 정확하게 예측하는 신경망 모델을 만들었습니다.

즉 문장 전체가 맞았는지 틀렸는지를 다양한 feature들을 모아서 맞히는 이진 분류기인 MCM을 따로 훈련했습니다.

Feature	의미
✅ Beam Scores (Scores)	각 beam의 log-prob
✅ RNN-T Transcription Output (Trans)	음향 요약
✅ RNN-T Prediction Net Output (Pred)	언어 정보
✅ RNN-T Joint Net Logits (Joint)	최종 토큰 분포
✅ 2-Pass Encoder Output (Enc)	더 정제된 음향
✅ 2-Pass Decoder Logits (Dec)	attention 기반 토큰 분포

정답을 맞추면 label = 1 아니면 0이다.

=> 지도학습을 진행한다.

2-Pass Decoder feature가 confience 예측의 핵심임

https://arxiv.org/abs/2302.07521

Confidence Score Based Speaker Adaptation of Conformer Speech Recognition Systems

Speaker adaptation techniques provide a powerful solution to customise automatic speech recognition (ASR) systems for individual users. Practical application of unsupervised model-based speaker adaptation techniques to data intensive end-to-end ASR systems

arxiv.org

화자 데이터는 너무 적다! = 실제 서비스에선 화자별 음성 데이터가 너무 적어 full fine-tuning은 과적합된다.

비지도 적응은 틀린 정답으로 학습된다! = 오답이 supervision으로 들어가면 성능이 망함

신뢰도(confidence)가 높은 발화만 골라서 speaker adaptation(틀린 pseudo label 제거)을 수행하고,
그 적은 데이터에서 생기는 불확실성은 Bayesian learning으로 처리한다.

https://ieeexplore.ieee.org/document/9688210

Improving ASR Error Correction Using N-Best Hypotheses

In the field of Automatic Speech Recognition (ASR), Grammatical Error Correction (GEC) can be used to correct errors in recognition results of ASR systems and whereby it further reduces the word error rate (WER). Most conventional GEC approaches make corre

ieeexplore.ieee.org

기존 ASR 오류 보정은 1-best 결과 하나만 가지고 오류를 고쳤다.

그러나 실제 ASR 과정에는 여러 후보가 존재하고, 1-best만 쓰면 중요한 대안 정보가 모두 버려져 오류 탐지를 실패하고, 잘못된 수정을 진행한다. => N-best 정보를 GEC에 직접 써 오류 교정 성능을 올린다.

결국 각 토큰 위치마다 하나씩 임베딩에 넣은 다음에 concat 후 linear 태워서 하나의 토큰처럼 크기 만든 다음에 decoder에 넣는거네?

그럼 오류가 줄어든 문장이 생성되고?

=> 굳

https://aclanthology.org/2021.findings-emnlp.367/

FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition

Yichong Leng, Xu Tan, Rui Wang, Linchen Zhu, Jin Xu, Wenjie Liu, Linquan Liu, Xiang-Yang Li, Tao Qin, Edward Lin, Tie-Yan Liu. Findings of the Association for Computational Linguistics: EMNLP 2021. 2021.

aclanthology.org

이 논문도 ASR 오류 보정에서 1-Best만 쓰는 것을 문제라 말함

여기서도 여러 후보를 병렬적으로 처리하여 decoding 함

근데 다른 논문들은 단순 padding을 맞추는데 여기선 발음 유사도와 edit path를 통해 의미 단위로 정렬함

https://arxiv.org/abs/2307.09744

Enhancing conversational quality in language learning chatbots: An evaluation of GPT4 for ASR error correction

The integration of natural language processing (NLP) technologies into educational applications has shown promising results, particularly in the language learning domain. Recently, many spoken open-domain chatbots have been used as speaking partners, helpi

arxiv.org

ASR 오류는 얼마나 단어 단위로 정확히 고쳤냐는 ASR 오류보다 얼마나 대화를 자연스럽게 만들었느냐가 중요하지 않냐!!

=> GPT-4를 ASR 오류 교정기로 사용하면 WER은 낮아질 수 있지만 대화 품질이 올라가고, 자연스러워 질 수 있다.

오류 교정에서 문법 어순까지 수정하라고 하면 WER도 증가함

저작자표시 비영리 (새창열림)

'인공지능 > 공부' 카테고리의 다른 글

AI Agent Orchestrator, 오케스트레이션 (0)	2026.01.10
딥러닝 응용 시험 정리 -2 (0)	2025.12.08
허깅페이스 2 기초 - Transformers 모듈, 모델 추가 (0)	2025.11.12
인공지능 저작권 보호 (Copyright protection in AI) - 2 (0)	2025.10.16
딥러닝 응용 9- 정리2 (0)	2025.10.13

현재글딥러닝 응용 시험 정리 - 1 CTC Loss, LoRA

NLP, AI, XAI에 관심있는 공대생의 일기장...?

Today :
Yesterday :

공대생 도전 일지