딥러닝 공부하기 3

인공지능/공부

딥러닝 공부하기 3

이게될까 2025. 7. 3. 17:28

728x90

10강 - GCN

📌 그래프 신경망(GCN) 강의 요약

1. 그래프(Graph)의 정의

그래프란 객체들의 집합이며, 그 중 일부 쌍이 특정 관계를 가질 수 있는 구조.
구성 요소:
- 노드(node): 객체를 의미.
- 엣지(edge): 노드 간의 관계.
  - 방향 그래프 (directed) vs 무방향 그래프 (undirected).

2. 이미지 vs 그래프 구조 비교

구성	이미지 (CNN 입력)	그래프 (GCN 입력)
단위	픽셀 (Pixel)	노드 (Node)
연결 구조	고정된 격자 (2D)	유동적 연결 (Edge)

CNN은 고정된 격자 위에서 필터를 움직이며 국소 정보를 처리하지만,
GCN은 노드 간 연결 정보를 기반으로 정보를 집계(Aggregation)함.

3. CNN vs GCN: 필터 차이

측면	CNN	GCN
입력 구조	2D 이미지 (행렬)	노드 간 인접 행렬 + 피처
필터 방식	커널을 통한 컨볼루션	이웃 노드로부터 메시지 집계

4. GCN 핵심 수식 구조

입력 요소:
- X: 노드 특성 행렬 (node features)
- A: 인접 행렬 (adjacency matrix)
- D: 디그리 행렬 (degree matrix) — 각 노드의 연결 개수
- W: 학습 가능한 가중치 행렬
노드 정보 집계:
- GCN의 기본 연산은 다음과 같이 표현됨
- 여기서 A^~=A+I가 포함된 인접 행렬.
이 수식은 각 노드가 자신과 이웃 노드들로부터 정보를 받아 업데이트되는 방식으로 이해할 수 있음.

5. 예시 설명

특정 그래프에 대해 노드 특성 행렬과 인접 행렬, 디그리 행렬이 주어졌을 때:
- GCN은 각 노드의 이웃 노드들의 정보를 평균 혹은 정규화 방식으로 집계.
- 그 결과는 각 노드의 새로운 표현 (embedding)으로 활용됨.

6. GCN의 주요 논문 참고

대표 논문: Rex Ying et al., NeurIPS 2018
- GCN의 실용화 및 효율성, 샘플링 기반 기법 등 다양한 변형 기법의 기초가 된 논문.

✅ 핵심 요약

GCN은 그래프 구조 데이터를 처리하는 신경망이며, CNN과 달리 노드 간 연결성을 활용함.
주요 개념은 "이웃 노드의 정보 집계"로, 인접 행렬과 디그리 행렬을 통해 이를 수학적으로 정형화함.
CNN이 이미지 내 공간적 관계를 포착하듯, GCN은 그래프 내 구조적 관계를 포착하여 노드/그래프 수준의 태스크(분류, 예측 등)에 활용됨.

📌 GCN 강의 대본 기반 요약

1. 그래프와 이미지의 본질적 차이

이미지: 픽셀 간의 정형화된 격자 관계를 가짐 → 고정된 이웃.
그래프: 유연한 연결 구조. 사람 간의 관계 등은 물리적 거리와 무관한 연결성 표현 필요.
- 예: 소셜 네트워크에서 두 사람의 연결은 실제 위치보다 관계로 결정됨.

2. CNN vs GCN의 컨볼루션 연산 비교

항목	CNN	GCN
단위	픽셀 (정형 배열)	노드 (비정형 연결)
필터 작동 방식	주변 픽셀의 값을 커널로 스캔하여 출력 계산	이웃 노드의 특성을 집계하여 출력 계산
필터 공유	동일 커널이 전체 이미지에 공유됨	동일 가중치 행렬이 전체 그래프 노드에 공유됨

공통점: 로컬 정보 집계 후 활성화 함수 통과.
차이점: CNN은 2D 구조의 고정 이웃, GCN은 가변 연결 구조에 따른 이웃 정보.

3. GCN 연산 과정의 직관적 이해

입력:
- X: 노드 특성 행렬 (각 노드의 초기 정보)
- A: 인접 행렬 (노드 간 연결 정보)
- W: 필터 가중치 행렬 (학습 파라미터)
연산 순서 (Forward Propagation):
1. X × W : 노드 특성 임베딩.
2. A × (XW) : 이웃 노드 정보 집계.
3. σ(⋅) : 활성화 함수 적용.
정규화:
- 인접 행렬 A는 self-loop를 포함해 A^~=A+I로 설정.
- 각 노드의 연결 수를 반영한 정규화:

4. 필터의 작동 방식

각 노드는 이웃 노드의 정보를 가중 평균해 새로운 특성값 생성.
하나의 필터는 하나의 출력 특성을 생성하며, 여러 필터를 병렬로 사용해 여러 특성으로 확장 가능.
CNN처럼 필터는 모든 노드에 공유되어 파라미터 수를 줄이고 일반화 성능 향상.

5. 구조적 확장 요소들

GCN Layer Stack:
- 여러 GCN 레이어를 쌓아 멀티홉 이웃 정보까지 반영 가능.
Skip Connection:
- CNN의 residual connection과 유사하게 사용 가능.
Readout Layer:
- 그래프 수준 출력을 위해 모든 노드의 임베딩을 병합 (예: 평균, concat 등).
Pooling:
- 그래프 구조를 요약하는 graph pooling도 가능 (예: 클러스터링 기반 압축).

✅ 핵심 정리

GCN은 노드 간의 연결 관계를 반영하여 이웃 노드들의 정보를 집계 → 각 노드의 표현을 업데이트.
CNN의 컨볼루션 필터 개념을 그래프의 이웃 노드 집계로 확장한 구조.
Forward에서는 입력 → 가중치 곱 → 이웃 정보 정규화 및 집계 → 활성화 함수 순으로 진행.
GCN에서도 CNN처럼 레이어 스택, Skip Connection, Pooling, Readout 등의 구조가 사용됨.
필터 가중치는 전체 노드에 공유, 학습을 통해 task-specific하게 최적화됨.

📌 보완된 GCN 강의 핵심 요약 (중복 없는 관점 및 세부 설명 포함)

1. 그래프 데이터와 GCN의 출발점

일반적인 CNN은 유클리드 공간(예: 2D 이미지)에 국한되며 고정된 이웃 구조를 가정함.
GCN은 비유클리드 구조인 그래프에 적용 가능 → 이미지, 소셜네트워크, 화합물 구조 등 복잡한 관계형 데이터 표현 가능.

2. GCN의 핵심 구성 요소

구성 요소	설명
노드(feature vector)	각 노드의 속성 정보를 담는 벡터
인접 행렬 (A)	노드 간 연결 관계를 나타내며, 대각 원소(Self-loop)를 포함
정규화 인접 행렬	각 노드의 연결 수(Degree)를 고려한 정규화 처리로 학습 안정성 확보

Self-loop 포함의 이유: 자신의 정보를 포함해 업데이트 가능하게 함.
정규화의 효과: 연결 많은 노드가 다른 노드에 과도한 영향을 주는 것을 방지.

3. GCN의 연산 방식 – 수식 기반 설명

GCN의 한 레이어 연산은 다음과 같이 표현됨:

H^(l): 현재 레이어의 노드 표현 (초기에는 입력 X)
W^(l): 학습 가능한 가중치 행렬
A^: 정규화된 인접 행렬
σ: 활성화 함수 (예: ReLU)

🧠 이 수식은 다음의 의미를 갖는다:

(1) W 적용: 현재 노드 특성 변환 (학습 목적)
(2) A^와 곱: 이웃 노드의 정보를 집계
(3) σ: 비선형성 도입

4. GCN의 구조적 특성

레이어 쌓기 (stacking): 멀티 홉 이웃 정보를 포함 가능 → 더 넓은 문맥 파악
스킵 연결(skip connections): 레이어가 깊어질수록 정보 손실을 방지
풀링(graph pooling): 큰 그래프를 요약하기 위한 다운샘플링 기법
리드아웃(readout): 노드 수준 임베딩 → 전체 그래프 임베딩으로 통합 (예: classification, regression)

5. CNN과의 정합성

GCN은 CNN의 연산 철학을 그래프에 일반화한 모델임:

CNN 개념	GCN 대응 개념
이미지 픽셀	그래프의 노드
고정 커널	공유 가중치 행렬 W
지역 수용영역 (Receptive Field)	이웃 노드 집합
필터 적용	이웃 정보 집계 (aggregation)
풀링	Graph pooling
FC layer + softmax	Readout + classifier layer

6. 실전적 관찰 및 활용 통찰

인접 행렬과 노드 특성은 학습 전에 고정됨: 따라서 GCN은 semi-supervised learning 구조에서 특히 강력함 (ex. 일부 라벨만 제공된 노드 기반 학습).
GCN은 국소적 집계만 가능 → 너무 많은 레이어를 쌓으면 oversmoothing 문제가 발생하여 노드 표현이 동질화됨.
Graph-level task(예: 분자 특성 예측)에서는 readout의 설계가 성능을 좌우함.

✅ 요약: GCN을 이해하는 데 필요한 핵심 포인트

📐 그래프 구조	비유클리드적, 유연한 관계 모델링 가능
🧮 핵심 연산	정규화된 인접 행렬 기반의 선형변환 + 이웃 집계
🔄 레이어 확장성	여러 GCN 레이어를 통해 고차 이웃 정보 학습
⚖️ 정규화의 필요성	연결 수 차이로 인한 정보 편향 방지
🔁 정보 재사용	가중치 공유로 파라미터 수 감소 및 일반화 유도
📊 응용 가능성	노드 분류, 링크 예측, 그래프 분류 등 다양한 영역에 적용

좋습니다. GCN(Graph Convolutional Network)의 작동 방식을 직관적인 예시와 함께 입력 → 중간 연산 → 출력까지 레이어 구조와 차원 변화를 중심으로 상세히 설명해드리겠습니다.

🧩 예시 기반 GCN 구조 설명

🎯 목적

노드 분류 (예: 각 노드가 어떤 클래스에 속하는지 예측)

1️⃣ 입력 구성

예시 그래프

5개의 노드로 구성된 작은 소셜 네트워크라고 생각해봅시다.

노드 번호	노드 의미	특징 벡터 (Feature Vector, 차원: 3)
0	학생	[1, 0, 0]
1	교수	[0, 1, 0]
2	연구원	[0, 0, 1]
3	학생	[1, 0, 0]
4	연구원	[0, 0, 1]

노드 수: N = 5
특징 차원: F_in = 3
입력 행렬 X: (5 x 3)

X = [
  [1, 0, 0],  ← 학생
  [0, 1, 0],  ← 교수
  [0, 0, 1],  ← 연구원
  [1, 0, 0],  ← 학생
  [0, 0, 1],  ← 연구원
]

인접 행렬 (A)

노드 간 연결 관계:

0은 1, 3과 연결
1은 0, 2와 연결
2는 1, 4와 연결
3은 0과 연결
4는 2와 연결

A = [
  [1, 1, 0, 1, 0],
  [1, 1, 1, 0, 0],
  [0, 1, 1, 0, 1],
  [1, 0, 0, 1, 0],
  [0, 0, 1, 0, 1]
]

⚠️ Self-loop 포함됨 (자기 자신과의 연결)

2️⃣ GCN의 연산 단계 (1 Layer 기준)

GCN 레이어는 다음의 수식으로 표현됩니다:

H = σ ( A^ X W )

X: 입력 노드 특성 행렬 (N × F_in)
W: 학습 가능한 가중치 행렬 (F_in × F_out)
A^: 정규화된 인접 행렬 (N × N)
H: 출력 노드 임베딩 (N × F_out)
σ: 활성화 함수 (예: ReLU)

⛓️ 단계별 연산 설명

(1) 정규화 인접 행렬 A^ 생성

D: 각 노드의 degree를 대각 원소로 가지는 대각 행렬
이 정규화를 통해 연결 많은 노드의 영향 과도 반영 방지

(2) W 가중치 행렬 초기화

예시로:

즉, 입력 3차원 → 출력 2차원으로 임베딩

🔁 연산 순서

Z = A^ ⋅ X ⋅ W

예를 들어, 노드 0에 대한 연산:

주변 노드 0, 1, 3의 특성을 평균 또는 정규화 집계
이를 W와 곱해서 2차원 임베딩 생성

결과적으로:

입력 X: (5 x 3)
W: (3 x 2)
XW: (5 x 2)
A_hat · XW → H: (5 x 2)

⇒ 노드마다 2차원 임베딩 벡터 획득

3️⃣ 다층 GCN (2 Layer 예시)

입력: X (5 x 3)
1층 GCN: H1 = ReLU(Â X W1)  (출력: 5 x 4)
2층 GCN: H2 = softmax(Â H1 W2) (출력: 5 x C) ← 클래스 수

W_1 : (3 × 4), W_2 : (4 × C)
결과: 각 노드가 C개의 클래스 확률 분포 가짐

📦 전체 구조 요약

단계	연산	차원 변화
입력	X	(N × F_in)
정규화 인접 행렬	A^	(N × N)
GCN Layer 1	H1 = σ(A^ X W1 )	(N × F_hidden)
GCN Layer 2	H2 = softmax(A^ H1 W2 )	(N × C)

🔚 출력

각 노드에 대해 C개의 클래스 중 하나를 분류
예) 학생/교수/연구원 분류 → C = 3, softmax 사용

✅ 정리: GCN을 이해하는 핵심

포인트	설명
입력	노드 특성 행렬 (X), 인접 행렬 (A)
레이어 존재	CNN처럼 여러 Layer 가능 (정보 확장)
연산 방식	A^ X W 로 이웃 정보 집계 + 선형 변환
차원 흐름	입력 → 은닉 → 출력으로 차원 점차 줄어듦
출력 의미	노드의 클래스 분류 또는 임베딩

00:00:01 In this lecture we will learn graph convolutional neural networks. Then what is graph? A graph is a set of objects which have the relations each other. So in the graph we call the objects as vertice or node and we can represent the relation between the pair of node as edges. So there are two types of graphs. So direct graph and undirect graph. So in the direct graph there are causal relations between the node while the indirect graphs represent just relation between the node. Then let's think about the

00:00:53 difference between image and graph. Actually image is a well arranged graph. So in the image you can see that each pixel has several relations with each neighborhood node like this. The graph has well orientated structures as we know that the closed pixels are the neighborhood of the pixel. In contrast to the image, a general graph has a more flexible structures. You can simply imagine that the social networks in the social networks the physical distance between the person could not represent the relations

00:01:51 between the node. Instead of that the social distance will be the edgy of the graph. To represent an image we only need a pixel information. So there is two location information the low and column and we also have the channel information in case of the color image RGB. In the graph we have the node information with the number of feature and the node index. Moreover in the graph we also have the addition matrix to represent the relation between the node. So each pair of the node has a relation. So we can represent the edging with the

00:02:52 addition matrix with the number of the nodes. So now let's compare the CNN and CNN. You may remember that in the convolutional layer we construct filters like this. With the filter, we can determine the next letter pixel features according to their neighborhood pixel information. Actually the GN filter works same as the CNN filter. In Jen same as the CNN the next level node feature will be determined according to their neighborhood node features. Now let's see more details of GCN. So here there is a graph with five

00:04:03 nodes and the relations of each pair of nodes are represented by edges. As I mentioned before each node has a feature vector like this and their edges are represented by the matrix. So in the matrix each row and column represent the index of the node like this and the value of the elements represent the connection or disconnection between the nodes. So for example in here we can see that there is a value one that means the first node and the third node is connected like this and the diagonal elements represent

00:05:10 the self connection. So each node has the self connection like this and because this graph is undirected graph so the adjent matrix has a symmetric structure. So these values are same across the diagonal elements. Now we have the every elements of the graph. So we can represent this graph with the adjust matrix and the node feature matrix like this. Now let's see the filters of GCN. The filter will perform the embedding of the node features. So with the element wise multiplication we can get a embedded value of the node

00:06:14 feature. You may know that we can construct multiple filters which has the different value. So in that case we can get the four different embedded value of the node features in the CNA model we run that the filter is shared across the image right so same as the CNN in GN you can share this filter weights for every node By concatenating the filters, we can generate the weight matrix. So the goal of the GCN is to run the weights according to their task. So in the for propagation in the GCN just multiply these three matrix and

00:07:18 apply the activation function for the hidden layer. So with the multiplication of the X and W we can get the embedded node features like it. You may know that we can perform like this for multiplication of the matrix. So now the each node features are updated with the weighted node feature. And with the multiplication of the adjust matrix and the weighted node feature, we can update the node features based on the neighborhood node features. For an example, the first node feature will be updated like this.

00:08:25 So it will be updated based on the all the node features except the five because this is zero. So except this feature the first node feature will be updated based on the list of the node features. So this is the for propagation process in GCN layer and actually in Gen we use the normalized addition matrix like this. So the D represent the degree matrix which represent the number of the connections for every nodes. So based on the degree matrix we use the normalized matrix like this. So this is an example of the GCN

00:09:24 structure. So in the structure we can see that we can stack multiple GCN layer and we can also use the skip connection like convolutional neural networks. You may remember that we stack the fully connected layer at the end of the CNN, right? Same as the fully connected layer in we will perform the radio process. So there are a lot of method for the read up but just think that we can concatenate the every node feature every embedded node features together to generate a single vector. So this is the one of the simplest

00:10:17 readout method to generate a single vector and to make the single vector will be connected to the output layer. There are also the concept of the pooling layer in GCN. So we can call it as graph pooling. There are also several method for graph pooling but just think that the graph pooling perform the summarize of the complex graph structure to the simple one. So this is the components of the graph CNN. So just simply think that the CNN is well organized Giann. So the every elements of the Giann work as those of CNN. So please

00:11:20 let me know if you have any questions with Giann. Thank you.

Summary

This lecture introduces Graph Convolutional Neural Networks (GCNs), explaining the foundational concepts of graphs and their relation to neural network structures like CNNs. A graph consists of nodes (vertices) and edges that represent relationships between nodes. There are two main types of graphs: directed, where edges have a causal direction, and undirected, where edges indicate symmetric relationships. Unlike images, which can be viewed as well-organized grids of pixels with fixed local neighborhoods, graphs offer more flexible structures where nodes can relate in diverse ways, such as in social networks where connections represent social distance instead of physical distance.

In GCNs, nodes are represented by feature vectors, and their relationships by adjacency matrices, which describe the connectivity between nodes. The graph adjacency matrix is often symmetric in undirected graphs and includes self-connections along the diagonal. The GCN applies filters analogous to CNN filters: it updates each node’s feature by aggregating feature information from its neighboring nodes through matrix multiplications involving node features and the adjacency matrix. To account for differing node degrees, GCNs use a normalized adjacency matrix that incorporates a degree matrix, ensuring stable and balanced feature propagation.

GCNs can stack multiple convolution layers and employ skip connections similarly to CNN architectures. Just as CNNs have fully connected layers for decision-making, GCNs perform a “readout” or pooling step to aggregate node features into a single vector, summarizing the graph information. Various pooling techniques simplify graph representations for efficient learning. Overall, GCNs extend convolutional operations beyond pixels to general graphs, preserving neighborhood feature relationships in distinctly structured data.

Highlights
📊 A graph comprises nodes (vertices) connected by edges representing relationships.
🔄 Directed graphs indicate causal relations; undirected graphs indicate symmetric relations.
🖼 Images can be viewed as well-organized graphs with fixed local neighborhoods; general graphs are more flexible.
🧮 GCNs use adjacency matrices to encode node connections and feature matrices to represent node attributes.
🔄 GCN filters update node features based on neighbors, analogous to CNN filters for pixels.
📐 Normalized adjacency matrices ensure balanced propagation accounting for node degrees.
🚀 Graph pooling aggregates node features into a single vector for graph-level tasks, similar to CNN readout layers.
Key Insights

📈 Graph Structure Flexibility Enables New Data Representations: Unlike the rigid grid structure of images, graphs can represent complex, non-Euclidean relational data such as social networks, molecules, and citation networks, enabling GCNs to work on diverse problems beyond traditional vision tasks. This flexibility provides a powerful framework for capturing intricate dependencies.

🔍 Adjacency Matrix as Core Representation: The adjacency matrix is fundamental in GCNs, encoding how nodes interconnect. Including self-loops (diagonal elements) ensures each node’s own features contribute during convolution, stabilizing learning. For undirected graphs, symmetric adjacency facilitates balanced neighborhood gathering.

🧮 Matrix Multiplication Propagation and Feature Aggregation: Node feature updates are performed by multiplying the feature matrix with the learnable weight matrix, then with the adjacency matrix. This operation aggregates neighboring features weighted by the shared convolution filter, analogous to sliding CNN filters on image pixels, allowing relational information to propagate through the graph.

⚖️ Normalized Adjacency Matrix Normalizes Feature Contributions: Incorporating the degree matrix to normalize adjacency ensures nodes with many neighbors don’t overwhelm those with fewer connections by scaling contributions, making the training more stable and preventing skewed feature updates.

🌉 Stacking Multiple GCN Layers Builds Hierarchical Representations: Similar to CNNs, stacking multiple graph convolution layers permits capturing higher-order neighborhood information, enabling the model to learn complex patterns across distant nodes. Skip connections help mitigate issues like vanishing gradients and preserve earlier layer information.

🔄 Pooling and Readout Summarize Graph Information: Graph pooling simplistically reduces graph complexity, while readout operations create global graph embeddings from node embeddings. These condensed representations are essential for graph-level prediction tasks, similar to fully connected layers at the end of CNNs.

🛠 GCNs Generalize CNN Concepts to Non-Euclidean Domains: The GCN can be understood as a natural extension of convolutional operations to graph-structured data, sharing concepts like weight sharing, local receptive fields, and feature aggregation but applied in flexible, irregular domains. This opens doors to advancing machine learning in networked and relational data fields.

If you need further clarification on any specific part of GCN architecture or operation, feel free to ask!

11강 - 생성형 모델 1

📚 강의 요약: Generative Model I

1. 머신러닝의 세 가지 유형 복습

머신러닝은 크게 다음과 같이 세 가지 학습 방식으로 나뉩니다:

분류	데이터	목표	예시
지도학습	(x, y)	입력 x → 출력 y 매핑 학습	분류, 회귀, 객체 탐지, 캡셔닝 등
비지도학습	x	데이터의 구조나 분포 학습	클러스터링, 차원 축소, 밀도 추정
강화학습	x	보상 신호 기반의 학습	게임 에이전트, 로봇 제어 등

2. 비지도학습에서의 생성 모델 (Generative Models)

🔍 목적

학습된 데이터 분포를 기반으로 새로운 샘플을 생성
비지도학습의 핵심 문제인 density estimation (밀도 추정) 문제를 해결

📌 정의

실제 데이터 분포 p_data(x) 와 유사한 모델 분포 p_model(x) 를 학습하고자 함
샘플 생성: x ∼ p_model(x)

3. PixelRNN & PixelCNN

🧠 아이디어

이미지를 픽셀 단위의 조건부 확률로 분해
전체 이미지를 직접 모델링하는 대신, 각 픽셀의 확률을 이전 픽셀 조건 하에 모델링

📐 수식

여기서 x_i: i번째 픽셀
이미지 전체의 likelihood는 체인 룰을 통해 분해

🎯 방법별 특징

모델	특징	사용 네트워크
PixelRNN	이전 픽셀들의 의존성을 RNN(LSTM)으로 모델링	RNN
PixelCNN	이전 픽셀 의존성을 CNN으로 모델링	CNN

4. PixelRNN / CNN의 장단점

✅ 장점

명시적인 likelihood p(x) 계산 가능 → 좋은 평가 지표
좋은 샘플 품질

❌ 단점

생성이 순차적으로 이루어짐 → 매우 느림

✨ 참고 논문

van den Oord et al., 2016

📌 핵심 개념 요약

항목	설명
밀도 추정	주어진 데이터 분포를 학습하여 새로운 샘플을 생성하는 것
Generative Model	데이터와 유사한 샘플을 생성하는 확률 모델 (ex: PixelCNN)
PixelRNN/CNN	픽셀 간 조건부 확률 기반 이미지 생성 모델
명시적 likelihood	p(x) 를 수식으로 명확히 계산할 수 있는 모델
순차적 생성 문제점	이미지 생성이 픽셀 단위로 순서대로 진행되어 속도가 느림

📌 강의 요약: PixelRNN & PixelCNN (생성 모델 I)

1. 🧠 지도학습 vs 비지도학습 개념 정리

지도학습 (Supervised Learning): 입력-출력 쌍 (x, y)을 학습하여 입력 → 출력 간 매핑 함수를 학습
- 예시: 이미지 분류, 객체 탐지, 캡셔닝 등
비지도학습 (Unsupervised Learning): 출력 y 없이 입력 x만으로 데이터의 구조, 분포, 특징을 이해
- 예시: 클러스터링, 차원 축소, 밀도 추정, 생성 모델 등

2. 🎯 생성 모델(Generative Model)의 목적

훈련 데이터와 유사한 새로운 데이터를 생성하는 것이 목적
이를 위해 입력 데이터의 분포 p(x) 를 모델링함
분포를 알면 샘플링을 통해 신규 데이터를 생성할 수 있음

3. 🔍 밀도 추정 문제와 생성 모델 접근 방식

✅ 접근 방식 분류

구분	설명
Explicit Density	데이터 분포를 명시적으로 추정함
└ Tractable	수학적으로 직접 p(x) 계산 가능 (e.g. PixelRNN/CNN)
└ Approximate	근사적인 확률 분포 추정 (e.g. Variational Autoencoder)
Implicit Density	분포를 명시적으로 모델링하지 않고 샘플만 생성 (e.g. GAN)

4. 🧩 PixelRNN & PixelCNN: Tractable Density Estimation

✔ 핵심 아이디어

이미지 x 의 전체 확률 분포를 픽셀 단위 조건부 확률로 분해:

이를 위해 이미지의 픽셀에 순서를 가정하고, 이전 픽셀 값을 기반으로 다음 픽셀의 확률을 예측

📐 예시: 픽셀 순서

왼쪽 상단 → 오른쪽 하단으로 스캔하듯 순서를 지정
예: 왼쪽 상단 픽셀부터 → 가로 방향 → 아래 줄로 이동

5. 🔁 모델 구조 및 동작 방식

모델	설명
PixelRNN	RNN(LSTM)을 활용하여 픽셀 시퀀스를 모델링
PixelCNN	CNN을 활용하되, 미래 픽셀을 가리지 않도록 마스킹 적용된 필터 사용

Masked CNN: 현재 픽셀 예측 시, 앞선 픽셀들만 활용되도록 필터 내부를 마스킹 처리
CNN의 병렬성은 살리되, 순서 기반 생성 구조 유지 가능

6. 🔨 모델 학습 및 생성 방식

학습 시: 전체 이미지를 입력으로 넣고, 조건부 픽셀 확률을 학습하여 likelihood p(x) 최대화
생성 시:
1. 첫 픽셀부터 순차적으로 생성
2. 이전 픽셀 정보를 조건으로 다음 픽셀 생성
3. 전체 이미지가 순차적으로 구성됨

7. 📉 장단점

장점	단점
✔ 명시적인 likelihood 제공 (평가 용이)	❌ 순차적 샘플링 → 느림
✔ 샘플 품질 우수	❌ 고해상도 생성에 부적합

8. 💡 세부 개념 강조

채인 룰 (Chain Rule): 고차원 데이터 x 의 likelihood를 낮은 차원 단위로 조건부 분해 가능
픽셀 순서 설정: 자연 이미지에는 순서 개념이 없기 때문에, 가상의 순서(스캔 방향) 를 도입
샘플 생성 지연 문제: 각 픽셀이 순차적으로 생성되므로 병렬화가 어렵고, 실시간 응용에는 부적합

9. 🤖 실제 예시

픽셀RNN/CNN을 통해 생성된 이미지:
- 실제로 존재하지 않지만 그럴듯한 이미지 생성 가능
- 생성 이미지는 실제와 유사하지만, 모델이 학습한 분포 기반으로 만들어진 새로운 데이터

🧠 핵심 정리

항목	설명
생성 모델의 목적	훈련 데이터와 유사한 신규 샘플 생성
PixelRNN/CNN의 구조	픽셀 간 조건부 확률로 이미지 생성
확률 분해 방식	Chain Rule 기반의 시퀀스 확률 분해
학습 목표	전체 likelihood p(x) 최대화
생성 시퀀스 문제점	느린 속도, 고해상도 비효율적

요약

이번 강의는 이전에 배운 지도학습(supervised learning)과 대비하여 비지도학습(unsupervised learning)에 대해 소개하고 주요 개념과 응용법을 설명한다. 지도학습은 입력과 출력 데이터 쌍을 가지고 모델을 학습하여 분류(classification), 회귀(regression), 객체 탐지(object detection), 의미 분할(semantic segmentation), 이미지 캡셔닝(image captioning) 등 다양한 문제를 해결한다. 반면 비지도학습은 출력 데이터가 없고 입력 데이터 내 숨겨진 구조를 이해하는 데 목적이 있다. 클러스터링(clustering), 차원 축소(dimensionality reduction), 그리고 특징 학습(feature learning)이 주요 응용법이다. 대표적인 비지도학습 모델인 오토인코더(autoencoder)는 입력 데이터를 저차원 특징으로 압축하고 이를 복원하여 중요한 특성을 추출한다.

다음으로 생성모델(generative model)에 대한 개념과 핵심 목표, 즉 입력 데이터 분포를 추정하고 그로부터 새로운 샘플을 생성하는 방법을 다룬다. 생성모델은 명시적 확률 밀도(explicit density)와 암묵적 밀도(implicit density) 접근법으로 나뉘며, 이번 강의에서는 픽셀CNN, 픽셀RNN 같은 순차 확률 분포 모델을 통해 명시적 밀도를 추정하는 방법을 중점적으로 학습한다. 픽셀 단위의 확률을 이전 픽셀 값들에 조건부로 연결해 전체 이미지 확률을 계산하며, CNN과 RNN 구조를 응용해 이를 구현한다. 이 방법은 데이터 확률을 직접 계산해 생성 과정을 투명하게 이해할 수 있지만, 샘플 생성에 시간이 많이 소요된다는 단점이 있다.

다음 강의에서는 확률 밀도 추정을 근사하는 다른 접근법을 다룰 예정임을 안내하며 강의를 마무리한다.

주요 내용
지도학습과 비지도학습의 근본적인 차이점
비지도학습의 대표적인 응용: 클러스터링, 차원 축소, 특징 학습
오토인코더 구조 및 특징 추출 역할
생성모델의 목적: 데이터의 확률 분포 추정 및 샘플 생성
생성모델의 분류: 명시적 확률 밀도 vs 암묵적 확률 밀도
픽셀CNN, 픽셀RNN 기법 소개와 확률적 이미지 모델링
픽셀 단위 순서 가정과 이를 활용한 확률 계산
픽셀CNN/RNN의 장점과 단점
하이라이트
🔍 비지도학습은 레이블 없이 입력 데이터의 구조를 탐구하는 데 집중한다.
🧩 클러스터링과 차원 축소는 비지도학습의 핵심 대표 응용이다.
🤖 오토인코더는 입력 데이터를 압축 및 복원하며 중요한 특징을 학습한다.
🎨 생성모델은 학습 데이터의 분포를 모델링하여 새로운 샘플을 생성한다.
🔢 픽셀CNN과 픽셀RNN은 이미지 내 픽셀의 조건부 확률을 계산해 데이터 분포를 추정한다.
⏳ 픽셀CNN, RNN 기반 생성은 확률 분포를 명확하게 이해할 수 있으나 샘플 생성 시간이 길다.
📧 다음 강의에서는 확률 밀도 근사 방법 및 다른 생성 모델을 다룰 예정이다.
주요 인사이트

🧠 비지도학습의 핵심 목적은 입력 데이터의 잠재적 구조를 이해하는 것

레이블 없이 데이터 자체의 패턴과 특성을 탐색하는 도구로 클러스터링과 차원 축소가 활용된다. 이런 방법은 데이터 사전 지식 없이도 유용한 정보 추출이 가능하다.

🏗️ 오토인코더는 비지도학습에서 강력한 특징 추출 수단이다

입력 데이터를 인코더에서 저차원으로 압축한 뒤 디코더에서 복원하는 과정을 통해 데이터 본질을 요약하는 특징을 학습하며, 이는 차원 축소와도 밀접하게 연결되어 차후 지도학습 성능 향상에 사용될 수 있다.

🎯 생성모델은 학습 데이터의 확률 분포를 정확히 모델링하는 것을 목표로 한다

이 과정이 가능해야만 현실적이고 다양한 새로운 샘플을 만들어낼 수 있으며, 생성모델의 품질은 분포 추정 능력에 크게 좌우된다.

📊 명시적 확률 밀도 모델(픽셀CNN, 픽셀RNN)은 데이터 분포를 직접 계산 가능하다

픽셀 단위로 조건부 확률을 분해해 전체 이미지 확률로 결합하는 방식으로 확률을 직접 표현하며, 모델 내부 동작을 해석하기 용이하다.

⚙️ RNN과 CNN을 활용한 픽셀별 확률 모델의 차이점과 활용법

RNN기반 픽셀 모델은 순차적 시퀀스 특성에 염두를 두고 개별 픽셀의 조건부 확률을 계산하며 CNN기반 모델은 주변 이웃 픽셀 정보를 활용해 확률을 추정한다.

⌛ 픽셀CNN, 픽셀RNN의 한계는 샘플 생성 속도에 있다

순차적 특성으로 인해 전체 이미지 생성에 많은 시간이 소요되어 실시간 혹은 대량 생성 시 부적합할 수 있으므로, 다른 근사 방법의 필요성이 존재한다.

📘 Generative Model I — 보완 요약 (중복 없이 핵심 심화)

🔹 1. 비지도 학습의 본질적 의의

지도 학습은 "정답(label)"에 대한 예측 정확도를 최적화하는 데 집중하는 반면,
비지도 학습은 "정답이 없음에도 불구하고" 데이터의 구조를 파악하려는 시도이다.
즉, 모델이 스스로 데이터의 패턴이나 잠재 공간(latent structure)을 발견해내는 것이 핵심.

→ 이러한 특성은 레이블 수집이 불가능하거나 고비용인 도메인(의료, 보안 등)에서 매우 유용.

🔹 2. 생성 모델의 존재 이유와 연구 필요성

단순히 기존 데이터의 패턴을 기억하거나 요약하는 것을 넘어,
새로운 데이터를 만들 수 있는 모델, 즉 데이터 생성 능력 자체를 학습하는 것이 목표.

🔍 예: 수많은 고양이 이미지를 보고, “고양이처럼 보이는 새로운 이미지”를 창조하는 능력
→ 이는 데이터 압축(compression), 복원(reconstruction), 확률적 상상(imagination) 능력을 모두 요구

🔹 3. 생성 모델의 기반: 확률 분해와 샘플링 관점

데이터 생성은 확률 공간에서의 샘플링 문제로 해석될 수 있음
이미지와 같은 고차원 데이터 x는 단일 확률이 아니라 각 구성 요소(픽셀)의 조건부 확률 곱으로 나타남:

→ 이것이 PixelRNN, PixelCNN이 사용하는 chain rule 기반의 확률 분해 구조

🔹 4. PixelRNN vs PixelCNN – 구조 차이와 응용

비교 항목	PixelRNN	PixelCNN
기반 구조	순환 신경망 (LSTM 등)	합성곱 신경망 (CNN)
시간적 순서 반영	순차적으로 하나씩 생성	CNN을 마스킹(masking)하여 반영
병렬성	낮음 (생성 속도 느림)	상대적으로 높음
학습 안정성	긴 시퀀스일수록 어려움 (RNN 특성)	CNN 기반이라 학습 안정성 높음
사용 예시	자연스러운 텍스트나 시계열 이미지 생성	픽셀 기반 이미지 생성, 학습 설명용

📌 PixelCNN의 마스킹 필터는 다음 픽셀 정보를 차단하여 조건부 확률 구조를 유지
→ 이 방식은 CNN의 연산 효율성과 조건부 생성 능력을 모두 유지할 수 있음

🔹 5. 샘플링 관점에서의 제약

Pixel 기반 생성 모델은 “왼쪽 위 픽셀부터 순차 생성” 구조이기 때문에,
전체 이미지를 생성하는 데 매우 긴 시간이 소요됨

❗ 이 때문에 실시간 생성, 대규모 이미지 생성에는 근사적 모델(예: VAE, GAN)이 선호됨

🔹 6. 오토인코더와의 연결 고리

PixelRNN/CNN은 데이터 분포를 정밀히 추정하는 데 집중하지만,
오토인코더는 특징 표현 학습(feature representation)에 집중함

→ 두 방식 모두 “데이터 구조를 학습한다”는 점에서는 비슷하나,

전자는 “정확한 확률 분포 추정”
후자는 “요약된 잠재 정보(latent code)” 추출에 중점을 둠

📌 이 둘의 접점이 Variational Autoencoder (VAE)이며,
→ 다음 강의에서 이 연결을 설명할 예정

🔹 7. 교육적, 실무적 시사점

PixelRNN/CNN은 해석력과 이론적 정당성이 높아 연구적 가치가 큼
특히 모델이 데이터의 생성 원리를 명확히 설명해주기 때문에,
- 디버깅, 신뢰성 있는 생성모델 구축, 해석 가능한 AI 연구에 적합
하지만 실용성(속도, 대규모 확장) 측면에서는 VAE, GAN 등에 밀릴 수 있음

✅ 요약 정리

항목	설명
비지도 학습의 본질	입력만으로 데이터 구조를 추론
생성 모델의 목표	학습한 데이터 분포 기반 새로운 샘플 생성
PixelRNN/CNN의 수학적 기반	조건부 확률 분해 (Chain rule)
PixelCNN의 마스킹 기법	미래 정보를 차단하고 과거 정보만 활용
생성 시간 문제점	순차적 생성 → 실시간 응용 어려움
응용적 시사점	학습 과정 설명, 모델 해석에 유리
다음 강의 예고	확률 밀도 근사 기반 생성모델 (VAE 등)

00:00:00 In the previous lectures we learned about supervised learning from this lectures we will learn about unsupervised learning. So please remind that what the supervised learning is and what the unsupervised learning is. The goal of the supervised learning is developing predictive model based on both input and output data. So in the supervised learning the deep learning model first take some input like image or sequence data and then the model predict appropriate output according to their input. So you may remember

00:00:51 that we brought about CNN model for image data and RNN model for sequence data as input and each model if we generate a discrete output such as class label then we can call it as classification problem and if we generate continuous value as output as regression problems. In contrast the goal of the unsped learning is grouping and interpreting the input data so there is no output. So we can grouping the input data with clustering technique and we can also interpret the input data with the dimension reduction

00:01:50 problem. So to train our supervised learning model we need to take a pair of input and output data. Then the trained model will perform as mapping functions between the input and output. So in the previous lectures we learned about some supervis learning applications so we first learn about classification. So when we take input image X and then we can estimate the label of the image. We also learned about object detection. In the application our sports learning model not only estimate the labels

00:02:46 of the objects but also estimate the location of the objects. We are so learned about semantic segmentation. So in the application the model will perform pixelwise classification. So the trained model will inform which class each pixel belongs to. We also learned what the image captioning is. So in the application our model will perform as a mapping function between image and sentence. So the go of all the application is find a mapping function between input and output. In the unsport learning we only

00:03:43 take input data without their label and the main goal of the unsp learning is understanding some underline hidden structure of the input data. So clustering is one of the main applications in onsped learning. So in the application we want to cluster the samples of the input data with some criteria. We also perform dimensionality reduction with uns learning models. So we can summarize our high dimensional original data into low dimensional data with dimensionality reduction techniques. We can also use the unspite

00:04:42 learning models for feature learnings. So this is examples of auto encoder. takes image as input data and in the encod page the model will extract features by using some dimensionality reduction techniques and in the decoder page the model will reconstruct the input image from the features in here the looks will be the difference between the original input data and the reconstructed data then the model will work to well generate the summarized feature which can be used to reconstruct the original input

00:05:45 data then we can say that This features will capture the important features of the input data. The encoder part will be also used as the dimension reduction techniques. We also estimate the property density of the input data with uns learning techniques. Actually these are highly related to each other. For example, we can use the summarized datas based on the dimensionality reduction techniques as features of the data. Also to cluster data samples we need to extract features of data based on the

00:06:42 dimensionality reduction techniqu also to cluster the data or to extract features of data we can also use some priority density based techniqu so the densityation could be used for clust and feature extract. So from this lectures we will learn how we can use learning techniques for this kinds of applications. So let's first talk about generative models. The goal of the generative model is generating new sample data which looks like training data. For our purpose we first understand the training data. So

00:07:41 based on our understanding we first generate the property distribution of the data. After that from the distribution we can generate new sample data. So it address the density estimation problem which is a core problems in onsp learning. Let's see an example. So these are images. So let's imagine that we can collect every car image in the work then we can understand. the original probability of the images right however unfortunately we can only collect a subset of the older data as training datet so just based

00:08:46 on the subset training dat we need to estimate the real data probability So if we will estimate the reality with our model then based on this estimated probability we can generate new sample data which looks like the real training images. So the go of the generating model is running estimated probability similar to the real data probability. So there are a lot of approaches for generative models. So there are two main categories so split density and implicit density. The explicit density approaching required

00:09:50 to understand the property of the data. And if we directly measure the probability then we can call it as tractable density. And if we approximate the probability with some parameters, we can call it as approximate density approaches. In case of the implicit approaches we don't need to measure the distribution of in the approaches the models indirectly use the probability distributions inside the model. So first we will learn how we can measure the property density. directly based on pixel and pixel CNN

00:10:53 methods. Next we will learn how we can approximate the data distribution based on variational auto encoder. Moreover, we also run how we can generate new sample data without the understanding of the probability distribution. So let's first see the concept of pixel RNN and pixel CNN models. As mentioned before these methods are included in the split density models. As mentioned before the goal of the generated model is understand the property distribution of the training data X. So to measure the

00:11:52 property desption we first use channel to decompose the likelihood of the input image. So based on the channel we can decompose this priority description of x like this. So let's see an example. So in the example we have some sequence data like this. So this one is x0 and this one is x1 and so on. So this could be xi. So in the example first we can measure the property of x0 and as mentioned before this is sequence data so based on the property of x0 we can measure the properity of x1 given by x0 so

00:13:05 this is could be the conditional probability XI could be like this right this is same as this one so if we multiply the every probability of the sample then we can generate the overall probability of the data. So this equation represent the previous example. So this part represent every sequence of the data and by multiplying the every sequence of the data we can generate the overall probability of the data. So this could be the likelihood of image X and this could be the probability of every pixel value

00:14:20 based on the previous pixel value. So the goal of the pixel RNN and pixel CNN is express this priority of every pixel value based on CNN and RN models. For the propose we first need to define the orders of previous pictures. Actually there is orders in image but we assume the order of pixels for the pixel RN and pixel CNN. So the of the pixel RN and CNN is maximiz the priority of the training data. Now let's see the details of the pixel RN and pixel CN. As mentioned before, we first need to assume the orders of

00:15:25 the pixels. So in the example, we can assume that the left top corner is the first pixel and the information of the pixel will be passed to the right and down. So in this case it could be the first sequence data and this are the second sequence data and are the third sequence data. So we can pass the information like this. So the pixel RNA just use the RNA model like LSTM to model this property of the data. Of course we can also start the right down point as the starting point. In the case this

00:16:34 could be the first sequence data and this could be the second sequence data and this are becomes third sequence data. So we can also construct this kind of RNA model. So you may remember that we can use the bidirectional RNA model. So we can use this two directional RNA model together. We can also use the convolution corner to estimate the priority of every pixels. So you may remember that we use convolutional corner like this in CNN model. So with the convolutional con we can estimate the pixel

00:17:35 value at the center point according to the values of their neighborhood pixels. In this problem we suppose that there are some orders between pixel like this. So if we use this kind of convolutional corner where the these pixels are masked then we can estimate the pixel value of this point only based on their previous sequence data. So this is the concept of pixel CN. So now we if we have this kind of trained corner then we can determine the pixel value at every point based on their previous point.

00:18:44 So based on the this kinds of RN and CNN model we can generate new sample data according to the trained probability distribution. So these are the example result of the pixel RNN and pixel CNN. So based on this technique we can generate new images that are not real object but looks like real object. So main advantage of pixel RN and CNN is we canly compute the property distribution of the data and because of that we can directly understand how our model generate new sample data. However, it used

00:19:43 some sequence generation techniques. It takes lot of time to generate new sample data. So in this lecture we learned about the pixel RN and CNN model which can directly estimate the priority density of the input data. In the next lecture we will learn how we can approximate the property density of the data withation tech please contact me by this email if have any question in the prev lectures than Ja.

12강 - 생성형 모델 2

📘 강의 요약: Variational Autoencoders (VAE)

1. 🔄 기존 AE(Autoencoder)와 VAE의 차이

구분	Autoencoder (AE)	Variational Autoencoder (VAE)
구조	Encoder - Decoder	Probabilistic Encoder - Decoder
목적	입력 압축/복원	입력의 확률 분포 모델링 및 샘플 생성
출력	고정된 feature	확률분포에서 샘플링된 latent vector
학습	Reconstruction loss	Reconstruction + KL divergence (ELBO)

2. 🎲 VAE의 확률적 모델링 과정

VAE는 다음과 같은 생성적 과정을 가정합니다:

잠재 변수 z는 어떤 사전 분포(보통 표준 정규분포) p(z)에서 샘플링됨.
관측 데이터 x는 조건부 분포 p(x|z)를 따라 생성됨.
직접 p(x)를 계산하기는 어렵기 때문에, 변분 추론(Variational Inference)을 사용해 lower bound를 최적화함.

3. 📐 최적화 수식 (Evidence Lower Bound, ELBO)

VAE는 아래와 같은 lower bound를 최대화합니다:

첫 항: 복원 손실 (Reconstruction loss)
→ Decoder를 통해 z로부터 x를 얼마나 잘 복원하는지.
두 번째 항: KL divergence
→ Encoder의 추정 posterior q(z|x)가 prior p(z)에 가까워지도록 유도.

4. 📊 VAE 구성 요소 정리

Encoder:
입력 x → 평균 μ(x) 및 분산 σ²(x) → q(z|x)를 통해 z 샘플링
(Reparameterization trick 사용)
Decoder:
z → p(x|z)를 통해 원래 입력 x 재구성
Loss function:

5. ✅ VAE의 장점과 단점

장점 (Pros)

확률적 표현을 통해 다양한 샘플 생성 가능

잠재 변수 z를 통해 의미 있는 feature 추출 가능

다른 작업(ex. 분류, 클러스터링)에 활용 가능한 표현 학습

단점 (Cons)

학습 목표가 진짜 likelihood가 아닌 lower bound 최적화임

평가 지표로서 PixelRNN/CNN보다 정확도가 낮을 수 있음

6. 🧠 VAE와 AE의 연결 고리

일반적인 AE는 압축 및 복원을 통해 특징 추출에 중점을 둔 반면,
VAE는 확률적 해석을 도입하여 “표현 학습 + 데이터 생성”을 동시에 가능하게 함.
AE의 encoder는 지도 학습 초기 가중치로도 활용될 수 있음.

7. 📌 정리

VAE는 autoencoder를 확률 모델로 확장한 구조로, 입력 데이터의 생성 과정을 명시적으로 모델링한다.
이를 통해 단순 복원 이상의 작업(예: 샘플 생성, feature disentanglement)에 활용 가능하며, 생성모델로서 강력한 기반을 제공한다.
PixelCNN/RNN과 달리 직접적인 확률 계산은 어렵지만, 근사 추론을 통해 실용적인 학습이 가능하다.

📘 VAE 대본 기반 상세 요약 (강의 자료 미중복 중심)

1. 🧠 Variational Autoencoder (VAE)의 목적과 배경

기존 Autoencoder는 입력을 잠재 공간(latent space)으로 압축한 후 복원하는 구조이지만, 샘플 생성은 어려움.
이에 대해 VAE는 확률 기반 모델링을 도입하여, 입력 분포를 근사하고 새로운 샘플을 생성하는 것이 가능함.
VAE는 "Approximate Explicit Density Estimation" 접근 방식의 생성 모델이며, 학습 목표는 입력 데이터의 marginal likelihood p(x)를 최대화하는 것.

2. 🧮 VAE의 구성요소: 확률 기반 오토인코더

Encoder (Recognition Model):
- 입력 x → 잠재 변수 z의 분포 q(z∣x)를 추정.
- 이때 q(z∣x)는 정규분포 N(μ(x) , σ^2(x))로 가정함.
- 직접 샘플링은 불가능하므로, reparameterization trick을 사용해 z = μ(x) + σ(x) ⋅ ϵ 형태로 샘플링.
Decoder (Generative Model):
- 잠재 변수 z → p(x∣z)를 통해 데이터를 재생성.
- 보통 x ∼ N(μ(z) , σ^2(z))로 가정함.

3. 🔄 수식 유도: Evidence Lower Bound (ELBO)

VAE는 직접적으로 p(x)=∫p(x∣z) p(z) dz를 계산할 수 없기 때문에, 변분 추론을 사용해 ELBO를 최대화:

첫 번째 항: 복원 손실 (Reconstruction Loss) → x가 얼마나 잘 복원되는지를 측정.
두 번째 항: KL Divergence → 인코더가 추정한 q(z∣x) 분포가 prior p(z)에 얼마나 가까운지를 정규화.

주의: ELBO는 log ⁡p(x)의 하한(lower bound)이며, 진짜 likelihood는 알 수 없음.

4. 📐 수식 재정리 흐름

대본에서는 다음과 같은 수식 유도 과정을 자세히 설명:

마진 likelihood 식:
p(x) = ∫p(x∣z) p(z) dz
직접 계산 불가능.
베이즈 정리로 전개:
log⁡ p(x) =log {⁡∫q(z∣x) / q(z∣x) * p(x∣z) * p(z) / p(z | x) } dz
→ Jensen's inequality 적용 → ELBO 도출.
KL divergence 분해:
log⁡p(x)=ELBO+KL(q(z∣x) || p(z∣x))
→ 두 번째 KL 항은 음수가 될 수 없으므로 ELBO는 하한이 됨.

5. 🔍 각 항의 의미와 학습 역할 분리

Reconstruction Loss (데코더 학습 목적):

→ 디코더가 z로부터 얼마나 잘 x를 복원하는지 측정.
KL Divergence (인코더 학습 목적):

→ 인코더의 잠재 분포가 prior p(z)와 가까워지도록 유도.

따라서, VAE의 손실 함수는 reconstruction + regularization(KL)의 합으로 구성됨.

6. 🧪 추가 개념 및 응용적 인사이트

VAE는 생성과 인코딩을 동시에 수행하는 모델.
확률적 구조이므로, 단순한 특징 추출뿐만 아니라 샘플링 기반 생성이 가능.
샘플 생성은 다음 순서로 가능:
1. z ∼ N(0,I) 샘플링
2. x ∼ p(x∣z)를 통해 새로운 x 생성

7. 📌 요약된 흐름도

입력 x
  ↓
Encoder (q(z|x)) → μ(x), σ(x)
  ↓     (reparameterization trick)
z ~ N(μ, σ)
  ↓
Decoder (p(x|z))
  ↓
재구성된 x'

✅ 정리

항목	설명
목적	입력 데이터의 확률 분포를 근사하고 새로운 샘플 생성
주요 구조	확률적 Encoder + Decoder
핵심 수식	ELBO = Reconstruction Loss - KL Divergence
학습 방식	ELBO 최대화 (likelihood의 lower bound)
활용	이미지 생성, latent space 기반 feature 추출, anomaly detection 등

요약

이번 강의에서는 생성 모델 중 하나인 변분 오토인코더(Variational Autoencoder, VAE)에 대해 다룬다. 생성 모델은 크게 명시적 밀도 모델(explicit density approaches)과 암묵적 밀도 모델(implicit density approaches)로 나뉘며, VAE는 명시적 밀도 모델의 근사 방법에 해당한다. 이전 강의에서 다룬 Pixel RNN, Pixel CNN은 확률 밀도를 직접 계산하는 추적 가능 밀도 모델인 반면, VAE는 잠재 변수(latent variable)를 도입해 복잡한 확률 분포를 근사하고, 데이터의 밀도를 직접 계산하지 않고 그 하한(lower bound)을 최대화하는 방법이다.

오토인코더는 차원 축소와 특징 추출에 쓰이며, 입력을 인코딩해 잠재 공간(latent space)에 매핑하고 다시 복원하는 비지도 학습 방식이다. VAE는 이에 확률적 해석을 추가하여, 잠재 공간의 분포를 미리 가정하고, 샘플링을 통해 새로운 데이터를 생성할 수 있도록 한다. 인코더는 입력 데이터의 잠재 변수 분포를 추정하고, 디코더는 잠재 변수로부터 데이터를 복원한다. VAE의 목표는 데이터의 가능도(likelihood)의 하한(lower bound)을 최대화하는 것이다. 이를 위해 변분 추론과 KL 발산을 활용하여 모델을 최적화한다.

VAE는 생성, 특징 표현 학습 등 다양한 분야에 활용 가능하며, 밀도를 직접 계산할 수 없다는 한계를 확률적 접근과 하한 최적화로 극복한 혁신적인 생성 모델이다. 다음 강의에서는 암묵적 밀도를 사용하는 생성적 적대 신경망(Generative Adversarial Networks, GAN)에 대해 설명할 예정이다.

하이라이트
🔄 변분 오토인코더(VAE)는 오토인코더에 확률적 해석을 더해 새로운 데이터 생성을 가능케 함
🧩 VAE는 명시적 밀도 모델 중 확률 분포를 근사하는 근사 밀도 접근법에 속함
🔍 인코더는 입력 데이터로부터 잠재 변수 분포를 추정, 디코더는 잠재 변수로부터 데이터를 생성
⚙️ VAE의 학습 목표는 데이터 가능도의 하한(lower bound)을 최대화하는 것
🧮 손실함수에 KL발산을 포함해 인코더와 디코더를 동시에 최적화
🌐 VAE는 저차원 특징 표현 학습과 새 샘플 생성에 활용 가능
📊 다음 강의에서는 암묵적 밀도 모델인 생성적 적대 신경망(GAN) 설명 예정
주요 인사이트

🔄 확률적 오토인코더가 생성 모델로의 확장
VAE는 기본 오토인코더의 한계인 단순 복원을 넘어서, 잠재 공간을 확률 분포로 모델링함으로써 다양한 데이터를 생성할 수 있는 생성 모델의 가능성을 열었다. 이는 비지도 학습과 생성 모델 사이의 중요한 연결점이다.

🔍 근사 밀도 접근법을 통한 복잡한 확률 분포 처리
VAE는 데이터의 정확한 밀도를 계산하기 어려운 문제를 하한을 최대화하는 변분 추론으로 우회한다. 이로써 복잡하고 고차원인 데이터 분포를 실제적으로 모델링할 수 있다.

🧩 잠재 공간(latent space) 구조화와 샘플링 용이성
잠재 변수를 사전에 정의된 분포(예: 가우시안)에서 샘플링하도록 모델링해, 새로운 입력 없이도 다양한 데이터를 생성하는 것이 가능해진다.

⚙️ KL 발산 활용을 통한 네트워크 공동 최적화
네트워크 학습 과정에 KL 발산을 도입해 인코더의 근사 posterior 분포와 실제 사전 분포 간 거리를 줄임으로써 잠재 공간이 잘 정규화되고 안정적인 학습이 이루어진다.

🌐 VAE가 제공하는 다목적 특징 표현학습
VAE의 인코더 부분에서 추출한 잠재 표현은 이미지 분류나 다른 지도학습 모델 초기화 시 유용한 특성으로 활용 가능해, 전이학습에 강점을 가진다.

📚 생성 모델 확장의 기반 마련
이번 강의는 명시적 밀도 모델인 VAE의 개념과 수식을 통해 생성 모델의 이론적 기초를 다졌으며, 이후 강의에서 GAN 등 암묵적 밀도 모델로 확장될 것이다.

⚠️ 한계점: 직접적 데이터 밀도 계산 불가
VAE는 데이터 밀도를 직접 계산하지 못하고 하한을 최적화하기 때문에, 모델의 성능이 하한의 타이트함에 따라 달라질 수 있어 실제 결과에 주의가 필요하다.

이번 강의는 변분 오토인코더의 핵심 구조, 목표, 학습 절차 및 수학적 배경을 체계적으로 설명하여, 생성 모델에 대한 이해를 확장하고, 차후 진행될 암묵적 밀도 모델에 대한 학습으로 자연스럽게 연결된다.

1. 📌 생성모델의 분류와 VAE의 위치

생성 모델은 다음과 같이 분류됨:
- 명시적 밀도 모델 (Explicit Density): 데이터의 확률분포 p(x)를 명시적으로 정의.
  - 추적 가능 (Tractable): PixelCNN, PixelRNN처럼 조건부 확률로 완전히 정의 가능.
  - 근사 기반 (Approximate): VAE처럼 직접 계산은 불가하지만 근사 추론을 통해 하한을 최대화.
- 암묵적 밀도 모델 (Implicit Density): 분포를 정의하지 않고 샘플링만 가능한 모델 (예: GAN, 다음 강의 예정).

2. 🔄 Autoencoder에서 VAE로의 확장

기본 Autoencoder는 입력 x를 잠재 공간 z로 인코딩하고 다시 복원함.
VAE는 여기에서 확률론적 접근을 추가:
- 인코더: 입력에 대해 잠재 변수 z 의 분포 q(z∣x)를 추정.
- 디코더: z에서 p(x∣z)로 데이터를 복원.
핵심 목적: 데이터의 마진 확률 p(x) = ∫p(x∣z) p(z) dz 를 최대화.

3. 🧮 수식 유도: Evidence Lower Bound (ELBO)

p(x)는 직접 계산 불가능 → ELBO를 최대화:

이때, 마지막 항은 항상 ≥ 0 → 첫 두 항을 합쳐 ELBO (하한) 으로 설정:

1항: 재구성 손실 (decoder 학습),
2항: 인코더가 prior에 가깝도록 정규화 (encoder 학습)

4. ⚙️ 네트워크 구조 설명

Encoder 네트워크 (Recognition model):
- 입력 x → μ(x),σ(x)
- z ∼ N(μ(x) , σ^2(x))
- Reparameterization trick 사용: z = μ + σ ⋅ ϵ, ϵ∼N(0,1)
Decoder 네트워크 (Generative model):
- z → x 복원
- x ∼ N(μ(z) , σ^2(z))

VAE는 학습이 가능하도록 모든 연산이 미분 가능하게 구성됨.

5. 📊 학습 목적: ELBO 최적화

총 손실 함수:
- 첫 항: 복원 정확도 향상
- 두 번째 항: 잠재 분포 정규화 (prior로 수렴 유도)

6. 🔬 주요 개념 및 모델 해석

구성 요소	역할	수식 의미
KL divergence	정규화	인코더 분포가 prior와 얼마나 다른지 측정
ELBO	최적화 목표	실제 likelihood의 하한

7. ✨ 응용 및 추가 인사이트

생성 능력:
- 학습된 모델은 z ∼ N(0,1)에서 샘플링 → x ∼ p(x∣z)로 이미지 생성 가능.
표현 학습으로 활용:
- 인코더는 입력의 잠재 표현을 학습 → downstream task (분류 등)에 전이 가능.
해석 가능성:
- 모델 내부의 μ, σ는 잠재 변수 분포의 해석적 표현 → 유사 데이터 탐색, 클러스터링 등에 응용.
한계:
- p(x) 직접 계산 불가 → ELBO의 타이트함이 성능을 결정.
- 생성물의 해상도나 세밀함은 GAN보다 낮은 경우 많음.

✅ 전체 구조 요약

        입력 x
          ↓
   Encoder: q(z|x)
   → μ(x), σ(x)
          ↓
    z = μ + σ * ε
          ↓
   Decoder: p(x|z)
          ↓
     재구성된 x'

📌 마무리

이 강의는 다음 내용을 명확히 설명하고 정리했습니다:

VAE는 오토인코더의 확률적 확장
명시적 밀도 기반 생성모델이지만 근사 가능도 최적화 방식
수식 기반으로 ELBO 유도 과정 설명
인코더-디코더 각 구성 요소의 확률적 역할 분리
표현 학습 + 생성 능력의 이중 목적을 달성

좋은 질문입니다. VAE에서 KL Divergence를 통해 잠재 분포 q(z∣x) 가 사전 분포 p(z) 와 크게 다르지 않도록 만드는 이유는 다음과 같은 세 가지 이론적/실용적 목적 때문입니다:

✅ 1. 샘플링 가능성 (Sampling Feasibility) 확보

VAE는 학습이 끝난 후, 새로운 데이터를 생성하기 위해서는 잠재 변수 z 를 사전 분포 p(z) 에서 샘플링해야 합니다.

이때 q(z∣x)가 p(z)와 너무 다르면, 훈련된 decoder는 익숙하지 않은 z를 받아 예측력이 떨어지게 됩니다.
즉, 학습 중 사용된 q(z∣x) 와 생성 시 사용할 p(z) 가 비슷해야 디코더가 새로운 샘플도 잘 복원할 수 있습니다.

📌 그래서 KL Divergence를 통해 q(z∣x) ≈ p(z)로 유도합니다.

✅ 2. 잠재 공간 정규화 (Latent Space Regularization)

KL Divergence는 인코더가 학습한 잠재 분포 q(z∣x)가 너무 복잡하거나 데이터마다 지나치게 다른 분포를 가지지 않도록 제약합니다.

이는 잠재 공간을 부드럽고 연속적으로 유지시켜 줍니다.
그 결과, 서로 비슷한 입력 x들은 비슷한 z로 매핑되고,
잠재 공간에서의 선형 보간 (interpolation) 이 의미 있는 결과를 생성합니다.

🔄 즉, 잠재 공간이 잘 정렬된 의미 공간으로 학습되도록 유도합니다.

✅ 3. Overfitting 방지 및 Regularization 효과

KL Divergence는 일종의 regularizer로 작동하여, 인코더가 데이터를 지나치게 정확히 (즉, 오버피팅하여) 설명하는 것을 막습니다.

만약 q(z∣x)가 모든 x에 대해 너무 좁고 서로 다른 정밀한 분포를 갖게 되면,
모델은 학습 데이터는 잘 재구성하지만, 일반화 성능이 떨어질 수 있습니다.

KL 항은 인코더가 지나치게 특화된 분포를 만들지 않도록 제한합니다.

📌 요약

이유	설명
✅ 샘플링 가능성	생성 시 p(z)에서 샘플링할 수 있어야 함
✅ 잠재 공간 정규화	유사한 입력은 유사한 z로 매핑되도록 공간을 연속적이고 의미 있게 구성
✅ 오버피팅 방지	KL 항이 인코더 분포를 제어함으로써 모델 복잡도 억제 (Regularization 역할)

원하시면 수학적으로도 다음과 같은 해석이 가능합니다:

첫 항은 복원 품질 향상
두 번째 항은 분포를 정규화하여 학습 안정성 및 생성 성능 향상

00:00:01 Continuing from the previous lecture, let's look at the generating model. This lecture we will learn about variational auto encoder. Previously we saw that there are two different approaches for generative model. The first one is exped denity approaches and the next one is implicit density approaches. We are so that in the split density approaches we can directly measure the density and we can approximate the density of the training data. The variation auto encoder is one of the method in the approximate density

00:00:52 approaches. So now let's compare the tractable and approximate density approaches. In the previous lecture we learned about pixel RMN and pixel CNN which are the tractable density approaches. In the method we first define the tractable density function p x with the conditional probability and then optimize the likelihood of the training data directly by using RNN and CNN approaches on the other hand in the varational method which is on approximate denity approaches first define interctable function

00:01:42 with returned variable c like this equation so in the equation the first term represent the priority of the returned variable z and the next term represent the probability of x given by the returned and measure over x by using integral of p and p given by z because this function is interctable function we cannot directly measure this value. However, at least we can measure the lower bound of the probability. So from now on see the details of the optimization process of the varational auto encoder. Before that

00:02:51 we need to check what the auto encoder is. The auto encoder is one of the unsurved approaches for learning a low dimensional feature representations from unlabeled training data. So this is the basic structure of encoder. Incoder first takes training data as input. After that the autoder performs encoding the input data to generate a row dimensional features. Next it tried to reconstruct the input data. from the extracted features in order to make the input data and reconstructed data as similar as

00:03:51 possible the l function will be set as the distance between the input data and the reconstructed data in this structure we can extract the features which best contains input data because the features will keep the key information to reconstruct the input data. So this is an example of auto includer for image reconstruction. With this application we can extract the basis features of the images encoder parts of the trained auto encoder can be used to initialize a supervised model. for the purpose the

00:04:54 decoder part will be replaced to the supervised modules so that the predicted label will be drived from the extracted features after constructing the structures we can reupdate every ways jointly kind of tuning approaches based on the encoder could be much efficient compared to the basic supervised approaches. So now we can raise a question. As seen before the auto encoder can reconstruct the input data then can we generate new images from an encoder instead of reconstructing input data the varational

00:06:02 encoders has been proposed to resolve the question the main idea of varational encoder probabilistic spin on auto encoder will let us sample from the model to generate new data. Now let's think about the decoder part of the auto encoder. So let's suppose that the Z is the latant features of X and the X is generated data from the Z. The value Z is generated from some prior distribution of Z. That means if we know the prior we can sample the Z from the prior distribution. Moreover in the case of the X we can

00:07:10 assume the conditional distribution P of X given by Z. In this case if we consider every possible samples of Z, we can measure the marginal distribution of X. Previously we saw that we cannot directly measure this property density. The reason is that we cannot consider every possible samples of Z. That's why we use approximation techniques based on the parameter set to measure this density function. So the goal of the varational encoder is measuring this marginal probability distribution of x and as mentioned

00:08:18 before the p of g is prior distribution so for the prior distribution we can simply assumean prior distribution and the conditional property distribution of x given by z measured based on the decoder networ and as mentioned before the interctability of varational encoder came from the integral part because we cann consider every vecil consider the bas in the bas we can also see the p of x given by c which can be measured from decluder networks more we can also see the P of Z which is the prior distribution. Now

00:09:34 the remaining part is this p of Z given by X. To approximate the P of Z given by X, we need to define additional encoder network. So we need two approximate network. The first one is decoder networks for measuring p of x given by z and we need encoder networks to measure the p of c given by x. This allows us to drive at least a lower bound on the data likelihood. So we will see how we can drive the lower bound. Before that, we need to clearly understand the process of VA. Let's first see the encoder part

00:10:44 which can be measured the P Z given by X. So the input of the encoder is X. And the output is C followed by the distribution. As mentioned before we can assume the prior dispion of C as function. So the so the paramer of p of zam ofan distribution and sigma th there are two hidden vectors in the includer networks one is used for and one is used for the sigma. If we have two parameter vector, we can generate a ganion. And with the distribution, we can sample Z. This is the decoder part to measure P of

00:12:14 X given by Z. If we assume the X is also follow the Ganion then we can also have two parameter mu and sigma for X. Now the sample to see from the encoder networks become the input of the decoder networks. vector we can generate another distribution for x and based on the gan distribution we can also sample x followed by the distribution based on this kind of encoder and decoder network we can generate new sample data followed by the probity distribution of x so this is overall process of the varational auto

00:13:35 encoder now let's see the equation for the varational auto encoder so the of var encoder is maximiz the p of x and as seen before we can represent p with p of z and p of x given by c with the return variable c so this one is same as this integral use now we can represent the p of x with the base loop now from the base loop we can simply add the encoder networ q like this there is equation because now let's reconstruct this term according to the color so this one is green part and this one is purple

00:15:05 part and this one is the orange because of the logication of the equation changed to the summation of the equation and we can see that this two terms becomes the KL diversions. Now let's first consider the second KL divergen STM. Actually we cannot calculate this KL divergence because it contain P which is what we want to know actually we cannot measure this KL divergence because the p g by x is interctable However, we know that the K diver can have the negative value. So we at least know that the

00:16:26 second K diverse system should have a positive value. become a lower bound of the likelihood. So even though we cannot directly measure the likelihood but at least we can optimize the lower bound. So now the go of the varational auto encoder change to the maximizing the lower bound. So let's see the equations with the images of the varational auto encoder. So the first term represent the reconstructing the input data. So by maximizing the first we can make the generated output maxim follow the probability of the

00:17:34 original data X. So this for training the decoder networ now let's see the second you may remember that the km der represent a distance between the two probity distribution so the minimizing the K divergence in here maximizing this is 마 sign make approximate posterior distr close to the prior distribution so the second term represent optimizing the encoder networ so the l function of the varation encoder become the lower bound equations tr of the autoder the main idea of the varational auto encoder is applying

00:18:53 the probabilistic spin on auto encoder to allow generate new samples the varational encoder defines on interact lower of the function so it is principled approach to generate model and allow more flexible approximation moreover the variation auto encoder allows inference of probability distribution of z given by X which can be used for feature representation for other task. However, as mentioned before it cannot directly measure the probability of X instead of that it just maximized the lower bound of the

00:19:59 likelihood. So until now we sold pixel RNN CNN and variational auto encoder which are split density approaches. So in the next lecture we will see the generative networks which adopt implicit approaches. Thank you.

13강 - 생성형 모델 3

📚 강의 대본 기반 핵심 요약: Generative Adversarial Networks (GAN)

1. 📌 생성 모델의 두 가지 접근 방식 복습

명시적 밀도 접근 (Explicit Density):
- 확률 분포 p(x)를 명시적으로 추정.
- 예시: Pixel RNN, Pixel CNN, Variational Autoencoder (VAE)
암묵적 밀도 접근 (Implicit Density):
- 확률 분포를 직접 정의하지 않고 샘플링을 통해 학습.
- 대표 모델: GAN

2. 🥊 GAN의 직관: 적대적 두 플레이어 게임

생성자 (Generator): 가짜 데이터를 생성하여 판별자를 속이려 함.
판별자 (Discriminator): 입력된 데이터가 진짜(real)인지 가짜(fake)인지 구분함.
두 모델은 서로를 경쟁적으로 학습:
- 생성자는 더 정교한 가짜를 만들어 판별자를 속이려 하고,
- 판별자는 점점 더 가짜를 잘 구별하려 함.

🎮 비유

위조범 vs 경찰
- 위조범(Generator)은 진짜처럼 보이는 지폐를 만들고,
- 경찰(Discriminator)은 이를 감별해내려 함.
- 이 게임이 반복될수록 양쪽 능력 모두 향상됨.

3. ⚙️ GAN의 수학적 목적 함수 (Minimax Objective)

GAN 목적 함수:

D(x): 진짜 이미지일 확률을 반환
G(z): 잠재 벡터 z로부터 생성된 이미지

최적화 전략:

Discriminator D: 진짜 이미지는 1, 가짜는 0으로 판별하도록 maximize.
Generator G: 생성된 가짜를 진짜로 오인하게 하도록 minimize.

4. 🚧 Generator 초기 학습의 어려움과 개선

문제: 학습 초기에 D(G(z)) ≈ 0 → log gradient가 작아서 학습이 느림.
해결: Generator의 목적함수를 다음과 같이 변경:

→ 판별자가 가짜를 진짜라고 생각하게 만드는 방향으로 Gradient 강도 증가.

5. 🔁 GAN 학습 순서 요약 (Alternating Optimization)

실제 데이터 x 와 노이즈 벡터 z ∼ N(0,1) 샘플링
Discriminator D 학습 (real vs fake 구분 잘 하도록)
Generator G 학습 (fake가 real처럼 보이게)

이 과정을 반복하면서 모델들이 점점 정교해짐.

6. 🧠 잠재 공간 zz의 수학적 해석 (Vector Arithmetic)

GAN의 생성자 입력 z는 단순한 노이즈가 아닌 이미지의 속성을 내포함.

📈 Interpolation

z1 → z2 사이를 선형 보간 → 생성 이미지가 부드럽게 변화

➕ 벡터 연산 예시

  z(Smiling Woman) - z(Neutral Woman) + z(Neutral Man) = z(Smiling Man)

즉, 웃음이라는 속성을 독립적으로 추출해 다른 얼굴에 더할 수 있음.

👓 안경 속성 예시

  z(With Glasses Man) - z(No Glasses Man) ≈ z(Glasses)
  z(Glasses) + z(No Glasses Woman) → z(Woman With Glasses)

잠재 벡터 zz는 이미지 속성의 조작이 가능한 수학적 공간이라는 점이 GAN의 특징 중 하나임.

7. 🎨 DCGAN: CNN을 기반으로 한 Generator 구조 개선

기존 GAN의 단순 MLP 구조 → CNN으로 대체 (Deep Convolutional GAN)
더 선명하고 현실적인 이미지 생성 가능
Generator와 Discriminator 모두 convolution layer 사용

8. 🔍 요약 정리

학습 방식	생성자 vs 판별자 간의 경쟁적 최적화 (Minimax Game)
입력	생성자: 노이즈 벡터 z ∼ N(0,1)
출력	진짜처럼 보이는 이미지
학습 목표	Discriminator: 진짜/가짜 판별 정확도 향상 Generator: Discriminator를 속일 수 있는 샘플 생성
주요 특성	분포를 명시적으로 정의하지 않고도 학습 가능
활용 가능성	이미지 생성, 스타일 전이, 이미지 보간, 속성 조작 등
잠재공간 특징	속성 벡터 조합이 가능, 수학적으로 의미 있는 구조
한계점	훈련 불안정, mode collapse, gradient vanishing 문제

✅ 마무리

본 강의는 명시적 확률 분포 없이도 생성이 가능한 GAN의 구조와 학습 메커니즘을 설명함.
다음 강의에서는 GAN의 다양한 변형 및 개선 기법들이 다뤄질 가능성이 높으며, 생성 모델 연구의 실전적 확장을 예고함.

1. 🎯 GAN의 구조적 직관을 돕는 비유적 모델링

강의에서는 GAN을 '위조범 vs 경찰' 게임으로 설명함.
- 위조범(Generator): 진짜처럼 보이는 가짜 데이터를 만들어 경찰을 속임.
- 경찰(Discriminator): 가짜와 진짜를 구분함.
이 비유는 GAN의 양방향 경쟁적 학습을 직관적으로 설명하며, GAN이 왜 ‘적대적’(Adversarial)인지 강조함.
반복적으로 두 네트워크를 교대로 학습시켜 성능이 점진적으로 향상됨.

2. 🔍 목적 함수의 수학적 구조와 각 파라미터의 역할

목적 함수 min_G max_D V(D, G)의 각 항의 의미를 분리해서 설명함:

생성자는 이 값을 최소화하여 판별자를 속이려고 하며, 판별자는 이 값을 최대화하여 정확히 분류하려고 함.

3. ⚙️ 생성자 학습 초기에 발생하는 문제와 그 해결 방법

문제점: 학습 초기에는 생성자가 만든 데이터의 품질이 낮아, 판별자가 너무 쉽게 가짜임을 인식 → gradient가 거의 0에 가까워 학습이 느림.
해결책:
- 기존의 log loss 대신 log⁡D(G(z))를 최대화하는 대안 objective로 변경.
- 이로 인해 초기 구간에서도 gradient의 크기가 충분히 커져 빠른 학습 가능.
해당 변화는 실전 GAN 구현에서 자주 쓰이는 테크닉이며, 훈련 안정성 확보에 기여함.

4. 🧬 잠재 벡터 공간 z 의 수학적 조작 가능성

GAN에서 사용되는 잠재 변수 z는 단순한 샘플링 도구가 아니라 의미 있는 속성 벡터로 해석 가능함.
벡터 산술 연산 예시:
- z(웃는 여자) - z(무표정 여자) + z(무표정 남자) ≈ z(웃는 남자)
- z(안경 남자) - z(안경 없음 남자) + z(안경 없음 여자) ≈ z(안경 쓴 여자)
이를 통해, 생성된 이미지의 속성을 수학적으로 조절 가능하며, 스타일 전이, 속성 조작, 인터랙티브 이미지 생성 등으로 응용 가능함.
이는 VAE의 잠재 공간과는 다른 형태의 해석적 구조를 갖는다는 점에서 의미 있음.

5. 🔄 학습 프로세스의 순차적 최적화 흐름

GAN의 학습은 단일 모델이 아닌 두 네트워크 간의 교대 학습(iterative alternating training) 방식.
1. Discriminator: real/fake 분류 성능 극대화
2. Generator: fake 이미지를 진짜로 인식되도록 훈련
이와 같은 학습 방식은 일반적인 신경망과는 다른 비정적(動的) 최적화 구조이며, 실제 구현에서 학습 균형(balance) 유지가 중요한 포인트임.

6. 🧪 실험적 결과와 그 해석

GAN으로 생성한 예시 결과에서, 일부 이미지는 매우 사실적으로 보이지만 일부는 품질이 낮음.
이러한 문제를 해결하기 위해 DCGAN(Deep Convolutional GAN)이 제안됨:
- 생성자와 판별자 모두에 CNN 구조를 도입하여 이미지 품질 개선에 기여.
- 특히 고해상도 이미지 생성, 얼굴 생성 등에서 높은 성능을 보임.

✨ 주요 통합 인사이트 (기존 요약과 차별화된 포인트 요약)

🎮 학습 직관	GAN의 학습은 경쟁하는 두 역할(위조 vs 판별)의 게임으로 모델링됨
📉 목적 함수 재설계	초기 gradient vanishing 문제를 해결하기 위한 수식 변경이 실질적 성능 향상에 기여
🧠 잠재 공간 해석	잠재 벡터 공간 zz는 수학적 벡터 산술이 가능하고, 이미지 속성 표현이 명확함
🔁 최적화 전략	Generator ↔ Discriminator 간 교차 최적화를 반복하며 동적 균형에 도달
🎨 표현력 확장	스타일 전이, 속성 조작, 이미지 보간 등 표현 기반 응용이 매우 풍부함

00:00:00

Following the previous lecture let's take a look at the generated model as mentioned before the goal of the generating model is generating new samples from same description of the given training data. So it will address density estimation which is a core problem in unsupervised learning. So in this problem we want to make our model can estimate p of x similar to p of x driv from the real training data. So just imagine that our generative model well estimate the property disples. We can generate new samples

00:00:58 followed by the estimated probability which will be similar to the original training samples. different branches in generative models. The first branch is measuring the split density p of x. In the tractable density approaches, we can directly measure the p of x with rn or cn model. On the other hand in the approximate density approaches we can approximate the p of x with given parameter. So in the previous lecture we check the varational of encoder. In the method we assume that original priority

00:02:01 distribution will be followed thean distribution. So this is the difference between the tractable density approach and approximate density approaches. You may remember that we directly optimize the likelihood of training data by using RNN or CNN. On the other hand in the varational encoder we have the interctable density function with returned variable C. We saw that this function is interactable density because of the integral function which means we need to consider every possible Z. However, in the

00:03:03 previous lecture we saw that the base rule and some equation transformation will lead the lower bound of the likelihood. and by optimizing the rund with the varational encoder structure we can update and optimize the encoder and decoder networks together so that we can generate new sample data followed by the original property distribution of the training data the main implicit approaches making our ability to generate new samples without exp the density functions so now let's see the generative adversal

00:04:08 networks which is one of the most widely used implicit density approaches adopt game theoretic approaches so it does not require any split density function and it just run to generate new samples from training dision through player game. Then let's see the details of the two player game. Let's suppose that there are two player. The first one is counterf and the next one is police. In this game the counterfilter empt to fold the police with the fake money. On the other hand the police spot real money from the

00:05:03 fake money. The counter filter work as generator networks. So the generator networks try to f the discriminator by generating new data and the police work as discriminator networks. So the discriminator networks try to distinguish between real and fake data through the interaction between the generator and discrutor network. They try to improve their own ability. With the two player game, the generator networks can make fake data that is more and more real. Also the discor networks improve their

00:06:02 ability to distinguish between real and more real looking fake data. So by playing the two player game iteratively both the generator and discriminator methods improved their ability. So let's see the details of the generator networks. So the generator networks take random noise vectors as input. Also you may know that the goal of the generator network is making a new samples follow the training sample priority distribution. Now let's see the details of the discriminator networks. The discriminator

00:06:57 networks will take both the fake data from the generator networks and the real training data together. the discriminator networks will determine whether each of the input data is data or fake data generated by the generator networks. So this is the world structure of the model. So now let's see the objective function of the model. So this is the objective function of the game. So in here the x means that the input data x is generated from the distribution of the actual training data. Then the first term

00:08:01 represent the discriminator networks objective function which means the likelihood of real images. So when we put the real data to the discor networks, we want that the [음악] discor close to one because that means the discror will classify the real data. Now let's see the second term. In here the Z is generated by following the prior of the G. the generator networks will making a new fake image by taking the input z and then the fake images will be fed to the discriminator networks to determine whether

00:09:15 the fake image is real or fake. In this case we want that the disor determines that the fake data is wake that means we want that the disrator output value to zer so when we see the over objective function the discriminator want to maximize and minim value but you can see that there is a minus sign so theor actually want to maximize this term and this term together so in the case of the disctor this should ma to wellinate the real and fake data together. On the other hand, the generator want to f the

00:10:35 discriminator. means the generator want to make the disor generate a value closed to one because that means the discriminator is full into thinking the generated data is so when we see the objective function the first is not related to the generator. generator want to minimize the overall second term in summary the discor want to maximize the objective function to well discriminate the real and fake data while the generator networks want to minimize the objective function to fold This is why the ised

00:11:43 as max game. To serve the max game, we alternatively optimize the discriminator and generator and by iteratively optimizing the discriminator and generator respectively, both the networks will improve. their own ability by using this kind of max optimization we can make the disminator well discriminate the real and fake data and we can make the generator networks generate fake data that looks like real data however you should note that there is a technical problem. Let's see the objective functions of the

00:12:43 generator networks. This function looks like this graph. So the girl of the generator networks is go from left to right to make the discriminator generate a value close to 1. So let's see the gradient descent on the generator objective function. In the first stage we want to make go more fast to the right. However, in the first stage the gradient is relatively flat because of that in the early stage of the generator network training process the speed of the training is very slow. To solve the problem, we will

00:13:56 change this function. So we only use this term. You may check that there is minus sign. So the minimization problem will be changed to the maximization process. So this is the changed objective function for the generator networks. So the graph of the change is shown like this. In the graph the gradient in the only stage of the function is relatively high. So the optimization process in the only stage is well worked instead of the first case. So this is the stock code for the G training algorithm.

00:15:07 So we first prepare the real training data and the noise sample from the prior priority of the data we first train the discriminator networks by maximizing the objective function and then we generative networks by [음악] maximizing the second term of the objective function and we will iteratively optimize the discriminator and generator respectively with this kind of optimizing process we can train the discriminator and generator generator generate new sample data which are looks like data so these are

00:16:24 the results of the generated samples by the game model so you can see that we can generate this kind of digit data or face data by using the networks. These are the result of the G networks from CIFA data. In the result, you can see that some of the generated data looks very real object. However, you can see that it is hard to say that the rest of the data are well generated. To solve the problem, the deep convolutional is proposed. In the approaches, the generator networks are constructed with the

00:17:29 CNN structure which shows the advanced performance for image processing. with approaches we can generate new image data which looks very real images so this represent one of the main characteristics of the this game. Then let's see the what the interpolation is. Now you may know that the generator will takes random noise C as input and then generate new sample images. Now let's change some value of Z step by step. Then the generated image will be changed like this step by step. Now you can see that the

00:18:39 generated images are changed step by step with interporation. So by changing step by step this image will be converted to the images sequently. So that means with the trained generator networks each point of the return vector Z contains important characteristics of the images. So the main characteristic of the this gang is we can interpret the vector Z mathematically. Let's see the another example. In here our generator networks can generate smiling woman, neutral woman and neutral man images. Now after

00:19:47 averaging every vectors Z including each category respectively and then apply the subtraction and summation. So when we perform the first subtraction smiling wom neutral vectors then we can only have the component of smiling and when we add the smiling component to the neutral image we can generate smiling man. So this is the main characteristics of this game with the results we can see that the returned vector Z are mathematically interpretable. Let's see the another example. In here we have the vectors for

00:21:00 glasses man and vectors for no glasses man and no glasses woman images. So with the first substraction mathematics we can generate the vector C which contain the information of glasses. by add to the vector of no glasses we can generate image with glasses like this so with this kind of characteristics of this we can change the styles of our image so lots of applications are proposed this day. So until now we see some generative models. Some of are included in the split density approaches like pixel RNN, CNN or

00:22:15 varational autoder. The difference between the model is whether the property distribution is tractable or untractable. And in this structure we also see the different types of generating models followed the implicit approaches. Please let me know if any questions about geners. Thank you.

요약

이번 강의는 생성 모델(generative model)에 대한 심층적인 설명으로, 주된 목표는 주어진 훈련 데이터의 분포를 학습하여 유사한 새로운 샘플을 생성하는 것이다. 이는 비지도 학습의 핵심 문제인 밀도 추정(density estimation)을 다루며, 생성 모델은 실제 데이터 분포와 유사한 확률 분포 ( p(x) )를 추정한다. 생성 모델의 접근법은 크게 두 가지로 나뉜다. 하나는 계산 가능한 밀도 함수(tractable density)를 직접 측정하는 방법이며, 다른 하나는 근사 밀도 함수를 사용하는 방법이다. 변분 오토인코더(VAE)는 근사 밀도 함수의 대표적 예이며, 인코더와 디코더 네트워크를 최적화해 밀도의 하한치를 최대화한다.

이와 함께, 생성적 적대 신경망(GAN)은 암시적 방법(implicit density approach) 중 가장 널리 쓰이며, 게임 이론(game theoretic)의 관점을 차용한 두 네트워크(생성자와 판별자)의 경쟁을 통해 학습한다. 생성자는 판별자를 속일 만큼 진짜 같은 가짜 데이터를 만들고, 판별자는 진짜와 가짜를 구분하도록 경쟁하면서 서로의 성능을 향상시킨다. GAN의 목적 함수는 두 네트워크의 최적화 문제를 명확히 규정하며, 초기 생성자의 학습 속도를 높이기 위한 수정을 포함한다.

심층 합성곱 신경망(DCGAN) 구조를 도입하면 이미지 품질이 크게 향상되며, Z라는 잠재 벡터를 조작해 이미지의 다양한 특성을 연속적으로 변화시키는 인터폴레이션이 가능하다. 이 잠재 공간의 벡터 연산은 ‘미소 짓는 여성’ 또는 ‘안경 쓴 남성’ 같은 구체적 속성을 수학적으로 조작할 수 있음을 보여준다.

결론적으로, 생성 모델은 계산 가능 밀도 모델과 암시적 밀도 모델로 구분되며 각 접근법은 장단점이 있다. GAN은 복잡한 데이터 분포에서 현실적인 이미지를 효과적으로 생성하는 중요한 기술로 자리잡았다.

하이라이트
🎯 생성 모델은 실제 데이터 분포를 모사하는 새로운 샘플 생성이 목표이다.
📊 밀도 추정법은 계산 가능 밀도 함수와 근사 밀도 함수로 나뉜다.
🔄 GAN은 생성자와 판별자의 경쟁적인 두 플레이어 게임 구조를 따른다.
⚙️ GAN의 목적 함수는 생성자와 판별자의 상호 최적화를 정확히 설명한다.
🚀 초기 단계 생성자 학습의 느린 속도를 개선하기 위한 함수 수정을 적용한다.
🖼️ DCGAN 구조는 생성 이미지 품질을 크게 향상시키는 합성곱 신경망 기반 모델이다.
🎨 잠재 공간 벡터 연산을 통해 이미지 속성의 연속적 변화와 스타일 변형이 가능하다.
주요 인사이트

🔍 밀도 추정과 비지도 학습의 연결성
밀도 추정은 데이터의 기본 분포를 찾는 문제로, 비지도 학습에서 핵심적이다. 생성 모델은 정확한 밀도 함수를 추정하거나 근사하여 데이터 재생산을 목표로 하며, 이 과정에서 모델의 복잡성과 계산 효율성 사이의 trade-off가 존재한다.

🤖 생성적 적대 신경망의 게임 이론적 접근
GAN은 두 신경망이 경쟁하는 게임 메커니즘을 통해 성능을 개선한다. 이러한 구조는 한 쪽이 발전할 때마다 다른 쪽도 전략을 강화해야 하는 ‘제로섬 게임’ 개념을 차용하여 서로 발전하게 만든다.

📈 GAN의 목적 함수 최적화 문제와 해결책
생성자의 학습 초반 gradient가 매우 작아 학습 속도가 느린 문제를 발견하고, 목적 함수를 재설계하여 gradient가 더 커지도록 조정해 학습 효율성을 높인다. 이는 실질적인 모델 구현에서 중요한 기술적 개선점이다.

🧠 잠재 공간(Z)의 해석 가능성 및 응용
GAN의 잠재 공간을 조작함으로써 생성된 이미지의 속성을 미묘하게 조정할 수 있으며, 벡터 간 산술 연산을 통해 이미지 스타일을 변화시키는 등 창의적인 응용이 가능함을 보여준다. 이는 생성 모델을 단순 이미지 생성뿐 아니라 스타일 변환 및 속성 조작에 활용할 수 있게 한다.

🖼️ DCGAN의 이미지 품질 개선 효과
전통적인 GAN에서 고전적인 CNN구조를 접목한 DCGAN이 도입되면서 생성된 이미지가 훨씬 세밀하고 사실적인 결과를 낼 수 있음을 실험적으로 증명했다. 이는 비전 신경망의 강점과 GAN의 적대적 학습 기법을 결합한 대표적인 성공 사례이다.

⚡ 두 네트워크의 반복적이고 상호 보완적인 학습 방식
생성자와 판별자는 번갈아가며 최적화되며 반복 학습을 통해 균형에 도달한다. 이는 단일 모델 학습과는 달리 경쟁과 협력이 동시에 일어나 성능을 극대화하는 특성이다.

🌐 생성 모델들의 분류와 각 방식의 차이점
생성 모델은 밀도 함수가 직접 계산 가능한 ‘분포 추정’ 방식과 암시적 방식으로 구분되며, 각각은 장단점을 가지므로 목적에 따라 적절한 모델을 선택하는 것이 중요하다. GAN은 후자에 해당하며, 실험과 응용 측면에서 높은 가능성을 보여준다.

이상으로, 이번 강의는 생성 모델의 이론부터 실제 적용까지 중요한 측면과 최적화 기법, 그리고 디자인 철학을 폭넓게 담고 있어 현대 딥러닝 연구에 필수적인 내용을 제공한다.

저작자표시 비영리 (새창열림)

'인공지능 > 공부' 카테고리의 다른 글

Collator란? (4)	2025.07.29
딥러닝 총 정리 (2)	2025.07.04
딥러닝 공부하기 2 (1)	2025.07.02
딥러닝 공부하기 1 (1)	2025.07.01
시험 공부 겸 정리하는 곳 - LVLM (0)	2025.04.21

현재글딥러닝 공부하기 3

NLP, AI, XAI에 관심있는 공대생의 일기장...?

Today :
Yesterday :

« 2026/02 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28