Output a scalar:
Linear regression:
y = W x + b = ∑ i = 1 n w i x i + b y=Wx+b=\sum\limits_{i=1}^n{w_ix_i}+b y = W x + b = i = 1 ∑ n w i x i + b ,
L = ∑ i = 1 n ( y i − y ^ i ) 2 L=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2 L = i = 1 ∑ n ( y i − y ^ i ) 2 .
Polynomial regression:
y = ∑ i = 1 n w i x i + b y=\sum\limits_{i=1}^n{w_ix^i}+b y = i = 1 ∑ n w i x i + b .
Logistic regression (output probability):
y = σ ( W x + b ) = 1 1 + e − ∑ i = 1 n w i x i − b y=\sigma(Wx+b)=\frac{1}{1+e^{-\sum\limits_{i=1}^n{w_ix_i}-b}} y = σ ( W x + b ) = 1 + e − i = 1 ∑ n w i x i − b 1 ,
L = − ∑ i = 1 n y i log ( y ^ i ) L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)} L = − i = 1 ∑ n y i log ( y ^ i ) .
If model can't even fit training data,
then model have large bias (underfitting).
If model can fit training data but not testing data,
then model have large variance (overfitting).
To prevent underfitting, we can:
Add more features as input.
Use more complex and flexible model.
More complex model does not always lead to better performance
on testing data or new data.
Model Training Error Testing Error x x x 31.9 35.0 x 2 x^2 x 2 15.4 18.4 x 3 x^3 x 3 15.3 18.1 x 4 x^4 x 4 14.9 28.2 x 5 x^5 x 5 12.8 232.1
A extreme example,
such function obtains 0 0 0 training loss, but large testing loss:
f ( x ) = { y i , ∃ x i ∈ X random , otherwise \begin{align*}
f(x)=\begin{cases}
y_i, & \exists{x_i}\in{X} \\
\text{random}, & \text{otherwise}
\end{cases}
\end{align*} f ( x ) = { y i , random , ∃ x i ∈ X otherwise
To prevent overfitting, we can:
More training data.
Data augmentation: crop, flip, rotate, cutout, mixup.
Constrained model:
Less parameters, sharing parameters.
Less features.
Early stopping.
Dropout.
Regularization.
L ( w ) = ∑ i = 1 n ( y i − y ^ i ) 2 + λ ∑ i = 1 n w i 2 w t + 1 = w t − η ∇ L ( w ) = w t − η ( ∂ L ∂ w + λ w t ) = ( 1 − η λ ) w t − η ∂ L ∂ w ( Regularization: Weight Decay ) \begin{split}
L(w)&=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2+\lambda\sum\limits_{i=1}^n{w_i^2}\\
w_{t+1}&=w_t-\eta\nabla{L(w)}\\
&=w_t-\eta(\frac{\partial{L}}{\partial{w}}+\lambda{w_t})\\
&=(1-\eta\lambda)w_t-\eta\frac{\partial{L}}{\partial{w}}
\quad (\text{Regularization: Weight Decay})
\end{split} L ( w ) w t + 1 = i = 1 ∑ n ( y i − y ^ i ) 2 + λ i = 1 ∑ n w i 2 = w t − η ∇ L ( w ) = w t − η ( ∂ w ∂ L + λ w t ) = ( 1 − η λ ) w t − η ∂ w ∂ L ( Regularization: Weight Decay )
Binary classification:
y = δ ( W x + b ) y=\delta(Wx+b) y = δ ( W x + b ) ,
L = ∑ i = 1 n δ ( y i ≠ y ^ i ) L=\sum\limits_{i=1}^n\delta(y_i\ne\hat{y}_i) L = i = 1 ∑ n δ ( y i = y ^ i ) ,
e.g. spam filtering.
Multi-class classification:
y = softmax ( W x + b ) y=\text{softmax}(Wx+b) y = softmax ( W x + b ) ,
L = − ∑ i = 1 n y i log ( y ^ i ) L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)} L = − i = 1 ∑ n y i log ( y ^ i ) ,
e.g. document classification.
Non-linear model:
Deep learning: y = softmax ( ReLU ( W x + b ) ) y=\text{softmax}(\text{ReLU}(Wx+b)) y = softmax ( ReLU ( W x + b )) ,
e.g. image recognition, game playing.
Support vector machine (SVM): y = sign ( W x + b ) y=\text{sign}(Wx+b) y = sign ( W x + b ) .
Decision tree: y = vote ( leaves ( x ) ) y=\text{vote}(\text{leaves}(x)) y = vote ( leaves ( x )) .
K-nearest neighbors (KNN): y = vote ( neighbors ( x ) ) y=\text{vote}(\text{neighbors}(x)) y = vote ( neighbors ( x )) .
Find a function F F F :
F : X × Y → R F:X\times{Y}\to{R} F : X × Y → R
F ( x , y ) F(x, y) F ( x , y ) evaluates how well y y y fits x x x (object compatible).
Given an object x x x :
y ~ = arg max y ∈ Y F ( x , y ) \tilde{y}=\arg\max\limits_{y\in{Y}}F(x, y) y ~ = arg y ∈ Y max F ( x , y )
Evaluation: what does F ( X , y ) F(X, y) F ( X , y ) look like.
Inference: how to solve arg max \arg\max arg max problem.
Training: how to find F ( x , y ) F(x, y) F ( x , y ) with given training data.
F ( x , y ) = ∑ i = 1 n w i ϕ i ( x , y ) = [ w 1 w 2 w 3 ⋮ w n ] ⋅ [ ϕ 1 ( x , y ) ϕ 2 ( x , y ) ϕ 3 ( x , y ) ⋮ ϕ n ( x , y ) ] = W ⋅ Φ ( x , y ) \begin{split}
F(x, y)&=\sum\limits_{i=1}^n{w_i\phi_i(x, y)} \\
&=\begin{bmatrix}w_1\\w_2\\w_3\\\vdots\\w_n\end{bmatrix}\cdot
\begin{bmatrix}\phi_1(x, y)\\\phi_2(x, y)\\\phi_3(x, y)\\\vdots\\\phi_n(x, y)\end{bmatrix}\\
&=W\cdot\Phi(x, y)
\end{split} F ( x , y ) = i = 1 ∑ n w i ϕ i ( x , y ) = w 1 w 2 w 3 ⋮ w n ⋅ ϕ 1 ( x , y ) ϕ 2 ( x , y ) ϕ 3 ( x , y ) ⋮ ϕ n ( x , y ) = W ⋅ Φ ( x , y )
主成分分析 (PCA) 是一种常用的数据降维方法, 将 m m m 个 n n n 维向量降为 k k k 维,
其目标是选择 k k k 个单位 (模为 1 1 1 ) 正交基, 使得原始数据变换到这组基上后,
各字段两两间协方差为 0 0 0 (各字段完全独立), 各字段的方差尽可能大 (各字段降维后分布尽可能分散),
即在正交的约束下, 取最大的 k k k 个方差:
C = 1 m X X T = 1 m [ ∑ i = 1 m ( x i 1 ) 2 ∑ i = 1 m x i 1 x i 2 … ∑ i = 1 m x i 1 x i n ∑ i = 1 m x i 2 x i 1 ∑ i = 1 m ( x i 2 ) 2 … ∑ i = 1 m x i 2 x i n ⋮ ⋮ ⋱ ⋮ ∑ i = 1 m x i n x i 1 ∑ i = 1 m x i n x i 2 … ∑ i = 1 m ( x i n ) 2 ] \begin{equation}
C=\frac{1}{m}XX^T=\frac{1}{m}\begin{bmatrix}
\sum_{i=1}^m(x_i^1)^2&\sum_{i=1}^m{x_i^1x_i^2}&\dots&\sum_{i=1}^m{x_i^1x_i^n}\\
\sum_{i=1}^m{x_i^2x_i^1}&\sum_{i=1}^m(x_i^2)^2&\dots&\sum_{i=1}^m{x_i^2x_i^n}\\
\vdots&\vdots&\ddots&\vdots\\
\sum_{i=1}^m{x_i^nx_i^1}&\sum_{i=1}^m{x_i^nx_i^2}&\dots&\sum_{i=1}^m(x_i^n)^2
\end{bmatrix}
\end{equation} C = m 1 X X T = m 1 ∑ i = 1 m ( x i 1 ) 2 ∑ i = 1 m x i 2 x i 1 ⋮ ∑ i = 1 m x i n x i 1 ∑ i = 1 m x i 1 x i 2 ∑ i = 1 m ( x i 2 ) 2 ⋮ ∑ i = 1 m x i n x i 2 … … ⋱ … ∑ i = 1 m x i 1 x i n ∑ i = 1 m x i 2 x i n ⋮ ∑ i = 1 m ( x i n ) 2
协方差矩阵 C C C 是一个对称矩阵, 其对角线分别为各字段的方差,
其第 i i i 行 j j j 列和第 j j j 行 i i i 列元素相同, 表示 i i i 和 j j j 两个字段的协方差.
将协方差矩阵对角化, 得到基于矩阵运算的 PCA 算法如下:
将原始数据按列组成 n n n 行 m m m 列矩阵 X X X .
将 X X X 的每一行 (代表一个属性字段) 进行零均值化, 即减去这一行的均值,
使得 x ˉ = 0 \bar{x}=0 x ˉ = 0 , 方便方差与协方差的计算.
求出协方差矩阵 C = 1 m X X T C=\frac{1}{m}XX^T C = m 1 X X T 的特征值及对应的特征向量.
将特征向量按对应特征值大小从上到下按行排列成矩阵, 取前 k k k 行组成矩阵 P P P .
Y = P X Y=PX Y = PX 即为降维到 k k k 维后的数据.
x i ′ = x i − μ σ x'_i=\frac{x_i-\mu}{\sigma} x i ′ = σ x i − μ
词嵌入是自然语言处理 (NLP) 中的一种技术,
将词汇映射到实数向量空间, 使得词汇之间的语义关系可以通过向量空间中的距离来表示.
变分自编码器 (VAEs) 是一种生成模型, 通过学习数据的潜在分布来生成新的数据:
Z = Encoder ( X ) X ′ = Decoder ( Z ) L = Min Loss ( X ′ , X ) \begin{split}
Z&=\text{Encoder}(X) \\
X'&=\text{Decoder}(Z) \\
L&=\text{Min Loss}(X',X)
\end{split} Z X ′ L = Encoder ( X ) = Decoder ( Z ) = Min Loss ( X ′ , X )
变分自动编码器学习的是隐变量 (特征) Z Z Z 的概率分布, z ∼ N ( 0 , I ) , x ∣ z ∼ N ( μ ( z ) , σ ( z ) ) z\sim N(0, I), x|z\sim N\big(\mu(z), \sigma(z)\big) z ∼ N ( 0 , I ) , x ∣ z ∼ N ( μ ( z ) , σ ( z ) ) ,
通过深度网络来学习 q ( z ∣ x ) q(z|x) q ( z ∣ x ) 的参数, 一步步优化 q q q 使其与 p ( z ∣ x ) p(z|x) p ( z ∣ x ) 十分相似, 便可用它来对复杂的分布进行近似的推理:
Feature disentangle:
voice conversion.
Discrete representation:
unsupervised classification, unsupervised summarization.
Anomaly detection:
face detection, fraud detection, disease detection, network intrusion detection.
Compression and decompression.
Generator.
生成对抗网络 (GANs) 由两个网络组成: 生成器 (Generator) 和判别器 (Discriminator).
生成器的目标是生成尽可能逼真的数据, 判别器的目标是尽可能准确地区分真实数据和生成数据.
两个网络相互对抗, 生成器生成数据 (decoder in VAE), 判别器判断数据真伪 (1 / 0 1/0 1/0 classification neural network),
生成器根据判别器的判断结果调整生成数据的策略, 不断提升生成数据的逼真程度.
G ∗ = arg min G max D V ( G , D ) D ∗ = arg max D V ( D , G ) \begin{split}
G^*&=\arg\min_G\max_DV(G,D)\\
D^*&=\arg\max_DV(D,G)
\end{split} G ∗ D ∗ = arg G min D max V ( G , D ) = arg D max V ( D , G )
Pre-trained models + fine-tuning (downstream tasks):
Cross lingual.
Cross discipline.
Pre-training with artificial data.
Long context window.
Content filtering: 去除有害内容.
Text extraction: 去除 HTML 标签.
Quality filtering: 去除低质量内容.
Document deduplication: 去除重复内容.
BERT (Bidirectional Encoder Representations from Transformers) 是一种 Encoder-only 预训练模型,
通过大规模无监督学习, 学习文本的语义信息, 用于下游任务的微调:
Masked token prediction: 随机遮挡输入文本中的一些词, 预测被遮挡的词.
Next sentence prediction: 预测两个句子的顺序关系.
GPT (Generative Pre-trained Transformers) 是一种 Decoder-only 预训练模型.
低秩适配 (LoRA) 是一种参数高效微调技术 (Parameter-efficient Fine-tuning),
其基本思想是冻结原始矩阵 W 0 ∈ R H × H W_0\in\mathbb{R}^{H\times{H}} W 0 ∈ R H × H ,
通过低秩分解矩阵 A ∈ R H × R A\in\mathbb{R}^{H\times{R}} A ∈ R H × R 和 B ∈ R H × R B\in\mathbb{R}^{H\times{R}} B ∈ R H × R
来近似参数更新矩阵 Δ W = A ⋅ B T \Delta{W}=A\cdot{B^T} Δ W = A ⋅ B T ,
其中 R ≪ H R\ll{H} R ≪ H 是减小后的秩:
W = W 0 + Δ W = W 0 + A ⋅ B T \begin{equation}
W=W_0+\Delta{W}=W_0+A\cdot{B^T}
\end{equation} W = W 0 + Δ W = W 0 + A ⋅ B T
在微调期间, 原始的矩阵参数 W 0 W_0 W 0 不会被更新,
低秩分解矩阵 A A A 和 B B B 则是可训练参数用于适配下游任务.
LoRA 微调在保证模型效果的同时, 能够显著降低模型训练的成本.
Make model can understand human instructions not appear in training data:
提高指令复杂性和多样性能够促进模型性能的提升.
更大的参数规模有助于提升模型的指令遵循能力.
强化学习是一种机器学习方法, 通过智能体与环境交互,
智能体根据环境的反馈调整策略, 利用梯度上升算法 (Gradient Ascent),
最大化长期奖励 (learn from rewards and mistakes).
θ ∗ = arg max θ R ˉ θ = arg max θ ∑ τ R ( τ ) P ( τ ∣ θ ) θ t + 1 = θ t + η ∇ R ˉ θ ∇ R ˉ θ = [ ∂ R ˉ θ ∂ w 1 ∂ R ˉ θ ∂ w 2 ⋮ ∂ R ˉ θ ∂ b 1 ⋮ ] R t = ∑ n = t N γ n − t r n \begin{equation}
\begin{split}
\theta^*&=\arg\max\limits_\theta\bar{R}_\theta=\arg\max\limits_\theta\sum\limits_{\tau}R(\tau)P(\tau|\theta)\\
\theta_{t+1}&=\theta_t+\eta\nabla\bar{R}_\theta\\
\nabla\bar{R}_\theta&=\begin{bmatrix}\frac{\partial\bar{R}_\theta}{\partial{w_1}}\\\frac{\partial\bar{R}_\theta}{\partial{w_2}}\\\vdots\\\frac{\partial\bar{R}_\theta}{\partial{b_1}}\\\vdots\end{bmatrix}\\
R_t&=\sum\limits_{n=t}^N\gamma^{n-t}r_n
\end{split}
\end{equation} θ ∗ θ t + 1 ∇ R ˉ θ R t = arg θ max R ˉ θ = arg θ max τ ∑ R ( τ ) P ( τ ∣ θ ) = θ t + η ∇ R ˉ θ = ∂ w 1 ∂ R ˉ θ ∂ w 2 ∂ R ˉ θ ⋮ ∂ b 1 ∂ R ˉ θ ⋮ = n = t ∑ N γ n − t r n