Skip to main content

Supervised Learning

Regression

Output a scalar:

  • Linear regression: y=Wx+b=i=1nwixi+by=Wx+b=\sum\limits_{i=1}^n{w_ix_i}+b, L=i=1n(yiy^i)2L=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2.
  • Polynomial regression: y=i=1nwixi+by=\sum\limits_{i=1}^n{w_ix^i}+b.
  • Logistic regression (output probability): y=σ(Wx+b)=11+ei=1nwixiby=\sigma(Wx+b)=\frac{1}{1+e^{-\sum\limits_{i=1}^n{w_ix_i}-b}}, L=i=1nyilog(y^i)L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)}.

If model can't even fit training data, then model have large bias (underfitting). If model can fit training data but not testing data, then model have large variance (overfitting).

Underfitting

To prevent underfitting, we can:

  • Add more features as input.
  • Use more complex and flexible model.

Overfitting

More complex model does not always lead to better performance on testing data or new data.

ModelTraining ErrorTesting Error
xx31.935.0
x2x^215.418.4
x3x^315.318.1
x4x^414.928.2
x5x^512.8232.1

An extreme example, such function obtains 00 training loss, but large testing loss:

f(x)={yi,xiXrandom,otherwise\begin{align*} f(x)=\begin{cases} y_i, & \exists{x_i}\in{X} \\ \text{random}, & \text{otherwise} \end{cases} \end{align*}

To prevent overfitting, we can:

  • More training data.
  • Data augmentation: crop, flip, rotate, cutout, mixup.
  • Constrained model:
    • Less parameters, sharing parameters.
    • Less features.
    • Early stopping.
    • Dropout.
    • Regularization.

Regularization

  • L1 (Lasso): L(w)=i=1n(yiy^i)2+λi=1nwiL(w)=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2+\lambda\sum\limits_{i=1}^n|w_i|, 产生稀疏解, 可用于特征选择
  • L2 (Ridge): L(w)=i=1n(yiy^i)2+λi=1nwi2L(w)=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2+\lambda\sum\limits_{i=1}^n{w_i^2}, 权重衰减 (Weight Decay), 防止权重过大
  • Elastic Net: L(w)=i=1n(yiy^i)2+λ1i=1nwi+λ2i=1nwi2L(w)=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2+\lambda_1\sum\limits_{i=1}^n|w_i|+\lambda_2\sum\limits_{i=1}^n{w_i^2}, 结合 L1L2
wt+1=wtηL(w)=wtη(Lw+λsign(wt))(L1: Sparse Solution)wt+1=wtηL(w)=(1ηλ)wtηLw(L2: Weight Decay)\begin{split} w_{t+1}&=w_t-\eta\nabla{L(w)} \\ &=w_t-\eta(\frac{\partial{L}}{\partial{w}}+\lambda\,\text{sign}(w_t)) \quad (\text{L1: Sparse Solution}) \\ w_{t+1}&=w_t-\eta\nabla{L(w)} \\ &=(1-\eta\lambda)w_t-\eta\frac{\partial{L}}{\partial{w}} \quad (\text{L2: Weight Decay}) \end{split}

Classification

Binary Class

Spam filtering:

y=δ(Wx+b)y=\delta(Wx+b) L=i=1nδ(yiy^i)L=\sum\limits_{i=1}^n\delta(y_i\ne\hat{y}_i)

Multiple Class

Document classification:

y=softmax(Wx+b)y=\text{softmax}(Wx+b) L=i=1nyilog(y^i)L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)}

Decision Tree

最大化信息增益 (Information Gain):

a=argmaxaAGain(D,a)=argmaxaAEnt(D)v=1VDvDEnt(Dv)\begin{aligned} a_* &= \underset{a \in A}{\arg \max} \text{Gain}(D, a) \\ &= \underset{a \in A}{\arg \max} \text{Ent}(D) - \sum\limits_{v=1}^{V}\frac{|{D^v}|}{|{D}|}\text{Ent}(D^v) \end{aligned}

其中, 信息熵为 Ent(D)=k=1Ypklog2pk\text{Ent}(D)=-\sum\limits_{k=1}^{|\mathcal{Y}|}p_k\log_2 p_k, 熵值越大, 数据越混乱, 不确定性越大.

Support Vector Machine

SVM, 寻找距离正负样本都最远的超平面:

y=sign(Wx+b)y=\text{sign}(Wx+b)

最靠近决策边界 (超平面) 的样本点称为支持向量:

  • 决定位置: 决定决策边界的方向和位置
  • 高效性: SVM 对输入扰动不敏感, 删掉非支持向量的点, 模型的决策边界不变
  • 最大化间隔: 最大化超平面到支持向量的距离

Bayesian Classifier

贝叶斯分类器, 最小化分类错误率:

h(x)=argmincYR(cx)=argmini{1,2,,N}j=1NλijP(cjx)\begin{aligned} h^*(\boldsymbol{x}) &= \underset{c \in \mathcal{Y}}{\arg \min} R(c | \boldsymbol{x}) \\ &= \underset{i \in \{1, 2, \dots, N\}}{\arg \min} \sum\limits_{j=1}^N\lambda_{ij}P(c_j | \boldsymbol{x}) \end{aligned}

朴素贝叶斯分类器, 最大化后验概率:

hnb(x)=argmaxcYP(c)i=1dP(xic)h_\text{nb}(\boldsymbol{x}) = \underset{c \in \mathcal{Y}}{\arg \max} P(c) \prod_{i=1}^{d}P(x_i | c)

其中, P(c)=DcDP(c) = \frac{|D_c|}{|D|}, 利用样本数之比 (频率) 估计先验概率.

K-Nearest Neighbors

kNN (近朱者赤) 基于与待测样本最近的 kk 个样本的信息进行预测:

y=vote(neighbors(x))y=\text{vote}(\text{neighbors}(x))

利用降维增大样本的稠密性, 消解维度灾难 (样本稀疏而特征维数极高), 使得 kNN 起作用.

Neural Network

Image recognition, game playing:

y=softmax(ReLU(Wx+b))y=\text{softmax}(\text{ReLU}(Wx+b))

Ensemble Learning

结合多个个体学习器 (e.g. 决策树, 支持向量机, 神经网络) 完成学习任务的机器学习范式:

  • 基础模型需要满足 好而不同, 具有准确性与差异性.
  • ϵ=0.5\epsilon = 0.5 的个体学习器对收敛没有作用: 误差率高 (取反) 或低都有作用.
方法训练方式关注点数据采样
Boosting串行 (后续修正前期)降低偏差 (提升精度)调整样本权重
Bagging并行 (独立训练)降低方差 (减小过拟合)自助采样 (有放回)
Stacking分层 (组合输出)提升整体泛化能力初级模型预测值

Boosting

串行式集成, 个体学习器之间存在强依赖关系, 必须串行生成:

  • 每轮迭代根据前一轮表现调整样本权重, 让模型更关注之前 学错 的样本
  • 将弱学习器逐步提升为强学习器
  • 主要作用: 降低偏差, 提升准确率
  • 典型代表: AdaBoost, GBDT, XGBoost, LightGBM

Bagging

并行式集成, 个体学习器之间不存在强依赖关系, 可以并行生成:

  • Bootstrap: 有放回抽样生成多个不同训练集, 分别训练模型
  • Aggregation: 平均 (回归) 或投票 (分类) 得出最终结果
  • 主要作用: 降低方差, 提高稳定性
  • 典型代表: 随机森林 (Random Forest), 在决策树训练过程中引入随机属性选择

Stacking

堆叠式集成, 分层组合策略:

  • 先训练多个不同的初级学习器, 将它们的预测结果作为 新特征 输入次级学习器
  • 能够融合不同算法的优点: e.g. 同时结合决策树, SVM, KNN

Structured Learning

Training

Find a function FF:

F:X×YRF:X\times{Y}\to{R}

F(x,y)F(x, y) evaluates how well yy fits xx (object compatible).

Inference

Given an object xx:

y~=argmaxyYF(x,y)\tilde{y}=\arg\max\limits_{y\in{Y}}F(x, y)

Structured Learning

Three Problems
  • Evaluation: what does F(X,y)F(X, y) look like.
  • Inference: how to solve argmax\arg\max problem.
  • Training: how to find F(x,y)F(x, y) with given training data.

Structured Linear Model

F(x,y)=i=1nwiϕi(x,y)=[w1w2w3wn][ϕ1(x,y)ϕ2(x,y)ϕ3(x,y)ϕn(x,y)]=WΦ(x,y)\begin{split} F(x, y)&=\sum\limits_{i=1}^n{w_i\phi_i(x, y)} \\ &=\begin{bmatrix}w_1\\w_2\\w_3\\\vdots\\w_n\end{bmatrix}\cdot \begin{bmatrix}\phi_1(x, y)\\\phi_2(x, y)\\\phi_3(x, y)\\\vdots\\\phi_n(x, y)\end{bmatrix}\\ &=W\cdot\Phi(x, y) \end{split}