Skip to main content

Supervised Learning

Regression

Output a scalar:

  • Linear regression: y=Wx+b=i=1nwixi+by=Wx+b=\sum\limits_{i=1}^n{w_ix_i}+b, L=i=1n(yiy^i)2L=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2.
  • Polynomial regression: y=i=1nwixi+by=\sum\limits_{i=1}^n{w_ix^i}+b.
  • Logistic regression (output probability): y=σ(Wx+b)=11+ei=1nwixiby=\sigma(Wx+b)=\frac{1}{1+e^{-\sum\limits_{i=1}^n{w_ix_i}-b}}, L=i=1nyilog(y^i)L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)}.

If model can't even fit training data, then model have large bias (underfitting). If model can fit training data but not testing data, then model have large variance (overfitting).

Underfitting

To prevent underfitting, we can:

  • Add more features as input.
  • Use more complex and flexible model.

Overfitting

More complex model does not always lead to better performance on testing data or new data.

ModelTraining ErrorTesting Error
xx31.935.0
x2x^215.418.4
x3x^315.318.1
x4x^414.928.2
x5x^512.8232.1

A extreme example, such function obtains 00 training loss, but large testing loss:

f(x)={yi,xiXrandom,otherwise\begin{align*} f(x)=\begin{cases} y_i, & \exists{x_i}\in{X} \\ \text{random}, & \text{otherwise} \end{cases} \end{align*}

To prevent overfitting, we can:

  • More training data.
  • Data augmentation: crop, flip, rotate, cutout, mixup.
  • Constrained model:
    • Less parameters, sharing parameters.
    • Less features.
    • Early stopping.
    • Dropout.
    • Regularization.
L(w)=i=1n(yiy^i)2+λi=1nwi2wt+1=wtηL(w)=wtη(Lw+λwt)=(1ηλ)wtηLw(Regularization: Weight Decay)\begin{split} L(w)&=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2+\lambda\sum\limits_{i=1}^n{w_i^2}\\ w_{t+1}&=w_t-\eta\nabla{L(w)}\\ &=w_t-\eta(\frac{\partial{L}}{\partial{w}}+\lambda{w_t})\\ &=(1-\eta\lambda)w_t-\eta\frac{\partial{L}}{\partial{w}} \quad (\text{Regularization: Weight Decay}) \end{split}

Classification

  • Binary classification: y=δ(Wx+b)y=\delta(Wx+b), L=i=1nδ(yiy^i)L=\sum\limits_{i=1}^n\delta(y_i\ne\hat{y}_i), e.g. spam filtering.
  • Multi-class classification: y=softmax(Wx+b)y=\text{softmax}(Wx+b), L=i=1nyilog(y^i)L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)}, e.g. document classification.
  • Non-linear model:
    • Deep learning: y=softmax(ReLU(Wx+b))y=\text{softmax}(\text{ReLU}(Wx+b)), e.g. image recognition, game playing.
    • Support vector machine (SVM): y=sign(Wx+b)y=\text{sign}(Wx+b).
    • Decision tree: y=vote(leaves(x))y=\text{vote}(\text{leaves}(x)).
    • K-nearest neighbors (KNN): y=vote(neighbors(x))y=\text{vote}(\text{neighbors}(x)).

Structured Learning

Training

Find a function FF:

F:X×YRF:X\times{Y}\to{R}

F(x,y)F(x, y) evaluates how well yy fits xx (object compatible).

Inference

Given an object xx:

y~=argmaxyYF(x,y)\tilde{y}=\arg\max\limits_{y\in{Y}}F(x, y)

Structured Learning

Three Problems
  • Evaluation: what does F(X,y)F(X, y) look like.
  • Inference: how to solve argmax\arg\max problem.
  • Training: how to find F(x,y)F(x, y) with given training data.

Structured Linear Model

F(x,y)=i=1nwiϕi(x,y)=[w1w2w3wn][ϕ1(x,y)ϕ2(x,y)ϕ3(x,y)ϕn(x,y)]=WΦ(x,y)\begin{split} F(x, y)&=\sum\limits_{i=1}^n{w_i\phi_i(x, y)} \\ &=\begin{bmatrix}w_1\\w_2\\w_3\\\vdots\\w_n\end{bmatrix}\cdot \begin{bmatrix}\phi_1(x, y)\\\phi_2(x, y)\\\phi_3(x, y)\\\vdots\\\phi_n(x, y)\end{bmatrix}\\ &=W\cdot\Phi(x, y) \end{split}