Skip to main content

AI Basic Notes

Linear Algebra

Linear Space

向量空间的一组基: 张成 (span) 该空间的一个线性无关 (linearly independent) 向量集.

Linear Transformation

线性变换是指一个向量空间到另一个向量空间的映射, 满足加法和数乘运算的线性性质:

L(αv+βw)=αL(v)+βL(w)\begin{equation} L(\alpha\vec{v}+\beta\vec{w})=\alpha{L(\vec{v})}+\beta{L(\vec{w})} \end{equation}

Matrix representation:

v=xi+yj=[xy]Av=xi^+yj^=x[ac]i+y[bd]j=[abcd][xy]A2A1=A2i^+A2j^=([a2b2c2d2][a1c1])i+([a2b2c2d2][b1d1])j=(a1[a2c2]+c1[b2d2])i+(b1[a2c2]+d1[b2d2])j=[a2a1+b2c1a2b1+b2d1c2a1+d2c1c2b1+d2d1]\begin{split} \vec{v}&=x\vec{i}+y\vec{j} \\ &=\begin{bmatrix}x \\ y\end{bmatrix} \\ A\vec{v}&=x\hat{i}+y\hat{j} \\ &=x\begin{bmatrix}a \\ c\end{bmatrix}\vec{i} +y\begin{bmatrix}b \\ d\end{bmatrix}\vec{j} \\ &=\begin{bmatrix}a & b \\ c & d\end{bmatrix} \begin{bmatrix}x \\ y\end{bmatrix} \\ A_2A_1&=A_2\hat{i}+A_2\hat{j} \\ &=(\begin{bmatrix}a_2 & b_2 \\ c_2 & d_2\end{bmatrix} \begin{bmatrix}a_1 \\ c_1 \end{bmatrix})\vec{i} +(\begin{bmatrix}a_2 & b_2 \\ c_2 & d_2\end{bmatrix} \begin{bmatrix}b_1 \\ d_1\end{bmatrix})\vec{j} \\ &=(a_1\begin{bmatrix}a_2 \\ c_2\end{bmatrix} +c_1\begin{bmatrix}b_2 \\ d_2\end{bmatrix})\vec{i} +(b_1\begin{bmatrix}a_2 \\ c_2\end{bmatrix} +d_1\begin{bmatrix}b_2 \\ d_2\end{bmatrix})\vec{j} \\ &=\begin{bmatrix} a_2a_1+b_2c_1 & a_2b_1+b_2d_1 \\ c_2a_1+d_2c_1 & c_2b_1+d_2d_1 \end{bmatrix} \end{split}
Matrix Multiplication

左乘矩阵相当于对列向量进行线性变换, 右乘矩阵相当于对行向量进行线性变换.

Am×nA_{m\times n} 表示 n 维空间到 m 维空间的线性变换:

  • n 列: 输入空间有 n 个基向量, 即为 n 维空间.
  • m 行: 输出空间每个基向量对应 m 个坐标, 即为 m 维空间.
  • A1×nA_{1\times n} 表示 n 维空间到一维空间的线性变换: 向量点乘 (Dot Product) vw\vec{v} \cdot \vec{w} 可以理解为 w\vec{w} 通过 V1×nV_{1\times n} 变换到一维空间后的投影.
Dot Product and Cross Product
  • Dot Product: vw=vwcosθ\vec{v} \cdot \vec{w}=\|\vec{v}\|\|\vec{w}\|\cos{\theta}.
  • Cross Product: v×w=vwsinθ\|\vec{v} \times \vec{w}\|=\|\vec{v}\|\|\vec{w}\|\sin{\theta}.
  • v(v×w)=0\vec{v}\cdot(\vec{v}\times\vec{w})=0, w(v×w)=0\vec{w}\cdot(\vec{v}\times\vec{w})=0.

Basis changes, translating transformations:

vp=P1APwp\vec{v_p}=P^{-1}AP\vec{w_p}

Determinant

det(A)\det(A) 表示矩阵 A 的行列式, 表示该变换对空间的缩放因子:

Determinant

det(A)=0\det(A)=0 时, 表示该变换将空间压缩到一个低维空间, 称矩阵 AA 为奇异矩阵 (Singular Matrix):

  • 矩阵 AA 列向量线性相关.
  • 矩阵 AA 不满秩 (Not full rank).
  • 矩阵 AA 不可逆.

Determinant for 2d matrix:

abcd=adbc\begin{equation} \begin{vmatrix}a & b \\ c & d\end{vmatrix}=ad-bc \end{equation}

Determinant Diagram

Determinant for 3d matrix:

abcdefghi=aefhibdfgi+cdegh\begin{equation} \begin{vmatrix}a & b & c \\ d & e & f \\ g & h & i\end{vmatrix} =a\begin{vmatrix}e & f \\ h & i\end{vmatrix} -b\begin{vmatrix}d & f \\ g & i\end{vmatrix} +c\begin{vmatrix}d & e \\ g & h\end{vmatrix} \end{equation}

Determinant for matrix multiplication:

det(A1A2)=det(A1)det(A2)\begin{equation} \det(A_1A_2)=\det(A_1)\det(A_2) \end{equation}

Gaussian Elimination

高斯消元法求解线性方程组 (Linear System Of Equations):

首先第一行的第一个元素化为 1, 下面每行减去第一行乘以该行第一个元素的倍数, 从而把第一列除第一行外的全部元素都化为 0, 进而把第二列除前两个元素之外的元素都化为 0, 最后把矩阵化为上三角矩阵. 类似地, 从最后一行开始, 逐行把上三角矩阵化为单位矩阵.

Ax=vA1Ax=A1vx=A1v\begin{split} A\vec{x}&=\vec{v} \\ A^{-1}A\vec{x}&=A^{-1}\vec{v} \\ \vec{x}&=A^{-1}\vec{v} \end{split}

Eigenvalue and Eigenvector

A=[abcd]A=\begin{bmatrix} a & b \\ c & d \end{bmatrix} eigenvalue Av=λvA\vec{v}=\lambda\vec{v} quick calculation:

λ=m±m2p=λ1+λ22±(λ1+λ22)2λ1λ2=a+d2±(a+d2)2(adbc)\begin{split} \lambda&=m\pm\sqrt{m^2-p} \\ &=\frac{\lambda_1+\lambda_2}{2} \pm\sqrt{(\frac{\lambda_1+\lambda_2}{2})^2-\lambda_1\lambda_2} \\ &=\frac{a+d}{2}\pm\sqrt{(\frac{a+d}{2})^2-(ad-bc)} \end{split}

Linear Algebra Reference

Mathematical Analysis

Limit

洛必达法则是求解分数形式的未定型极限 limxa00\lim_{x\to{a}}\frac{0}{0} 的有效方法之一:

limxaf(x)g(x)=limxadf(x)dg(x)=limxadfdx(a)dxdgdx(a)dx=limxadfdx(a)dgdx(a)=limxaf(a)g(a)\begin{equation} \begin{split} \lim_{x\to{a}}\frac{f(x)}{g(x)} &=\lim_{x\to{a}}\frac{df(x)}{dg(x)} \\ &=\lim_{x\to{a}}\frac{\frac{df}{dx}(a)dx}{\frac{dg}{dx}(a)dx} \\ &=\lim_{x\to{a}}\frac{\frac{df}{dx}(a)}{\frac{dg}{dx}(a)} \\ &=\lim_{x\to{a}}\frac{f'(a)}{g'(a)} \end{split} \end{equation}

Derivative

常见导数:

ddxxn=nxn1ddxsinx=cosxddxcosx=sinxddxax=axlnaddxex=exddxlogax=1xlnaddxlnx=1xddx(g(x)+h(x))=g(x)+h(x)ddx(g(x)h(x))=g(x)h(x)+g(x)h(x)ddxf(g(x))=f(g(x))g(x)ddxf1(x)=1f(f1(x))ddxa(x)b(x)f(t)dt=f(b(x))b(x)f(a(x))a(x)\begin{equation} \begin{split} \frac{d}{dx}x^n&=nx^{n-1} \\ \frac{d}{dx}\sin{x}&=\cos{x} \\ \frac{d}{dx}\cos{x}&=-\sin{x} \\ \frac{d}{dx}a^x&=a^x\ln{a} \\ \frac{d}{dx}e^x&=e^x \\ \frac{d}{dx}\log_a{x}&=\frac{1}{x\ln{a}} \\ \frac{d}{dx}\ln{x}&=\frac{1}{x} \\ \frac{d}{dx}(g(x)+h(x))&=g'(x)+h'(x) \\ \frac{d}{dx}(g(x)h(x))&=g'(x)h(x)+g(x)h'(x) \\ \frac{d}{dx}f(g(x))&=f'(g(x))g'(x) \\ \frac{d}{dx}f^{-1}(x)&=\frac{1}{f'(f^{-1}(x))} \\ \frac{d}{dx}\int_{a(x)}^{b(x)}f(t)dt&=f(b(x))b'(x)-f(a(x))a'(x) \end{split} \end{equation}

Series

泰勒级数利用函数在某点的各阶导数, 近似该点附近函数的值:

11x=n=0xnx<1ex=n=0xnn!ln(1+x)=n=1(1)n1nxnx(1,1]sin(x)=n=0(1)n(2n+1)!x2n+1cos(x)=n=0(1)n(2n)!x2nf(x)=n=0f(n)(x0)n!(xx0)n=f(x0)+f(x0)(xx0)+f(x0)2!(xx0)2+\begin{equation} \begin{split} \frac{1}{1-x}&=\sum\limits_{n=0}^{\infty}x^n \quad |x|\lt1 \\ e^x&=\sum\limits_{n=0}^{\infty}\frac{x^n}{n!} \\ \ln(1+x)&=\sum\limits_{n=1}^{\infty}\frac{(-1)^{n-1}}{n}x^n \quad x\in(-1,1] \\ \sin(x)&=\sum\limits_{n=0}^{\infty}\frac{(-1)^n}{(2n+1)!}x^{2n+1} \\ \cos(x)&=\sum\limits_{n=0}^{\infty}\frac{(-1)^n}{(2n)!}x^{2n} \\ f(x)&=\sum\limits_{n=0}^{\infty}\frac{f^{(n)(x_0)}}{n!}(x-x_0)^n \\ &=f(x_0)+f'(x_0)(x-x_0)+\frac{f''(x_0)}{2!}(x-x_0)^2+\dots \end{split} \end{equation}

Euler's Formula

复数平面 (Complex Plane) 上的圆周运动:

eix=cosx+isinx\begin{equation} e^{ix}=\cos{x}+i\sin{x} \end{equation}

Fourier Transform

Time to frequency transform:

f^(ξ)=f(t)e2πiξtdt\begin{equation} \hat{f}(\xi)=\int_{-\infty}^{\infty}f(t)e^{-2\pi i\xi t}dt \end{equation}

Discrete Fourier Transform (DFT):

X[k]=n=0N1xnei2πNkn\begin{equation} X[k]=\sum\limits_{n=0}^{N-1}x_n e^{-\frac{i2\pi}{N}kn} \end{equation}

outcomes

[11111e2πine2πi(2)ne2πi(n1)n1e2πi(2)ne2πi(4)ne2πi(2)(n1)n1e2πi(n1)ne2πi(2)(n1)ne2πi(n1)(n1)n]\begin{bmatrix} 1 & 1 & 1 & \dots & 1 \\ 1 & e^{\frac{2\pi i}{n}} & e^{\frac{2\pi i(2)}{n}} & \dots & e^{\frac{2\pi i(n-1)}{n}} \\ 1 & e^{\frac{2\pi i(2)}{n}} & e^{\frac{2\pi i(4)}{n}} & \dots & e^{\frac{2\pi i(2)(n-1)}{n}} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & e^{\frac{2\pi i(n-1)}{n}} & e^{\frac{2\pi i(2)(n-1)}{n}} & \dots & e^{\frac{2\pi i(n-1)(n-1)}{n}} \end{bmatrix}

Fourier Transform

Differential Equation

微分方程 (Differential Equation) 是描述变量之间关系的方程, 通常包含未知函数及其导数, 用于描述物理现象和自然规律.

First Order Differential Equation

一阶微分方程:

ddt[x(t)y(t)]=[abcd][x(t)y(t)][x(t)y(t)]=e[abcd]t[x(0)y(0)]\begin{equation} \frac{d}{dt}\begin{bmatrix}x(t)\\y(t)\end{bmatrix} =\begin{bmatrix}a&b\\c&d\end{bmatrix}\begin{bmatrix}x(t)\\y(t)\end{bmatrix} \Rightarrow \begin{bmatrix}x(t)\\y(t)\end{bmatrix} =e^{\begin{bmatrix}a&b\\c&d\end{bmatrix}t}\begin{bmatrix}x(0)\\y(0)\end{bmatrix} \end{equation} ifv(t)=eMtv0thenddtv(t)=ddteMtv0=ddtn=0Mnn!tnv0=n=0Mnn!ntn1v0=Mn=0Mn1(n1)!tn1v0=MeMtv0=Mv(t)\begin{split} \text{if} \quad \vec{v}(t)&=e^{Mt}\vec{v}_0 \\ \text{then} \quad \frac{d}{dt}\vec{v}(t) &=\frac{d}{dt}e^{Mt}\vec{v}_0 \\ &=\frac{d}{dt}\sum\limits_{n=0}^{\infty}\frac{M^n}{n!}t^n\vec{v}_0 \\ &=\sum\limits_{n=0}^{\infty}\frac{M^n}{n!}nt^{n-1}\vec{v}_0 \\ &=M\sum\limits_{n=0}^{\infty}\frac{M^{n-1}}{(n-1)!}t^{n-1}\vec{v}_0 \\ &=Me^{Mt}\vec{v}_0 \\ &=M\vec{v}(t) \end{split}

Second Order Differential Equation

x¨(t)=μx˙(t)ωx(t)\begin{equation} \ddot{x}(t)=-\mu\dot{x}(t)-\omega x(t) \end{equation}

Gravitational force equation:

y¨(t)=g,y˙(t)=gt+v0dx1dt=v1,dv1dt=Gm2(x2x1x2x1)(1x2x12)θ¨(t)=μθ˙(t)gLsin(θ(t))\begin{split} \ddot{y}(t)=-g, \quad & \dot{y}(t)=-gt+v_0 \\ \frac{d\vec{x}_1}{dt}=\vec{v}_1, \quad & \frac{d\vec{v}_1}{dt}=Gm_2\Big(\frac{\vec{x}_2-\vec{x}_1}{\|\vec{x}_2-\vec{x}_1\|}\Big)\Big(\frac{1}{\|\vec{x}_2-\vec{x}_1\|^2}\Big) \\ & \ddot{\theta}(t)=-\mu\dot{\theta}(t)-\frac{g}{L}\sin\big({\theta}(t)\big) \end{split}

Partial Differential Equation

热传导方程:

Tt(x,t)=α2Tx2(x,t)\frac{\partial{T}}{\partial{t}}(x,t)=\alpha\frac{\partial^2{T}}{\partial{x^2}}(x,t)

Black-Scholes / Merton equation:

Vt+rSVS+12σ2S22VS2rV=0\frac{\partial{V}}{\partial{t}}+rS\frac{\partial{V}}{\partial{S}}+\frac{1}{2}\sigma^2S^2\frac{\partial^2{V}}{\partial{S^2}}-rV=0

Phase Space

相空间是描述系统状态的空间, 每个点代表系统的一个状态, 点的轨迹描述了系统的演化.

import numpy as np

# Physical constants
g = 9.8
L = 2
mu = 0.1

THETA_0 = np.pi / 3 # 60 degrees
THETA_DOT_0 = 0 # No initial angular velocity

# Definition of ODE
def get_theta_double_dot(theta, theta_dot):
return -mu * theta_dot - (g / L) * np.sin(theta)

# Solution to the differential equation (numerically)
def theta(t):
theta = THETA_0
theta_dot = THETA_DOT_0
delta_t = 0.01 # Time step
for _ in np.arange(0, t, delta_t):
theta_double_dot = get_theta_double_dot(theta, theta_dot)
theta += theta_dot * delta_t
theta_dot += theta_double_dot * delta_t
return theta

Mathematical Statistics

Normal Distribution

若随机变量 XX 服从一个位置参数为 μ\mu, 尺度参数为 σ\sigma 的概率分布, 且其概率密度函数 (Probability Density Function, PDF) 为:

f(x)=1σ2πe12(xμσ)2\begin{equation} f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} \end{equation}

则这个随机变量称为正态随机变量, 正态随机变量服从的分布称为正态分布, 记作 XN(μ,σ2)X \sim N(\mu,\sigma^2), 读作 XX 服从 N(μ,σ2)N(\mu,\sigma^2) (正态分布). 其中 μ\mu 为均值 (数学期望 Mean), σ\sigma 为标准差 (Standard Deviation).

正态分布 (又称 Gaussian Distribution) 是一种连续概率分布. 当 μ\mu 为 0, σ\sigma 为 1 时, 称为标准正态分布 (Standard Normal Distribution).

Central Limit Theorem

在自然界与生产中, 一些现象受到许多相互独立的随机因素的影响, 如果每个因素所产生的影响都很微小时, 总影响 (Sum) 可以看作服从正态分布.

相互独立的正态分布, 其和也是正态分布. 总体正态分布的均值等于各个分布的均值之和, E(X1++Xn)=E(X1)++E(Xn)=nμE(X_1+\dots+X_n)=E(X_1)+\dots+E(X_n)=n\mu. 假设协方差为 0, 则总体正态分布的方差等于各个分布的方差之和, Var(X1++Xn)=Var(X1)++Var(Xn)=nσ2{Var(X_1+\dots+X_n)}={Var(X_1)+\dots+Var(X_n)}={n}\sigma^2, 可以得到总体正态分布的标准差为 nσ\sqrt{n}\sigma.

设随机变量 X1,X2,,XnX_1,X_2,\dots,X_n 独立同分布(Independent Identically Distribution), 且均值为 E(Xi)=μE(X_i)=\mu, 方差为 D(Xi)=σ2D(X_i)=\sigma^2, 对于任意 xx, 其分布函数为

Fn(x)=P{i=1nXinμnσx}F_n(x)=P\left\{\frac{\sum_{i=1}^n{X_i}-n\mu}{\sqrt{n}\sigma}\leq{x}\right\}

满足

limnFn(x)=limnP{i=1nXinμnσx}=12πxet22dt=(x)\begin{equation} \lim_{n\to\infty}F_n(x) =\lim_{n\to\infty}P\left\{\frac{\sum_{i=1}^n{X_i}-n\mu}{\sqrt{n}\sigma}\leq{x}\right\} =\frac{1}{\sqrt{2\pi}}\int_{-\infty}^x{e^{-\frac{t^2}{2}}dt} =\varnothing(x) \end{equation}

独立同分布的中心极限定理说明, 当 nn 足够大时, 随机变量 Xn=i=1nXiX_n=\sum\limits_{i=1}^n{X_i} 近似服从正态分布 N(nμ,nσ2)N(n\mu,n\sigma^2); 标准化后的随机变量 Yn=i=1nXinμnσY_n=\frac{\sum_{i=1}^n{X_i}-n\mu}{\sqrt{n}\sigma} 近似服从标准正态分布 N(0,1)N(0,1).

Central Limit Theorem

更一般化的中心极限定理, 可参见林德伯格中心极限定理 (Lindeberg CLT) etc.

Gaussian Integral

ex2dx=π\begin{equation} \int_{-\infty}^{\infty}e^{-x^2}dx=\sqrt{\pi} \end{equation}

高维空间求解高斯积分:

Gaussian Integral

对于正态分布, 系数 1π\frac{1}{\sqrt{\pi}} 使得概率密度函数的积分为 1, 即 f(x)dx=1\int_{-\infty}^{\infty}f(x)dx=1, 使其成为有意义的概率分布.

Binomial Distribution

重复 n 次独立的伯努利试验, XB(n,p)X \sim B(n,p), 期望值 E(X)=npE(X)=np, 方差 D(X)=np(1p)D(X)=np(1-p):

P(X=k)=Cnkpk(1p)nk\begin{equation} P(X=k)=C_n^kp^k(1-p)^{n-k} \end{equation}

Bayes Theorem

Bayes theorem:

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B)=P(A|B)P(B)=P(B|A)P(A)\Rightarrow P(AB)=P(BA)P(A)P(B)=P(BA)P(A)P(BA)P(A)+P(B¬A)P(¬A)\begin{equation} P(A|B)=\frac{P(B|A)P(A)}{P(B)}=\frac{P(B|A)P(A)}{P(B|A)P(A)+P(B|\neg{A})P(\neg{A})} \end{equation}

Bayes Theorem

其中, P(BA)P(B¬A)\frac{P(B|A)}{P(B|\neg{A})} 称为贝叶斯系数 (Bayes Factor):

O(AB)=P(AB)P(¬AB)=P(AB)P(B)P(¬AB)P(B)=P(BA)P(A)P(B¬A)P(¬A)=O(A)P(BA)P(B¬A)O(A|B)=\frac{P(A|B)}{P(\neg{A}|B)}=\frac{P(A|B)P(B)}{P(\neg{A}|B)P(B)}=\frac{P(B|A)P(A)}{P(B|\neg{A})P(\neg{A})}=O(A)\frac{P(B|A)}{P(B|\neg{A})}

Information Entropy

信息熵 是对信息量的度量 (E[I]E[I]), 概率小的事件发生所带来的信息量大, 概率大的事件发生所带来的信息量小, 即概率小, 出现机会小, 不确定性大, 信息熵大, 信息量大:

H(X)=E[log2P(xi)]=i=1nP(xi)log2P(xi)\begin{equation} H(X)=E[-\log_2{P(x_i)}]=-\sum\limits_{i=1}^n{P(x_i)\log_2{P(x_i)}} \end{equation}

Supervised Learning

Regression

Output a scalar:

  • Linear regression: y=Wx+b=i=1nwixi+by=Wx+b=\sum\limits_{i=1}^n{w_ix_i}+b, L=i=1n(yiy^i)2L=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2.
  • Polynomial regression: y=i=1nwixi+by=\sum\limits_{i=1}^n{w_ix^i}+b.
  • Logistic regression (output probability): y=σ(Wx+b)=11+ei=1nwixiby=\sigma(Wx+b)=\frac{1}{1+e^{-\sum\limits_{i=1}^n{w_ix_i}-b}}, L=i=1nyilog(y^i)L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)}.

If model can't even fit training data, then model have large bias (underfitting). If model can fit training data but not testing data, then model have large variance (overfitting).

Underfitting

To prevent underfitting, we can:

  • Add more features as input.
  • Use more complex and flexible model.

Overfitting

More complex model does not always lead to better performance on testing data or new data.

ModelTraining ErrorTesting Error
xx31.935.0
x2x^215.418.4
x3x^315.318.1
x4x^414.928.2
x5x^512.8232.1

A extreme example, such function obtains 00 training loss, but large testing loss:

f(x)={yi,xiXrandom,otherwise\begin{align*} f(x)=\begin{cases} y_i, & \exists{x_i}\in{X} \\ \text{random}, & \text{otherwise} \end{cases} \end{align*}

To prevent overfitting, we can:

  • More training data.
  • Data augmentation: crop, flip, rotate, cutout, mixup.
  • Constrained model:
    • Less parameters, sharing parameters.
    • Less features.
    • Early stopping.
    • Dropout.
    • Regularization.
L(w)=i=1n(yiy^i)2+λi=1nwi2wt+1=wtηL(w)=wtη(Lw+λwt)=(1ηλ)wtηLw(Regularization: Weight Decay)\begin{split} L(w)&=\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2+\lambda\sum\limits_{i=1}^n{w_i^2}\\ w_{t+1}&=w_t-\eta\nabla{L(w)}\\ &=w_t-\eta(\frac{\partial{L}}{\partial{w}}+\lambda{w_t})\\ &=(1-\eta\lambda)w_t-\eta\frac{\partial{L}}{\partial{w}} \quad (\text{Regularization: Weight Decay}) \end{split}

Classification

  • Binary classification: y=δ(Wx+b)y=\delta(Wx+b), L=i=1nδ(yiy^i)L=\sum\limits_{i=1}^n\delta(y_i\ne\hat{y}_i), e.g spam filtering.
  • Multi-class classification: y=softmax(Wx+b)y=\text{softmax}(Wx+b), L=i=1nyilog(y^i)L=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)}, e.g document classification.
  • Non-linear model:
    • Deep learning: y=softmax(ReLU(Wx+b))y=\text{softmax}(\text{ReLU}(Wx+b)), e.g image recognition, game playing.
    • Support vector machine (SVM): y=sign(Wx+b)y=\text{sign}(Wx+b).
    • Decision tree: y=vote(leaves(x))y=\text{vote}(\text{leaves}(x)).
    • K-nearest neighbors (KNN): y=vote(neighbors(x))y=\text{vote}(\text{neighbors}(x)).

Structured Learning

Training

Find a function FF:

F:X×YRF:X\times{Y}\to{R}

F(x,y)F(x, y) evaluates how well yy fits xx (object compatible).

Inference

Given an object xx:

y~=argmaxyYF(x,y)\tilde{y}=\arg\max\limits_{y\in{Y}}F(x, y)

Structured Learning

Three Problems
  • Evaluation: what does F(X,y)F(X, y) look like.
  • Inference: how to solve argmax\arg\max problem.
  • Training: how to find F(x,y)F(x, y) with given training data.

Structured Linear Model

F(x,y)=i=1nwiϕi(x,y)=[w1w2w3wn][ϕ1(x,y)ϕ2(x,y)ϕ3(x,y)ϕn(x,y)]=WΦ(x,y)\begin{split} F(x, y)&=\sum\limits_{i=1}^n{w_i\phi_i(x, y)} \\ &=\begin{bmatrix}w_1\\w_2\\w_3\\\vdots\\w_n\end{bmatrix}\cdot \begin{bmatrix}\phi_1(x, y)\\\phi_2(x, y)\\\phi_3(x, y)\\\vdots\\\phi_n(x, y)\end{bmatrix}\\ &=W\cdot\Phi(x, y) \end{split}

Unsupervised Learning

Principal Component Analysis

主成分分析 (PCA) 是一种常用的数据降维方法, 将 mmnn 维向量降为 kk 维, 其目标是选择 kk 个单位 (模为 11) 正交基, 使得原始数据变换到这组基上后, 各字段两两间协方差为 00 (各字段完全独立), 各字段的方差尽可能大 (各字段降维后分布尽可能分散), 即在正交的约束下, 取最大的 kk 个方差:

C=1mXXT=1m[i=1m(xi1)2i=1mxi1xi2i=1mxi1xini=1mxi2xi1i=1m(xi2)2i=1mxi2xini=1mxinxi1i=1mxinxi2i=1m(xin)2]\begin{equation} C=\frac{1}{m}XX^T=\frac{1}{m}\begin{bmatrix} \sum_{i=1}^m(x_i^1)^2&\sum_{i=1}^m{x_i^1x_i^2}&\dots&\sum_{i=1}^m{x_i^1x_i^n}\\ \sum_{i=1}^m{x_i^2x_i^1}&\sum_{i=1}^m(x_i^2)^2&\dots&\sum_{i=1}^m{x_i^2x_i^n}\\ \vdots&\vdots&\ddots&\vdots\\ \sum_{i=1}^m{x_i^nx_i^1}&\sum_{i=1}^m{x_i^nx_i^2}&\dots&\sum_{i=1}^m(x_i^n)^2 \end{bmatrix} \end{equation}

协方差矩阵 CC 是一个对称矩阵, 其对角线分别为各字段的方差, 其第 iijj 列和第 jjii 列元素相同, 表示 iijj 两个字段的协方差. 将协方差矩阵对角化, 得到基于矩阵运算的 PCA 算法如下:

  • 将原始数据按列组成 nnmm 列矩阵 XX.
  • XX 的每一行 (代表一个属性字段) 进行零均值化, 即减去这一行的均值, 使得 xˉ=0\bar{x}=0, 方便方差与协方差的计算.
  • 求出协方差矩阵 C=1mXXTC=\frac{1}{m}XX^T 的特征值及对应的特征向量.
  • 将特征向量按对应特征值大小从上到下按行排列成矩阵, 取前 kk 行组成矩阵 PP.
  • Y=PXY=PX 即为降维到 kk 维后的数据.
Normalization
xi=xiμσx'_i=\frac{x_i-\mu}{\sigma}

Word Embedding

词嵌入是自然语言处理 (NLP) 中的一种技术, 将词汇映射到实数向量空间, 使得词汇之间的语义关系可以通过向量空间中的距离来表示.

Variational Auto-Encoders

变分自编码器 (VAEs) 是一种生成模型, 通过学习数据的潜在分布来生成新的数据:

Z=Encoder(X)X=Decoder(Z)L=Min Loss(X,X)\begin{split} Z&=\text{Encoder}(X) \\ X'&=\text{Decoder}(Z) \\ L&=\text{Min Loss}(X',X) \end{split}

变分自动编码器学习的是隐变量 (特征) ZZ 的概率分布, zN(0,I),xzN(μ(z),σ(z))z\sim N(0, I), x|z\sim N\big(\mu(z), \sigma(z)\big), 通过深度网络来学习 q(zx)q(z|x) 的参数, 一步步优化 qq 使其与 p(zx)p(z|x) 十分相似, 便可用它来对复杂的分布进行近似的推理:

  • Feature disentangle: voice conversion.
  • Discrete representation: unsupervised classification, unsupervised summarization.
  • Anomaly detection: face detection, fraud detection, disease detection, network intrusion detection.
  • Compression and decompression.
  • Generator.

Variational Auto-Encoders

Generative Adversarial Networks

生成对抗网络 (GANs) 由两个网络组成: 生成器 (Generator) 和判别器 (Discriminator). 生成器的目标是生成尽可能逼真的数据, 判别器的目标是尽可能准确地区分真实数据和生成数据. 两个网络相互对抗, 生成器生成数据 (decoder in VAE), 判别器判断数据真伪 (1/01/0 classification neural network), 生成器根据判别器的判断结果调整生成数据的策略, 不断提升生成数据的逼真程度.

G=argminGmaxDV(G,D)D=argmaxDV(D,G)\begin{split} G^*&=\arg\min_G\max_DV(G,D)\\ D^*&=\arg\max_DV(D,G) \end{split}

Generative Adversarial Networks

Self-supervised Learning

Pre-trained models + fine-tuning (downstream tasks):

  • Cross lingual.
  • Cross discipline.
  • Pre-training with artificial data.
  • Long context window.

Pre-trained Models

Pre-trained Data

  • Content filtering: 去除有害内容.
  • Text extraction: 去除 HTML 标签.
  • Quality filtering: 去除低质量内容.
  • Document deduplication: 去除重复内容.

BERT

BERT (Bidirectional Encoder Representations from Transformers) 是一种 Encoder-only 预训练模型, 通过大规模无监督学习, 学习文本的语义信息, 用于下游任务的微调:

  • Masked token prediction: 随机遮挡输入文本中的一些词, 预测被遮挡的词.
  • Next sentence prediction: 预测两个句子的顺序关系.

Bidirectional Encoder Representations from Transformers

GPT

GPT (Generative Pre-trained Transformers) 是一种 Decoder-only 预训练模型.

Fine-tuning

BERT Adapters

BERT Adapters

Low-Rank Adaptation

低秩适配 (LoRA) 是一种参数高效微调技术 (Parameter-efficient Fine-tuning), 其基本思想是冻结原始矩阵 W0RH×HW_0\in\mathbb{R}^{H\times{H}}, 通过低秩分解矩阵 ARH×RA\in\mathbb{R}^{H\times{R}}BRH×RB\in\mathbb{R}^{H\times{R}} 来近似参数更新矩阵 ΔW=ABT\Delta{W}=A\cdot{B^T}, 其中 RHR\ll{H} 是减小后的秩:

W=W0+ΔW=W0+ABT\begin{equation} W=W_0+\Delta{W}=W_0+A\cdot{B^T} \end{equation}

在微调期间, 原始的矩阵参数 W0W_0 不会被更新, 低秩分解矩阵 AABB 则是可训练参数用于适配下游任务. LoRA 微调在保证模型效果的同时, 能够显著降低模型训练的成本.

Instruction-tuning

Make model can understand human instructions not appear in training data:

Instruction-tuning

  • 提高指令复杂性和多样性能够促进模型性能的提升.
  • 更大的参数规模有助于提升模型的指令遵循能力.

Reinforcement Learning

强化学习是一种机器学习方法, 通过智能体与环境交互, 智能体根据环境的反馈调整策略, 利用梯度上升算法 (Gradient Ascent), 最大化长期奖励 (learn from rewards and mistakes).

Reinforcement Learning

θ=argmaxθRˉθ=argmaxθτR(τ)P(τθ)θt+1=θt+ηRˉθRˉθ=[Rˉθw1Rˉθw2Rˉθb1]Rt=n=tNγntrn\begin{equation} \begin{split} \theta^*&=\arg\max\limits_\theta\bar{R}_\theta=\arg\max\limits_\theta\sum\limits_{\tau}R(\tau)P(\tau|\theta)\\ \theta_{t+1}&=\theta_t+\eta\nabla\bar{R}_\theta\\ \nabla\bar{R}_\theta&=\begin{bmatrix}\frac{\partial\bar{R}_\theta}{\partial{w_1}}\\\frac{\partial\bar{R}_\theta}{\partial{w_2}}\\\vdots\\\frac{\partial\bar{R}_\theta}{\partial{b_1}}\\\vdots\end{bmatrix}\\ R_t&=\sum\limits_{n=t}^N\gamma^{n-t}r_n \end{split} \end{equation}

Actor-Critic Model

Actor-Critic Model

Inverse Reinforcement Learning

Inverse Reinforcement Learning

Multilayer Perceptron

Multilayer Perceptron

多层感知机是一种前馈神经网络 (Feedforward Neural Network) 就像是一个模拟大脑处理信息的过程, 通过多层处理 (输入层, 隐藏层, 输出层), 从原始数据中提取特征, 并做出预测或分类, 它通过调整内部连接权重来学习和改进其预测能力.

Linear Mapping

线性变换 Hl=WlXl1+BlH^l=W^lX^{l-1}+B^l:

  • wijlw_{ij}^l (weight): 第 ll 层第 ii 个节点与上一层第 jj 个节点连接的权重.
  • bilb_i^l (bias): 第 ll 层第 ii 个节点的偏置.
H=[w00lw01lw0nlw10lw11lw1nlwk0lwk1lwknl][x0l1x1l1xnl1]+[b0lb1lbkl]H=\begin{bmatrix} w_{00}^l & w_{01}^l & \dots & w_{0n}^l \\ w_{10}^l & w_{11}^l & \dots & w_{1n}^l \\ \vdots & \vdots & \ddots & \vdots \\ w_{k0}^l & w_{k1}^l & \dots & w_{kn}^l \end{bmatrix} \begin{bmatrix} x_0^{l-1} \\ x_1^{l-1} \\ \vdots \\ x_n^{l-1} \end{bmatrix} +\begin{bmatrix} b_0^l \\ b_1^l \\ \vdots \\ b_k^l \end{bmatrix}

Activation Function

激活函数 y=σ(H)y=\sigma(H), Xl=σ(Hl)X^l=\sigma(H^l):

  • 引入非线性特性, 使得网络可以学习和拟合复杂函数.
  • ReLU (Rectified Linear Unit, 线性整流单元): σ(H)=max(0,H)\sigma(H)=\max(0,H), 可以解决梯度消失问题 (越靠近输入层的神经元梯度越接近 0), 加速收敛.
  • Sigmoid: σ(H)=11+eH\sigma(H)=\frac{1}{1+e^{-H}}.
  • e.g 归一化函数, 使得输出值在 0 到 1 之间, 可以使得整个网络成为概率模型.

Activation Function

Loss Function

损失函数 L(y,y^)L(y,\hat{y}):

  • 用于衡量真实值(或人工标注值) yy 与模型预测值 y^\hat{y} 之间的差异.
  • 常见的损失函数有均方误差 (Mean Squared Error, MSE) 和交叉熵 (Cross Entropy).
    • 均方误差: L(y,y^)=1ni=1n(yiy^i)2L(y,\hat{y})=\frac{1}{n}\sum\limits_{i=1}^n(y_i-\hat{y}_i)^2.
    • 交叉熵: L(y,y^)=i=1nyilog(y^i)L(y,\hat{y})=-\sum\limits_{i=1}^n{y_i\log(\hat{y}_i)}, 常用于 classification 任务.
    • Selective synaptic plasticity: L(θ)=L(θ)+λibi(θiθi)2L'(\theta)=L(\theta)+\lambda\sum\limits_ib_i(\theta_i'-\theta_i)^2.
Deep

H|\mathcal{H}| is the size of hypothesis space, larger H|\mathcal{H}| means deeper model:

Why Deep Learning?

Deep model need less neurons (parameters) to represent same function, means deep model has smaller H|\mathcal{H}|, flat/shallow model has larger H|\mathcal{H}|.

Learning

Learning is the process of minimizing loss function, finally find out the right weights and biases:

  • Early stopping.
  • Dropout.
  • Regularization.
  • New activation function.
  • Adaptive learning rate.

Gradient Descent

通过梯度下降算法, 优化损失函数, 使其最小化 (沿梯度下降方向, 调整 W 和 B):

θt+1=θtηL\begin{equation} \theta_{t+1}=\theta_t-\eta\nabla{L} \end{equation}

其中, η\eta 为学习率, LL 为损失函数, L\nabla{L} 为损失函数的梯度, θ\theta 为模型参数, tt 为迭代次数.

argminL(θ)=L(a,b)+L(a,b)θ1(θ1a)+L(a,b)θ2(θ2b)++1n!nL(a,b)θ1n(θ1a)n+1n!nL(a,b)θ2n(θ2b)nL(a,b)+L(a,b)θ1(θ1a)+L(a,b)θ2(θ2b)[θ1aθ2b]=η[L(a,b)θ1L(a,b)θ2][θ1θ2]=[ab]η[L(a,b)θ1L(a,b)θ2]\begin{split} \arg\min{L(\theta)}&=L(a,b)+\frac{\partial{L(a,b)}}{\partial{\theta_1}}(\theta_1-a)+\frac{\partial{L(a,b)}}{\partial{\theta_2}}(\theta_2-b)\\ &+\dots+\frac{1}{n!}\frac{\partial^n{L(a,b)}}{\partial{\theta_1^n}}(\theta_1-a)^n+\frac{1}{n!}\frac{\partial^n{L(a,b)}}{\partial{\theta_2^n}}(\theta_2-b)^n\\ &\approx{L(a,b)}+\frac{\partial{L(a,b)}}{\partial{\theta_1}}(\theta_1-a)+\frac{\partial{L(a,b)}}{\partial{\theta_2}}(\theta_2-b) \\ &\Rightarrow \begin{bmatrix}\theta_1-a\\\theta_2-b\end{bmatrix} =-\eta\begin{bmatrix}\frac{\partial{L(a,b)}}{\partial{\theta_1}}\\\frac{\partial{L(a,b)}}{\partial{\theta_2}}\end{bmatrix} \\ &\Rightarrow \begin{bmatrix}\theta_1 \\ \theta_2\end{bmatrix} =\begin{bmatrix}a\\b\end{bmatrix}-\eta\begin{bmatrix}\frac{\partial{L(a,b)}}{\partial{\theta_1}}\\\frac{\partial{L(a,b)}}{\partial{\theta_2}}\end{bmatrix} \end{split}

Gradient Descent

def convex_function(x):
return x**2

def gradient_descent(initial_x, learning_rate, num_iterations):
x_values = [initial_x]
y_values = [convex_function(initial_x)]

x = initial_x

for i in range(num_iterations):
gradient = 2 * x # 函数 f(x) = x^2 的导数为 f'(x) = 2x
x -= learning_rate * gradient

x_values.append(x)
y_values.append(convex_function(x))

return x_values, y_values

Gradient

一个优秀的梯度下降算法, 需要满足以下几个条件:

  • 高效性: 梯度下降算法的迭代次数尽可能少. 当距离最小值较远时, L0\nabla{L}\gg0, 当距离最小值较近时, L0\nabla{L}\to0, 这样的梯度下降算法可以更快地收敛. 反之, 当距离最小值较远时, L0\nabla{L}\to0, 这样的梯度下降算法更慢收敛.
  • 稳定性: 梯度下降算法的迭代过程尽可能稳定. Maxout 激活函数拟合能力非常强, 可以拟合任意的凸函数.
  • 鲁棒性: 梯度下降算法对于初始值的选择或者特定的线索片段不敏感. Dropout 策略 减少神经元之间复杂的共适应关系 (每个神经元有 p%p\% 概率不被激活), 迫使网络去学习更加鲁棒的特征, 缓解过拟合问题, 提高模型的泛化能力.

Learning Rate

必要时需要调整学习率, 使得梯度下降更快收敛:

  • 如果学习率过大, 可能会导致梯度下降不稳定, 甚至发散.
  • 如果学习率过小, 可能会导致梯度下降收敛速度过慢.

常见的学习率调整策略有:

  • Learning rate decay:
    • 阶梯衰减 (Step Decay): ηt=ηt+1\eta_t=\frac{\eta}{\sqrt{t+1}}.
    • 线性衰减 (Linear Decay): ηt=η(1tT)\eta_t=\eta(1-\frac{t}{T}).
    • 指数衰减 (Exponential Decay): ηt=ηekt\eta_t=\eta{e^{-kt}}.
    • 余弦衰减 (Cosine Decay): ηt=η1+cos(πtT)2\eta_t=\eta\frac{1+\cos(\frac{\pi{t}}{T})}{2}.
  • Warm up learning rate: increase and then decrease learning rate.
  • SGD with momentum: wt+1=wt+vt+1=wt+λvtηgtw_{t+1}=w_t+v_{t+1}=w_t+\lambda{v_t}-\eta{g_t}.
  • Adaptive learning rate:
    • AdaGrad: adaptive sub-gradient method, wt+1=wtηt+11t+1i=0tgi2gt=wtηi=0tgi2gtw_{t+1}=w_t-\frac{\frac{\eta}{\sqrt{t+1}}}{\sqrt{\frac{1}{t+1}\sum_{i=0}^t{g_i^2}}}g_t=w_t-\frac{\eta}{\sqrt{\sum_{i=0}^t{g_i^2}}}g_t.
    • RMSprop: root mean square propagation, wt+1=wtησtgt=wtηασt12+(1α)gt2gtw_{t+1}=w_t-\frac{\eta}{\sigma_t}g_t=w_t-\frac{\eta}{\sqrt{\alpha\sigma_{t-1}^2}+(1-\alpha)g_t^2}g_t,
    • Adam: adaptive moment estimation (Momentum + RMSprop).
    • RAdam: start with SGDM, then switch to Adam.

Critical Point

当遇到 L=0\nabla{L}=0 的情况时, 可能是以下几种情况:

  • 局部最小值 (Local Minimum).
  • 局部最大值 (Local Maximum).
  • 鞍点 (Saddle Point).

此时利用泰勒级数展开, 可以得到:

L(θ)L(θ0)+L(θ0)(θθ0)+12(θθ0)T2L(θ0)(θθ0)=L(θ0)+12(θθ0)T2L(θ0)(θθ0)\begin{split} L(\theta)&\approx{L(\theta_0)}+\nabla{L(\theta_0)}(\theta-\theta_0)+\frac{1}{2}(\theta-\theta_0)^T\nabla^2{L(\theta_0)}(\theta-\theta_0)\\ &=L(\theta_0)+\frac{1}{2}(\theta-\theta_0)^T\nabla^2{L(\theta_0)}(\theta-\theta_0) \end{split}

其中, 2L(θ0)\nabla^2{L(\theta_0)} 为 Hessian 矩阵, 为二阶导数矩阵:

  • 2L(θ0)\nabla^2{L(\theta_0)} 为正定矩阵时, θ0\theta_0 为局部最小值.
  • 2L(θ0)\nabla^2{L(\theta_0)} 为负定矩阵时, θ0\theta_0 为局部最大值.
  • 2L(θ0)\nabla^2{L(\theta_0)} 为不定矩阵时, θ0\theta_0 为鞍点.
Saddle Point

在高维空间中, 鞍点的数量远远多于局部最小值. 深度神经网络拥有大量的参数, 使得其损失函数的鞍点数量远远多于局部最小值.

Backpropagation

反向传播算法: 从最小化损失函数出发, 由输出层到输入层, 通过链式法则, 计算每一层的梯度, 从而更新权重和偏置.

Backpropagation

Derivative chain rule (链式法则):

Lwijl=Lxilxilzilzilwijl=Lxilσ(zil)zil(Xl1Wil+bil)wijl=δilσ(zil)xjl1Lbil=Lxilxilzilzilbil=Lxilσ(zil)zil(bil+WilXl1)bil=δilσ(zil)Lxjl1=i=0Nl1Lxilxilzilzilxjl1=i=0Nl1Lxilσ(zil)zil(WilXl1+bil)xjl1=i=0Nl1δilσ(zil)wijl\begin{split} \frac{\partial{L}}{\partial{w_{ij}^l}} &=\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{x_i^l}}{\partial{z_i^l}}\cdot \frac{\partial{z_i^l}}{\partial{w_{ij}^l}} \\ &=\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{\sigma(z_i^l)}}{\partial{z_i^l}}\cdot \frac{\partial(X^{l-1}W_i^l+b_i^l)}{\partial{w_{ij}^l}} \\ &=\delta_i^l\cdot\sigma'(z_i^l)\cdot{x_j^{l-1}} \\ \frac{\partial{L}}{\partial{b_i^l}} &=\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{x_i^l}}{\partial{z_i^l}}\cdot \frac{\partial{z_i^l}}{\partial{b_i^l}} \\ &=\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{\sigma(z_i^l)}}{\partial{z_i^l}}\cdot \frac{\partial(b_i^l+W_i^lX^{l-1})}{\partial{b_i^l}} \\ &=\delta_i^l\cdot\sigma'(z_i^l) \\ \frac{\partial{L}}{\partial{x_j^{l-1}}} &=\sum\limits_{i=0}^{N_l-1}\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{x_i^l}}{\partial{z_i^l}}\cdot \frac{\partial{z_i^l}}{\partial{x_j^{l-1}}} \\ &=\sum\limits_{i=0}^{N_l-1}\frac{\partial{L}}{\partial{x_i^l}}\cdot \frac{\partial{\sigma(z_i^l)}}{\partial{z_i^l}}\cdot \frac{\partial(W_i^lX^{l-1}+b_i^l)}{\partial{x_j^{l-1}}} \\ &=\sum\limits_{i=0}^{N_l-1}\delta_i^l\cdot\sigma'(z_i^l)\cdot{w_{ij}^l} \end{split}