当前位置：首页 > news >正文

《d2l Chapter4 多层感知机基础内容》

news 2026/7/10 13:19:39

Chapter 4 : 多层感知机（MLP）

先看看这个原理简单的嵌套式的“双层”的线性回归

\[\mathbf{X} \in \mathbb{R}^{n \times d} \]

表示 $n$ 个样品且每个样品有 $d$ 个输入特征的小批量，$\mathbf{H}$ 为隐藏层变量。
且 $\mathbf{W}^{(1)} \in \mathbb{R}^{d \times h}$ ， $\mathbf{b}^{(1)} \in \mathbb{R}^{1 \times h}$ ，$\mathbf{W}^{(2)} \in \mathbb{R}^{h \times q}$ ，$\mathbf{b}^{(2)} \in \mathbb{R}^{1 \times q}$
有：

\[\begin{split}\begin{aligned}\mathbf{H} & = \mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}, \\\mathbf{O} & = \mathbf{H}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}. \end{aligned}\end{split}\]

然而显然，这并没有什么用处，因为：

\[\mathbf{O} = (\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)})\mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X} \mathbf{W}^{(1)}\mathbf{W}^{(2)} + \mathbf{b}^{(1)} \mathbf{W}^{(2)} + \mathbf{b}^{(2)} = \mathbf{X} \mathbf{W} + \mathbf{b} \]

完全可以被一个 $\mathbf{X}$ 完美替代，还不会添加性能开销。

为了发挥多层的真正实力用处，我们在仿射变换之后对每个隐藏单元应用非线性的激活函数 $\sigma$。
我们让

\[\begin{split}\begin{aligned}\mathbf{H} & = \sigma(\mathbf{X} \mathbf{W}^{(1)} + \mathbf{b}^{(1)}), \\\mathbf{O} & = \mathbf{H}\mathbf{W}^{(2)} + \mathbf{b}^{(2)}.\\ \end{aligned}\end{split} \]

激活函数的输出（例如$\sigma(\cdot)$）被称为活性值
一般来说，有了激活函数，就不可能再将我们的多层感知机退化成线性模型

常见激活函数

$ReLU$ 函数：

$\operatorname{ReLU}(x) = \max(x, 0)$
$x=0$ 时认为导数为0
还有变式：$\operatorname{pReLU}(x) = \max(0, x) + \alpha \min(0, x)$

$sigmoid$ 函数：

将输入变换为区间 $(0, 1)$ 上的输出$$\operatorname{sigmoid}(x) = \frac{1}{1 + \exp(-x)}$$
导数如下$$\frac{d}{dx} \operatorname{sigmoid}(x) = \frac{\exp(-x)}{(1 + \exp(-x))^2} = \operatorname{sigmoid}(x)\left(1-\operatorname{sigmoid}(x)\right)$$

$tanh$ 函数：

双曲正切函数压缩至 $(-1,1)$

\[\operatorname{tanh}(x) = \frac{1 - \exp(-2x)}{1 + \exp(-2x)}\]

此外，存在关系式：$$\operatorname{tanh}(x) + 1 = 2 \operatorname{sigmoid}(2x)$$

简单模型（调用d2l函数）

import torch
from torch import nn
from d2l import torch as d2l
net = nn.Sequential(nn.Flatten(),nn.Linear(784, 256),nn.ReLU(),nn.Linear(256, 10))def init_weights(m):if type(m) == nn.Linear:nn.init.normal_(m.weight, std=0.01)net.apply(init_weights);
batch_size, lr, num_epochs = 256, 0.1, 10
loss = nn.CrossEntropyLoss(reduction='none')
trainer = torch.optim.SGD(net.parameters(), lr=lr)train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, trainer)

查看全文

http://www.jsqmd.com/news/48418/