卷积层的参数量和计算量

本文将介绍深度学习中的一般卷积, 分组卷积和深度可分离卷积的参数量和计算量.

在行文之前先做符号约定: 设卷积层的输入特征映射 (feature map) 的形状为 $B \times C_{in}\times H_{in} \times W_{in}$, 输出特征映射的形状为 $B \times C_{out} \times H_{out} \times W_{out}$ ($C_{out}$ 是卷积层的超参, $H_{out}$ 和 $W_{out}$ 是由其他卷积层超参决定的, 如 padding size, kernel size, stride 和 dilate rate. 由于本文着重叙述卷积层的参数量和计算量, 故本文不介绍其间的关系), 其中 $B$ 是批尺寸 (batch size), $C_{in}$ 和 $C_{out}$ 分别是输入和输出特征映射的通道数, $H_{in}$ 和 $H_{out}$ 分别是输入和输出特征映射的高度, $W_{in}$ 和 $W_{out}$ 分别是输入和输出特征映射的宽度.

1. 一般卷积的参数量和计算量

设一般卷积层的卷积核尺寸 (kernel size) 为 $K_{h} \times K_{w}$

1.1. 一般卷积的参数量

不考虑偏置项: $C_{in} \times C_{out} \times K_h \times K_w$

考虑偏置项: $C_{in} \times C_{out} \times K_h \times K_w + C_{out} = C_{out} \times (C_{in} \times K_h \times K_w + 1)$

偏置项的参数量占比很小, 所以下面将不考虑偏置项的参数量, 不妨记:

$$Params_{conv} = C_{in} \times C_{out} \times K_h \times K_w$$

1.2. 一般卷积的计算量

不考虑偏置项:

乘法计算量: $B \times C_{in} \times K_h \times K_w\times C_{out} \times H_{out} \times W_{out}$
加法计算量: $B \times(C_{in} \times K_h \times K_w - 1 )\times C_{out} \times H_{out} \times W_{out}$

考虑偏置项:

乘法计算量: $B \times C_{in} \times K_h \times K_w\times C_{out} \times H_{out} \times W_{out}$
加法计算量: $B \times C_{in} \times K_h \times K_w \times C_{out} \times H_{out} \times W_{out}$

卷积层的计算是被乘法计算量主导的, 所以不妨记:

$$FLOPS_{conv} = B \times C_{in} \times K_h \times K_w \times C_{out} \times H_{out} \times W_{out}$$

1.3. 一般卷积的参数量和计算量之间的关系

$$\frac{FLOPS_{conv}}{Params_{conv}} = B \times H_{out} \times W_{out}$$
可见, $FLOPS_{conv}$ 除了由 $Params_{conv}$ 决定之外, 还与 $B$, $H_{out}$ 和 $W_{out}$ 有关 (即与批尺寸,输出特征映射的宽高有关).

2. 分组卷积的参数量和计算量

分组卷积 (group convoultion) 将输入和输出特征映射分为若干组, 不妨设为 $G$, 每一组独立地进行一般卷积操作 (故分组卷积要求 $G$ 能整除 $C_{in}$ 和 $C_{out}$), 最后将各组的输出特征映射在通道维度上串联.

2.1. 分组卷积的参数量

$$\begin{aligned} Params_{gc} &= \frac{C_{in}}{G} \times \frac{C_{out}}{G} \times K_h \times K_w \times G \\ &= C_{in} \times C_{out} \times K_h \times K_w / G \\ \end{aligned}$$

2.2. 分组卷积的计算量

$$\begin{aligned} FLOPS_{gc} &= B \times \frac{C_{in}}{G} \times K_h \times K_w \times \frac{C_{out}}{G} \times H_{out} \times W_{out} \times G\\ &= B \times C_{in} \times K_h \times K_w \times C_{out} \times H_{out} \times W_{out} / G \end{aligned}$$

2.3. 分组卷积的参数量和计算量之间的关系

$$\frac{FLOPS_{gc}}{Params_{gc}} = B \times H_{out} \times W_{out}$$

与 1.3. 的结果一样,

3. 深度可分离卷积的计算量

深度可分离卷积 (depthwise separable convolution, 可简称为 DSC) 包含按深度卷积 (depthwise convolution) 和按点卷积 (pointwise convolution). 按深度卷积实际上就是 $C_{in} = C_{out} = G$ 的分组卷积; 按点卷积实际上就是卷积核尺寸为 $1\times 1$ 的一般卷积.

3.1. 深度可分离卷积的参数量

按深度卷积的参数量:

$$\begin{aligned} Params_{dc} &= C_{in} \times C_{in} \times K_h \times K_w / C_{in} \\ &= C_{in} \times K_h \times K_w \end{aligned}$$

按点卷积的参数量:

$$\begin{aligned} Params_{pc} &= C_{in} \times C_{out} \times 1 \times 1 \\ &= C_{in} \times C_{out} \end{aligned}$$

深度可分离卷积的参数量:

$$\begin{aligned} Params_{dsc} &= Params_{dc} + Params_{pc}\\ &= C_{in} \times K_h \times K_w + C_{in} \times C_{out} \\ &= C_{in} \times (K_h \times K_w + C_{out}) \end{aligned}$$

3.2. 深度可分离卷积的计算量

按深度卷积的计算量:

$$\begin{aligned} FLOPS_{dc} &= B \times C_{in} \times K_h \times K_w \times C_{in} \times H_{out} \times W_{out} / C_{in} \\ &= B \times C_{in} \times K_h \times K_w \times H_{out} \times W_{out} \end{aligned}$$

按点卷积的计算量:

$$\begin{aligned} FLOPS_{pc} &= B \times C_{in} \times 1 \times 1 \times C_{out} \times H_{out} \times W_{out} \\ &= B \times C_{in} \times C_{out} \times H_{out} \times W_{out} \end{aligned}$$

深度可分离卷积的计算量:

$$\begin{aligned} FLOPS_{dsc} &= FLOPS_{dc} + FLOPS_{pc} \\ &= B \times C_{in} \times K_h \times K_w \times H_{out} \times W_{out} + B \times C_{in} \times C_{out} \times H_{out} \times W_{out} \\ &= B \times C_{in} \times H_{out} \times W_{out} \times (K_h \times K_w + C_{out}) \end{aligned}$$

3.3. 计算量的关系

按深度卷积和按点卷积的计算量之比为:

$$\frac{FLOPS_{dc}}{FLOPS_{pc}} = \frac{K_h \times K_w}{C_{out}}$$

由于一般有 $C_{out} \gt K_h \times K_w$, 所以 $FLOPS_{pc} > FLOPS_{dc}$, 即计算量大概率集中在按点卷积.

深度可分离卷积与普通卷积的计算量之比:

$$\frac{FLOPS_{dsc}}{FLOPS_{conv} } = \frac{1}{C_{out}} + \frac{1}{K_h \times K_w}$$

一般 $C_{out} \gt K_h \times K_w$, 所以:

$$\frac{FLOPS_{dsc}}{FLOPS_{conv} } \approx \frac{1}{K_h \times K_w}$$

这与 [2018] Inverted Residuals and Linear Bottlenecks_ Mobile Networks for Classification, Detection and Segmentation 文献中的下列叙述是相符的.

Effectively depthwise separable convolution reduces computation compared to traditional layers by almost a factor of k^2 . MobileNetV2 uses k = 3 (3 × 3 depthwise separable convolutions) so the computational cost is 8 to 9 times smaller than that of standard convolutions at only a small reduction in accuracy [26].

更新记录

20220211, 初版
20220312, 发布