# 深度学习中不同类型卷积的综合介绍--通过可视化直观了解卷积

如果您听说过深度学习中的不同类型的卷积（例如2D / 3D / 1x1 /转置/扩张（Atrous）/空间可分离/深度可分离/展平/分组/混洗分组卷积），并且混淆了它们的实际含义，本文旨在帮助您了解它们的实际工作原理。

1. Convolution v.s. Cross-correlation

2. Convolution in Deep Learning (single channel version, multi-channel version)

3. 3D Convolution

4. 1 x 1 Convolution

5. Convolution Arithmetic

6. Transposed Convolution (Deconvolution, checkerboard artifacts)

7. Dilated Convolution (Atrous Convolution)

8. Separable Convolution (Spatially Separable Convolution, Depthwise Convolution)

9. Flattened Convolution

10. Grouped Convolution

11. Shuffled Grouped Convolution

12. Pointwise Grouped Convolution

### 1.卷积与互相关

卷积是信号处理，图像处理和其他工程/科学领域中广泛使用的技术。在深度学习中，以这种技术命名了一种模型架构，即卷积神经网络（CNN）。但是，深度学习中的卷积本质上是信号/图像处理中的互相关。这两个操作之间有细微的差别。

在不深入细节的情况下，这就是区别。在信号/图像处理中，卷积定义为： 两个函数的乘积在一个函数被反转和移位后的积分。以下可视化演示了该想法。  图1. 卷积信号处理。过滤器g反转，然后沿水平轴滑动。对于每个位置，我们计算f和反向g之间的交点的面积。相交区域是该特定位置处的卷积值。通过此链接采用和编辑图像。

此处，函数g是过滤器。将其反转，然后沿水平轴滑动。对于每个位置，我们计算f和反向g之间的交点的面积。该相交区域是该特定位置处的卷积值。

另一方面，互相关被称为两个函数的滑点积或滑动内积。互相关的滤波器不会反转。它直接在函数f中滑动。f和g之间的交集区域是互相关。下图显示了相关和互相关之间的差异。  图2.信号处理中卷积和互相关之间的差异。图片由Wikipedia采纳和编辑。

在深度学习中，卷积中的过滤器不会反转。严格来说，它是互相关的。我们本质上执行逐元素的乘法和加法。但这只是惯例，在深度学习中称之为卷积。很好，因为可以在训练过程中了解滤波器的权重。如果以上示例中的逆函数g是正确的函数，则在训练后，学习到的滤波器将看起来像逆函数g。因此，不需要像真正的卷积那样在训练之前先反转滤波器。

### 2.深度学习中的卷积

卷积的目的是从输入中提取有用的特征。在图像处理中，可以选择多种卷积滤波器。每种类型的过滤器都有助于从输入图像中提取不同的方面或特征，例如水平/垂直/对角线边缘。类似地，在卷积神经网络中，使用过滤器通过卷积提取不同的特征，这些过滤器的权重是在训练过程中自动学习的。然后将所有这些提取的特征“组合”以做出决策。

进行卷积有一些优点，例如权重共享和翻译不变式。卷积还考虑了像素的空间关系。这些功能尤其有用，特别是在许多计算机视觉任务中，因为这些任务通常涉及识别某些组件与其他组件在空间上具有一定关系的对象。

#### 2.1 卷积：单通道版本 在深度学习中，卷积是逐元素的乘法和加法。对于具有1个通道的图像，下图演示了卷积。这里的过滤器是一个3 x 3的矩阵，元素为[[0，1，2]，[2，2，0]，[0，1，2]]。过滤器在输入中滑动。在每个位置，它都在进行逐元素的乘法和加法。每个滑动位置以一个数字结尾。最终输出为3 x 3矩阵。（请注意，在本示例中，步幅= 1，填充=0。这些概念将在下面的算术部分中进行介绍。

#### 2.2 卷积：多通道版本 Another example of multi-channel data is the layers in Convolutional Neural Network. A convolutional-net layer usually consists of multiple channels (typically hundreds of channels). Each channel describes different aspects of the previous layer. How do we make transition between layers with different depth? How do we transform a layer with depth n to the following layer with depth m?

Before describing the process, we would like to clarify a few terminologies: layers(层), channels(通道), feature maps(特征图), filters(过滤器), and kernels(核). From a hierarchical point of view, the concepts of layers and filters are at the same level, while channels and kernels are at one level below. Channels and feature maps are the same thing. A layer could have multiple channels (or feature maps): an input layer has 3 channels if the inputs are RGB images. “channel” is usually used to describe the structure of a “layer”. Similarly, “kernel” is used to describe the structure of a “filter”. The difference between filter and kernel is a bit tricky. Sometimes, they are used interchangeably, which could create confusions. Essentially, these two terms have subtle difference. A “Kernel” refers to a 2D array of weights. The term “filter” is for 3D structures of multiple kernels stacked together. For a 2D filter, filter is same as kernel. But for a 3D filter and most convolutions in deep learning, a filter is a collection of kernels. Each kernel is unique, emphasizing different aspects of the input channel.

With these concepts, the multi-channel convolution goes as the following. Each kernel is applied onto an input channel of the previous layer to generate one output channel. This is a kernel-wise process. We repeat such process for all kernels to generate multiple channels. Each of these channels are then summed together to form one single output channel. The following illustration should make the process clearer.

Here the input layer is a 5 x 5 x 3 matrix, with 3 channels. The filter is a 3 x 3 x 3 matrix. First, each of the kernels in the filter are applied to three channels in the input layer, separately. Three convolutions are performed, which result in 3 channels with size 3 x 3. Then these three channels are summed together (element-wise addition) to form one single channel (3 x 3 x 1). This channel is the result of convolution of the input layer (5 x 5 x 3 matrix) using a filter (3 x 3 x 3 matrix). Equivalently, we can think of this process as sliding a 3D filter matrix through the input layer. Notice that the input layer and the filter have the same depth (channel number = kernel number)The 3D filter moves only in 2-direction, height & width of the image (That’s why such operation is called as 2D convolution although a 3D filter is used to process 3D volumetric data). At each sliding position, we perform element-wise multiplication and addition, which results in a single number. In the example shown below, the sliding is performed at 5 positions horizontally and 5 positions vertically. Overall, we get a single output channel. Now we can see how one can make transitions between layers with different depth. Let’s say the input layer has Din channels, and we want the output layer has Dout channels. What we need to do is to just apply Dout filters to the input layer. Each filter has Din kernels. Each filter provides one output channel. After applying Dout filters, we have Dout channels, which can then be stacked together to form the output layer. ### 3. 3D卷积(3D Convolution)

In the last illustration of the previous section, we see that we were actually perform convolution to a 3D volume. But typically, we still call that operation as 2D convolution in Deep Learning. It’s a 2D convolution on a 3D volumetric data. The filter depth is same as the input layer depth. The 3D filter moves only in 2-direction (height & width of the image). The output of such operation is a 2D image (with 1 channel only).

Naturally, there are 3D convolutions. They are the generalization of the 2D convolution. Here in 3D convolution, the filter depth is smaller than the input layer depth (kernel size < channel size). As a result, the 3D filter can move in all 3-direction (height, width, channel of the image). At each position, the element-wise multiplication and addition provide one number. Since the filter slides through a 3D space, the output numbers are arranged in a 3D space as well. The output is then a 3D data. Similar as 2D convolutions which encode spatial relationships of objects in a 2D domain, 3D convolutions can describe the spatial relationships of objects in the 3D space. Such 3D relationship is important for some applications, such as in 3D segmentations / reconstructions of biomedical imagining, e.g. CT and MRI where objects such as blood vessels meander around in the 3D space.

### 4. 1 x 1 Convolution

Since we talked about depth-wise operation in the previous section of 3D convolution, let’s look at another interesting operation, 1 x 1 convolution.

You may wonder why this is helpful. Do we just multiply a number to every number in the input layer? Yes and No. The operation is trivial for layers with only one channel. There, we multiply every element by a number.

Things become interesting if the input layer has multiple channels. The following picture illustrates how 1 x 1 convolution works for an input layer with dimension H x W x D. After 1 x 1 convolution with filter size 1 x 1 x D, the output channel is with dimension H x W x 1. If we apply N such 1 x 1 convolutions and then concatenate results together, we could have a output layer with dimension H x W x N. Initially, 1 x 1 convolutions were proposed in the Network-in-network paper. They were then highly used in the Google Inception paper. A few advantages of 1 x 1 convolutions are:

• Dimensionality reduction for efficient computations(降维以实现高效计算)

• Efficient low dimensional embedding, or feature pooling(高效的低维嵌入或特征池化)

• Applying nonlinearity again after convolution(卷积后再次应用非线性)

The first two advantages can be observed in the image above. After 1 x 1 convolution, we significantly reduce the dimension depth-wise. Say if the original input has 200 channels, the 1 x 1 convolution will embed these channels (features) into a single channel. The third advantage comes in as after the 1 x 1 convolution, non-linear activation such as ReLU can be added. The non-linearity allows the network to learn more complex function.

“One big problem with the above modules, at least in this naïve form, is that even a modest number of 5x5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters.

This leads to the second idea of the proposed architecture: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch… That is, 1 x 1 convolutions are used to compute reductions before the expensive 3 x 3 and 5 x 5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose.”

One interesting perspective regarding 1 x 1 convolution comes from Yann LeCun “In Convolutional Nets, there is no such thing as “fully-connected layers”. There are only convolution layers with 1x1 convolution kernels and a full connection table.” ### 5. 卷积算法(Convolution Arithmetic)

We now know how to deal with depth in convolution. Let’s move on to talk about how to handle the convolution in the other two directions (height & width), as well as important convolution arithmetic.

Here are a few terminologies:

Kernel size: kernel is discussed in the previous section. The kernel size defines the field of view of the convolution.

Stride: it defines the step size of the kernel when sliding through the image. Stride of 1 means that the kernel slides through the image pixel by pixel. Stride of 2 means that the kernel slides through image by moving 2 pixels per step (i.e., skipping 1 pixel). We can use stride (>= 2) for downsampling an image.

Padding: the padding defines how the border of an image is handled. A padded convolution (‘same’ padding in Tensorflow) will keep the spatial output dimensions equal to the input image, by padding 0 around the input boundaries if necessary. On the other hand, unpadded convolution (‘valid’ padding in Tensorflow) only perform convolution on the pixels of the input image, without adding 0 around the input boundaries. The output size is smaller than the input size.

This following illustration describes a 2D convolution using a kernel size of 3, stride of 1 and padding of 1. For an input image with size of i, kernel size of k, padding of p, and stride of s, the output image from convolution has size o: ### 6. Transposed Convolution (Deconvolution)

For many applications and in many network architectures, we often want to do transformations going in the opposite direction of a normal convolution, i.e. we’d like to perform up-sampling. A few examples include generating high-resolution images and mapping low dimensional feature map to high dimensional space such as in auto-encoder or semantic segmentation. (In the later example, semantic segmentation first extracts feature maps in the encoder and then restores the original image size in the decoder so that it can classify every pixel in the original image.)

Traditionally, one could achieve up-sampling by applying interpolation schemes or manually creating rules. Modern architectures such as neural networks, on the other hand, tend to let the network itself learn the proper transformation automatically, without human intervention. To achieve that, we can use the transposed convolution.

The transposed convolution is also known as deconvolution, or fractionally strided convolution in the literature. However, it’s worth noting that the name “deconvolution” is less appropriate, since transposed convolution is not the real deconvolution as defined in signal / image processing. Technically speaking, deconvolution in signal processing reverses the convolution operation. That is not the case here. Because of that, some authors are strongly against calling transposed convolution as deconvolution. People call it deconvolution mainly because of simplicity. Later, we will see why calling such operation as transposed convolution is natural and more appropriate.

It is always possible to implement a transposed convolution with a direct convolution. For an example in the image below, we apply transposed convolution with a 3 x 3 kernel over a 2 x 2 input padded with a 2 x 2 border of zeros using unit strides. The up-sampled output is with size 4 x 4（左）.

Interestingly enough, one can map the same 2 x 2 input image to a different image size, by applying fancy padding & stride. Below, transposed convolution is applied over the same 2 x 2 input (with 1 zero inserted between inputs) padded with a 2 x 2 border of zeros using unit strides. Now the output is with size 5 x 5（右）.  在上面的示例中查看转置卷积可以帮助我们建立一些直觉。但是要概括其应用，查看一下如何通过计算机中的矩阵乘法来实现它是有益的。从那里，我们还可以看到为什么“转置卷积”是一个合适的名称。

在卷积中，让我们将C定义为内核，将Large定义为输入图像，将Small定义为卷积中的输出图像。卷积（矩阵乘法）后，我们将大图像下采样为小输出图像。矩阵乘法中的卷积实现如下：C x Large = Small。

以下示例显示了这种操作的工作方式。它将输入展平为16 x 1矩阵，并将内核转换为稀疏矩阵（4 x 16）。然后在稀疏矩阵和展平输入之间应用矩阵乘法。然后，将所得矩阵（4 x 1）转换回2 x 2输出。  现在，如果我们在等式的两边乘以矩阵C的转置，并利用矩阵与其转置矩阵的相乘得到单位矩阵的性质，则我们有以下公式CT x Small = Large，如下所示： As you can see here, we perform up-sampling from a small image to a large image. That is what we want to achieve. And now, you can also see where the name “transposed convolution” comes from.

The general arithmetic for transposed convolution can be found from Relationship 13 and Relationship 14 in this excellent article (“A guide to convolution arithmetic for deep learning”).

#### 6.1. Checkerboard artifacts (棋盘效应)

One unpleasant behavior that people observe when using transposed convolution is the so-called checkerboard artifacts. Checkerboard artifacts result from “uneven overlap” of transposed convolution. Such overlap puts more of the metaphorical paint in some places than others.

棋盘伪像是由转置卷积的“不均匀重叠”引起的。 这种重叠在某些地方比其他地方更多地隐喻了绘画。

In the image below, the layer on the top is the input layer, and the layer on the bottom is the output layer after transposed convolution. During transposed convolution, a layer with small size is mapped to a layer with larger size.

In the example (a), the stride is 1 and the filer size is 2. As outlined in red, the first pixel on the input maps to the first and second pixels on the output. As outlined in green, the second pixel on the input maps to the second and the third pixels on the output. The second pixel on the output receives information from both the first and the second pixels on the input. Overall, the pixels in the middle portion of the output receive same amount of information from the input. Here exist a region where kernels overlapped. As the filter size is increased to 3 in the example (b), the center portion that receives most information shrinks. But this may not be a big deal, since the overlap is still even. The pixels in the center portion of the output receive same amount of information from the input.

在示例（a）中，步幅为1，过滤器大小为2。如红色所示，输入上的第一个像素映射到输出上的第一个像素和第二个像素。 如绿色轮廓所示，输入上的第二个像素映射到输出上的第二个和第三个像素。 输出上的第二像素从输入上的第一和第二像素接收信息。 总体而言，输出中间部分的像素从输入接收相同数量的信息。 这里存在一个内核重叠的区域。 在示例（b）中，随着滤波器大小增加到3，接收最多信息的中心部分将缩小。 但这可能没什么大不了的，因为重叠仍然是均匀的。 输出中心部分的像素从输入接收相同数量的信息。 Now for the example below, we change stride = 2. In the example (a) where filter size = 2, all pixels on the output receive same amount of information from the input. They all receive information from a single pixel on the input. There is no overlap of transposed convolution here. If we change the filter size to 4 in the example (b), the evenly overlapped region shrinks. But still, one can use the center portion of the output as the valid output, where each pixel receives the same amount of information from the input.

However, things become interesting if we change the filter size to 3 and 5 in the example (c) and (d). For these two cases, every pixel on the output receives different amount of information compared to its adjacent pixels. One cannot find a continuous and evenly overlapped region on the output.

The transposed convolution has uneven overlap when the filter size is not divisible by the stride. This “uneven overlap” puts more of the paint in some places than others, thus creates the checkerboard effects. In fact, the unevenly overlapped region tends to be more extreme in two dimensions. There, two patterns are multiplied together, the unevenness gets squared.

Two things one could do to reduce such artifacts, while applying transposed convolution. First, make sure you use a filer size that is divided by your stride, avoiding the overlap issue. Secondly, one can use transposed convolution with stride = 1, which helps to reduce the checkerboard effects. However, artifacts can still leak through, as seen in many recent models.

The paper further proposed a better up-sampling approach: resize the image first (using nearest-neighbor interpolation or bilinear interpolation) and then do a convolutional layer. By doing that, the authors avoid the checkerboard effects. You may want to try it for your applications.

### 7. Dilated Convolution (Atrous Convolution)

Dilated convolution was introduced in the paper (link) and the paper “Multi-scale context aggregation by dilated convolutions” (link).

This is the standard discrete convolution: The dilated convolution follows: When l = 1, the dilated convolution becomes as the standard convolution.   Intuitively, dilated convolutions “inflate” the kernel by inserting spaces between the kernel elements. This additional parameter l (dilation rate) indicates how much we want to widen the kernel. Implementations may vary, but there are usually l-1 spaces inserted between kernel elements. The following image shows the kernel size when l = 1, 2, and 4. In the image, the 3 x 3 red dots indicate that after the convolution, the output image is with 3 x 3 pixels. Although all three dilated convolutions provide the output with the same dimension, the receptive field observed by the model is dramatically different. The receptive filed is 3 x 3 for l =1. It is 7 x 7 for l =2. The receptive filed increases to 15 x 15 for l = 4. Interestingly, the numbers of parameters associated with these operations are essentially identical. We “observe” a large receptive filed without adding additional costs. Because of that, dilated convolution is used to cheaply increase the receptive field of output units without increasing the kernel size, which is especially effective when multiple dilated convolutions are stacked one after another.

The authors in the paper “Multi-scale context aggregation by dilated convolutions” build a network out of multiple layers of dilated convolutions, where the dilation rate l increases exponentially at each layer. As a result, the effective receptive field grows exponentially while the number of parameters grows only linearly with layers!

The dilated convolution in the paper is used to systematically aggregate multi-scale contextual information without losing resolution. The paper shows that the proposed module increases the accuracy of state-of-the-art semantic segmentation systems at that time (2016). Please check out the paper for more information.

### 8. Separable Convolutions

Separable Convolutions are used in some neural net architectures, such as the MobileNet (Link). One can perform separable convolution spatially (spatially separable convolution) or depthwise (depthwise separable convolution).

#### 8.1. Spatially Separable Convolutions

空间上可分离的卷积在图像的2D空间维度（即高度和宽度）上运行。从概念上讲，空间上可分离的卷积将卷积分解为两个单独的运算。对于下面显示的示例，作为3x3内核的Sobel内核分为3x1和1x3内核。 在卷积中，3x3内核直接与图像卷积。在空间上可分离的卷积中，3x1内核首先与图像进行卷积。然后应用1x3内核。在执行相同操作时，这将需要6个参数而不是9个参数。

而且，在空间上可分离的卷积中，与卷积相比，需要较少的矩阵乘法。举一个具体的例子，在具有3 x 3内核（步幅= 1，填充= 0）的5 x 5图像上进行卷积需要在水平3个位置（垂直3个位置）上扫描内核。总共9个位置，在下图中以点表示。在每个位置上，将应用9个按元素的乘法。总体来说，这是9 x 9 = 81乘法。 另一方面，对于空间上可分离的卷积，我们首先在5 x 5图像上应用3 x 1 filter。我们在水平5个位置和垂直3个位置扫描这样的内核。这是5×3 = 15在总的位置，表示为下面的图像上的点。在每个位置，应用3个逐元素的乘法。那就是15 x 3 = 45乘法。现在，我们获得了一个3 x 5的矩阵。现在，此矩阵与1 x 3内核卷积，该内核在水平3个位置和垂直3个位置上扫描矩阵。对于这9个位置中的每一个，将应用3个按元素的乘法。此步骤需要9 x 3 = 27乘法。因此，总的来说，空间上可分离的卷积需要45 + 27 = 72乘法，小于卷积。 23. 1通道的空间可分离卷积

Let’s generalize the above examples a little bit. Let’s say we now apply convolutions on a N x N image with a m x m kernel, with stride=1 and padding=0. Traditional convolution requires (N-2) x (N-2) x m x m multiplications. Spatially separable convolution requires N x (N-2) x m + (N-2) x (N-2) x m = (2N-2) x (N-2) x m multiplications. The ratio of computation costs between spatially separable convolution and the standard convolution is 对于图像尺寸N大于滤镜尺寸m（N >> m）的图层，此比率变为2 / m。这意味着在这种渐近情况下（N >> m），空间可分离卷积的计算成本是3 x 3滤波器标准卷积的2/3。对于5 x 5滤镜为2/5，对于7 x 7滤镜为2/7，依此类推

尽管空间上可分离的卷积节省了成本，但很少在深度学习中使用它主要原因之一是，并非所有内核都可以分为两个较小的内核。如果我们用空间上可分离的卷积代替所有传统的卷积，我们将限制自己在训练过程中搜索所有可能的核。训练结果可能不是最佳的

#### 8.2. Depthwise Separable Convolutions

现在，让我们继续进行深度可分离卷积，它是深度学习（例如MobileNetXception）中更为常用的卷积。深度方向可分离卷积包括两个步骤：深度方向卷积和1x1卷积。

在描述这些步骤之前，值得回顾一下我们在前几节中讨论的2D卷积和1 x 1卷积。让我们快速回顾一下标准2D卷积。举一个具体的例子，假设输入层的尺寸为7 x 7 x 3（高度x宽度x通道），而滤镜的尺寸为3 x 3 x3。用一个filter进行2D卷积后，输出层为大小为5 x 5 x 1（只有1个通道）。 通常，在两个神经网络层之间应用多个过滤器。假设这里有128个过滤器。应用这128个2D卷积后，我们得到128个5 x 5 x 1输出贴图。然后，我们将这些地图堆叠到大小为5 x 5 x 128的单层中。这样做，我们将输入层（7 x 7 x 3）转换为输出层（5 x 5 x 128）。在扩展深度的同时，空间尺寸（即高度和宽度）会缩小。 现在，通过深度可分离卷积，让我们看看如何实现相同的变换

First, we apply depthwise convolution to the input layer. Instead of using a single filter of size 3 x 3 x 3 in 2D convolution, we used 3 kernels, separately. Each filter has size 3 x 3 x 1. Each kernel convolves with 1 channel of the input layer (1 channel only, not all channels!). Each of such convolution provides a map of size 5 x 5 x 1. We then stack these maps together to create a 5 x 5 x 3 image. After this, we have the output with size 5 x 5 x 3. We now shrink the spatial dimensions, but the depth is still the same as before.

首先，我们将深度卷积应用于输入层。我们没有在2D卷积中使用单个大小为3 x 3 x 3的滤波器，而是分别使用了3个内核。每个过滤器的大小为3 x 3 x1。每个内核与输入层的1个通道卷积（仅1个通道，而不是所有通道！）。每个这样的卷积都提供了一张5 x 5 x 1的地图。然后，我们将这些地图堆叠在一起以创建5 x 5 x 3的图像。此后，我们得到大小为5 x 5 x 3的输出。我们现在缩小空间尺寸，但深度仍与以前相同。 As the second step of depthwise separable convolution, to extend the depth, we apply the 1x1 convolution with kernel size 1x1x3. Convolving the 5 x 5 x 3 input image with each 1 x 1 x 3 kernel provides a map of size 5 x 5 x 1.

作为深度可分离卷积的第二步，为了扩展深度，我们应用内核大小为1x1x3的1x1卷积。将5 x 5 x 3输入图像与每个1 x 1 x 3内核进行卷积可以得到大小为5 x 5 x 1的map。 因此，应用128个1x1卷积后，我们可以得到一个大小为5 x 5 x 128的图层。 通过这两个步骤，深度可分离卷积将输入层（7 x 7 x 3）转换为输出层（5 x 5 x 128）。下图显示了深度可分离卷积的整个过程： So, what’s the advantage of doing depthwise separable convolutions? Efficiency!One needs much less operations for depthwise separable convolutions compared to 2D convolutions.

Let’s recall the computation costs for our example of 2D convolutions. There are 128 3x3x3 kernels that move 5x5 times. That is 128 x 3 x 3 x 3 x 5 x 5 = 86,400 multiplications.

How about the separable convolution? In the first depthwise convolution step, there are 3 3x3x1 kernels that moves 5x5 times. That is 3x3x3x1x5x5 = 675 multiplications. In the second step of 1 x 1 convolution, there are 128 1x1x3 kernels that moves 5x5 times. That is 128 x 1 x 1 x 3 x 5 x 5 = 9,600 multiplications. Thus, overall, the depthwise separable convolution takes 675 + 9600 = 10,275 multiplications. This is only about 12% of the cost of the 2D convolution!

So, for an image with arbitrary size, how much time can we save if we apply depthwise separable convolution. Let’s generalize the above examples a little bit. Now, for an input image of size H x W x D, we want to do 2D convolution (stride=1, padding=0) with Nc kernels of size h x h x D, where h is even. This transform the input layer (H x W x D) into the output layer (H-h+1 x W-h+1 x Nc). The overall multiplications needed is：Nc x h x h x D x (H-h+1) x (W-h+1).

On the other hand, for the same transformation, the multiplication needed for depthwise separable convolution is

D x h x h x 1 x (H-h+1) x (W-h+1) + Nc x 1 x 1 x D x (H-h+1) x (W-h+1) = (h x h + Nc) x D x (H-h+1) x (W-h+1)

The ratio of multiplications between depthwise separable convolution and 2D convolution is now: For most modern architectures, it is common that the output layer has many channels, e.g. several hundreds if not several thousands. For such layers (Nc >> h), then the above expression reduces down to 1 / h / h. It means for this asymptotic expression, if 3 x 3 filters are used, 2D convolutions spend 9 times more multiplications than a depthwise separable convolutions. For 5 x 5 filters, 2D convolutions spend 25 times more multiplications.

Is there any drawback of using depthwise separable convolutions? Sure, there are. The depthwise separable convolutions reduces the number of parameters in the convolution. As such, for a small model, the model capacity may be decreased significantly if the 2D convolutions are replaced by depthwise separable convolutions. As a result, the model may become sub-optimal. However, if properly used, depthwise separable convolutions can give you the efficiency without dramatically damaging your model performance.

### 9. Flattened convolutions

The flattened convolution was introduced in the paper “Flattened convolutional neural networks for feedforward acceleration”. Intuitively, the idea is to apply filter separation. Instead of applying one standard convolution filter to map the input layer to an output layer, we separate this standard filter into 3 1D filters. Such idea is similar as that in the spatial separable convolution described above, where a spatial filter is approximated by two rank-1 filters. One should notice that if the standard convolution filter is a rank-1 filter, such filter can always be separated into cross-products of three 1D filters. But this is a strong condition and the intrinsic rank of the standard filter is higher than one in practice. As pointed out in the paper “As the difficulty of classification problem increases, the more number of leading components is required to solve the problem… Learned filters in deep networks have distributed eigenvalues and applying the separation directly to the filters results in significant information loss.”

To alleviate such problem, the paper restricts connections in receptive fields so that the model can learn 1D separated filters upon training. The paper claims that by training with flattened networks that consists of consecutive sequence of 1D filters across all directions in 3D space provides comparable performance as standard convolutional networks, with much less computation costs due to the significant reduction of learning parameters.

### 10. Grouped Convolution

Grouped convolution was introduced in the AlexNet paper (link) in 2012. The main reason of implementing it was to allow the network training over two GPUs with limited memory (1.5 GB memory per GPU). The AlexNet below shows two separate convolution paths at most of the layers. It’s doing model-parallelization across two GPUs (of course one can do multi-GPUs parallelization if more GPUs are available).

分组卷积在2012年的AlexNet论文（链接）中引入。实现它的主要原因是允许在内存有限（每个GPU 1.5 GB内存）的两个GPU上进行网络训练。下面的AlexNet在大多数层上显示了两个单独的卷积路径。它正在两个GPU上进行模型并行化（当然，如果有更多GPU可用，则可以进行多GPU并行化）。 Here we describe how the grouped convolutions work. First of all, conventional 2D convolutions follow the steps showing below. In this example, the input layer of size (7 x 7 x 3) is transformed into the output layer of size (5 x 5 x 128) by applying 128 filters (each filter is of size 3 x 3 x 3). Or in general case, the input layer of size (Hin x Win x Din) is transformed into the output layer of size (Hout x Wout x Dout) by applying Dout kernels (each is of size h x w x Din).

在这里，我们描述了分组卷积的工作原理。首先，常规2D卷积遵循以下步骤。在此示例中，通过应用128个滤镜（每个滤镜的尺寸为3 x 3 x 3），将尺寸为（7 x 7 x 3）的输入层转换为尺寸为（5 x 5 x 128）的输出层。或在一般情况下，通过应用Dout内核将大小为（Hin x Win x Din）的输入层转换为大小为（Hout x Wout x Dout）的输出层（每个内核的大小为h x w x Din）。 In grouped convolution, the filters are separated into different groups. Each group is responsible for a conventional 2D convolutions with certain depth. The following examples can make this clearer.

在分组卷积中，filter分为不同的组。每个组负责一定深度的常规2D卷积。以下示例可以使这一点更加清楚。 Above is the illustration of grouped convolution with 2 filter groups. In each filter group, the depth of each filter is only half of the that in the nominal 2D convolutions. They are of depth Din / 2. Each filter group contains Dout /2 filters. The first filter group (red) convolves with the first half of the input layer ([:, :, 0:Din/2]), while the second filter group (blue) convolves with the second half of the input layer ([:, :, Din/2:Din]). As a result, each filter group creates Dout/2 channels. Overall, two groups create 2 x Dout/2 = Dout channels. We then stack these channels in the output layer with Dout channels.

#### 10.1. Grouped convolution v.s. depthwise convolution

You may already observe some linkage and difference between grouped convolution and the depthwise convolution used in the depthwise separable convolution. If the number of filter groups is the same as the input layer channel, each filter is of depth Din / Din = 1. This is the same filter depth as in depthwise convolution.

On the other hand, each filter group now contains Dout / Din filters. Overall, the output layer is of depth Dout. This is different from that in depthwise convolution, which does not change the layer depth. The layer depth is extended later by 1x1 convolution in the depthwise separable convolution.

There are a few advantages of doing grouped convolution.

The first advantage is the efficient training. Since the convolutions are divided into several paths, each path can be handled separately by different GPUs. This procedure allows the model training over multiple GPUs, in a parallel fashion. Such model-parallelization over multi-GPUs allows more images to be fed into the network per step, compared to training with everything with one GPU. The model-parallelization is considered to be better than data parallelization. The later one split the dataset into batches and then we train on each batch. However, when the batch size becomes too small, we are essentially doing stochastic than batch gradient descent. This would result in slower and sometimes poorer convergence.

The grouped convolutions become important for training very deep neural nets, as in the ResNeXt shown below The second advantage is the model is more efficient, i.e. the model parameters decrease as number of filter group increases. In the previous examples, filters have h x w x Din x Dout parameters in a nominal 2D convolution. Filters in a grouped convolution with 2 filter groups has (h x w x Din/2 x Dout/2) x 2 parameters. The number of parameters is reduced by half.

The third advantage is a bit surprising. Grouped convolution may provide a better model than a nominal 2D convolution. This another fantastic blog (link) explains it. Here is a brief summary.

The reason links to the sparse filter relationship. The image below is the correlation across filters of adjacent layers. The relationship is sparse. 图 33. The correlation matrix between filters of adjacent layers in a Network-in-Network model trained on CIFAR10. Pairs of highly correlated filters are brighter, while lower correlated filters are darker. The image is adopted from this article.

How about the correlation map for grouped convolution? The image above is the correlation across filters of adjacent layers, when the model is trained with 1, 2, 4, 8, and 16 filter groups. The article proposed one reasoning (link): “The effect of filter groups is to learn with a block-diagonal structured sparsity on the channel dimension… the filters with high correlation are learned in a more structured way in the networks with filter groups. In effect, filter relationships that don’t have to be learned are on longer parameterized. In reducing the number of parameters in the network in this salient way, it is not as easy to over-fit, and hence a regularization-like effect allows the optimizer to learn more accurate, more efficient deep networks.”

In addition, each filter group learns a unique representation of the data. As noticed by the authors of the AlexNet, filter groups appear to structure learned filters into two distinct groups, black-white filter and color filters. ### 11. Shuffled Grouped Convolution

Shuffled grouped convolution was introduced in the ShuffleNet from Magvii Inc (Face++). ShuffleNet is a computation-efficient convolution architecture, which is designed specially for mobile devices with very limited computing power (e.g. 10–150 MFLOPs).

The ideas behind the shuffled grouped convolution are linked to the ideas behind grouped convolution (used in MobileNet and ResNeXt for examples) and depthwise separable convolution (used in Xception).

Overall, the shuffled grouped convolution involves grouped convolution and channel shuffling.

In the section about grouped convolution, we know that the filters are separated into different groups. Each group is responsible for a conventional 2D convolutions with certain depth. The total operations are significantly reduced. For examples in the figure below, we have 3 filter groups. The first filter group convolves with the red portion in the input layer. Similarly, the second and the third filter group convolves with the green and blue portions in the input. The kernel depth in each filter group is only 1/3 of the total channel count in the input layer. In this example, after the first grouped convolution GConv1, the input layer is mapped to the intermediate feature map. This feature map is then mapped to the output layer through the second grouped convolution GConv2. Grouped convolution is computationally efficient. But the problem is that each filter group only handles information passed down from the fixed portion in the previous layers. For examples in the image above, the first filter group (red) only process information that is passed down from the first 1/3 of the input channels. The blue filter group (blue) only process information that is passed down from the last 1/3 of the input channels. As such, each filter group is only limited to learn a few specific features. This property blocks information flow between channel groups and weakens representations during training. To overcome this problem, we apply the channel shuffle.

The idea of channel shuffle is that we want to mix up the information from different filter groups. In the image below, we get the feature map after applying the first grouped convolution GConv1 with 3 filter groups. Before feeding this feature map into the second grouped convolution, we first divide the channels in each group into several subgroups. The we mix up these subgroups. After such shuffling, we continue performing the second grouped convolution GConv2 as usual. But now, since the information in the shuffled layer has already been mixed, we essentially feed each group in GConv2 with different subgroups in the feature map layer (or in the input layer). As a result, we allow the information flow between channels groups and strengthen the representations.

### 12. Pointwise grouped convolution

The ShuffleNet paper (link) also introduced the pointwise grouped convolution. Typically for grouped convolution such as in MobileNet (link) or ResNeXt (link), the group operation is performed on the 3x3 spatial convolution, but not on 1 x 1 convolution.

The shuffleNet paper argues that the 1 x 1 convolution are also computationally costly. It suggests applying group convolution for 1 x 1 convolution as well. The pointwise grouped convolution, as the name suggested, performs group operations for 1 x 1 convolution. The operation is identical as for grouped convolution, with only one modification — performing on 1x1 filters instead of NxN filters (N>1).

shuffleNet论文认为1 x 1卷积在计算上也很耗费计算量。它建议对1 x 1卷积也应用组卷积。顾名思义，按点分组卷积执行1 x 1卷积的组运算。该操作与分组卷积的操作相同，只不过有一个修改-对1x1滤镜而不是NxN滤镜（N> 1）执行。

In the ShuffleNet paper, authors utilized three types of convolutions we have learned: (1) shuffled grouped convolution; (2) pointwise grouped convolution; and (3) depthwise separable convolution. Such architecture design significantly reduces the computation cost while maintaining the accuracy. For examples the classification error of ShuffleNet and AlexNet is comparable on actual mobile devices. However, the computation cost has been dramatically reduced from 720 MFLOPs in AlexNet down to 40–140 MFLOPs in ShuffleNet. With relatively small computation cost and good model performance, ShuffleNet gained popularity in the field of convolutional neural net for mobile devices.

【附件】

See PDF for more details, Nice one！

【参考】

https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d

https://wap.sciencenet.cn/blog-3428464-1281828.html

## 全部精选博文导读

GMT+8, 2023-3-27 15:43