Convolutional Neural Network Explained
Abstract : Convolutional Neural Network Explained
This post explains in detail what a convolutional neural network (CNN) is and how they are structured and built.
Moreover, it contains a step-by-step guide on how to implement a CNN on a public dataset in PyTorch, a machine learning framework used with the programming language Python. Furthermore, it explains why a CNN is much more efficient concerning image processing than a standard neural network. It precisely examines the different layers that conduct image augmentation, which is crucial when considering the high performance of convolutional neural networks. This very paper intends to enlighten the ideas behind this technology as well as showing how to implement it.
Keywords: Edge detection, augmentation, Image classification, convolutions, machine learning
Some argue that Machine Learning will be more revolutionary than the discovery of electricity. Of course, this is a bold statement, nevertheless, the doors that Artificial Intelligence opens are immense and the most opportunity-opening progress in the fields of image processing and visual computing. To further maximize the performance of image recognition with Machine Learning, Convolution Neural Networks, a specific type of Neural Networks, are the cutting-edge technology at the moment, providing an innovative way to recognize everything from a vertical line to a sophisticated object.
In the present work, I explain the math and the idea behind CNNs as well as a practical implementation in Python with a step-by-step guide to implement a basic Convolution Neural Network from scratch. It is highly recommended to have a basic understanding of Machine Learning in order to understand the concepts of Neural Networks. Moreover, in the practical example, not every hyperparameter and every step will be logical to you if you do not have background knowledge of simple machine learning theory. There will be a brief description about how a Neural Network operates.
Convolution operations and fundamental edge detection
A computer sees an image as a set of numbers, depending on whether it is grayscale or RGB. In Machine Learning, this set of numbers is represented as a matrix. The matrix below shows a 6×6 dimensional grayscale image with a vertical line:
For a machine to recognize this vertical line, filters can be useful. By applying a certain filter, every vertical line in an image can be easily recognized. In Machine Learning, those filters are also represented as matrices.
To apply the above filter on an image, there is an operation called convolution. In math, the star symbol * is used to mark a convolution operation.
To now recognize all the vertical lines in an image, the image is convolved with the appropriate filter:
This operation is not normal matrix multiplication. Instead, it can be seen as a shift-wise element-wise matrix multiplication with the filter F being shifted along the image I. This means that the first calculation would look like this:
When inserting the correct values, this would look like the following:
The next step then is to shift the filter one element to the right and to do this calculation with the new values. After shifting, the resulting matrix should look like this:
The output image shows that a vertical line has been detected. Since the input image is very small, the vertical line in the output image is very big. Taking largerimages would yield more accurate results.
To avoid confusion, the diagram below shows the process of applying a filter to an input image and how the output image then is calculated:
Maybe you have recognized that with the method of applying filters shown in the last chapter, the edges and especially the corners are used much less than pixels in the center of the image. That is because the filter always shifts one to the right or one to the bottom resulting in the center pixels being much more often used than corners or edges. As an example, let us take I(4,3) from the previous input image into consideration.
This pixel is used in 9 different calculations when applying the 3×3 filter on the 6×6 image. This means that this very pixel influences 9 out of 16 pixels from the output image, while I(1,1) is just used once, in the first calculation. The result of that is that the center pixels influence the output image much more than the corners or the edges.
As you may have already recognized, another problem that occurs is that the output image shrinks unregulated after applying the filter. Since we just want the vertical lines sticking out instead of the image getting smaller, this event causes some inconveniences.
To counter those events, we introduce padding. Padding means that the input image is added with additional pixels around the image filled with zeros. If the right number of additional pixels are added, it results in the corners and edges being used more often which weakens the outbalanced influence of the center pixels. Furthermore, the output image will be the same size as in the input image. For calculations, we abbreviate the number of additional layers with p. To get the same output size as input size, the following formula for calculating the number of layers needed can be applied:
with F being the size of the filter.
Previously I talked about the filter always shifting one element to the right. The number of shifts, also called the stride, can be influenced. The larger the stride, the smaller the output image and vice-versa. If the stride, for example, was equal to two, the filter would switch two elements to the right or the bottom. For calculations, we abbreviate the number of steps with s. The graph below shows the example from before with the strides being marked.
With the padding p and the stride s, the output size can be calculated as follows:
Convolutions on RGB images
Before, everything was explained with grayscale images. However, almost all images nowadays are RGB, making images and filters three-dimensional. This means that addition to the height and the width, there is a third dimension called channel size.
With all the parameters shown above we are now able to calculate the output image with one single formula:
It always has to be taken into consideration that the filters here are filled with values for the sake of representation and explanation, however, in a neural network, the filters are filled by the algorithm with backpropagation.
Before I can go deeper on Convolution Neural Networks, I want to briefly explain what a Neural Network is and how it works.
The ideas behind Neural Networks are inspired by the human brain. There are neurons distributed along with the network that is interconnected with each other to exchange information. This concept of the human brain was used for the architecture of a Neural Network.
As the diagram above shows, there are three layers we distinguish in Neural Networks.
The Input Layer is where the input is provided. No complex calculations are done, just the input is provided here. There always is just one Input Layer. The hidden layers are the layers where the algorithm tries to improve itself. This means that it compares predicted values with actual values, calculates the error, and tunes the parameters of the hypothesis.
Although in the diagram above there is just one Hidden Layer, in a Neural Network there are mostly more than that, there is no limitation in the number. In the Output layer, the final hypothesis is returned. The Neurons are represented as circles. The number of Neurons depends on the specific case, making it one of the many parameters that have to be chosen just right.
The process of training the Neural Networks consists of Forwarding Propagation and Backwards Propagation. In Forwarding Propagation we distribute the data over the network and calculate the loss function for the batch, which is the squared sum of the errors transmitted when predicting different lines. The error is the difference between the predicted value and the actual one. Backpropagation consists of calculating the gradient of the cost function with various parameters and then applying a descent algorithm to update it. It is indeed a very complicated process, however, in general backpropagation just tries to determine which part of the Network is responsible for how much of the cost to update the parameters individually.
This process of forwarding- and backward propagation is done multiple times, the exact number of runs is specified with a parameter called epochs.
Individual Parts of a Convolutional Neural Network
The Convolutional Neural Network now is an interaction between all the steps explained above. A CNN really is a chain consisting of many processes until the output is achieved. Besides the input and output layer, there are three different layers to distinguish in a CNN:
1. Convolutional Layer
2. Pooling Layer
3. Fully Connected Layer
The Convolutional Layer is all about applying filters as we have learned in a previous section. However this time, many filters are applied to one image simultaneously. Moreover, the filters are not specified by us anymore, they are filled by the algorithm that is able to learn how to improve its process by tuning the filter parameters.
Earlier in this paper, we talked about the problem of images shrinking after every step. With padding, we solved the problem. Nevertheless, it is not always the best idea to keep going with the same image size in every step of the network. Especially when dealing with high resolutions and large data sets, this would exhaust the computing power and would additionally take forever to finish a decent training process.
Moreover, by removing noise from the image, we can alleviate the risk of overfitting the training set. Therefore, we apply pooling layers that intentionally shrink an image for speeding up the algorithm. There are different ways and concepts to do that, we however are going to have a look at the max-pooling algorithm.
The following hyper parameters are needed to be defined in max-pooling: Filter-size: f, Stride: s
What max-pooling does is, it applies a filter with filter size f and stride s that outputs just the highest value in one section.
To understand the intuition behind max-pooling better, let us assume our algorithm tries to detect the eyes of cats. If the previous 4×4 image is some set of features, for example, a set of activations in the Neural Network, a high number would state that something specific was detected in this area. With pooling, we now scale this set of features down to their most important values. With that, we do not lose important information and we can reduce the size in order to speed up the algorithm.
Fully Connected Layer
Last but not least, we finally get into the wonderful concepts of Neural Networks.
After applying filters and pooling, the final step is to run the output from the filter section through a fully connected layer so it can start predicting. We call it “fully connected” because, as already explained, every neuron of layer l is connected with every neuron of layer l+1.
Architecture of a CNN
In the chapters above we have seen all parts of a Convolutional Neural network individually in order to understand every step a CNN does. Now we are going to go through an example to explain how those single parts work together.
Let’s assume we want to build an algorithm that is able to predict whether an input image is a cat or not. This problem could indeed be solved by a standard Neural Network. This Network would be trained with labeled data so it can distinguish a cat from something else. However, this algorithm would be highly insufficient because in order to have a properly functioning image classifier, a lot of data is needed.
Having for example 100,000 images of cats with a relatively low resolution of 64×64 pixels, the algorithm would have to compute over 400 Million features just for passing the data set through the Neural Network once. Given that for proper training you choose an epoch size of about 500 or even more, it does not require lots of math to recognize that this needs a lot of computing power.
Of course, this is doable for machines with good GPUs, however, thanks to Yann LeCun, the inventor of CNNs, this whole process can be solved much more elegantly and sufficiently.
With the convolutional and the pooling layers, images are scaled down in a way that already highlights crucial information for the Neural Network, speeding up the process of classification tremendously.
Although I said that every step of a CNN was explained, I actually have one last that is important. When filtering and pooling is done, the output of those layers has to be mapped in a way that it works as an input for the fully connected layer. This process is called flattening.
A Neural Network in general only takes one-dimensional linear vectors as an input. Since that an image is almost always a multidimensional matrix,
We can very well see how the input image shrinks on size and increases on channels. After the feature extraction, which contains the steps of applying filters, is done, the image is reduced to their most important features, mapped to a one dimensional vector and sent through the Neural Network.
Applying and creating a CNN
All the steps that were explained above consist of many mathematical calculations. While simple machine learning problems are very easily doable in an application like MATLAB, where every step has to be implemented with their calculations, further advanced machine learning problems, as Convolutional Neural Network, would give every programmer a hard time when implementing it in MATLAB or other math-oriented languages. Fortunately, there are a lot of frameworks where its efficiency in numerical computing is absolutely maximized. Moreover, they implement functions that are built in order to create machine learning applications very easily.
In the next few steps, we are going to set up a simple CNN in PyTorch. PyTorch is one of the most popular deep-learning frameworks when it comes to machine learning using python. I am going to use google collab as my python editor since it provides you with a cloud-based GPU.
How to program a simple Convolution Neural Network
The first step in every machine learning problem is to collect data. Since mostly, this is the most time-expensive task, I will use an open-source data set that was collected for demonstrating how a CNN works.
The dataset we are going to use is called CIFAR-10. It has labeled images from 10 different objects:
The first step is to load those images of the data set into the code. In order to do that, we use torchvision. Moreover, the images have to be transformed to tensors of normalized range [-1, 1]:
import torchvision.transforms as transforms
transform = transforms.Compose(
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) ①
trainset = torchvision.datasets.CIFAR10(root=’./data’, train=True, download=True, transform=transform) ②
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root=’./data’, train=False, download=True, transform=transform) ③
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
classes = (‘plane’, ‘car’, ‘bird’, ‘cat’,
‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’) ④
① The normalization parameters are fixed so that the data can be transformed to tensors.
② The training set is defined.
③ The test set is defined.
④ The classes the classifier should distinguish between.
If the data was rightfully imported, normalized and split, we can start building the CNN. It has to be said that machine learning is a field where programmers tend to use network architectures of open-source projects that have been successful. Subsequently, just the hyperparameters have to be tuned in order to work for the specific use case.
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module): ①
self.conv1 = nn.Conv2d(3, 6, 5) ②
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120) ③
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x ④
net = Net()
① We create a class that inherits from nn.Module. This superclass already contains everything someone needs to create a Neural Network of all kinds. ② In the following steps we create our architecture of layers. Here, we start with a convolutional layer, followed by a pooling layer with max-pooling, and finally, another convolutional layer. This concludes the augmentation part. The first argument of the convolutional layer is the channel size. Due to the fact that we are dealing with RGB images, the channel size of conv1 is 3.
The second argument is the output channel size. The third argument fixes the kernel-size. This means it determines the dimension of the filters that are applied. In our case, 5×5 filters are going to be applied on the images. The two pooling arguments are kernel-size and stride, having 2×2 pooling with a stride of two. These are the same values as used in the example when pooling was explained before.
③ In step three, fully connected layers are defined. We chose to use three of them. The first argument specifies the number of nodes in this layer, while the second argument specifies the number of nodes in the next layer.
④ The next step now is to define a forward function. This function is responsible for specifying how the data flows through the network when performing a forward pass. The parameter x is one batch of the dataset. Now the data is sent through the different layers. It starts with the conv1, where the output is sent into the first pooling layer. Although there is no specific layer for that, the output of conv2 is as well applied with pooling. Finally, we send x through the hidden layers and return it.
Last but not least, a loss function and an optimizer have to be defined.
import torch.optim as optim
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
We now can train the algorithm that was built.
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print(‘[%d, %5d] loss: %.3f’ %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
The following diagram shows the architecture we just built:
As we can see, the classifier made all predictions correctly.
Convolutional Neural Networks are very revolutionary in the fields of machine learning and image processing. It has never been easier to build image classification algorithms that perform well with limited computing power. Images are reduced to their most important features and finally used as training data for a neural network. With PyTorch, it is very easy to create simple implementations in a small number of time that deliver outstanding performances and accurate results.
Convolutional Neural Network Explained Written by Bernhard Böck
1. Pytorch: Convolutional Neural Networks Tutorial in PyTorch, December 2020 https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
2. Adventure in Machine Learning: Convolutional Neural Networks Tutorial in PyTorch, December 2020 https://adventuresinmachinelearning.com/ convolutional-neural-networks-tutorial-in-pytorch/
3. Towards data science: The Most Intuitive and Easiest Guide for Convolutional Neural Network, November 2020 https://towardsdatascience.com/the-most intuitive-and-easiest-guide-for-convolutional-neural-network-3607be47480
4. Towards data science: Convolutional Neural Networks’ mathematics, November 2020 https://towardsdatascience.com/convolutional-neural networks-mathematics-1beb3e6447c0
5. Andrew Ng: Specialization Deep Learning, November 2020 – December 2020 https://www.coursera.org/specializations/deep-learning
Edited by Avhan Misra, Jimei Shen & Thomas Braun