Data Science

Receptive field for CNN in layperson's words

Last Updated: 19th May, 2023

Vivek Chaudhary

Research Fellow (Data Science) at almaBetter

In this article we are going to master the fundamentals of receptive fields in a layperson’s words for Convolutional Neural Network.

Before you start reading this article, we suggest you read our previous article on “What is “padding” in Convolutional Neural Network?”.

Without getting into deep mathematics, let’s build the intuition behind receptive fields.

Last night my bot, Sam, was driving my car. By the way, bot Sam is a system trained for driving a car automatically for people who are too lazy to do it. Suddenly a truck came, and we had a really bad accident. Unfortunately, I died.

When my soul asked the robot what really went wrong, he said he was not able to notice the truck.

Can you guess what really went wrong?

I am just kidding.

We are sure about one thing, we don’t want Sam to do this again. Well, to understand this, we first need to understand one more concept called the receptive field.

In the perspective of a human being, the receptive field indicates how wide you can see. For example, if we are standing on a road, we can see all the vehicles that are coming. We can also see the same road from our room. We can notice a lot of things like vehicles on the road, nearby shops, people who are standing around stalls and footpaths, and even the sky.

So, we can say, when we were standing on the road our receptive field was not that high, we could just notice a particular area, and if we want to notice other things, we need to move our neck.

And in the room our receptive field was high, we could notice the details much more without moving our neck.

Now, let us assume it with the perspective of neural networks

2 (1).png

In the above image, we can see a 3x3 kernel is extracting the features from the top left corner of the image, and when we move ahead we have the features of that part. Kernel moves ahead and does it for the whole image. We can see that at a time the kernel is extracting the features of just 3x3 pixels of the image.

And in the final output, the top-left pixel of the final output just has the idea about the top-left 3x3 pixels, not about the whole image.

However, at this time, we tried to run the 3x3 kernel on the output feature map of the previous layer and we can see the output is extracting the whole image.

If you notice, as we move ahead in the network, the receptive field keeps increasing.

So, we can say, at the beginning, the kernel tries to extract the small features, like edges and gradients and only has an idea about those. However, as soon as it moves to the next layer, it starts getting a larger perspective of the image and notices the pattern and textures. Furthermore, in the final layer, our network is able to build the complete object.

Can we connect it to real life?

Assuming we are going to watch a movie in the theatre, what would be our choice for selecting the seat? We will always choose the last seat, since we can see the movie without any trouble or moving our necks.

However, if we get seats in the first row, we will be moving our necks continuously.

3 (1).png

We can assume these rows as the layers of the neural network. The first row means the first layer, which does not have the whole idea of the screen or in the case of a neural network, the input image. On the other hand, the last row means the last layer, which has the whole idea of the image without any problem.

How to calculate the receptive field?

4 (1).png In the above diagram, we can see that the first layer has a kernel of 3x3 moving from the input image. In this case, we talk about just one layer, which is the local receptive field. Since at a time, our kernel can just see the pixels of 3x3, our local receptive field, becomes 3x3.

However, as soon as we move to the next layer, we can see the output feature map has the idea of every 3x3 pixel of the image. Furthermore, another kernel of 3x3 is moving through this feature map, and has the idea of all these pixels.

Can we calculate the receptive field now?

Basically, the local receptive field of both the layers are 3 since the kernel can only see 3x3 pixels at a time. So, we can come to the conclusion that the receptive field directly depends on the kernel size. Yes, the receptive field also depends on the strides but we will talk about it some other day. Right now we are assuming the stride is 1.

5 (1).png

However, the global receptive field of this network will be 5 at the end.

For calculating the global receptive field, we need to consider two things, what is the local receptive field and then, what is the kernel size?

So, basically, when we try to calculate the global receptive field, we add KernelSize-1 in the previous global receptive field. The below Image will solve the mystery.

As we can see, each time the local receptive field is not changing since we can use the same kernel and the global receptive field is increasing by kernel-1. However, we should always remember that in the case of receptive fields, strides and jumps also matter. We will understand those things in an article on types of convolutions, where we will learn about the jumping parameters and dilated convolution.

Now comes the question, why do we even need to learn how to calculate the receptive field?

The use of receptive field

When we talk about convolutional neural networks, we need to use the feature extraction layers. In order to have better feature extraction, we try to put more and more convolution layers blindly and end up with a really heavy network. We are not sure about whether this network will give us expected results or not, but it will definitely waste our time.

So, here’s where the concept of receptive fields comes to save us.

The idea is to calculate the receptive field of every layer and keep the count. Our target should be to match the receptive field with image size or in some special cases even more than our image size when we have reached the final convolutional layer.

This rule makes sure that we are not making a network that has useless extra layers, which are making it heavy. Also by this, we are making sure that the final convolutional layer has seen the complete image and is holding different information about it.

Additionally, the receptive field decides when we should apply the pooling layers.

Basically, the concept is simple: if the image size is small like 32x32 or 28x28, we should add the pooling layers when the receptive field is 5x5 and follow the same for the whole network. In the case of bigger sizes like 64x64, apply pooling layers after the receptive field of 7x7 or 9x9. And in really big images like image size above 250, we can go for 11x11 or 13x13 receptive fields and then pooling layers.

Okay! What is the reason, though?

7 (1).png

Image -> A zoomed picture of the forehead of the cat

Consider this cute cat in the above image. We can see that it is holding the edges and gradient in the receptive field of 9x9, after that it will start holding the pattern of the cat’s forehead.

This image tells us how the receptive field is helping us to design a network.

If you are interested in exploring the world of convolutional neural networks and Data Science, AlmaBetter could be the right choice for you. Sign up for AlmaBetter’s Full Stack Data Science program which offers 100% placement guarantee and the opportunity to get placed with its network of 500+ active hiring partners.

Read our recent blog on “Top 15 highest paying Information Technology jobs in 2023”.