What does tf.nn.conv2d do in tensorflow?

Ok I think this is about the simplest way to explain it all.


Your example is 1 image, size 2×2, with 1 channel. You have 1 filter, with size 1×1, and 1 channel (size is height x width x channels x number of filters).

For this simple case the resulting 2×2, 1 channel image (size 1x2x2x1, number of images x height x width x x channels) is the result of multiplying the filter value by each pixel of the image.


Now let’s try more channels:

input = tf.Variable(tf.random_normal([1,3,3,5]))
filter = tf.Variable(tf.random_normal([1,1,5,1]))

op = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='VALID')

Here the 3×3 image and the 1×1 filter each have 5 channels. The resulting image will be 3×3 with 1 channel (size 1x3x3x1), where the value of each pixel is the dot product across channels of the filter with the corresponding pixel in the input image.


Now with a 3×3 filter

input = tf.Variable(tf.random_normal([1,3,3,5]))
filter = tf.Variable(tf.random_normal([3,3,5,1]))

op = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='VALID')

Here we get a 1×1 image, with 1 channel (size 1x1x1x1). The value is the sum of the 9, 5-element dot products. But you could just call this a 45-element dot product.


Now with a bigger image

input = tf.Variable(tf.random_normal([1,5,5,5]))
filter = tf.Variable(tf.random_normal([3,3,5,1]))

op = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='VALID')

The output is a 3×3 1-channel image (size 1x3x3x1).
Each of these values is a sum of 9, 5-element dot products.

Each output is made by centering the filter on one of the 9 center pixels of the input image, so that none of the filter sticks out. The xs below represent the filter centers for each output pixel.

.....
.xxx.
.xxx.
.xxx.
.....

Now with “SAME” padding:

input = tf.Variable(tf.random_normal([1,5,5,5]))
filter = tf.Variable(tf.random_normal([3,3,5,1]))

op = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='SAME')

This gives a 5×5 output image (size 1x5x5x1). This is done by centering the filter at each position on the image.

Any of the 5-element dot products where the filter sticks out past the edge of the image get a value of zero.

So the corners are only sums of 4, 5-element dot products.


Now with multiple filters.

input = tf.Variable(tf.random_normal([1,5,5,5]))
filter = tf.Variable(tf.random_normal([3,3,5,7]))

op = tf.nn.conv2d(input, filter, strides=[1, 1, 1, 1], padding='SAME')

This still gives a 5×5 output image, but with 7 channels (size 1x5x5x7). Where each channel is produced by one of the filters in the set.


Now with strides 2,2:

input = tf.Variable(tf.random_normal([1,5,5,5]))
filter = tf.Variable(tf.random_normal([3,3,5,7]))

op = tf.nn.conv2d(input, filter, strides=[1, 2, 2, 1], padding='SAME')

Now the result still has 7 channels, but is only 3×3 (size 1x3x3x7).

This is because instead of centering the filters at every point on the image, the filters are centered at every other point on the image, taking steps (strides) of width 2. The x‘s below represent the filter center for each output pixel, on the input image.

x.x.x
.....
x.x.x
.....
x.x.x

And of course the first dimension of the input is the number of images so you can apply it over a batch of 10 images, for example:

input = tf.Variable(tf.random_normal([10,5,5,5]))
filter = tf.Variable(tf.random_normal([3,3,5,7]))

op = tf.nn.conv2d(input, filter, strides=[1, 2, 2, 1], padding='SAME')

This performs the same operation, for each image independently, giving a stack of 10 images as the result (size 10x3x3x7)

Leave a Comment