Naoki Yokoyama

The Problem with Fully-Connected Networks

A fully connected neural net

Fully-connected layers learn global patterns in a training sample; ordering matters!

Fully-connected models have to learn the pattern as if it was new/different if it appears elsewhere in the input, i.e if the input is shifted slightly.

Not suited for learning local features placed arbitrarily in an image!

We need a model that is not sensitive to the locations at which features appear in the image.

How Convolutional Neural Nets Work

A convolutional layer with a stride of 2, output depth of 2. Usually these layers will have a stride of 1 though.

Uses a constrained window (usually 3x3 or 5x5 pixels) that processes each input pixel independently.

Thus, each result does not depend on where the window is in the image, just the patch of pixels it captures.

Each patch then produces a tensor product with a learned kernel.

This product is flattened into one dimension, and is placed in the corresponding spatial location of the output.

Output of a convolutional layer is usually activated using ReLU.

Max Pooling

The input grid represents the green grids from the previous image.

Outputs of convolutional layers are downsampled, disposing unhelpful information and allowing important features to persist.

Shrinking outputs also allows the model to learn features not necessarily captured by the small 3x3 kernels of convolutional layers.

Max pooling layers are used to shrink the outputs.

Similar to convolutional layers, a sliding window is used, usually 2x2.

However, instead of using learned weights, the output of a max pool window will always be the max value captured.

Max pooling is preferred to average pooling to prevent prominent features from being degraded.

Beyond Convolutional Neural Networks

These images may look the same to a CNN.

CNNs are a little TOO translation invariant (see image above).

Recently, Hinton created the concept of Capsule Networks, which take spatial hierachies into consideration when looking for patterns.

CNNs are still sensitive against variations such as rotation and scale.

To combat this, there are data augmentation techniques that expand training sets by creating additional variants of training samples, tweaking rotation, scale, blur, noise, color, etc.

From Hinton's Reddit AMA in r/MachineLearning: "The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster."

Fully Connected Example

# Prepare training/test set
from keras.datasets import mnist
(train_im,train_labels),(test_im,test_labels) = mnist.load_data()
train_im = train_im.reshape((60000,28*28))
train_im = train_im.astype('float32')/255
test_im  = test_im.reshape((10000,28*28))
test_im  = test_im.astype('float32')/255
from keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels  = to_categorical(test_labels)

# Create neural net
from keras import models, layers
net = models.Sequential()
# Input, fully connected layer
net.add(layers.Dense(512,activation="relu",input_shape=(28*28,)))
# Output, classification layer
net.add(layers.Dense(10,activation="softmax"))
# Add loss/optimizer
net.compile(optimizer='rmsprop',
			loss='categorical_crossentropy',
			metrics=['accuracy'])
# Train
net.fit(train_im,train_labels,epochs=5,batch_size=128)
loss,acc = net.evaluate(test_im, test_labels)
print acc

Accuracy on test set: 97.8%
As you can see, a 2D image must be flattened into a 1D vector for use with fully connected layers.

Convolutional Example

# Prepare training/test set
from keras.datasets import mnist
(train_im,train_labels),(test_im,test_labels) = mnist.load_data()
train_im = train_im.reshape((60000,28,28,1))
train_im = train_im.astype('float32')/255
test_im  = test_im.reshape((10000,28,28,1))
test_im  = test_im.astype('float32')/255
from keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels  = to_categorical(test_labels)

# Create neural net
from keras import models, layers
net = models.Sequential()
# Input, convolutional layer
net.add(layers.Conv2D(32,(3,3),activation="relu",input_shape=(28,28,1)))
net.add(layers.MaxPooling2D((2,2)))
# Hidden convolutional layer #1
net.add(layers.Conv2D(64,(3,3),activation="relu"))
net.add(layers.MaxPooling2D((2,2)))
# Hidden convolutional layer #2
net.add(layers.Conv2D(64,(3,3),activation="relu"))
# Flatten for classification
net.add(layers.Flatten())
# Classification layer
net.add(layers.Dense(64,activation='relu'))
net.add(layers.Dense(10,activation='softmax'))
# Add loss/optimizer
net.compile(optimizer='rmsprop',
			loss='categorical_crossentropy',
			metrics=['accuracy'])
# Train
net.fit(train_im,train_labels,epochs=5,batch_size=128)
loss,acc = net.evaluate(test_im, test_labels)
print acc

Accuracy on test set: 99%
With CNNs, 2D images are processed as they are, without being flattened into a 1D signal. May not seem like much of a bump, but this is a classification task, in which the objects (MNIST digits) are all generally the same size and are centered. Convolutional layers will shine brighter at detection tasks, where their robustness against arbitrary placement of features will be more useful.

Convolutional Neural Networks (Code Examples at the Bottom)