Decode Bengali: final post
Bengali.AI Handwritten Grapheme Classification competition
Style transfer relies on separating the content and style of an image. Given one content image and one style image, we aim to create a new, target image which should contain our desired content and style components:
I’ll use a pre-trained VGG19 to extract content or style features from a passed-in image. I’ll then quantify content and style losses and use those to iteratively update the target image. The following walkthrough code was copied from Udacity deep learning repo.
VGG19 is split into two portions:
We only need the features portion, which we’re going to load in and “freeze” the weights of, below.
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import torch.optim as optim
from torchvision import transforms, models
vgg = models.vgg19(pretrained=True).features
for param in vgg.parameters():
param.requires_grad_(False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
vgg.to(device)
Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace)
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace)
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU(inplace)
(7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU(inplace)
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace)
(12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(13): ReLU(inplace)
(14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(15): ReLU(inplace)
(16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(17): ReLU(inplace)
(18): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(19): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(20): ReLU(inplace)
(21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(22): ReLU(inplace)
(23): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(24): ReLU(inplace)
(25): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(26): ReLU(inplace)
(27): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(29): ReLU(inplace)
(30): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(31): ReLU(inplace)
(32): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(33): ReLU(inplace)
(34): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(35): ReLU(inplace)
(36): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
Load in content and style images
Map layer names to the names found in the paper for the content representation and the style representation
def get_features(image, model, layers=None):
""" Run an image forward through a model and get the features for
a set of layers. Default layers are for VGGNet matching Gatys et al (2016)
"""
## Need the layers for the content and style representations of an image
if layers is None:
layers = {'0':'conv1_1',
'5':'conv2_1',
'10':'conv3_1',
'19':'conv4_1',
'21':'conv4_2',
'28':'conv5_1'}
features = {}
x = image
# model._modules is a dictionary holding each module in the model
for name, layer in model._modules.items():
x = layer(x)
if name in layers:
features[layers[name]] = x
return features
The output of every convolutional layer is a tensor with dimensions associated with the batch_size, a depth, d and some height and width (h, w). The Gram matrix of a convolutional layer can be calculated as follows:
def gram_matrix(tensor):
""" Calculate the Gram Matrix of a given tensor
Gram Matrix: https://en.wikipedia.org/wiki/Gramian_matrix
"""
batch_size, d, h, w = tensor.size()
## reshape it, so we're multiplying the features for each channel
matrix = tensor.view(d, h * w)
## calculate the gram matrix
gram = torch.mm(matrix,matrix.t())
return gram
I’ll extract our features from images and calculate the gram matrices for each layer in our style representation.
# get content and style features only once before forming the target image
content_features = get_features(content, vgg)
style_features = get_features(style, vgg)
# calculate the gram matrices for each layer of our style representation
style_grams = {layer: gram_matrix(style_features[layer]) for layer in style_features}
target = content.clone().requires_grad_(True).to(device)
Weight the style representation at each relevant layer. Use a range between 0-1 to weight these layers. By weighting earlier layers (conv1_1 and conv2_1) more, you can expect to get larger style artifacts in your resulting, target image. Should you choose to weight later layers, you’ll get more emphasis on smaller features. This is because each layer is a different size and together they create a multi-scale style representation!
Just like in the paper, I define an alpha (content_weight) and a beta (style_weight). This ratio will affect how stylized your final image is. It’s recommended that to leave the content_weight = 1 and set the style_weight to achieve the ratio.
style_weights = {'conv1_1': 1.,
'conv2_1': 0.8,
'conv3_1': 0.5,
'conv4_1': 0.3,
'conv5_1': 0.1}
content_weight = 1 # alpha
style_weight = 1e6 # beta
Inside the iteration loop, calculate the content and style losses and update target image accordingly.
The content loss will be the mean squared difference between the target and content features at layer conv4_2. The style loss is calculated in a similar way, only you have to iterate through a number of layers, specified by name in the dictionary style_weights. Finally, create the total loss by adding up the style and content losses and weighting them with your specified alpha and beta.
# for displaying the target image, intermittently
show_every = 400
optimizer = optim.Adam([target], lr=0.003)
steps = 2000 # decide how many iterations to update
for ii in range(1, steps+1):
## get the features from your target image
## Then calculate the content loss
target_features = get_features(target, vgg)
content_loss = torch.mean((target_features['conv4_2'] - content_features['conv4_2'])**2)
# the style loss
# initialize the style loss to 0
style_loss = 0
# iterate through each style layer and add to the style loss
for layer in style_weights:
# get the "target" style representation for the layer
target_feature = target_features[layer]
_, d, h, w = target_feature.shape
## Calculate the target gram matrix
matrix = target_feature.view(d, h * w)
target_gram = torch.mm(matrix,matrix.t())
## get the "style" style representation
style_gram = style_grams[layer]
## Calculate the style loss for one layer, weighted appropriately
layer_style_loss = style_weights[layer] * torch.mean((target_gram - style_gram)**2)
# add to the style loss
style_loss += layer_style_loss / (d * h * w)
## calculate the *total* loss
total_loss = style_weight * style_loss + content_weight * content_loss
# update your target image
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
# display intermediate images and print the loss
if ii % show_every == 0:
print('Total loss: ', total_loss.item())
plt.imshow(im_convert(target))
plt.show()
For every 400 steps, you can see that total loss decrease by some amount. You can decide how many steps to take.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
ax1.imshow(im_convert(content))
ax2.imshow(im_convert(target))
Does this look better? For the whole “training process”, check out my repo.
Reference: Udacity deep learning repo