Decode Bengali: Midway post

5 minute read

The first model, a plain vanilla CNN model achieves a score of 0.9516. Its architecture is shown below. alt text

To boost the performance of our model, we first tried ResNet architecture. ResNet (Residual Neural Network) is an artificial neural network which builds on a convolutional neural network. It uses skip connections to build a shortcut to send the input past several convolutional layers.

Why ResNet In image classification problems, it is common to use CNN to solve these problems. Especially, as the depth of the network and the complexity of the model increase, the network performs better and predicts more accurately. However, as the number of layers increases, another problem might arise. If the network is too deep, the gradients from where the loss function is calculated easily shrink to zero after several applications of the chain rule. There would be nearly no update on the value and therefore, no learning is being performed. ResNet could solve this vanishing gradient problem as it reuses activations from a previous layer until the adjacent layer learns its weights. Also, after solving the problem of vanishing gradients, ResNet will speed up the training process as well.

Construction Now, let us take a look at the Residual Neural Network we trained to classify the Bengali graphemes. We read the data and resized them as we did before. At first, we constructed 3 layers, in each of which there are 32 feature maps and the kernel has a size of 3. We also choose zero padding to maintain the size of the inputs and the ReLu activation. Then, after batch normalization and max pooling, we built another layer, which includes 32 feature maps as before but with the size of filter equal to 5. Then, we did dropout with the dropout rate equal to 0.3. That was how we did regularization. Then, we started to construct layers of 64 feature maps. The first two of such layers had 3 by 3 filters, while the third layer had 5 by 5 filters. We also added zero padding and ReLu activation methods. At the end of the third layer, we did dropout again. After building the basic layers, we extended the network with the residual unit. alt text

In the first unit, we had two convolutional layers. Each layer had 128 feature maps and each filter had a size of 3. But one thing we need to notice is that the first layer had the stride equal to 2. Thus, we had to make a convolutional layer in the skip connection to match the size of the output after the skip step and the size of the output after the two layers. Then, we continued to construct 4 similar units. The only difference was that all the strides were set to one and the skip connection was set to the identity function. Later, we constructed 6 more units, similar to what we just did before. In the first unit, the first layer was constructed and it contained two convolutional layers. Both of the layers had 256 feature maps with 3 by 3 filters. It was just that the first layer’s stride was 2, while the second filter had stride equal to 1. Then, we made a convolutional layer in the skip connection to match the size of the output after the skip step and the size of the output after the two layers. We also made 5 more similar units. They were the same as the first unit except that all the strides were set to one and the skip connection function was equal to identity.

To make the network computationally stronger, we constructed 7 more units, almost repeating the previous process. In the first unit, the first layer was constructed and it contained two convolutional layers. Both of the layers had 512 feature maps with 3 by 3 filters. It was just that the first layer’s stride was 2, while the second filter had stride equal to 1. Then, we made a convolutional layer in the skip connection to match the size of the output after the skip step and the size of the output after the two layers. We continued to produce 6 more similar units. They were the same as the first unit except that all the strides were set to one and the skip connection function was equal to identity. In summary, we constructed 5+6+7=18 residual units and we had 18X2=36 convolutional layers in total. After flattening, the overall LB score of the ResNet is 0.957. Our code can be found here.

EfficientNet B3 We also tried EfficientNetB3 based on the idea from this kernel.

Image processing: simple resize, shrink the image from 137 X 236 to 109 X 188 （preserves ratio instead of crop center), we have not used data augmentation.

def resize_image(img, w, h, new_w, new_h):
    img = 255 - img
    img = (img * (255.0 / img.max())).astype(np.uint8)
    img = img.reshape(h, w)
    image_resized = cv2.resize(img, (new_w, new_h), interpolation = cv2.INTER_AREA)
    return image_resized    

model.compile(optimizer = Adam(lr = 0.00016),
                loss = {'root': 'categorical_crossentropy',
                        'vowel': 'categorical_crossentropy',
                        'consonant': 'categorical_crossentropy'},
                loss_weights = {'root': 0.50,        
                                'vowel': 0.25,
                                'consonant': 0.25},
                metrics = {'root': ['accuracy', tf.keras.metrics.Recall()],
                            'vowel': ['accuracy', tf.keras.metrics.Recall()],
                            'consonant': ['accuracy', tf.keras.metrics.Recall()] })

This was trained on a GCP virtual instance. We published our source code and training notebook to our repo.

We trained on a batch size of 112, and image size of 109 X 188 X 3. For the train test split, stratified K fold was used. We allocated 1/6 of training images to to validation set and 5/6 to training set. Actually, the split generator would generate different train and test distribution for every epoch. Because based on the suggestion from public score leaders, using this method will generalize to test data better. If after 5 consecutive epochs, validation loss for the root is still higher than previous lowest level, we would reduce the learning rate by 25% until it reaches the minimum of 1e-5.

From the training log printouts, we can see that after training for 27 epochs, the learning rate was was reduced to the minimum level of 0.00001 and moved slowly. Another interesting phenomenon is that the loss metrics fluctuate greatly during the training process, but after the 20th epoch, the accuracy scores increase steadily. alt text

The model weight from the 36th epoch was submitted and got a score of 0.93. alt text We will need to tune our training parameter further and restart the training process again.

Share on

Twitter Facebook LinkedIn

Hannah Han

Decode Bengali: Midway post

Share on

You may also enjoy

Implement style transfer with VGG19

Decode Bengali: final post

Visuals

Decode Bengali: initial post