Implementing L2-constrained Softmax Loss Function on a Convolutional Neural Network using TensorFlow

When we build Neural Network based models to classify more than 2 things using machine learning models, normally we choose softmax function. Basically it assigns a probability to each class and we just take the class with the higher probability as a verdict of the model. That is how we infer using the model or it can be referred as a forward pass. To train the model, we just perform backpropagation with the softmax function. Normally the output is represented as a one-hot vector or by 1-K class representation.

Since I am not writing this post to explain softmax regression or CNN. The main objective of this blog post is to implement L2-constrained softmax loss function using TensorFlow on good old MNIST dataset. The full description and other information related to this function can be found on this paper.

Let’s clear some topics before diving into implementation.

Softmax Loss Function

Softmax loss function is defined like this,


L2-Constrained Softmax Loss Function

The definition stays almost exactly the same, the objective is to minimize this loss function

But here, f(x) gets changed.

Instead of computing the dot product between the last layer weights and f(x) [which is the output from the N-1 layer, given Nth is the last layer]. Each element of f(x) matrix gets normalized by the norm value of f(x), then the matrix get scaled up by a parameter named alpha. After performing this operations first which can be addressed as L2-Normalization then Scaling, finally the dot product is computed then the regular softmax function is applied.

So the loss function is subjected to,

Implementation Details

So the architecture looks something like this, [this is the model I am going to implement]

C means Convolution Layer, P means Pooling Layer, FC as you guessed, Fully Connected Layer, and L2-Norm and Scale layers are the ones I am going to implement.

Implementing using TensorFlow

To implement the model, I have taken the MNIST classification code using CNN from this repository.

Before applying dropout, at first I normalized the output from N-1 th layer which is fc1 then scaled it up by the parameter alpha. Then softmax was computed. Here is the line of code that does all of this,

fc1 = alpha * tf.divide(fc1, tf.norm(fc1, ord='euclidean'))

If alpha is set to 0 then it’s just regular softmax loss, otherwise it’s the L2-Normalized one.

Here is the full code

Performance Evaluation

Did it actually work? Yes it did, accuracy was indeed increased almost by 1%. Although I should’ve applied more iterations but I didn’t because I didn’t have a powerful computer, and second reason is laziness. If you don’t believe me or the paper, why don’t you experiment it by yourself?

Here is the performance of the model for different alpha values.

Orange line stands for the regular softmax loss function and the blue one is the l2 normalized one.

As I said, to know more about it, I recommend going through the paper, there are a lot of information I haven’t mentioned. Thanks for reading the post!