We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.

NamHyuk Ahn • 9 years ago

Really thanks. Actually I used tf.contib.layers.batch_norm(), but it makes zero accuracy in test time.
So instead, I use this batch_norm wrapper :)

Pavel • 9 years ago

Is there a reason you don't put a direct link to your Jupyter notebook of the material here? It was very handy to discover it exists, while I didn't initially saw a hint that it does.
https://github.com/spitis/r...

Dongjin Yoon • 7 years ago

Thanks for your post. But I can't understand well why using population mean/var instead of batch mean/var in testing except the one single example.. Can you explain in detail?

Michal Pustka • 7 years ago

Improvement for different shape of tensor: batch_mean, batch_var = tf.nn.moments(inputs, list(numpy.arange(0, len(inputs.get_shape()) - 1)))

machine learning • 7 years ago

Great tutorial!
Is there any reason why you did not apply batch normalization in the last layer?
Also, since the same names used for scale and beta in both layer1 and layer2, these trainable variables now becomes shared between these 2 layers. Please correct me if I am mistaken. But should we not have separate scale and beta trainable tensorflow variables per layer (just like weights and biases)?

Alex Church • 7 years ago

Quick question, is there a reason that when calculating the moving averages pop_mean is initialised as zeros and pop_var is initialised as ones?

alexandra-stefania moloiu • 6 years ago

Standardization/Normalization - mean 0 and variance 1

SAI SUMA KAMIREDDY • 8 years ago

Thanks for this post.While training the model many suggested to use batch normalization after the activation function.Where as here it is used before the activation function.Can someone please explain which could be better for training a model.

charlietrouble • 8 years ago

misspelling: neuron instead of nuron

Derk Mus • 9 years ago

Thanks, this posts explains batch norm very good. I tried your implementation with a tf.placeholder for is_training,but that seems not to work. So with this example you are building two different models during training and inference.

And a promising research, that possibly could solve the shortcomings of batch norm: batch renormalization, https://arxiv.org/pdf/1702..... Looking forward to implementations of this as well :)

RoyS • 9 years ago

Tried to change the sigmoid activation to relu and something strange is happening. The values of "acc without BN" are extremely low and "acc with BN" is ok but drops off to zero at some point. Any idea why this behaviour?

Alex • 9 years ago

Thanks for this post! A question regarding the batch_norm_wrapper function though:

You're resetting the pop_mean to all zeros and the pop_var to all ones on every function call. In the subsequent calculation of train_mean and train_var in the (if is_training) condition, one part of the moving average is the current batch_mean / batch_var, but the other part is always the pop_mean of all zeros / pop_var of all ones. That can't be right?

The wrapper doesn't actually calculate a population mean that is accumulated over multiple mini batches, for instance. Instead the pop_mean at any given point is just equal to (batch_mean * (1 - decay)), i.e. it is just whatever the last batch_mean value was, multiplied by (1 - decay). Any previous batch means do not go into its calculation at all, since pop_mean is reset to all zeros every time the function is called.

Am I getting something terribly wrong here?

r2rt • 9 years ago

Hi Alex, the function itself is only called once at graph creation, not each time the graph executed.

Alex • 9 years ago

Oh, now it all makes sense. Thanks a lot and sorry for the question! One should understand how Tensorflow works first and ask questions after ;)

Nikhil Chhabra • 8 years ago

But when he calls this statement: (x, y_), _, accuracy, y, saver = build_graph(is_training=False)

the function is called again for test phase and pop_mean is initialised to zero.
So I don't understand how is the pop_mean for test phase gets the value from average pop_mean that was saved during training. ? I guess I have less understanding of how Tensorflow works.

Vikas AG • 7 years ago

pop_mean and pop_var get updated by "tf.assign". "tf.assign" is used to update the moving average and moving variance.

Chenglin Xu • 8 years ago

In the test stage, although you call the build_graph function, the trained model is restored. So the pop_mean and pop_var would be the values after training.

Nikhil Chhabra • 8 years ago

ok, but as the variables 'pop_mean' and 'pop_var' are declared as 'trainable = False' so they might not be saved when we do 'saver.save(sess, './temp-bn-save')'.
I found this discussion where they talk about how to save non-trainable variables.
https://stackoverflow.com/q...

Nikhil Chhabra • 8 years ago

I think I was wrong. Even non trainable variables will be saved by default.
Thank You Chenglin.

naveen • 9 years ago

shouldn't the cross entropy be defined like the following?
cross_entropy = -tf.reduce_mean( tf.reduce_sum(y_* tf.log(y), 1) )

r2rt • 9 years ago

Sure, you can do that. Not sure why I had used total loss instead of average loss for this post (was a while ago)!

Inderpreet Singh • 9 years ago

Under "Fixing the model for test time" paragraph, I think should be "Training" instead of "testing" the model above only worked because the entire training set was
predicted at once, so the “batch mean” and “batch variance” of the training
set provided good estimates for the population mean and population
variance.

Paweł Krotewicz • 9 years ago

Nice tutorial. However I have a problem when I want to use momentum optimizer. The graph is not restored correctly from tf.Saver() and i get the message like: tensorflow.python.framework.errors.NotFoundError: Tensor name "Variable_9/Momentum" not found in checkpoint files temp-bn-save.

Could you help with this please?

I suppose the code only works for GradientDescentOptimizer.

Also I am not able to get 80% test accuracy after restoring the model even though I get 80% on test data while training. Please help.

r2rt • 9 years ago

Ah, thanks for this! I updated the post with the fix that will let you use other optimizers, which was to add "trainable = False" to the declaration of the pop_mean and pop_var variables in the batch_norm_wrapper.

What happened is that more sophisticated optimizers, like momentum, add additional variables to manage each "trainable variable". Whenever you declare a variable with tf.Variable or tf.get_variable, it is added by default to the graph's "trainable_variables" collection (see tf.trainable_variables()). So here, the pop_mean and pop_var were mistakenly added to our trainable variables collection and the momentum optimizer added momentum variables for them. But since the momentum variables were never used, they were not saved to the checkpoint file, which resulted in an error on restore.

Paweł Krotewicz • 9 years ago

Ok, now it works also with tf.train.MomentumOptimizer.

However I still have this divergence between accuracy at train and test time.

During training I check the accuracy on the whole test set and when I reach 80% accuracy, I save the model. Than I restore the model and again evaluate the accuracy. This time I get 74.66%.

Why do I miss more than 5% of my model accuracy?
Any idea?

Maybe I have too small batch size? (8) compared to the size of my test set (10000)?

seanv • 9 years ago

Hi R2RT did you try putting the batch normalisation on the 'inputs' u (in section 3.2 of the paper) rather than as they prefer between the linear transformation and the nonlinearity. This seems more consistent with the standard practise of normalising the inputs ( eg section 2 first paragraph). I assume it doesn't work as well (or authos would have recommended it), but can't see why.

Rahul R • 9 years ago

Hi R2RT,

Thanks a lot for the post. I found it very helpful. :)

I am just confused about BN during test-time. As you mentioned in the last paragraph of your post, we need to compute the mean and the variance of the entire training set. I can see that there is a running mean and variance in Torch and MatConvNet to do this. I did not see this in the code of TensorFlow ? So

1. I guess we have to define two variables running_mean and running_var for each of the Conv layer while building the computational graph right ?

2. Once the computational graph is built and the weights are trained, I am not sure how to use this running_mean and running_var for each layer. Because as you have written, for each of the test image it will still compute the mean and variance inside the graph which will end up in terrible results. So can you please help me how can we manipulate the graph to use the "running_mean and running_var" during test time ?

Thanks once again. Any help is much appreciated.

r2rt • 9 years ago

Hi Rahul,

Apologies for the probably way too late reply: I've updated the post to show how we can store the running_mean and running_var as tf.Variables during training, and then use them at test time. I also added a reference to Tensorflow's batch_norm implementation here: https://github.com/tensorfl..., which will do this for you but is not yet covered in Tensorflow's API docs. Hope this is helpful!

Matt • 9 years ago

Thanks for this post, helped me implement batch norm in tensorflow. Some comments: I ran your code, and in its current form the non-BN network doesn't learn anything (might have to do with multiple sigmoid layers, etc). Also, it seems it cannot work in general since in load_mnist one_hot_encode is ignored if fake_data=False, which it is in your case. However, your network assumes one-hot encoding. So it seems the actual code that generated the figs might have been slightly different. In any case, it helped, so thanks again.

r2rt • 9 years ago

Glad it helped, and thanks for the comment! I ran it again on my system and wasn't able to reproduce the bug (the one_hot encoding works and doesn't require fake_data=True, else neither network would train), so I think it was likely a bad initialization or something else more subtle.

Matt • 9 years ago

Sorry, got mislead by this comment in the DataSet constructor: 'one_hot arg is used only if fake_data is true'. Missed the extract_labels function that takes care of this.

Nadeem Shaikh • 9 years ago

Thanks for the post it cleared my concepts about batch normalization , can you tell me the neural networks in which batch normalization should not be applied or is it applicable everywhere

r2rt • 9 years ago

The BN2015 paper applies batch normalization to both fully-connected and convolutional layers. The problem batch normalization addresses is inherent whenever multiple layers that depend on each other are trained at the same time. I haven't worked with batch normalization in a recurrent context yet, but I would refer you to this paper published less than 2 weeks ago: http://arxiv.org/abs/1603.0....