We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.

Gledson Melotti • 6 years ago

Hello Alessandro Angioi, thank you very much for your posting. How are you? Why do you replace tf.keras.layers.Dense (1) with tf.keras.layers.Dense (2) when you use the distribution layer? Why is 1 replaced by 2? Could you post any examples using classification? For example, the distribution layers to classify the mnist dataset.

Alessandro Angioi • 6 years ago

Hey Gledson Melotti, thank you for reading! Good question, maybe I should have been clearer on this: in the first case, I was basically just performing regression, so the output layer of the network should return only one parameter, the expected value of Y given X.

When I used distributions, since I wanted to be able to handle the heteroscedasticity of Y, the network needed not only to learn the expected value of Y given X, but also its variance. In other words, since the distribution (normal) I chose to model Y needs two parameters (mean and standard deviation), the network needs to learn both of them. The simplest useful example I could think of is just putting two output neurons in the network, where before there was only one. In principle, you can devise architectures much more complicated than this, but the code for training it should stay pretty much the same.

I do not have any examples now on any MNIST dataset, but I think I will write about it soon!

Gledson Melotti • 6 years ago

Hello Alessandro Angioi, thank you so much for your explanations. Congratulations again for your posting.

Tom Chris • 6 years ago

Hi Alessandro, thank you for you posting! That is pretty helpful. I just got two questions.

One is how to output the learned standard deviation after training? After training, I used "y_val = model.predict(x_val)" to get some prediction. But I can only get the predicted mean without the std for the last two models.

The second question is what is the p_y in the negloglik = lambda y, p_y: -p_y.log_prob(y)? Don't we need to define it beforehand?

Thanks!!!

Alessandro Angioi • 6 years ago

Hey Tom! Sorry for the late reply, and thank you for reading!

When the last layer of your model is a distribution, you can use model(x).mean() and model(x).stddev() to access mean and standard deviation (here, x is the set of values for which you want to make predictions; for a plot you can just use an np.linspace). You can see my whole code for this article here: https://colab.research.goog...

Regarding your second question, p_y, is the probability distribution of y ("conditioned" by x), and it really is just the "output layer" of this model. Since it is calculated at training time, we cannot define a set one beforehand (negloglik is our loss function here). It is just the argument of a lambda function... in other words, neglogik is the same as


def neglogik(y, p_y):
    return p_y.log_prob(y)

Markus Wenzel • 6 years ago

Hey Alessandro, it's very valuable that you provide a small example. Good for beginners!
However, the code is not running for me. Neither does the joint distribution come out as expected, nor am I able to let the model predict anything. Can you imagine showing more of the code, e.g. how you produced the plots? Sure this is my fault...

Alessandro Angioi • 6 years ago

Hey Markus! Thanks a lot for your feedback! Could you tell me which kind of errors is the interpreter showing you? In any case, here is the full code I wrote for this article on a colab notebook (you can copy it to your own colab); can this be helpful?

https://colab.research.goog...

Another thing; I am using Tensorflow 2.0; if you use the latest stable version (1.14 I think) there should be some subtle changes to make; I can look into it, if you are interested!

Markus Wenzel • 6 years ago

Thanks for the rapid response! After a quick look, I guess the differences are due to my misuse of TF sessions. I'm not using eager execution -- but my code equals yours otherwise,
I can imagine that future readers of your blog will appreciate a hint to the Colab notebook in the main body of the text.

Alessandro Angioi • 6 years ago

Yes, I will add that link to the main text as soon as I can. Thank you for your help!

Utkarsh Sarawgi • 5 years ago

Hello Alessandro Angioi, thank you for this post, really helps a lot! Do you happen to have an example (or any lead through other blog or reference) for classification as well? I'm having a figuring this out for classification. Thank you :)

mouad ablad • 5 years ago

Hello @Alessandro Angioi , thank you for you posting this helpful article. I just got questions :
What if we have more than one feature ? How we can quantify uncertainty in this case ?
how we can visualize the final plot in this case ?

Alessandro Angioi • 5 years ago

Hey! I assume that with "more than one feature" you mean that X is not a scalar anymore, but a vector, right? Well, the code/model will not change that much. The Dense layer at the top of the model can eat any (fixed) number of features.

In the same fashion, the standard deviation of Y can give you an idea of the uncertainty of Y regardless of the dimensionality of X.

Producing an analogue of the final plot might be tricky tho. Here on the spot I can only think that if you have two scalar features and one scalar output you can trivially generalize the 2d plot I produced to a 3d one, but for more than that you need to get creative, and possibly the best solution depends on the problem at hand!

mouad ablad • 5 years ago

Thank you so much for your help.

Iñaki Jn • 5 years ago

HI, thank you very much for this lucid explanation!

I understand that the model is modeling the posterior P(Y/X). We know that

P(Y/X) = (P(X/Y)*P(Y))/P(X)

When generating the data with JointDistributionSequential([...]) I guess that:

1. The first element in the list, i.e. tfd.Uniform(low=-8, high=15), is P(X)
2. The second element in the list, lambda x : tfd.Normal(loc=x, scale=abs(x) + 3), is P(Y/X)
3. Am I wrong and this second argument is P(Y)?

Do you know if tensorflow distribution have some kind o prior estimation?

Thx so much!

Alessandro Angioi • 5 years ago

Hey! yes, what you are saying is exactly right, as it can also be seen from
the docs of tfp.distributions.JointDistributionSequential,
in the "Matemathical Details" section.

Regarding your second question, I am not sure what do you mean with "prior estimation";
can you elaborate on that? AFAIK, you do have to put your priors in.

Iñaki Jn • 5 years ago

Now I realize that priors and likelihoods are "selected", not estimated... thanks for your reply

Amilkar Herrera • 6 years ago

Thank you very much for the tutorial. it's a great explanation. The best simple explanation of these capabilities I have been able to find so far.

Could you please provide some comments on why you provided as agruments to the DistributionLambda the argument 0.005*t[...,1:])+0.001?

From what I read, here you are passing the parameters from the previous layer, bu why did you multply times 0.005 and added 0.001?

Thank you very much in advance for your help!

Alessandro Angioi • 6 years ago

Hey Amilkar, thank you so much for the kind words!

I put those parameters there because at the time I wrote the article I did notice they helped with training. The second one (+0.001) is to enforce that the scale parameter is always strictly larger than 0. The 0.005 multiplying t[...,1:] is actually not needed if you play around a bit with the learning rate and initialization of the network weights.

Sorry for the delay in my reply; if you need further help, I can elaborate on this a bit more.

Zilong Zhao • 5 years ago

Awesome tutorial, thanks Alessandro Angioi!! A quick follow-up question: when predicting distribution, you used the following as the output layer, can you please help me understand what is `t[...,:1]` doing here? How did you know it's the mean of the normal distribution?

tfp.layers.DistributionLambda(lambda t:
tfd.Normal(loc = t[...,:1],
scale=tf.math.softplus(0.005*t[...,1:])+0.001)

Alessandro Angioi • 5 years ago

Zilong Zhao Thanks for the kind words and sorry for the late reply!

So, basically the DistributionLambda is just a layer that allows you to plug a TensorFlow Probability distribution into a Keras model. This layer receives a tensor from the previous layer (which becomes the input to a lambda function which must return a distribution). Since the previous layer is Dense(2), in my example the tensor passed forward has the shape (n, 2), where n is the number of training examples passed as an input to the model.

Now, since I care only about the last dimension of that tensor, I sliced it using the ellipsis notation (see https://stackoverflow.com/q... ); you could have achieved the same result with t[:,0] and t[:,1], but when I wrote this article this would not work for some reason (when I wrote this article, TensorFlow 2.0 was in beta, and TensorFlow Probability was not as mature as it is now). I checked again, and now it works also with t[:,0] and t[:,1].

Final comment: for tfd.Normal, loc is the mean and scale is the std. So, the output of the keras model are not numbers, but probability distributions, with mean and standard deviation determined by the neural network preceding them!

I hope this answers your question, but feel free to ask further clarifications in case some points are still not so clear!

totucuong • 6 years ago

Hi Alessandro,

Thanks for the article. Have you tried to use tfd.JointDistributionSequential instead of tfd.Normal as the output of Keras model?

I have tried it but I have a problem while compiling the model using:


negloglik = lambda y, p_y: -p_y.log_prob(y)
model.compile(optimizer=tf.optimizers.Adam(learning_rate = 0.1), loss=negloglik)

Thanks,

Alessandro Angioi • 6 years ago

Hi, sorry for the late reply; I suppose you can use a
tfd.JointDistributionSequentialas the output, but you need to wrap it in a DistributionLambda. If you have some specific problem in mind, feel free to share it and I can try to have a go at it.

Fabien Niel • 6 years ago

Hey Alessandro! That's really interesting, great job. I really like your blog!

Alessandro Angioi • 6 years ago

Thanks Fabien! Not a lot of content for now, but it's growing. I am really glad you like what's there!

Fabien Niel • 6 years ago

Yeah I recently got interested in machine learning as well + your physicist approach to it makes it really interesting!

Oscar de Felice • 6 years ago

great job!

Alessandro Angioi • 6 years ago

Thanks!