We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.
Awesome! Great to hear that it was so helpful--I enjoy writing these tutorials, and it's very rewarding to hear when they make a difference for people!
Hi Chris, I hope that you'll still read and respond to comments on this post sometimes. Great post, also appreciate clarification made in comments. I've a question for you. When you're explaining the 3/4 power for weights, you're referring to range(0,1). But the X axis is the word frequency which wont fall in that range. Shouldn't we be focused on the rest of plot and note that raising frequency to the power of 3/4 will change the distribution the way that words with their frequencies being in a (wide) range would be treated the same way? - almost like bucketing!
Hi Mahnaz, thank you for your comment! That portion of the post was in error, and I've updated it.
In order to draw insights from the negative sampling equation, it helps to apply it to some real word frequency data. You can generate a plot where the x-axis is the word's ranking in the vocabulary, and the y-axis is the probability of selecting it. The gist of it is that it decreases the probability of choosing common words and increases the probability of choosing rarer words.
I go into depth on this in Chapter 3 of the eBook, and in the accompanying example code, both of which are available to purchase here.
Thanks again!
Chris
mahnaz_k more
In my opinion, I think you can limit the range of the f(w_i) to (0,1) by dividing N_3/4 on both denominator and numerator. N is the sum of the word frequency of the whole words in the training vocabulary. Instead of taking f(w_i) as the word frequency, you can see it as the the fraction of the total words in the corpus that are that word.
Hi Chris,
Awesome post. Very insightful. However, I do have a question. I noticed that in the bid to reduce the amount of weights we'll have to update, frequently occurring words were paired and viewed as one word. Intuitively, that looks to me like we've just added an extra word (New York) while other versions of the word New that have not occurred with the word York would be treated as stand-alone. Am I entirely wrong to assume that the ultimate size of our one-hot encoded vector would grow in this regard? Thanks.
"Common word pair" is an improvement to the original model in a way to increase the performance of the model. As it was mentioned in the post the context for "New" as a single word is different than the context for "New-York". Unlike other improvements presented in second paper, this one brings a cost along. Still it's worth it.
Hi Laurence, thanks for your comment! Mahnaz is correct--adding word pairs and phrases to the vocabulary makes the compute burden higher, not less.
I see how the post was misleading around this, though. I've moved the Word Pairs and "Phrases" section to the end, so that it's not misconstrued as a performance enhancement (like the other two topics).
Thanks again!
Chris
this is probably one of the best articles i've read so far in the matter of word2vec and how it was optimized. Thank you very much Chris!
Awesome, thanks for the encouragement!!
Hi. Great tutorial and thank you. I was wondering what is the resource paper for the following:
"They have a large array with 100M elements (which they refer to as the unigram table). They fill this table with the index of each word in the vocabulary multiple times, and the number of times a word’s index appears in the table is given by P(wi) * table_size."
I am interested to know why 100 million was chosen as the table size.
Hi Octavian, thanks for the comment!
The size of the unigram table determines the precision of its behavior. Whatever the probability is for a word, you need to round it to an integer number of table rows, which means some loss of precision.
The size of the vocabulary also matters--their Google News model had 3 million words, so I think 100M is scaled for that vocabulary size.
These numbers come from the author's own C implementation, which I've commented . Search for the "InitUnigramTable" function.
Thanks,
Chris
Thanks a lot for your awesome blog!
But I got a question about the negative sampling process while reading.
In the paper, it'll sample some negative words to which the outputs are expected zeros, but what if the sampled word is occasionally in the comtext of the input word? Form example,sentence is "The quick brown fox jumps over the lazy dog", input word is "fox", the positive word is "jumps", and one sampled word is "brown". Will this situation result in some errors?
It's probabilistic model. So, if we have a large vocabulary, the probability of occurring this will be small.
so aweeesome! Thanks Chris! Everything became soo clear! So much fun learn it all!
Haha, thanks, Jane! Great to hear that it was helpful.
this made a big difference for me in terms of understanding the optimisation process. Thanks!
Good to hear, thanks Sebastian!
exactly what I needed to know about Negative Sampling thank you so much may God bless you !
So the subsampling P(w_i) is not really a probability as its not bounded between 0-1. Case in point try it for 1e-6 and you get a 1000 something, threw me for quite a loop when i was coding this.
Yeah, good point. You can see that in the plot of P(w_i).
Thanks for the great tutorial.
About this comment "Recall that the output layer of our model has a weight matrix that’s 300 x 10,000. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons, and 1,800 weight values total. That’s only 0.06% of the 3M weights in the output layer!"
Should this actually be 3600 weights total for each training example, given that we have an embedding matrix and an matrix of weights, and BOTH involve updating 1800 weights (300 X 6 neurons)? (Both of which should be whatever dimension you are using for your embeddings multiplied by vocab size)?
Hi Ben, thanks for the comment.
In my comment I'm talking specifically about the output layer. If you include the hidden layer, then yes, there are more weights updated. The number of weights updated in the hidden layer is only 300, though, not 1800, because there is only a single input word.
So the total for the whole network is 2,100. 300 weights in the hidden layer for the input word, plus 6 x 300 weights in the output layer for the positive word and five negative samples.
And yes, you would replace "300" with whatever dimension you are using for your word embeddings. The vocabulary size does *not* factor into this, though--you're just working with one input word and 6 output words, so the size of your vocabulary doesn't impact this.
Hope that helps! Thanks!
This is super helpful, I appreciate this. My intuition (however naive it may be) was that the embeddings in the hidden layer for the negative sample words should also be updated as they are relevant to the loss function. Why is this not the case? I suppose I may have to drill down into the equation for backprop to find out. I suppose it has to do with the fact that when the one-hot vector is propagated forward in the network, it amounts to selecting only the embedding that corresponds to the target word.
That's exactly right--the derivative of the model with respect to the weights of any other word besides our input word is going to be zero.
Hit me up on LinkedIn!
Hey Chris,
When you say that only 300 weights in the hidden layer are updated, are you assuming that the training is done with a minibatch of size 1?
I think if the minibatch is greater than 1, then the number of weights that will be updated in the hidden layer is 300 x number of unique input words in that minibatch.
Please correct me if I am wrong.
And thank you so much for writing this post. It makes reading the academic papers so much easier!
Hi Leleand, that's correct--I'm just saying that there are only 300 weights updated per input word.
Thank you for your detailed explanation! This is the best personal knowledge-sharing website I have ever seen! :D
Thanks for tutorial!
I've thought for a long time that negative sampling is some tricky way to calculate the loss function. But it turns out that is a mechanism which limits the movement of the gradient through a large number of weights (neurons). This means multiplying a small matrix. That is, the mechanism works inside the network, and not somewhere outside. I'm right?
But why this gradient not important? Why the same mechanism not applied, say in computer vision?
Dost The Inner Workings of word2vec - Pro Edition has python code step by step?
Great explanation !
" In the hidden layer, only the weights for the input word are updated (this is true whether you’re using Negative Sampling or not) ". When we backpropagate, the weights of other words i.e. weight matrix corresponding to other words are also updated, isn't it?
In regards to the Subsampling, I was still trying to understand the intuition behind the formula for keeping/ removing each word so i decided to play with some plots (link on the bottom of post):
First, it makes sense that we don't want to use common (aka: high frequency) words such as "that, the, etc..", so the formula for the Probability of accepting a word should be inversely proportional to the frequency of the word, which leads us to the naive formula: P(wi) = 1/f(wi), where f(wi) is the word i's frequency. This is a decreasing function as f(wi) goes from 0 to 1. (the wide red line in the plot)
Then we realise that the frequency domain is between [0,1] which means the naive formula P(wi) is always greater than 1. no matter the frequency. But we want to have words whose P(wi) is less than 1, and ideally, very small when those words are very frequent. Therefore we can add an extra term, the sample parameter, t which, if you notice in the figure below, will make the function P(wi) drop sooner. actually, when f(wi) reaches s, P(wi) = 1, and then it drops further. Our formula for the probability of keeping the word i is now: P(wi) = t/f(wi). ( blue line)
The authors then decide to add the square root which has a smoothing effect (purple line). This makes frequent words have similar probabilities because the function doesn't drop as fast.
Voilá, the final formula is then: P(wi) = sqrt(t/f(wi)) (purple line). If you consider the probability of rejecting a word, it is simply: 1-P(wi).
Surprisingly, the formula in this article isn't the one presented on the original paper. if you work through the math, the formula from this article is actually: sqrt(t/f(wi)) + t/f(wi), which is our final formula + t/f(wi), which from the graph (red line from the middle) you see that it seems to mostly add more smoothness to the graph and right shift the P(wi) = 1.
Plot of the steps to reach the subsampling formula:
https://www.desmos.com/calc...
Hi,
great explanation.
I need python implementation of skip gram negative sampling
I havent checked this implementation yet, but this might help: https://www.kaggle.com/ashu...
Thank You so much!!!!Continue the good work!!
Great article. you have a typo on the second paragraph of `Selecting Negative Samples`, it says `occus`.
This article is pure gold. Thanks Chris
hi, thanks for your sharing. You mentioned "In the hidden layer, only the weights for the input word are updated (this is true whether you’re using Negative Sampling or not)". I think you are here referring to the input weights matrix, right? How about for the 300x6 output weights matrix when using Negative Sampling? How do you get the weights for this 300x6 matrix? Is it copied from the input weights matrix for these 6 words? Because if only the weights for the input word are updated, then this 300x6 matrix will never be updated. Just wonder where do we get it.
I might be wrong but:
Q. I think you are here referring to the input weights matrix, right?
A. Yes
Q. Is it copied from the input weights matrix for these 6 words?
A. No, the output weights are different. These are discarded after the training is over and only the input weights are kept.
Inputs shape: 1x 10000 # for 10000 words
W1: Input Weights shape: 10000 x 300 # 300 is the embedding size.
W2: Output Weights shape: 300 x 10000 # without negative sampling.
Both W1 and W2 get updated. We discard W2 after training and keep only W1 which are nothing but the embeddings for the words.
In case of Negative sampling, out of 10000 only 6 are selected for loss and backpropagation.
Hope this helps.
In subsampling, do we remove all the instances of the words which have selection probability less than 1 or do we remove the corresponding amount of words (for ex. if the selection probability of a word is .5 then do we remove 50% of it's instances?? ) also if we do remove all the instances then how are the representations for those words learnt ??
I am still learning this, so i might be wrong, but:
If the selection prob is 0.5, then we discard 50% of the samples and keep the rest. I don't think we will remove all the instances.
Also, despite this, there might still be words that are not in dataset or we just ignore them since the frequency might be too low. In all such cases, it will be considered as OOV - out of vocabulary word. Probably word with index '0' in the dataset will be OOV (or UNK)
I have a silly qts if my input is quick
Then my output should be
The ,fox , brown rite ?
The quick brown fox jumps....
If your input is quick, then output will be 'The' and 'brown'.
This is the most simple explanation of the topic that I found. Thanks for taking out the time to write this,
Thanks again Chris for this awesome explanation !
I have a question. Why is the length of the unigram table chosen to be 100M elements in the C-code implementation ? Does 100M have a special meaning relative the vocabulary size or the size of the hidden layer ?
Chris, Awesome explanation. You have saved us thousands of man hours putting these together. I have a little point which might be worth adding. The objective function (Eq 4) in the second paper by Mikolov et. al. is an approximation to softmax objective function. Specially, the way p(w_i)s for negatrive samples show up in equation 4 might be a great addition to this article. Thanks so much for mincing the papers for us.
Thanks for the great article!
There appear to be two definitions provided for the probability of keeping a word (a negative sample): the first is inversely proportional to the frequency of the word and the second is directly proportional to the frequency of the word?
I think, the first probability is for preparing the input dataset - sample the common words less frequently and sample the rare words more frequently, so that all words get good representation.
The second probability is to choose the output words for negative sampling. Here it is directly proportional.
I have got a virtual map in my head about word2vec within a couple hours thanks to your posts. The concept doesn't seem daunting anymore. Your posts are so enlightening and easily understandable. Thank you so much for the wonderful work!!!