We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.

Yueting Liu • 7 years ago

I have got a virtual map in my head about word2vec within a couple hours thanks to your posts. The concept doesn't seem daunting anymore. Your posts are so enlightening and easily understandable. Thank you so much for the wonderful work!!!

Chris McCormick • 7 years ago

Awesome! Great to hear that it was so helpful--I enjoy writing these tutorials, and it's very rewarding to hear when they make a difference for people!

Mahnaz K • 5 years ago

Hi Chris, I hope that you'll still read and respond to comments on this post sometimes. Great post, also appreciate clarification made in comments. I've a question for you. When you're explaining the 3/4 power for weights, you're referring to range(0,1). But the X axis is the word frequency which wont fall in that range. Shouldn't we be focused on the rest of plot and note that raising frequency to the power of 3/4 will change the distribution the way that words with their frequencies being in a (wide) range would be treated the same way? - almost like bucketing!

Chris McCormick • 5 years ago

Hi Mahnaz, thank you for your comment! That portion of the post was in error, and I've updated it.

In order to draw insights from the negative sampling equation, it helps to apply it to some real word frequency data. You can generate a plot where the x-axis is the word's ranking in the vocabulary, and the y-axis is the probability of selecting it. The gist of it is that it decreases the probability of choosing common words and increases the probability of choosing rarer words.

I go into depth on this in Chapter 3 of the eBook, and in the accompanying example code, both of which are available to purchase here.

Thanks again!
Chris

binspiredmamakrissy • 5 years ago

mahnaz_k more

Smith Jason • 5 years ago

In my opinion, I think you can limit the range of the f(w_i) to (0,1) by dividing N_3/4 on both denominator and numerator. N is the sum of the word frequency of the whole words in the training vocabulary. Instead of taking f(w_i) as the word frequency, you can see it as the the fraction of the total words in the corpus that are that word.

Laurence Obi • 6 years ago

Hi Chris,
Awesome post. Very insightful. However, I do have a question. I noticed that in the bid to reduce the amount of weights we'll have to update, frequently occurring words were paired and viewed as one word. Intuitively, that looks to me like we've just added an extra word (New York) while other versions of the word New that have not occurred with the word York would be treated as stand-alone. Am I entirely wrong to assume that the ultimate size of our one-hot encoded vector would grow in this regard? Thanks.

Mahnaz K • 5 years ago

"Common word pair" is an improvement to the original model in a way to increase the performance of the model. As it was mentioned in the post the context for "New" as a single word is different than the context for "New-York". Unlike other improvements presented in second paper, this one brings a cost along. Still it's worth it.

Chris McCormick • 5 years ago

Hi Laurence, thanks for your comment! Mahnaz is correct--adding word pairs and phrases to the vocabulary makes the compute burden higher, not less.

I see how the post was misleading around this, though. I've moved the Word Pairs and "Phrases" section to the end, so that it's not misconstrued as a performance enhancement (like the other two topics).

Thanks again!
Chris

amine ahnine • 5 years ago

this is probably one of the best articles i've read so far in the matter of word2vec and how it was optimized. Thank you very much Chris!

Chris McCormick • 5 years ago

Awesome, thanks for the encouragement!!

Octavian Bordeanu • 5 years ago

Hi. Great tutorial and thank you. I was wondering what is the resource paper for the following:

"They have a large array with 100M elements (which they refer to as the unigram table). They fill this table with the index of each word in the vocabulary multiple times, and the number of times a word’s index appears in the table is given by P(wi) * table_size."

I am interested to know why 100 million was chosen as the table size.

Chris McCormick • 5 years ago

Hi Octavian, thanks for the comment!

The size of the unigram table determines the precision of its behavior. Whatever the probability is for a word, you need to round it to an integer number of table rows, which means some loss of precision.

The size of the vocabulary also matters--their Google News model had 3 million words, so I think 100M is scaled for that vocabulary size.

These numbers come from the author's own C implementation, which I've commented . Search for the "InitUnigramTable" function.

Thanks,
Chris

fangchao liu • 6 years ago

Thanks a lot for your awesome blog!
But I got a question about the negative sampling process while reading.
In the paper, it'll sample some negative words to which the outputs are expected zeros, but what if the sampled word is occasionally in the comtext of the input word? Form example,sentence is "The quick brown fox jumps over the lazy dog", input word is "fox", the positive word is "jumps", and one sampled word is "brown". Will this situation result in some errors?

lovekesh thakur • 5 years ago

It's probabilistic model. So, if we have a large vocabulary, the probability of occurring this will be small.

Jane • 7 years ago

so aweeesome! Thanks Chris! Everything became soo clear! So much fun learn it all!

Chris McCormick • 7 years ago

Haha, thanks, Jane! Great to hear that it was helpful.

Sebastiaan Van Baars • 4 years ago

this made a big difference for me in terms of understanding the optimisation process. Thanks!

Chris McCormick • 4 years ago

Good to hear, thanks Sebastian!

Raki Lachraf • 5 years ago

exactly what I needed to know about Negative Sampling thank you so much may God bless you !

Joey Bose • 6 years ago

So the subsampling P(w_i) is not really a probability as its not bounded between 0-1. Case in point try it for 1e-6 and you get a 1000 something, threw me for quite a loop when i was coding this.

Chris McCormick • 6 years ago

Yeah, good point. You can see that in the plot of P(w_i).

Ben Bowles • 7 years ago

Thanks for the great tutorial.

About this comment "Recall that the output layer of our model has a weight matrix that’s 300 x 10,000. So we will just be updating the weights for our positive word (“quick”), plus the weights for 5 other words that we want to output 0. That’s a total of 6 output neurons, and 1,800 weight values total. That’s only 0.06% of the 3M weights in the output layer!"

Should this actually be 3600 weights total for each training example, given that we have an embedding matrix and an matrix of weights, and BOTH involve updating 1800 weights (300 X 6 neurons)? (Both of which should be whatever dimension you are using for your embeddings multiplied by vocab size)?

Chris McCormick • 7 years ago

Hi Ben, thanks for the comment.

In my comment I'm talking specifically about the output layer. If you include the hidden layer, then yes, there are more weights updated. The number of weights updated in the hidden layer is only 300, though, not 1800, because there is only a single input word.

So the total for the whole network is 2,100. 300 weights in the hidden layer for the input word, plus 6 x 300 weights in the output layer for the positive word and five negative samples.

And yes, you would replace "300" with whatever dimension you are using for your word embeddings. The vocabulary size does *not* factor into this, though--you're just working with one input word and 6 output words, so the size of your vocabulary doesn't impact this.

Hope that helps! Thanks!

Ben Bowles • 7 years ago

This is super helpful, I appreciate this. My intuition (however naive it may be) was that the embeddings in the hidden layer for the negative sample words should also be updated as they are relevant to the loss function. Why is this not the case? I suppose I may have to drill down into the equation for backprop to find out. I suppose it has to do with the fact that when the one-hot vector is propagated forward in the network, it amounts to selecting only the embedding that corresponds to the target word.

Chris McCormick • 7 years ago

That's exactly right--the derivative of the model with respect to the weights of any other word besides our input word is going to be zero.

Hit me up on LinkedIn!

Leland Milton Drake • 6 years ago

Hey Chris,

When you say that only 300 weights in the hidden layer are updated, are you assuming that the training is done with a minibatch of size 1?

I think if the minibatch is greater than 1, then the number of weights that will be updated in the hidden layer is 300 x number of unique input words in that minibatch.

Please correct me if I am wrong.

And thank you so much for writing this post. It makes reading the academic papers so much easier!

Chris McCormick • 6 years ago

Hi Leleand, that's correct--I'm just saying that there are only 300 weights updated per input word.

Jack Gao • 2 years ago

Thank you for your detailed explanation! This is the best personal knowledge-sharing website I have ever seen! :D

CyberDreamer • 3 years ago

Thanks for tutorial!

I've thought for a long time that negative sampling is some tricky way to calculate the loss function. But it turns out that is a mechanism which limits the movement of the gradient through a large number of weights (neurons). This means multiplying a small matrix. That is, the mechanism works inside the network, and not somewhere outside. I'm right?

But why this gradient not important? Why the same mechanism not applied, say in computer vision?

yubinml • 3 years ago

Dost The Inner Workings of word2vec - Pro Edition has python code step by step?

Great explanation !

Bhairaja Gautam • 4 years ago

" In the hidden layer, only the weights for the input word are updated (this is true whether you’re using Negative Sampling or not) ". When we backpropagate, the weights of other words i.e. weight matrix corresponding to other words are also updated, isn't it?

snovaisg • 4 years ago

In regards to the Subsampling, I was still trying to understand the intuition behind the formula for keeping/ removing each word so i decided to play with some plots (link on the bottom of post):

First, it makes sense that we don't want to use common (aka: high frequency) words such as "that, the, etc..", so the formula for the Probability of accepting a word should be inversely proportional to the frequency of the word, which leads us to the naive formula: P(wi) = 1/f(wi), where f(wi) is the word i's frequency. This is a decreasing function as f(wi) goes from 0 to 1. (the wide red line in the plot)

Then we realise that the frequency domain is between [0,1] which means the naive formula P(wi) is always greater than 1. no matter the frequency. But we want to have words whose P(wi) is less than 1, and ideally, very small when those words are very frequent. Therefore we can add an extra term, the sample parameter, t which, if you notice in the figure below, will make the function P(wi) drop sooner. actually, when f(wi) reaches s, P(wi) = 1, and then it drops further. Our formula for the probability of keeping the word i is now: P(wi) = t/f(wi). ( blue line)

The authors then decide to add the square root which has a smoothing effect (purple line). This makes frequent words have similar probabilities because the function doesn't drop as fast.

Voilá, the final formula is then: P(wi) = sqrt(t/f(wi)) (purple line). If you consider the probability of rejecting a word, it is simply: 1-P(wi).

Surprisingly, the formula in this article isn't the one presented on the original paper. if you work through the math, the formula from this article is actually: sqrt(t/f(wi)) + t/f(wi), which is our final formula + t/f(wi), which from the graph (red line from the middle) you see that it seems to mostly add more smoothness to the graph and right shift the P(wi) = 1.

Plot of the steps to reach the subsampling formula:
https://www.desmos.com/calc...

Dum • 4 years ago

Hi,
great explanation.

I need python implementation of skip gram negative sampling

Nikhil Tirumala • 4 years ago

I havent checked this implementation yet, but this might help: https://www.kaggle.com/ashu...

Harry Richard • 4 years ago

Thank You so much!!!!Continue the good work!!

perrohunter • 4 years ago

Great article. you have a typo on the second paragraph of `Selecting Negative Samples`, it says `occus`.

Louis • 5 years ago

This article is pure gold. Thanks Chris

Mingxuan Cao • 5 years ago

hi, thanks for your sharing. You mentioned "In the hidden layer, only the weights for the input word are updated (this is true whether you’re using Negative Sampling or not)". I think you are here referring to the input weights matrix, right? How about for the 300x6 output weights matrix when using Negative Sampling? How do you get the weights for this 300x6 matrix? Is it copied from the input weights matrix for these 6 words? Because if only the weights for the input word are updated, then this 300x6 matrix will never be updated. Just wonder where do we get it.

Nikhil Tirumala • 4 years ago

I might be wrong but:
Q. I think you are here referring to the input weights matrix, right?
A. Yes
Q. Is it copied from the input weights matrix for these 6 words?
A. No, the output weights are different. These are discarded after the training is over and only the input weights are kept.
Inputs shape: 1x 10000 # for 10000 words
W1: Input Weights shape: 10000 x 300 # 300 is the embedding size.
W2: Output Weights shape: 300 x 10000 # without negative sampling.
Both W1 and W2 get updated. We discard W2 after training and keep only W1 which are nothing but the embeddings for the words.
In case of Negative sampling, out of 10000 only 6 are selected for loss and backpropagation.
Hope this helps.

Dhruv Bhargava • 5 years ago

In subsampling, do we remove all the instances of the words which have selection probability less than 1 or do we remove the corresponding amount of words (for ex. if the selection probability of a word is .5 then do we remove 50% of it's instances?? ) also if we do remove all the instances then how are the representations for those words learnt ??

Nikhil Tirumala • 4 years ago

I am still learning this, so i might be wrong, but:
If the selection prob is 0.5, then we discard 50% of the samples and keep the rest. I don't think we will remove all the instances.
Also, despite this, there might still be words that are not in dataset or we just ignore them since the frequency might be too low. In all such cases, it will be considered as OOV - out of vocabulary word. Probably word with index '0' in the dataset will be OOV (or UNK)

Clive777 • 5 years ago

I have a silly qts if my input is quick
Then my output should be
The ,fox , brown rite ?

Nikhil Tirumala • 4 years ago

The quick brown fox jumps....
If your input is quick, then output will be 'The' and 'brown'.

mlwhiz • 5 years ago

This is the most simple explanation of the topic that I found. Thanks for taking out the time to write this,

trault14 • 5 years ago

Thanks again Chris for this awesome explanation !
I have a question. Why is the length of the unigram table chosen to be 100M elements in the C-code implementation ? Does 100M have a special meaning relative the vocabulary size or the size of the hidden layer ?

Ali KebariGhotbi • 5 years ago

Chris, Awesome explanation. You have saved us thousands of man hours putting these together. I have a little point which might be worth adding. The objective function (Eq 4) in the second paper by Mikolov et. al. is an approximation to softmax objective function. Specially, the way p(w_i)s for negatrive samples show up in equation 4 might be a great addition to this article. Thanks so much for mincing the papers for us.

Steven Ghoussain • 5 years ago

Thanks for the great article!

There appear to be two definitions provided for the probability of keeping a word (a negative sample): the first is inversely proportional to the frequency of the word and the second is directly proportional to the frequency of the word?

Nikhil Tirumala • 4 years ago

I think, the first probability is for preparing the input dataset - sample the common words less frequently and sample the rare words more frequently, so that all words get good representation.
The second probability is to choose the output words for negative sampling. Here it is directly proportional.