I have a question about hardware: for those who do not have powerful machines, can they run models on the cloud? what service is recommended?
Hi Mostaf,
if you just want to try things out, Amazon spot instances are a very cheap way to start doing GPU computing. Have a look here: http://erikbern.com/?p=737
I have no experience with this myself though, so I don't know how the performance is, for example.
In the long run, if you plan to use GPUs regularly it might be more cost-effective to just buy one or more machines (but don't forget to include electricity costs in your calculation).
Hi Sander,
Impressive work. Thank you for sharing!
I was wondering. You make the data scale invariant by adding scaled versions of images. This makes sense considering that galaxies are the same, independent of the scale at which you observe them. But this might not be true for the person judging the image. Smaller galaxies are probably harder to judge and maybe this can be modeled to. Any thoughts on that?
Once again, thnx for sharing.
Hi,
Thanks! That's definitely true. I randomly sampled the scale factor log-uniformly between 1/1.3 and 1.3, so nothing too crazy. The resulting images were not too different from what was already in the dataset in terms of 'feature size'. Scaling them up or down 2x would probably change the image distribution too drastically (although I never tried it, so maybe it works even better, who knows).
Sander
Hi Sander,
Thnx for the reply!
You're actually a colleague of me at UGent. Clearly different appication fields, but same interests I guess :-)
Thanks a lot for this deep-learning master class, Sander and congratulations on winning the competitions! Did you try dropconnect as well? Instead of setting a random subset of neuron activations to 0, you set a random subset of weights within your network to 0. Results from http://cs.nyu.edu/~wanli/dropc... look promising.
Thank you! I did not try dropconnect. It's actually not that easy to implement efficiently. With dropout, you just mask part of the input of the layer and then you can use the same weight matrix for all examples in a given minibatch, which makes it easy to exploit parallelism. With dropconnect however, each example in the minibatch needs to have its own weight matrix, with a different subset of weights masked out.
I believe the authors of the dropconnect paper found a way to exploit parallelism despite this, but in Theano for example, it isn't at all obvious to me how this would be done. I believe it would be a lot slower than dropout.
In addition, the results reported in the paper aren't that impressive, in my opinion. Most of them are obtained by averaging the outputs of 5 different networks, and supposedly it doesn't work better when you only have a single network (see this comment by David Warde-Farley on reddit: http://www.reddit.com/r/Machin... ).
In short, I'll be sticking with dropout for the time being as I'm not convinced that dropconnect offers any significant advantages.
Some other dropout-inspired regularisation methods are interesting though. I would love to have a fast implementation in Theano of Matthew Zeiler's stochastic pooling. I tried it during the contest, but I was only able to add it in the topmost convolutional layer because doing it in all convolutional layers was too expensive. Unfortunately that didn't help,
Hi Sander, fantastic work - well done! I was wondering when you think you will be able to share your code? Your work gave me lots of inspiration but few things are not clear to me.
I've just published the code on GitHub: https://github.com/benanne/kag...
This is great. Thx vm for sharing!
Thank you! I'm required to make the code available under a BSD-3 licence within the next 14 days to qualify for the prize, so you can definitely expect it before then :) Probably sooner!
really nice discussion. thanks for the details. makes an interesting read!
Congrats, I am interesting in the total time used for training for the best model in your given hardware.
Thank you! As mentioned in the post, the best model took about 67 hours to train on the available hardware (hexacore CPU + GTX 680). I don't remember for sure if this was with Theano's garbage collector enabled or disabled, which makes a difference on the order of a few hours. I think it was enabled though, so disabling it (which requires more GPU memory) would make the total training time a few hours shorter.
Could you tell us a bit more on how important was it to use Nesterov momentum vs simple SGD without momentum? What was the impact on the training speed, minimum train and test errors? Is the momentum weight value very sensitive?
Unfortunately I started using momentum very early on and I don't really have any numbers for it. I'd have to rerun one of the final models without it to be able to tell you how important it really is. I did experiment with a higher momentum value (0.99 instead of 0.9, with the learning rates decreased 10x), but this did not improve results, so I left it at 0.9 after that.
Congrats, I have really enjoyed the reading. Moreover thanks for sharing. I was curious about the hardware requirements, If I would like to give a try to your solution for didactical purpose, do you think a i7 quadcore workstation with a decent gpu is enough, or it is going to take a life?
Thanks
Hey Alessandro, glad you enjoyed it!
Regarding hardware requirements: the computers I used had two GPUs, so I trained two networks simultaneously (Theano does not support using multiple GPUs together at the moment). So any decent GPU should suffice, and a quadcore CPU for data augmentation should be fine too. The GTX 680s that I used are actually not very good for CUDA, but this is what I had available (and I'm not complaining!). The largest models require 2GB of GPU RAM (or more if you disable Theano's garbage collector, which results in another speedup), and the memory usage of the data augmentation can be tuned by generating smaller or larger chunks.
So an i7 quadcore workstation with a decent GPU should be fine :) As I said in the post, I hope to make a more-or-less cleaned up version of the code available shortly.