Would it be more insightful to compare how quickly the deep and narrow networks get to the early stopping state faster, assuming both’s early stopping state gives comparable accuracy? This is in reference to the comparison of 400 epochs vs. 4000 epochs.
Hi Brenda, that was kind of what I was thinking too! I guess if all nets were trained with vanilla SGD that would be a good thing to check :-)
do we know what the best performance we can get from an ensemble of underparameterized networks for CIFAR-10?
When comparing different networks and training the different networks, which parameter that you fix the same for all the networks? Computational time or Epoch or Something else? Why do you think that parameter is a reasonable choice for the comparison?
Have you considered running neural architecture search to see what “optimal” CNNs they come up with and how they compare against your models?
So for this you train the network first and then you’re pruning connections? Or do you do principled learning, learning sparse weights constructively?
@Manos: 2nd I guess
@Rahim: yeah, this is what it seems like!
I think it can be seen as the proximal-gradient update of a modified L1 penalty... usually beta would be related to the step size
Yeah, I just asked because there’s a series of results on network pruning after training (a paper from this year’s ICML using tropical polynomial division comes to mind)
Behnam, you and the text you cited on the MDL slide said that the description language is arbitrary. could you use a language that assigns a smaller description length to models with more parameters (e.g., use 1/(32*#params)), and if so, does that hurt the applicability of MDL as an explanation here?