I wonder if this is related to https://arxiv.org/pdf/1912.07559.pdf Hard to formulate it as a concise question but this paper covers research on interpolation of parameters while keeping the loss constant
I understand, just sharing :-/
Does this difference in importance between layers hold for MLPs too, or just conv nets?
What if we rewind back to early iterations not zero-th?
If lower layers are in charge of general features, why does this imply they shouldn't be critical? Something about the general features being more robust than the specialized features?
I have the same question
(I just asked a question above)
Can you speak about the relations to Frankle et al.'s work on linear mode connectivity and the lottery ticket hypothesis? (which I believe is about basins w.r.t. SGD noise)
Can LT make use out of these finding to where to rewind back?