Logo

This is a calendar meeting - Shared screen with speaker view
Haque Ishfaq
25:57
Why are we saying it overparameterized here? I mean what’s the relation wrt loss?
Amirreza Shaeiri
26:44
Do we have any assumption on the loss function?
Amirreza Shaeiri
27:18
e.g Lipschitz or convex
Haque Ishfaq
28:04
yes.
arbish
28:11
yes, it makes sense now.
Haque Ishfaq
28:57
yeah it makes sense
Kai Wang
29:00
yes
Haozhe Shan
33:28
Is w_j a vector or matrix?
Dimitris Kalimeris
34:08
it is a matrix of the weights for that layer
Haozhe Shan
34:57
So w_j^T w_j is a matrix right? So the constant is a constant matrix?
Manos Theodosis
35:39
A followup question on this, we assume that all layers have the same number of neurons?
Haozhe Shan
36:51
Yes. Thank you
Manos Theodosis
36:55
yeap, thanks!
Abdulkadir Canatar
37:00
Doesn’t overparametrization refer to number of parameters larger than the training samples? How does this enter in this discussion?
Boaz Barak
38:24
I think what he refers here is that in general, even though you can specify a linear map from d to m by d*m parameters, by adding layers you can increase the number of parameters by an arbitrary amount.
Abdulkadir Canatar
38:40
Thanks.
Boaz Barak
39:26
This theorem is important so questions would be welcome!
Haque Ishfaq
40:39
Could you please give visual intuition of the stretching over orthogonal direction using preconditional?
Nick Boffi
41:49
Is it a correct interpretation then to say that the deep linear parameterization corresponds to natural gradient descent in a specific metric?
Chinmay
42:51
is this specific to gradient flow, or also true for gradient descent?
Yamini Bansal
43:00
Just to clarify - this theorem holds only for small initialization or for all initializations?
Alan Chung
43:43
Would it be possible to write down the expression for this matrix P? Is it linear in W_{1:N}?
Yang Zheng
46:11
is there any intuition why the matrix P is PSD? Can it be positive definite?
Manos Theodosis
53:43
Should the 1:N be something else inside F? Since we’re talking about the benefit of introducing a linear network
Amirreza Shaeiri
54:22
Just to make sure, Is the theorem true for all loss functions?
annago
54:25
does N play a role?
annago
54:40
ie, the magnitude of N
Amirreza Shaeiri
55:28
thank you
Ben Edelman
01:05:17
Could it be that (nonlinear) deep learning is approximately equivalent in some way to training a non-overparameterized model with a weird optimization algorithm, just as you showed is the case in this linear setting?
Haque Ishfaq
01:06:46
What might happen if end to end matrix is not full rank here?
Manos Theodosis
01:10:05
@boaz you are muted
Rahul Sethi
01:11:07
Hi! Will this talk be recorded? I'm sorry I couldn't make it for the first half.
Dimitris Kalimeris
01:11:20
yes it’s recorded
Rahul Sethi
01:11:48
Thank you
kawaguch@mit.edu
01:24:46
But, a point with singular W is where we have a degenerate saddle point? So, the assumption makes us to avoid degenerate saddle points?
Boaz Barak
01:26:13
I think you are right but note that this is because of the specific structure of the pre-conditioner
Cunlu Zhou
01:26:59
What’s intuition for such phenomenon?
Cunlu Zhou
01:28:49
thanks!
Haque Ishfaq
01:35:22
any quick intuition on why this norm minimization is equivalent to solving for min rank W?
Dimitris Kalimeris
01:40:25
the nuclear norm is the sum of the singular values while the rank is the number of non-zero singular values, if you minimize the nuclear norm then you try to set as many singular values as possible close to zero, not sure that this 100% accurate but that’s my intuition
Blake Bordelon
01:40:32
Does initialization of W’s matter to show convergence to smallest norm?
Zhilong Fang
01:41:06
Nuclear norm minimization is a relaxation of the rank minimization. They are not totally equivalent
Dimitris Kalimeris
01:41:40
yes, that’s right
Yang Zheng
01:50:51
just to clarify: what is the dimension of each weight W_j?
Ben Edelman
01:51:19
Could the unobserved entry go to negative infinity instead?
Dimitris Kalimeris
01:51:49
I think he has an absolute value
Ben Edelman
01:52:21
ah, OK
Amirreza Shaeiri
02:01:28
Very Interesting, Just
Amirreza Shaeiri
02:02:34
Do you think smoothing like linear resnet could have some advantages?
Amirreza Shaeiri
02:03:56
*something
Amirreza Shaeiri
02:06:33
thank you
Cunlu Zhou
02:07:01
Thanks!
Shubh Pachchigar
02:07:18
Thank you
Ben Edelman
02:07:27
Thanks!
Kai Wang
02:07:36
Thank you