Haque Ishfaq

25:57

Why are we saying it overparameterized here? I mean what’s the relation wrt loss?

Amirreza Shaeiri

26:44

Do we have any assumption on the loss function?

Amirreza Shaeiri

27:18

e.g Lipschitz or convex

Haque Ishfaq

28:04

yes.

arbish

28:11

yes, it makes sense now.

Haque Ishfaq

28:57

yeah it makes sense

Kai Wang

29:00

yes

Haozhe Shan

33:28

Is w_j a vector or matrix?

Dimitris Kalimeris

34:08

it is a matrix of the weights for that layer

Haozhe Shan

34:57

So w_j^T w_j is a matrix right? So the constant is a constant matrix?

Manos Theodosis

35:39

A followup question on this, we assume that all layers have the same number of neurons?

Haozhe Shan

36:51

Yes. Thank you

Manos Theodosis

36:55

yeap, thanks!

Abdulkadir Canatar

37:00

Doesn’t overparametrization refer to number of parameters larger than the training samples? How does this enter in this discussion?

Boaz Barak

38:24

I think what he refers here is that in general, even though you can specify a linear map from d to m by d*m parameters, by adding layers you can increase the number of parameters by an arbitrary amount.

Abdulkadir Canatar

38:40

Thanks.

Boaz Barak

39:26

This theorem is important so questions would be welcome!

Haque Ishfaq

40:39

Could you please give visual intuition of the stretching over orthogonal direction using preconditional?

Nick Boffi

41:49

Is it a correct interpretation then to say that the deep linear parameterization corresponds to natural gradient descent in a specific metric?

Chinmay

42:51

is this specific to gradient flow, or also true for gradient descent?

Yamini Bansal

43:00

Just to clarify - this theorem holds only for small initialization or for all initializations?

Alan Chung

43:43

Would it be possible to write down the expression for this matrix P? Is it linear in W_{1:N}?

Yang Zheng

46:11

is there any intuition why the matrix P is PSD? Can it be positive definite?

Manos Theodosis

53:43

Should the 1:N be something else inside F? Since we’re talking about the benefit of introducing a linear network

Amirreza Shaeiri

54:22

Just to make sure, Is the theorem true for all loss functions?

annago

54:25

does N play a role?

annago

54:40

ie, the magnitude of N

Amirreza Shaeiri

55:28

thank you

Ben Edelman

01:05:17

Could it be that (nonlinear) deep learning is approximately equivalent in some way to training a non-overparameterized model with a weird optimization algorithm, just as you showed is the case in this linear setting?

Haque Ishfaq

01:06:46

What might happen if end to end matrix is not full rank here?

Manos Theodosis

01:10:05

@boaz you are muted

Rahul Sethi

01:11:07

Hi! Will this talk be recorded? I'm sorry I couldn't make it for the first half.

Dimitris Kalimeris

01:11:20

yes it’s recorded

Rahul Sethi

01:11:48

Thank you

kawaguch@mit.edu

01:24:46

But, a point with singular W is where we have a degenerate saddle point? So, the assumption makes us to avoid degenerate saddle points?

Boaz Barak

01:26:13

I think you are right but note that this is because of the specific structure of the pre-conditioner

Cunlu Zhou

01:26:59

What’s intuition for such phenomenon?

Cunlu Zhou

01:28:49

thanks!

Haque Ishfaq

01:35:22

any quick intuition on why this norm minimization is equivalent to solving for min rank W?

Dimitris Kalimeris

01:40:25

the nuclear norm is the sum of the singular values while the rank is the number of non-zero singular values, if you minimize the nuclear norm then you try to set as many singular values as possible close to zero, not sure that this 100% accurate but that’s my intuition

Blake Bordelon

01:40:32

Does initialization of W’s matter to show convergence to smallest norm?

Zhilong Fang

01:41:06

Nuclear norm minimization is a relaxation of the rank minimization. They are not totally equivalent

Dimitris Kalimeris

01:41:40

yes, that’s right

Yang Zheng

01:50:51

just to clarify: what is the dimension of each weight W_j?

Ben Edelman

01:51:19

Could the unobserved entry go to negative infinity instead?

Dimitris Kalimeris

01:51:49

I think he has an absolute value

Ben Edelman

01:52:21

ah, OK

Amirreza Shaeiri

02:01:28

Very Interesting, Just

Amirreza Shaeiri

02:02:34

Do you think smoothing like linear resnet could have some advantages?

Amirreza Shaeiri

02:03:56

*something

Amirreza Shaeiri

02:06:33

thank you

Cunlu Zhou

02:07:01

Thanks!

Shubh Pachchigar

02:07:18

Thank you

Ben Edelman

02:07:27

Thanks!

Kai Wang

02:07:36

Thank you