Theory ML seminar: Zico Kolter (w lightning talk by Edelman & Shavit)
- Shared screen with speaker view

Andrew Ross

56:54

Will “perform as well” just be defined in terms of accuracy, or also robustness?

Andrew Ross

57:43

Cool, thanks :)

Boaz Barak

01:03:12

So we can think of y(x) = argmin f(x,y) ?

Boaz Barak

01:03:39

I guess answered :)

Thibaut Horel

01:06:10

what if there are multiple solutions to this equation? this does not uniquely specify the output of the network, should I now think of a network as a set-valued function?

Thibaut Horel

01:06:58

ok, thanks :)

Jacob Noah Steinhardt

01:11:04

Don't we need to know y* to get these? (Which is what we're trying to get?)

Jacob Noah Steinhardt

01:12:00

Thanks, got it now :)

Boaz Barak

01:12:52

So you will really use this as a layer of the form find y s.t. f(x,w,y)=0 where w are the weight vectors/ parameters and differentiate wrt w

Jacob Noah Steinhardt

01:12:53

You might get to this later, but what about inequality constraints?

Andrew Ross

01:13:08

Is it straightforward to apply this procedure repeatedly to get higher-order derivatives out of implicit layers?

Jacob Noah Steinhardt

01:16:25

Do convolutions work out nicely with the linear system solving?

Boaz Barak

01:25:38

Could we reduce general NN to weight-tied by increasing the width by a factor of the number of layers and then having W include something like "shift from one block to another"?

Boaz Barak

01:26:16

I guess this would not be stable

Preetum Nakkiran

01:30:25

Is there an issue with finding trivial solutions? (Eg z* = U* = 0 )

Boaz Barak

01:45:08

Just to clarify - we still have the same issue as with DNN that there are many local minima - pairs (U,W) that minimize the loss, and some would generalize better than others?

Benjamin Edelman

01:45:32

Is the situation much worse if you remove input injection (the Ux term)?

Benjamin Edelman

01:46:34

ha, right

Andrew Ross

01:48:47

With the same number of parameters?

Boaz Barak

01:51:36

Maybe this will be adressed later in the talk, but in practical DEQ networks, is it the case that you can interpret some coordinates or linear functions of z_i as having different "virtual depth" and corresponding to more and less basic features such as edges, eyes, etc

Andrew Ross

01:52:37

This is totally against the spirit of your talk, but what if you alternated implicit and explicit layers?

Andrew Ross

01:52:49

Would that still be reducible to a single DEQ?

Andrew Ross

01:53:57

Thanks :)

Benjamin Edelman

01:54:04

Is there a nice intuition for why we should expect a DEQ to perform well that doesn't make reference to the deep network representation theorem? (just based on the defn of DEQ)

Benjamin Edelman

01:55:41

OK, thanks

Daniel Temko

02:07:37

How is the compact version of the DEQ derived in practice? Are there architecture-specific ways of doing this?

Andrew Ross

02:08:25

Do you feel that DEQs have different inductive biases than standard DNNs/resnets, even if you try to give them similar fs?

Kai Wang

02:08:56

Why is the training time (especially the backward path) only a constant-time worse? It requires to compute the matrix inverse (or solving linear equation), which sounds much more expensive than the standard back-propagation?

Kai Wang

02:11:27

Thanks!

Lucas Liebenwein

02:17:50

have you tried running Boyden’s method for less iteration so that DEQ has the same running time as a deep network and compared the performance? I realize you might not get to a fixed point but is it enough for god performance in practice?

Lucas Liebenwein

02:19:32

thank you :)

Preetum Nakkiran

02:19:43

It seems somewhat surprising that DEQs for images have the same “inductive biases” as deep convents — do you have thoughts/intuitions about this?

Andrew Ross

02:24:02

Tweaking my earlier question, I understand that mixing implicit and explicit layers may not increase expressivity, but could it change inductive biases?Also, I’m potentially mangling neuroscience here, but my understanding is that the brain contains both feedforward-ish and feedback loop components — structure in both space and time. It seems reasonable to suspect that an "optimal architecture" would combine both elements as well. Would you be willing to speculate on that? :)

Preetum Nakkiran

02:27:52

thanks!

Benjamin Edelman

02:28:09

Might DEQs be advantageous in terms of conduciveness to theory? i.e. might it be easier, or at least involve different tools, to reason about some properties of DEQs vs deep networks

Andrew Ross

02:31:17

Thanks :)