Logo

PeRSonAl tutorial @ ISCA 2020 - Shared screen with speaker view
Udit Gupta
02:31:52
If you haven’t already please add your name to this signup sheet: https://docs.google.com/spreadsheets/d/1eRO6tZUgUpay_Y-0Tx5rRkQdbYA2yP839kCI4C1uVX8/edit#gid=0
Udit Gupta
03:18:26
Please post questions for Yves in the chat here!
Sam Sandberg
03:18:41
🙌
David Jones
03:19:10
That was fascinating, thank you! Will the slides be shared?
SEAN KOHLER
03:19:46
Question for Yves, am curious if these new approached require a new computing strategy for Netflix? i.e. does it necessitate the need for GPUs or other accelerators?
gabrielbenedict
03:20:08
for the slides: https://drive.google.com/drive/folders/13kPiJnVx8C1kuRQ-iXYlE_1HqWrR1hSK
Boris Grot
03:21:08
Awesome talk, thanks! What exactly does “continuous time” mean — are we talking about fine-grained discretization? If so, what’s the granularity - minutes? Seconds?
Lina Weichbrodt
03:21:14
Can you say more about different approaches modelling time?
Kristen Bystrom
03:21:52
We have seen a lot of buzz with platforms who use collab filtering recommendations like TikTok that it can promote underlying bias based on race, gender etc... Does the algorithm that Netflix uses avoid that kind of bias?
Jan VB
03:22:54
Thanks for the talk Yves! Do you think in general recommendations should have a causal effect? Why exactly?
adith
03:23:00
How do recommendation systems work in multi-user settings? For example, say I share my Netflix (same profile) with my brother who has very different tastes than me? Can the recommender system understand this and make separate recommendations? Or is this a harder problem this?
gabrielbenedict
03:23:30
Thanks for the talk! Let’s say I have trouble running my bandits model online (as happens too often in industry practice), how much of a downturn is it to run my bandits model offline? The downturn seems pretty big.
eggie5
03:23:32
does Netflix see a correlation between online metrics and the IPS estimates?
Udit Gupta
03:27:23
To access the materials presented, we are recording the Zoom meeting and hope to post the video after the tutorial. Recorded talks for the papers presented in the Q&A are already available on the tutorial website (https://personal-tutorial.com/personal-at-isca-2020/).
yraimond
03:28:31
(sorry I didn't get through all the questions -- feel free to ping me at yraimond@netflix.com after the workshop :) )
Lina Weichbrodt
03:31:40
Thanks for the talk, Yves!
megatropics
03:40:43
Thanks Yves. It was an informative talk
gabrielbenedict
03:43:37
thanks for the talk! What would be the takeaways (if any) from your research for someone using classical cloud GPU servers [AWS / Azure /GCP / …]
David
03:43:40
Great talk!
Dmitry Mironov
03:43:43
How many threads, submitting CUDA kernels did you end up with?
Socrates Wong
03:43:51
Great talk!
Xiaoyu Ma
03:44:00
Do you have any insight on the growth rate of the embedding table size?
Mr.G
03:44:02
Super cool! :)
David
03:44:49
Can you split an embedding table across multiple GPUs?
Gennady Pekhimenko
03:45:21
Great talk! My question: did you come up with some general technique to automatically "fuse" or "concatenate" kernels? Merging two kernels can cause interference in the general case.
David
03:45:36
E.g., you have 8x32GB GPU memory, would you keep a 60GB embedding table in GPU memory or put in CPU memory?
baharasgari
03:45:44
Thanks, Bilge! Can you pls. elaborate more on the last optimization about where do perform the concatenation? What are the pros and cons of each option?
Jiashen Cao
03:46:59
Followup on this question, I don’t quite get last optimization that how would concatenation optimizes the GPU memory usage? Concatenation does not change the data size right?
Surya Narayanan
03:48:40
Does using AMD Epic like CPU (each with 128 threads) on a dual socket system be more helpful for recommendation systems than a GPU?
Dhiraj Kalamkar
03:48:43
What percentage of Facebook models fits on single GPU?
Gennady Pekhimenko
03:50:39
Also, for the system(s) that don't have NVLinks and are limited by GPU-to-GPU bandwidth in distributed training, did you consider "aggregate" multiple batch updates? We did such an optimization when training BERT-large on a non-DGX machine (4x T4 GPUs per machine) and it helped a lot in reducing communication overhead without hurting model accuracy/convergence
Dmitry Mironov
03:53:32
@Gennady, there is also a trick to introduce lag in the weights updates, and use the gradients from step n to update the weights on the step (n+1). It can help perform communication and computation simultaneously.
David
03:54:38
@dmitry is that sort of pipelining commonly used? Seems like using stale weights could have some accuracy challenges (potentially)
Gennady Pekhimenko
03:56:15
@Dmitry: yes, it can be a good technique to improve performance/training throughput, but it might have an affect on the speed of convergence. Although I would expect it to work in many cases without a problem.
Xiaoyu Ma
04:05:01
What bandwidth do you need for the SSD to sustain the GPU compute?
Dmitry Mironov
04:05:56
@David I've seen that actually used in Megatron-LM training, because the goal there was to optimize large-scale training to the limits.
Dmitry Mironov
04:10:15
@Weijie, thank you! What is the comparison of the accuracy with the systems, which use less dense input features? Say, BERT?
Dmitry Mironov
04:13:06
Thanks!
pchoganwala
04:17:57
how do you compare two embeddings of different sizes?
Paul & Kirsten
04:18:16
@Tony Do you have any insight as to how the latent space behaves when the dimensions are different?
pchoganwala
04:20:17
Thanks @Tony
Samuel Hsia
04:21:39
@Tony We see that for training using this mixed embeddings approach will give us gains on training time; how do you think this method will affect inference? Will there be similar performance benefits?
Meghan
04:22:11
Have you compared this method to hashing only unpopular items using a similar frequency cap and parameter budget?
Samuel Hsia
04:23:12
Thanks @Tony!
Youngeun Kwon
04:35:29
@Udit Thanks!
gabrielbenedict
04:43:24
how do you pick m?
Meghan
04:43:36
I have a question I’d love to ask over mic
Samuel Hsia
04:43:41
Sure
Samuel Hsia
04:43:51
Gabriel then Meghan
Meghan
04:44:20
Thanks Sam
Xing Wang
04:46:16
I am wondering does it affect the quality, does it affect convergence?
Bilge Acun
04:48:22
Hi everyone,I wanted to follow up with some of the questions I missed earlier.I unfortunately cannot answer the questions regarding the growth rate of the embedding sizes and ratio of the embeddings that fits on GPUs at Facebook. In the models I showed in my talk, we do distribute different tables across GPUs but we don’t split them.Regarding the questions about the last optimization I talked about:@Gennady PekhimenkoWe did employ kernel fusion for some operators but we did not use an automated way to do this. There is some great work on kernel fusion using Pytorch Jit too.@Jiashen CaoFusing the concats on the GPU helped with reducing the data transfer overhead, it does not optimize the memory usage.Thanks everyone! Feel free to email me at acun@fb.com if you have further questions.
Carole-Jean Wu
04:49:04
Udit and I would like to thank all the speakers, student session chairs, presenters, and all attendees!
Carole-Jean Wu
04:49:18
Please stay in touch! We will follow up with the recorded talks and materials
Carole-Jean Wu
04:49:22
Stay in touch!
Carole-Jean Wu
04:49:30
Here -- https://docs.google.com/spreadsheets/d/1eRO6tZUgUpay_Y-0Tx5rRkQdbYA2yP839kCI4C1uVX8/edit#gid=0
Meghan
04:49:36
This was excellent, thanks so much!
megatropics
04:49:45
Thanks for all the Organisers, Speakers and Moderators!!
Carole-Jean Wu
04:49:47
We will share with you all the future events!
saltyphish
04:49:51
Thank you!
Kristen Bystrom
04:49:56
Thank you!
Bilge Acun
04:50:13
Thank you everyone!
Gabriel Moreira
04:50:21
Thank you for this great content, made available freely online!
Dmitry Mironov
04:50:26
Thank you, everyone!
Hao-Jun Shi
04:50:53
Thank you!
Dhiraj Kalamkar
04:51:06
Thanks you!
Malathi
04:51:26
thank for the nice session..very useful to me