EECS-affiliated team break record for fastest deep learning training

Researchers used Stampede2 to train a deep neutral network on the ImageNet-1k benchmark set in minutes. Credit: Sean Cunningham, Texas Advanced Computing Center

Grad student Yang You, Prof. James Demmel and Prof. Kurt Keutzer, along with Prof. Cho-Jui Hsieh of UC Davis and Dr. Zhao Zhang of the Texas Advanced Computing Center (TACC), have created, in collaboration with researchers at NVIDIA, a new algorithm which enables them to harness the power of supercomputers to train a deep neural network (DNN) for image recognition at record speed. Deep learning researchers currently use trial-and-error to design new models, requiring them to run training processes tens or even hundreds of times for each model.  The team’s effort efficiently used 1024 Skylake processors on the Stampede2 supercomputer at TACC to complete a 100-epoch ImageNet training with AlexNet in 11 minutes – the fastest time recorded to date.  Also, using 1600 Skylake processors, they bested Facebook’s prior results by finishing a 90-epoch ImageNet training with ResNet-50 in 32 minutes and, for batch sizes above 20,000, their accuracy was much higher than Facebook’s.   The group’s breakthrough involved the development of the Layer-Wise Adaptive Rate Scaling (LARS) algorithm that is capable of distributing data efficiently to many processors to compute simultaneously using a larger-than-ever batch size (up to 32,000 items). The LARS algorithm was jointly developed with Nvidia. The findings show an alternative to the trend of using specialized hardware – either GPUs, Tensor Flow chips, FPGAs or other emerging architectures—for deep learning. The team wrote the code based on Caffe and utilized Intel-Caffe, which supports multi-node training. The results are published in Arxiv.