May 22, 2020. Virtually over BlueJeans
All times listed are Eastern Daylight Time (EDT)
09:45 - 10:00 Welcome by the organizers
10:00 - 10:40 Keynote/Invited Talk by Dr. Manish Gupta, Google Research, India.
"A Stitch in Time": A Grand Challenge for Distributed Machine Learning
Title: "A Stitch in Time": A Grand Challenge for Distributed Machine Learning
Abstract: Transformation of healthcare represents one of the biggest opportunities for machine learning to impact human lives. We argue that this requires a consumer-centric and proactive approach to healthcare, which in turn would require machine learning (both training and inference) in a distributed manner at massive scale. With the help of a few examples, such as screening for diabetic retinopathy, screening for breast cancer, and a proactive approach to prevention of cardio-vascular disease, we describe some promising approaches as well as outstanding challenges.
Bio: Dr. Manish Gupta is the Director of Google Research India, a new AI research lab recently announced by Google. He holds an additional appointment as Infosys Foundation Chair Professor at IIIT Bangalore. Previously, Manish has led VideoKen, a video technology startup, and the research centers for Xerox and IBM in India. As a Senior Manager at the IBM T.J. Watson Research Center in Yorktown Heights, New York, Manish led the team developing system software for the Blue Gene/L supercomputer. IBM was awarded a National Medal of Technology and Innovation for Blue Gene by US President Barack Obama in 2009. Manish holds a Ph.D. in Computer Science from the University of Illinois at Urbana Champaign. He has co-authored about 75 papers, with more than 7,000 citations in Google Scholar (and an h-index of 45), and has been granted 19 US patents. While at IBM, Manish received two Outstanding Technical Achievement Awards, an Outstanding Innovation Award and the Lou Gerstner Team Award for Client Excellence. Manish is a Fellow of ACM and the Indian National Academy of Engineering, and a recipient of a Distinguished Alumnus Award from IIT Delhi.
10:40 - 11:00 Accelerating Towards Larger Deep Learning Models and Datasets – A System Platform View Point
Authors: Saritha Vinod, Naveen M, Anto Ajay Raj John and Asis K Patra, IBM India Systems Development Lab, India
Abstract: Deep Learning (DL) is a rapidly evolving field under the umbrella of Artificial Intelligence (AI) with proven real-world use cases in supervised and unsupervised learning tasks. As the complexity of the learning tasks increases, the DL models become deeper or wider with millions of parameters and use larger datasets. Neural networks like AmoebaNet with 557M parameters and GPT-2 with 1.5 billion parameters are some of the recent examples of large models. DL trainings are generally run on accelerated hardware such as GPUs, TPUs or FPGAs which can satisfy the high computational demands of the neural network training. But accelerators are limited in their memory capacities. Larger the models, larger the memory required while training them. Hence, large DL models and large datasets cannot fit into the limited memory available on GPUs. However, there are techniques designed to overcome this limitation like compression, using CPU memory as a data swap, recomputations within the GPUs etc. But the efficiency of each of these techniques also depends on the underneath system platform capabilities. In this paper we present the observations from our study of training large DL models using data swap method on different system platforms. This study showcases the characteristics of large models and presents the system viewpoint of large deep learning model training by studying the relation of the software techniques to the system platform used underneath. The results presented are based on two DL models, 3DUnetCNN model for medical image segmentation and DeepLabV3+ model for semantic image segmentation.
11:00 - 11:40 Keynote/Invited Talk by Prof. Geoffrey Fox, Indiana University, USA.
High-Performance Computing: From Deep Learning to Data Engineering
Abstract: We describe how High-Performance Computing (HPC) can be used to enhance Big Data and Machine Learning (ML) systems (HPC for ML) but also how machine learning can be used to enhance system execution (ML for HPC). We review the different aspects of data engineering needed to process large scale data and how it is implemented in the Twister2 system combined with PyTorch and TensorFlow. We discuss the different forms of parallelism seen in deep learning with a focus on pipelined parallelism over layers
Bio: Geoffrey Fox received a Ph.D. in Theoretical Physics from Cambridge University and is now distinguished professor of Informatics and Computing, and Physics at Indiana University where he is director of the Digital Science Center, Chair of Department of Intelligent Systems Engineering and Director of the Data Science program at the School of Informatics, Computing, and Engineering.
He previously held positions at Caltech, Syracuse University and Florida State University after being a postdoc at the Institute of Advanced Study at Princeton, Lawrence Berkeley Laboratory and Peterhouse College Cambridge.
He has supervised the Ph.D. of 68 students and published around 1200 papers in physics and computer science with an index of 70 and over 26000 citations.
He currently works in applying computer science from infrastructure to analytics in Biology, Pathology, Sensor Clouds, Earthquake and Ice-sheet Science, Image processing, Deep Learning, Manufacturing, Network Science and Particle Physics. The infrastructure work is built around Software Defined Systems on Clouds and Clusters. The analytics focuses on scalable parallelism.
He is involved in several projects to enhance the capabilities of Minority Serving Institutions. He has experience in online education and its use in MOOCs for areas like Data and Computational Science.
He is a Fellow of APS (Physics) and ACM (Computing).
11:40 - 12:00 Efficient Training of Semantic Image Segmentation on Summit using Horovod and MVAPICH2-GDR
Authors: Quentin Anthony, Ammar Ahmad Awan, Arpan Jain, Hari Subramoni and Dhabaleswar K. Panda, Dept. of Computer Science and Engineering, The Ohio State University, USA
Abstract: Deep Learning (DL) models for semantic image segmentation are an emerging trend in response to the explosion of multi-class, high resolution image and video data. However, segmentation models are highly compute-intensive, and even the fastest Volta GPUs cannot train them in a reasonable time frame. In our experiments, we observed just 6.7 images/second on a single Volta GPU for training DeepLab-v3+ (DLv3+), a state-of-the-art Encoder-Decoder model for semantic image segmentation. For comparison, a Volta GPU can process 300 images/second for training ResNet-50, a state-of-the-art model for image classification. In this context, we see a clear opportunity to utilize supercomputers to speed up training of segmentation models. However, most published studies on the performance of novel DL models such as DLv3+ require the user to significantly change Horovod, MPI, and the DL model to improve performance. Our work proposes an alternative tuning method that achieves near-linear scaling without significant changes to Horovod, MPI, or the DL model. In this paper, we select DLv3+ as the candidate TensorFlow model and implement Horovodbased distributed training for DLv3+. We observed poor default scaling performance of DLv3+ on the Summit system at Oak Ridge National Laboratory. To address this, we conducted an in-depth performance tuning of various Horovod/MPI knobs to achieve better performance over the default parameters. We present a comprehensive scaling comparison for Horovod with MVAPICH2-GDR up to 132 GPUs on Summit. Our optimization approach achieves near-linear (92%) scaling with MVAPICH2- GDR. We achieved “mIOU” accuracy of 80.8% for distributed training, which is on par with published accuracy for this model. Further, we demonstrate an improvement in scaling efficiency by 23.9% over default Horovod training, which translates to a 1.3× speedup in training performance
12:00 - 12:20 Break
12:20 - 13:00 Keynote/Invited Talk by Prof. Wen-mei Hwu of UIUC, USA.
Advancing Computing Infrastructure for Very Large-Scale Deep Learning at C3SR
Abstract: DL for medical, financial, and enterprise applications requires new innovations in several areas computing infrastructure. First, privacy preserving multiple-party computing (MPC) infrastructures are need for medical centers, financial institutions, and enterprises to participate in training large-scale models without exposing the data collected from their customers and operations. Second, compression and sparsification techniques are needed to reduce the bandwidth consumption in distributed training. Third, memory and storage architectures of GPUs need to keep up with the fast-growing computing power of GPUs. Finally, as the models grow, model loading overhead can become the dominating component of inference latency. In this talk, I will present the research efforts at the IBM-Illinois Center for Cognitive Computing Systems (C3SR) that aim to provide the advancement needed in all these areas.
Bio: Wen-mei W. Hwu is a Professor and holds the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign. His research interests are in the area of architecture, algorithms, and programming tools for parallel computer systems (www.crhc.uiuc.edu/Impact). He is the chairman of the IEEE Computer Society Technical Committee on Microarchitecture (TCuARCH) and the chairman of the IEEE Computer society Research Advisory Board (RAB). He co-directs the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR) and served as one of the principal investigators of the NSF Blue Waters Petascale supercomputer. He has published more than 200 technical papers and co-authored with David Kirk a popular textbook entitled “Programming Massively Parallel Processors – a hands-on approach.” His publications have received more than 23,000 citations with an h-index of 71. For his contributions, he received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the IEEE Computer Society Charles Babbage Award, the ACM/IEEE ISCA Influential Paper Award, the ACM/IEEE MICRO Test-of-Time Award, the ACM/IEEE CGO Test-of-Time Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM. Dr. Hwu received his Ph.D. degree in Computer Science from the University of California, Berkeley.
Authors: Naw Safrin Sattar and Shaikh Arifuzzaman, Department of Computer Science, University of New Orleans, USA
Abstract: Sparse Deep Neural Network (DNN) is an emerging research area since deploying deep neural networks with limited resources is very challenging. In this work, we provide a scalable solution to the Sparse DNN Challenge--a challenge posed by MIT/IEEE/Amazon GraphChallenge.org--by designing data parallelism on GPUs. We provide a solution based on Python TensorFlow as it is a widely used tool in different scientific applications for deep learning. We use the datasets provided by GraphChallenge, derived from the MNIST handwritten letters. We use the Synthetic DNNs from RadiX-Net with varying number of neurons and layers. We implement a data parallel implementation of Sparse DNN using TensorFlow on GPU. Our solution shows up to 4.7x speedup over the baseline serial MATLAB implementation given in GraphChallenge. In addition to that, our TensorFlow GPU implementation demonstrates a 3-fold speedup over our TensorFlow CPU implementation.
13:20 - 14:00 Keynote/Invited Talk by Dr. Minsik Cho of IBM, USA.
Scalable Deep Learning Inference: Algorithmic Approach
Title: Scalable Deep Learning Inference: Algorithmic Approach
Abstract: Large-scale deep learning training has made significant progress in the last few years: more powerful systems/accelerators are delivered (i.e., Summit cluster), innovative training mechanisms are designed (i.e., sophisticated hyper-parm tuning), and advantage communication techniques are exercised (i.e., async-SGD). However, deep learning inference has rather limited options when it comes to scaling up the model density per device. Quantization to lower precision can be helpful along with sparsification such as pruning and compression yet suffers from the underlying hardware architecture and efficacy.
In this talk, I like to highlight some of the early efforts to scale up the inference from algorithm perspectives, specifically based on compute sharing and reuse. Compute sharing across multiple models enables a highly dense inference platform by fundamentally reducing the resource demand, and compute reuse based on sensory inputs further improves the performance of an individual model.
Authors: Pankaj Rajak (Argonne Leadership Computing Facility, Argonne National Laboratory, USA), Kuang Liu, Aravind Krishnamoorthy, Rajiv Kalia, Aiichiro Nakano, Ken-Ichi Nomura, Subodh Tiwari and Priya Vashishta (Collaboratory for Advanced Computing and Simulations, University of Southern California, USA)
Abstract: Neural network molecular dynamics (NNMD) simulations could revolutionize atomistic modeling of materials with quantum-mechanical accuracy at a fraction of computational cost. However, popular NNMD frameworks are generally implemented for a single computing node, and conventional energy-based NN models still suffer from large time-to-solution (T2S), prohibiting the application of NNMD to challenging materials simulations encompassing large spatiotemporal scales. Consequently, no leadership-scale NNMD simulation has thus far been reported. Here, we present a scalable parallel NNMD software (RXMD-NN) based on our scalable reactive molecular dynamics simulation engine named RXMD. RXMD-NN has achieved high scalability up to 786,432 IBM BlueGene/Q cores involving 1.7 billion atoms. Furthermore, we have achieved 4.6-fold reduction of T2S by using a novel network that directly predicts atomic forces from feature vectors. Reduced T2S has for the first time allowed the study of large-scale off-stoichiometry effect in a widely used phase change material, Ge2Se2Te5, thereby resolving its “first-sharp diffraction peak mystery”
Florent Lopez (Innovative Computing Laboratory (ICL), University of Tennessee, USA), Edmond Chow (College of Computing, Georgia Institute of Technology, USA), Stanimire Tomov (Innovative Computing Laboratory (ICL), University of Tennessee, USA) and Jack Dongarra (Innovative Computing Laboratory (ICL), University of Tennessee, USA)
Abstract: We present a parallel asynchronous Stochastic Gradient Descent algorithm for shared memory architectures. Different from previous asynchronous algorithms, we consider the case where the gradient updates are not particularly sparse. In the context of the MagmaDNN framework, we compare the parallel efficiency of the asynchronous implementation with that of the traditional synchronous implementation. Tests are performed for training deep neural networks on multicore CPUs and GPU devices.