On the Efficacy of GPU-Integrated MPI for Scientific Applications

On the Efficacy of GPU-Integrated MPI for Scientific Applications

On the Efficacy of GPUIntegrated MPI for Scientific Applications Ashwin M. Aji, Lokendra S. Panwar, Wu-chun Feng Virginia Tech Pavan Balaji, James Dinan, Rajeev Thakur Argonne National Lab. Feng Ji, Xiaosong Ma N. C. State University Milind Chabbi, Karthik Murthy, John Mellor-Crummey Rice University Keith R. Bisset

Virginia Bioinformatics Inst. Presenter: Ashwin M. Aji Ph.D. Candidate, Virginia Tech http:// http://synergy.cs.vt.ed Overview We describe the hybrid design and optimization strategies of todays MPI+GPU applications. We show how GPU-integrated MPI helps to expand this design and optimization space. We evaluate and discuss the tradeoffs of the new design and optimizations using real application case studies from epidemiology and seismology domains. on HokieSpeed, a 212 TFlop CPU-GPU cluster at Virginia Tech. http:// Ashwin M. Aji ([email protected]) 2 http://synergy.cs.vt.ed

A c c e le r a t o r - B a s e d S y s t e m S h a r e ( o u t o f 5 0 0 ) http:// Accelerator-Based Supercomputers 70 60 50 40 30 20 10 0 Accelerator Vendors/ Models NVIDIA 2050/2070/2090/K20 x Intel Xeon Phi (MIC) IBM PowerXCell 8i AMD Firepro/Radeon 3 Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed Programming HPC Systems CPU

CPU Overlap communication Mantra: with MPI CPU computation CPU 4 http:// Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed Programming HPC Systems CPU GPU Mantra(s) OverlapCPU CPU-CPU communication with MPI GPU CPU CPU GPU computation Overlap CPU computation with GPU computation PCIe

PCIe Overlap CPU-CPU communication with CPU-GPU communication CPU GPU communication Overlap xPU-xPU with xPU computations 5 http:// Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed What is GPU-integrated MPI? GPU device memory GPU device memory PC Ie Ie C P CPU

main memory Network MPI Rank = 0 if(rank == 0) { GPUMemcpy(host_buf, dev_buf, D2H) MPI_Send(host_buf, .. ..) } http:// CPU main memory MPI Rank = 1 if(rank == 1) { MPI_Recv(host_buf, .. ..) GPUMemcpy(dev_buf, host_buf, H2D) 6 } Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed What is GPU-integrated MPI? int processed[chunks] = {0};

for(j=0;j device numProcessed = 0; j = 0; flag = 1; memory while (numProcessed < chunks) { if(cudaStreamQuery(streams[j] == cudaSuccess)) { PC /* start MPI */ Ie MPI_Isend(host_buf+offset,...); numProcessed++; processed[j] = 1; CPU Network } MPI_Testany(...); /* check progress main */ if(numProcessed < chunks) /* next chunk */ memory while(flag) { j=(j+1)%chunks; flag=processed[j]; } } MPI_Waitall(); GPU device memory Ie C P Performance vs. Programmability CPU tradeoff main Multiple optimizations for different... memory GPUs (AMD/Intel/NVIDIA) programming models (CUDA/OpenCL) MPI Rank = 0 versions (CUDA v3/CUDA MPI Rank library v4) = 1 7 http:// Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed

What is GPU-integrated MPI? GPU device memory GPU device memory PC Ie Examples: e MPI-ACC, MVAPICH, Open MPI PCI CPU multiple CPU Programmability: accelerators and Network main main prog. modelsmemory (CUDA, OpenCL) memory Performance: system-specific and vendorspecific optimizations (Pipelining, GPUDirect, pinned host memory, IOH affinity) MPI Rank = 0 MPI Rank = 1 Performs

great on benchmarks! How about in if(rank == 1) == the 0) context of complex applications? if(rank { MPI_Send(any_buf, .. ..) } http:// { MPI_Recv(any_buf, .. ..) } Ashwin M. Aji ([email protected]) 8 http://synergy.cs.vt.ed Using GPU-integrated MPI GPU CPU GPU

CPU GPUMPI CPU GPU CPU GPU 10 http:// Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed A Programmers View CPU CPU GPU GPU CPU

GPU Provides a clean and natural interface to communicate with the intended communication target All the details GPU-MPIare CPU optimization CPU MPIcommunication hidden under the MPI layer Computation model still remains the same GPU kernel offload from the CPU CPU GPU CPU GPU CPU GPU 11 http:// Ashwin M. Aji ([email protected])

http://synergy.cs.vt.ed Using GPU-integrated MPI (2) CPU CPU MPI Preprocess CPU GPU CPU 12 http:// Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed Using GPU-integrated MPI (2) CPU CPU GPUMPI Preprocess

CPU GPU It enables the programmer to naturally consider even the GPU for preprocessing, which is usually against the norms. o GPU preprocessing need not always be a good CPU idea, but at least that option can easily be explored and evaluated. http:// Ashwin M. Aji ([email protected]) 13 http://synergy.cs.vt.ed Using GPU-integrated MPI (2) CPU CPU GPUMPI Preprocess CPU

GPU CPU 14 http:// Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed Capabilities of GPUintegrated MPI Capabilities of MPI Simultaneously execute multiple processes. Overlap communication with computation. Additional capabilities of GPU-integrated MPI Overlap CPU-CPU communication with CPU-GPU communication (internal pipelining). Overlap xPU-xPU communication with CPU computation and GPU computation. Provides a natural interface to move data and choose CPU or GPU for the next task at hand. Efficiently manages resources while providing optimal performance http:// Ashwin M. Aji ([email protected]) 15

http://synergy.cs.vt.ed Outline What is GPU-integrated MPI? Programming CPU-GPU systems using simple MPI and GPU-integrated MPI Evaluating the efficacy of GPUintegrated MPI Conclusion 16 http:// Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed Evaluation and Discussion Case Studies Epidemiology Simulation (EpiSimdemics) Seismology Simulation (FDM-Seismology) Experimental Setup HokieSpeed: CPU-GPU cluster at Virginia Tech Used up to 128 nodes, where each node has two hex-core Intel Xeon E5645 CPUs two NVIDIA Tesla M2050 GPUs CUDA toolkit v4.0 and driver v270. 17 http://

Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed EpiSimdemics: Infection propagation simulator Receive Visit Messages N Person: N m Location: m Preprocess Data Compute Interactions (on GPU) Location b http:// Ashwin M. Aji ([email protected]) 18 http://synergy.cs.vt.ed

Case Study for GPU-MPI: Epidemiology Network MPI+CUDA PEi (Host CPU) 1. Copy to GPU 2. Preprocess on GPU GPUi (Device) http:// for(j=0;j MPI_Waitall(); gpuMemcpy (dev_buf, host_buf, H2D); gpuPreprocess (dev_buf); gpuDeviceSynchronize(); gpuMemcpy NOT an overhead here gpuPreprocess is an overhead 20 can be pipelined Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed

Case Study for GPU-MPI: Epidemiology MPI+CUDA (Advanced) Network allocate(host_chunks[N]); allocate(dev_chunks[N]); Good for(j=0;j data Overlap /* Pipeline dataPreprocessing transfers and communication withALL GPU transfers to GPU preprocessing kernels */ Bad if( MPI_Testany(request[i], ...) ) { Pinned memory footprint increases with number gpuMemcpyAsync(dev_chunk[i], 1b. Overlapped

host_chunk[i], &event[i]); of processes (unless you write your own memory Preprocessing } manager as well) with internode if( gpuEventQuery(event[k]) ) { o The data movement pattern in EpiSimdemics gpuPreprocess (dev_chunk[k]); CPU-GPU } iscommunication GPU i nondeterministic } (Device) Code can get verbose gpuDeviceSynchronize(); http:// Ashwin M. Aji ([email protected])

21 http://synergy.cs.vt.ed Case Study for GPU-MPI: Epidemiology Network GPU-MPI allocate(dev_chunks[N]); for(j=0;j 1a. Pipelined data preprocessing Overlap kernels */ communication with GPU Preprocessing ){ transfers to GPU Pinned memory

thatif(isMPI_Testany(request[i], used for staging ...) has gpuPreprocess (dev_chunk[i]); constant footprint } 1b. Overlapped o Memory manager } is implemented within Preprocessing gpuDeviceSynchronize(); http:// GPU-MPI with internode o Created during MPI_Init and destroyed at CPU-GPU MPI_Finalize GPU communication i (Device) Code is almost like a CPU-only MPI program! 22 o More natural way of expressing the Ashwin M. Aji computational and/or communication target http://synergy.cs.vt.ed ([email protected])

Case Study for GPU-MPI: Epidemiology Network Network PEi (Host CPU) PEi (Host CPU) 1a. Pipelined data transfers to GPU 1. Copy to GPU Preprocessing 2. Preprocess on overhead GPU GPUi (Device) MPI+CUDA http:// Pipelined 1b. Overlapped Preprocessing preprocessing

Memory with internode CPU-GPU management? GPUcommunication i (Device) MPI+CUDA Adv. & GPU-MPI 23 Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed Execution Time (seconds) EpiSimdemics: Experimental Results 160 140 120 100 80 60 40 20 0 http://

6.3% better than MPI+CUDA on average 14.6% better than MPI+CUDA for lesser nodes 24.2% better than MPI+CUDA Adv. on average 24 ComputeInteractions Phase ComputeVisits Phase 61.6% better than MPI+CUDA Adv. for Ashwin M. Aji larger nodes http://synergy.cs.vt.ed ([email protected]) Cost of GPU Preprocessing (MPI+CUDA vs. GPU-MPI) 14 12 Time (sec) 10 Preprocessing on GPU is completely hidden Scope for optimization lessens for more

nodes because of strong scaling 8 6 4 2 0 http:// 25 Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed Cost of Pinned Memory Management (MPI+CUDA Adv. vs. GPU-MPI) Buffer Init. Time (sec) 100 80 60 40 20 0

http:// 26 Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed Case Study for GPU-MPI: FDM-Seismology GPU-MPI 30 http:// Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed FDM-Seismology: Experimental Results ~43% ~43% overall overall improvement improvement 31 http:// Ashwin M. Aji

([email protected]) http://synergy.cs.vt.ed FDM-Seismology: Experimental Results http:// As the number of nodes increase (strong scaling) cudaMemcpy time decreases and benefits of hiding it also decreases (strong scaling) 32 xPU-xPU MPI communication becomes Ashwin M. Aji significant http://synergy.cs.vt.ed ([email protected]) Outline What is GPU-integrated MPI? Programming CPU-GPU systems using simple MPI and GPU-integrated MPI Evaluating the efficacy of GPUintegrated MPI Conclusion 33 http://

Ashwin M. Aji ([email protected]) http://synergy.cs.vt.ed Efficacy of GPU-integrated Natural interface toMPI express communication/computation targets Overlapped xPU-xPU communication with xPU computation for better resource utilization Efficient buffer management techniques for improved scalability Convenient for programmers to explore and evaluate newer design/optimization spaces GPU-driven data partitioning (read the paper) GPU-driven data marshaling Questions? http:// Contact: Ashwin Aji ([email protected]) Prof. Wu Feng ([email protected]) http://synergy.cs.vt.ed

Recently Viewed Presentations

  • NSLSII Control System Developments and Lessons Learned The

    NSLSII Control System Developments and Lessons Learned The

    NSLS II Integration Lessons Learned (5 of 5) Archiving important control and monitor channels during commissioning is very important. Radiation Monitor Integration into the control system was tested for end to end communication, however it was not tested with a...
  • Rodney Fort&#x27;s Sports Economics - Leeds School of Business

    Rodney Fort's Sports Economics - Leeds School of Business

    Rodney Fort's Sports Economics Chapter 3 The Market for Sports Broadcast Rights
  • Instantaneous Gratification: Behavior, Models, and Retirement Savings Policy

    Instantaneous Gratification: Behavior, Models, and Retirement Savings Policy

    Can regulation implement early decisions? A regulatory example: To buy cigarettes you must have a free "tobacco sticker," available to adults who apply for one. Smokers who want to quit can destroy their sticker (it takes 4 wks to get...
  • FIRMA 2007 - FIRMA - Home

    FIRMA 2007 - FIRMA - Home

    FIRMA 2007 FDIC Trust Update Common FDIC Exam Issues / High Risk Areas Best Practices - Risk Management FDIC Exam Initiatives Presented by Carla Walter-Clifton, CFIRS [email protected] 816-234-8085 Common Trust Exam Problems: For discussion purposes segregating some issues by large...
  • Hazard Analysis - University of Utah

    Hazard Analysis - University of Utah

    Hazard Analysis Terry A. Ring CHEN 5253 Definitions A hazard analysis is a process used to assess risk. The results of a hazard analysis is the identification of unacceptable risks and the selection of means of controlling or eliminating them.
  • Elektronik Yayıncılık ve Bilimsel İletişim: Teknoloji ...

    Elektronik Yayıncılık ve Bilimsel İletişim: Teknoloji ...

    Database Design Hacettepe University Department of Information Management DOK 322: Database Management Systems Last Time File Management Systems vs. DBMSs Advantages and Disadvatages of DBMS Components of DBMS MS Access 2000 (Lab) DBMS Benefits Minimal Data Redundancy Consistency of Data...
  • Reproductive System - Crestwood Middle School

    Reproductive System - Crestwood Middle School

    II. The female reproductive system: These organs produce the ova (eggs), sex hormones, and provide an environment for internal fertilization and development of the fetus. They undergo changes according to the menstrual cycle. Mammary glands are considered part of this...
  • Diapositiva 1

    Diapositiva 1

    Alteraciones del ritmo intestinal Ángel Artal Hospital Universitario Miguel Servet Zaragoza * * * * * * * * * * * * * * * * Docusato Tensoactivo, Heces duras Procinéticos Metoclopramida, Domperidona Enemas y supositorios Episodios intensos Otros...