Performance analysis and tuning for General Purpose Graphics Processing Units (GPGPU)

Acknowledgments; GPU Design, Programming, and Trends; A Brief History of GPU; A Brief Overview of a GPU System; An Overview of GPU Architecture; A GPGPU Programming Model: CUDA; Kernels; Thread Hierarchy in CUDA; Memory Hierarchy; SIMT Execution; CUDA language extensions; Vector Addition Example; PT...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Weitere Verfasser: Kim, Hyesoon (BerichterstatterIn), Vuduc, Richard (BerichterstatterIn), Baghsorkhi, Sara (BerichterstatterIn)
Format: UnknownFormat
Sprache:eng
Veröffentlicht: S.l. Morgan & Claypool 2012
Schriftenreihe:Synthesis lectures on computer architecture 20
Schlagworte:
Online Zugang:Inhaltsverzeichnis
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Acknowledgments; GPU Design, Programming, and Trends; A Brief History of GPU; A Brief Overview of a GPU System; An Overview of GPU Architecture; A GPGPU Programming Model: CUDA; Kernels; Thread Hierarchy in CUDA; Memory Hierarchy; SIMT Execution; CUDA language extensions; Vector Addition Example; PTX; Consistency Model and Special Memory Operations; IEEE floating-point support; Execution Model of OpenCL; GPU Architecture; GPU Pipeline; Handling Branch Instructions; GPU Memory Systems; Other GPU Architectures; The Fermi Architecture; The AMD Architecture; Many Integrated Core Architecture
Combining CPUs and GPUs on the same DiePerformance Principles; Theory: Algorithm design models overview; Characterizing parallelism: the Work-Depth Model; Characterizing I/O behavior: the External Memory Model; Combined analyses of parallelism and I/O-efficiency; Abstract and concrete measures; Summary; From Principles to Practice: Analysis and Tuning; The computational problem: Particle interactions; An optimal approximation: the fast multipole method; Designing a parallel and I/O-efficient algorithm; A baseline implementation; Setting an optimization goal
Identifying candidate optimizationsExploring the optimization space; Summary; Using Detailed Performance Analysis to Guide Optimization; Instruction-level Analysis and Tuning; Execution Time Modeling; Applying the Model to FMM; Performance Optimization Guide; Other Performance Modeling Techniques and Tools; Limited Performance Visibility; Work Flow Graphs; Stochastic Memory Hierarchy Model; Roofline Model; Profiling and Performance Analysis of CUDA Workloads Using Ocelot ocelot; Other GPGPU Performance Modeling Techniques; Performance Analysis Tools for OpenCL; Bibliography
Authors' Biographies
1. GPU design, programming, and trends -- 1.1 A brief history of GPU -- 1.2 A brief overview of a GPU system -- 1.2.1 An overview of GPU architecture -- 1.3 A GPGPU programming model: CUDA -- 1.3.1 Kernels -- 1.3.2 Thread hierarchy in CUDA -- 1.3.3 Memory hierarchy -- 1.3.4 SIMT execution -- 1.3.5 CUDA language extensions -- 1.3.6 Vector addition example -- 1.3.7 PTX -- 1.3.8 Consistency model and special memory operations -- 1.3.9 IEEE floating-point support -- 1.3.10 Execution model of OpenCL -- 1.4 GPU architecture -- 1.4.1 GPU pipeline -- 1.4.2 Handling branch instructions -- 1.4.3 GPU memory systems -- 1.5 Other GPU architectures -- 1.5.1 The Fermi architecture -- 1.5.2 The AMD architecture -- 1.5.3 Many integrated core architecture -- 1.5.4 Combining CPUs and GPUs on the same die
General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models.We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms.We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques
Beschreibung:XI, 82 S
Ill., graph. Darst
ISBN:1608459543
1-60845-954-3
9781608459544
978-1-60845-954-4