Deck Chairs and Fiddles: Bibliography of GPUs in Climate Simulation

Wednesday, August 17, 2016

Bibliography of GPUs in Climate Simulation

Atlas: A Parallel Framework for Earth System Modeling - https://drive.google.com/file/d/0B6jgPQA_2GE6UUxJUTN1enU5Yk0/view

GPU Accleration of Numerical Weather Prediction - http://www.worldscientific.com/doi/abs/10.1142/S0129626408003557

Weather and climate prediction software has enjoyed the benefits of exponentially increasing processor power for almost 50 years. Even with the advent of large-scale parallelism in weather models, much of the performance increase has come from increasing processor speed rather than increased parallelism. This free ride is nearly over. Recent results also indicate that simply increasing the use of large-scale parallelism will prove ineffective for many scenarios where strong scaling is required. We present an alternative method of scaling model performance by exploiting emerging architectures using the fine-grain parallelism once used in vector machines. The paper shows the promise of this approach by demonstrating a nearly 10 × speedup for a computationally intensive portion of the Weather Research and Forecast (WRF) model on a variety of NVIDIA Graphics Processing Units (GPU). This change alone speeds up the whole weather model by 1.23×.

Using Compiler Directives to Port Large Scientific Applications to GPUs: An Example from Atmospheric Science - http://www.worldscientific.com/doi/abs/10.1142/S0129626414500030

For many scientific applications, Graphics Processing Units (GPUs) can be an interesting alternative to conventional CPUs as they can deliver higher memory bandwidth and computing power. While it is conceivable to re-write the most execution time intensive parts using a low-level API for accelerator programming, it may not be feasible to do it for the entire application. But, having only selected parts of the application running on the GPU requires repetitively transferring data between the GPU and the host CPU, which may lead to a serious performance penalty. In this paper we assess the potential of compiler directives, based on the OpenACC standard, for porting large parts of code and thus achieving a full GPU implementation. As an illustrative and relevant example, we consider the climate and numerical weather prediction code COSMO (Consortium for Small Scale Modeling) and focus on the physical parametrizations, a part of the code which describes all physical processes not accounted for by the fundamental equations of atmospheric motion. We show, by porting three of the dominant parametrization schemes, the radiation, microphysics and turbulence parametrizations, that compiler directives are an efficient tool both in terms of final execution time as well as implementation effort. Compiler directives enable to port large sections of the existing code with minor modifications while still allowing for further optimizations for the most performance critical parts. With the example of the radiation parametrization, which contains the solution of a block tri-diagonal linear system, the required code modifications and key optimizations are discussed in detail. Performance tests for the three physical parametrizations show a speedup of between 3× and 7× for execution time obtained on a GPU and on a multi-core CPU of an equivalent generation.

GPU Accelerated Discontinuous Galerkin Methods for Shallow Water Equations - https://www.cambridge.org/core/journals/communications-in-computational-physics/article/gpu-accelerated-discontinuous-galerkin-methods-for-shallow-water-equations/B1A0C93B3B4074B99D1313F3FF670F13

We discuss the development, verification, and performance of a GPU accelerated discontinuous Galerkin method for the solutions of two dimensional nonlinear shallow water equations. The shallow water equations are hyperbolic partial differential equations and are widely used in the simulation of tsunami wave propagations. Our algorithms are tailored to take advantage of the single instruction multiple data (SIMD) architecture of graphic processing units. The time integration is accelerated by local time stepping based on a multi-rate Adams-Bashforthscheme. A total variational bounded limiter is adopted for nonlinear stability of the numerical scheme. This limiter is coupled with a mass and momentum conserving positivity preserving limiter for the special treatment of a dry or partially wet element in the triangulation. Accuracy, robustness and performance are demonstrated with the aid of test cases. Furthermore, we developed a unified multi-threading model OCCA. The kernels expressed in OCCA model can be cross-compiled with multi-threading models OpenCL, CUDA, and OpenMP. We compare the performance of the OCCA kernels when cross-compiled with these models.

Weather, Climate, and Earth System Modeling on Intel HPC Architectures (2015) - https://www2.cisl.ucar.edu/sites/default/files/Mills_Slides.pdf

Update on GPU Acceleration of Earth System Models (2015) - https://www2.cisl.ucar.edu/sites/default/files/Ponder_Slides.pdf

Using the GPU to Predict Drift in the Ocean - http://on-demand.gputechconf.com/gtc/2016/posters/GTC_2016_Earth_Systems_Modelling_ESM_03_P6193_WEB.pdf

Co-Designing GPU-Based Systems and Tools for Numerical Weather Predictions - http://on-demand.gputechconf.com/gtc/2016/presentation/s6628-thoms-schulthess-co-designing-gpu-based-systems-and-tools.pdf

Development of a hybrid parallel MCV-based high-order global shallow-water model - http://link.springer.chttp://link.springer.com/article/10.1007/s11227-017-1958-1om/article/10.1007/s11227-017-1958-1

Utilization of high-order spatial discretizations is an important trend in developing global atmospheric models. As a competitive choice, the multi-moment constrained volume (MCV) method can achieve high accuracy while maintaining similar parallel scalability to classical finite volume methods. In this work, we introduce the development of a hybrid parallel MCV-based global shallow-water model on the cubed-sphere grid. Based on a sequential code, we perform parallelization on both the process and the thread levels. To enable process-level parallelism, we first decompose the six patches of the cubed-sphere in a same 2-D partition and then employ a conflict-free pipe-flow communication scheme for overlapping the halo exchange with computations. To further exploit the heterogeneous computing capacity of an Intel Xeon Phi accelerated supercomputer, we propose a guided panel-based inner–outer partition to distribute workload among the CPUs and the coprocessors. In addition to the above, thread-level parallelism along with various optimizations is done on both the multi-core CPU and the many-core accelerator.

Exascale Challenges for Numerical Weather Prediction: The ESCAPE Project - http://on-demand.gputechconf.com/gtc/2016/presentation/s6855-olivier-marsden-exascale-challenge-numerical-weather-prediction.pdf

Graphics processing unit optimizations for the dynamics of the HIRLAM weather forecast model - http://onlinelibrary.wiley.com/doi/10.1002/cpe.2951/abstract

Programmable graphics processing units (GPUs) nowadays offer tremendous computational resources for diverse applications. In this paper, we present the implementation of the dynamics routine of the HIRLAM weather forecast model on the NVIDIA GTX 480. The original Fortran code has been converted manually to C and CUDA. Empirically, it is determined what the optimal number of grid points per thread is, and what the best thread and block structures are. A significant amount of the elapsed time consists of transferring data between CPU and GPU. To reduce the impact of these transfer costs, we overlap calculation and transfer of data using multiple CUDA streams. We developed an algorithm that enables our code generator CTADEL to generate automatically the optimal CUDA streams program. Experiments are performed to find out if the applicability of GPUs is useful for Numerical Weather Prediction, in particular for the dynamics part.

Directive-Based Parallelization of the NIM Weather Model for GPUs - http://ieeexplore.ieee.org/document/7081678/

The NIM is a performance-portable model that runs on CPU, GPU and MIC architectures with a single source code. The single source plus efficient code design allows application scientists to maintain the Fortran code, while computer scientists optimize performance and portability using OpenMP, OpenACC, and F2CACC directives. The F2C-ACC compiler was developed in 2008 at NOAA's Earth System Research Laboratory (ESRL) to support GPU parallelization before commercial Fortran GPU compilers were available. Since then, a number of vendors have built GPU compilers that are compliant to the emerging OpenACC standard. The paper will compare parallelization and performance of NIM using the F2C-ACC, Cray and PGI Fortran GPU compilers.

Unified CPU+GPU Programming for the ASUCA Production Weather Model - http://on-demand.gputechconf.com/gtc/2016/presentation/s6621-michel-muller-unified-cpu-gpu-programming.pdf

Unleashing the Performance Potential of GPU for Atmospheric Dynamic Solvers - http://on-demand.gputechconf.com/gtc/2016/presentation/s6354-haohuan-fu-unleashing-the-performance-gpus.pdf

Task-Based Dynamic Scheduling Approach to Accelerating NASA's GEOS-5 - http://on-demand.gputechconf.com/gtc/2016/presentation/s6343-eric-kelmelis-nasa-geos-5.pdf

Parallelization and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures - http://on-demand.gputechconf.com/gtc/2016/presentation/s6117-mark-govett-parallelization-performance-nim-weather-model.pdf

GPGPU Applications for Hydrological and Atmospheric Simulations and Visualizations on the Web - http://on-demand.gputechconf.com/gtc/2016/presentation/s6388-ibrahim-demir-gpgpu-hydrological-atmospheric-applications.pdf

GPU Effectiveness for DART - http://on-demand.gputechconf.com/gtc/2016/posters/GTC_2016_Supercomputing_SC_04_P6221_WEB.pdf

"Piz Daint" and "Piz Kesch": From General Purpose GPU-Accelerated Supercomputing to an Appliance for Weather Forecasting - http://on-demand.gputechconf.com/gtc/2016/presentation/s6683-thomas-schulthess-piz-daint-and-piz-kesch.pdf

Accelerating CICE on the GPU - http://on-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulwes.pdf

Running the NIM Next-Generation Weather Model on GPUs - http://dl.acm.org/citation.cfm?id=1845128

We are using GPUs to run a new weather model being developed at NOAA’s Earth System Research Laboratory (ESRL). The parallelization approach is to run the entire model on the GPU and only rely on the CPU for model initialization, I/O, and inter-processor communications. We have written a compiler to convert Fortran into CUDA, and used it to parallelize the dynamics portion of the model. Dynamics, the most computationally intensive part of the model, is currently running 34 times faster on a single GPU than the CPU. We also describe our approach and progress to date in running NIM on multiple GPUs.

Experience Applying Fortran GPU Compilers to Numerical Weather Prediction - http://www.esrl.noaa.gov/gsd/ab/ac/Presentations/SAAHPC2011_HendersonTBtalk.pdf

Optimizing Weather Model Radiative Transfer Physics for Intel’s Many Integrated Core (MIC) Architecture - http://www.worldscientific.com/doi/abs/10.1142/S0129626416500195

Large numerical weather prediction (NWP) codes such as the Weather Research and Forecast (WRF) model and the NOAA Nonhydrostatic Multiscale Model (NMM-B) port easily to Intel's Many Integrated Core (MIC) architecture. But for NWP to significantly realize MIC’s one- to two-TFLOP/s peak computational power, we must expose and exploit thread and fine-grained (vector) parallelism while overcoming memory system bottlenecks that starve floating-point performance. We report on our work to improve the Rapid Radiative Transfer Model (RRTMG), responsible for 10-20 percent of total NMM-B run time. We isolated a standalone RRTMG benchmark code and workload from NMM-B and then analyzed performance using hardware performance counters and scaling studies. We restructured the code to improve vectorization, thread parallelism, locality, and thread contention. The restructured code ran three times faster than the original on MIC and, also importantly, 1.3x faster than the original on the host Xeon Sandybridge.

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code - http://www.sciencedirect.com/science/article/pii/S0010465516300959

"Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor."

High performance Python for direct numerical simulations of turbulent flows - http://www.sciencedirect.com/science/article/pii/S0010465516300200

"Direct Numerical Simulations (DNS) of the Navier Stokes equations is an invaluable research tool in fluid dynamics. Still, there are few publicly available research codes and, due to the heavy number crunching implied, available codes are usually written in low-level languages such as C/C++ or Fortran. In this paper we describe a pure scientific Python pseudo-spectral DNS code that nearly matches the performance of C++ for thousands of processors and billions of unknowns. We also describe a version optimized through Cython, that is found to match the speed of C++. The solvers are written from scratch in Python, both the mesh, the MPI domain decomposition, and the temporal integrators. The solvers have been verified and benchmarked on the Shaheen supercomputer at the KAUST supercomputing laboratory, and we are able to show very good scaling up to several thousand cores.

A very important part of the implementation is the mesh decomposition (we implement both slab and pencil decompositions) and 3D parallel Fast Fourier Transforms (FFT). The mesh decomposition and FFT routines have been implemented in Python using serial FFT routines (either NumPy, pyFFTW or any other serial FFT module), NumPy array manipulations and with MPI communications handled by MPI for Python (mpi4py). We show how we are able to execute a 3D parallel FFT in Python for a slab mesh decomposition using 4 lines of compact Python code, for which the parallel performance on Shaheen is found to be slightly better than similar routines provided through the FFTW library. For a pencil mesh decomposition 7 lines of code is required to execute a transform."

Implementation of the Vanka-type multigrid solver for the finite element approximation of the Navier–Stokes equations on GPU - http://www.sciencedirect.com/science/article/pii/S0010465515004026

"The first attempts to solve the Navier–Stokes equations on the GPUs date back to 2005. In [2], the authors implemented the SMAC method on rectangular grids. They approximate the velocity explicitly while the pressure implicitly. The linear system for pressure is solved by the Jacobi method. They achieved an overall speed-up more than 20 on GeForce FX 5900 and GeForce 6800 Ultra compared to Pentium IV running at 2 GHz. In [3], the authors present a GPU implementation of a solver for the compressible Euler equations for hypersonic flow on complex domains. They applied a semi-implicit scheme with a finite difference discretization in space. The resulting linear system is solved by the multigrid method. It allowed a speed-up 15–40 using Core 2 Duo E6600 CPU at 2.4 GHz and GeForce 8800 GTX GPU. Both models were restricted to two dimensions.

The results of 3D simulations on a structured grid of quadrilateral elements were presented in [4] with a speed-up 29 and 16 in 2D and 3D respectively (on Core 2 Duo at 2.33 GHz and GeForce 8800 GTX). 3D computations of inviscid compressible flow using unstructured grids were studied in [5]. The cell-centered finite volume method with the explicit Runge–Kutta timestepping was implemented and a speed-up 33 was achieved (on Intel Core 2 Q9450 and Nvidia Tesla card). The multigrid method for the Full Approximation Scheme in 2D was implemented on GPU in [6] with a speed-up 10. A computation on a multi-GPU system was presented in [7]. On four GPUs, a speed-up was 100 (on AMD Opteron 2.4 GHz and four Nvidia Tesla S870 cards). A two-phase flow solver for the Navier–Stokes equations was partially ported to GPU in [8]. On a GPU cluster with eight Nvidia Tesla S1070 GPUs made of two workstations equipped with Intel Core i7-920 at 2.66 GHz, the authors achieved a speed-up up to 115."

"Piz Daint" and "Piz Kesch": From General Purpose GPU-Accelerated Supercomputing to an Appliance for Weather Forecasting - See more at: http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=&searchItems=&sessionTopic=&sessionEvent=2&sessionYear=2016&sessionFormat=&submit=&select=#sthash.SsFwLw8Z.dpuf

Deck Chairs and Fiddles

Wednesday, August 17, 2016

Bibliography of GPUs in Climate Simulation

No comments:

Post a Comment