GPU Accleration of Numerical Weather Prediction - http://www.worldscientific.com/doi/abs/10.1142/S0129626408003557
Weather and climate prediction software has enjoyed the benefits of exponentially increasing processor power for almost 50 years. Even with the advent of large-scale parallelism in weather models, much of the performance increase has come from increasing processor speed rather than increased parallelism. This free ride is nearly over. Recent results also indicate that simply increasing the use of large-scale parallelism will prove ineffective for many scenarios where strong scaling is required. We present an alternative method of scaling model performance by exploiting emerging architectures using the fine-grain parallelism once used in vector machines. The paper shows the promise of this approach by demonstrating a nearly 10 × speedup for a computationally intensive portion of the Weather Research and Forecast (WRF) model on a variety of NVIDIA Graphics Processing Units (GPU). This change alone speeds up the whole weather model by 1.23×.
Using Compiler Directives to Port Large Scientific Applications to GPUs: An Example from Atmospheric Science - http://www.worldscientific.com/doi/abs/10.1142/S0129626414500030
For many scientific applications, Graphics Processing Units (GPUs) can be an interesting alternative to conventional CPUs as they can deliver higher memory bandwidth and computing power. While it is conceivable to re-write the most execution time intensive parts using a low-level API for accelerator programming, it may not be feasible to do it for the entire application. But, having only selected parts of the application running on the GPU requires repetitively transferring data between the GPU and the host CPU, which may lead to a serious performance penalty. In this paper we assess the potential of compiler directives, based on the OpenACC standard, for porting large parts of code and thus achieving a full GPU implementation. As an illustrative and relevant example, we consider the climate and numerical weather prediction code COSMO (Consortium for Small Scale Modeling) and focus on the physical parametrizations, a part of the code which describes all physical processes not accounted for by the fundamental equations of atmospheric motion. We show, by porting three of the dominant parametrization schemes, the radiation, microphysics and turbulence parametrizations, that compiler directives are an efficient tool both in terms of final execution time as well as implementation effort. Compiler directives enable to port large sections of the existing code with minor modifications while still allowing for further optimizations for the most performance critical parts. With the example of the radiation parametrization, which contains the solution of a block tri-diagonal linear system, the required code modifications and key optimizations are discussed in detail. Performance tests for the three physical parametrizations show a speedup of between 3× and 7× for execution time obtained on a GPU and on a multi-core CPU of an equivalent generation.
GPU Accelerated Discontinuous Galerkin Methods for Shallow Water Equations - https://www.cambridge.org/core/journals/communications-in-computational-physics/article/gpu-accelerated-discontinuous-galerkin-methods-for-shallow-water-equations/B1A0C93B3B4074B99D1313F3FF670F13
We discuss the development, verification, and performance of a GPU accelerated discontinuous Galerkin method for the solutions of two dimensional nonlinear shallow water equations. The shallow water equations are hyperbolic partial differential equations and are widely used in the simulation of tsunami wave propagations. Our algorithms are tailored to take advantage of the single instruction multiple data (SIMD) architecture of graphic processing units. The time integration is accelerated by local time stepping based on a multi-rate Adams-Bashforthscheme. A total variational bounded limiter is adopted for nonlinear stability of the numerical scheme. This limiter is coupled with a mass and momentum conserving positivity preserving limiter for the special treatment of a dry or partially wet element in the triangulation. Accuracy, robustness and performance are demonstrated with the aid of test cases. Furthermore, we developed a unified multi-threading model OCCA. The kernels expressed in OCCA model can be cross-compiled with multi-threading models OpenCL, CUDA, and OpenMP. We compare the performance of the OCCA kernels when cross-compiled with these models.
Weather, Climate, and Earth System Modeling on Intel HPC Architectures (2015) - https://www2.cisl.ucar.edu/sites/default/files/Mills_Slides.pdf
Using the GPU to Predict Drift in the Ocean - http://on-demand.gputechconf.com/gtc/2016/posters/GTC_2016_Earth_Systems_Modelling_ESM_03_P6193_WEB.pdf
Co-Designing GPU-Based Systems and Tools for Numerical Weather Predictions - http://on-demand.gputechconf.com/gtc/2016/presentation/s6628-thoms-schulthess-co-designing-gpu-based-systems-and-tools.pdf
Development of a hybrid parallel MCV-based high-order global shallow-water model - http://link.springer.chttp://link.springer.com/article/10.1007/s11227-017-1958-1om/article/10.1007/s11227-017-1958-1
Development of a hybrid parallel MCV-based high-order global shallow-water model - http://link.springer.chttp://link.springer.com/article/10.1007/s11227-017-1958-1om/article/10.1007/s11227-017-1958-1
Utilization of high-order spatial discretizations is an important trend in developing global atmospheric models. As a competitive choice, the multi-moment constrained volume (MCV) method can achieve high accuracy while maintaining similar parallel scalability to classical finite volume methods. In this work, we introduce the development of a hybrid parallel MCV-based global shallow-water model on the cubed-sphere grid. Based on a sequential code, we perform parallelization on both the process and the thread levels. To enable process-level parallelism, we first decompose the six patches of the cubed-sphere in a same 2-D partition and then employ a conflict-free pipe-flow communication scheme for overlapping the halo exchange with computations. To further exploit the heterogeneous computing capacity of an Intel Xeon Phi accelerated supercomputer, we propose a guided panel-based inner–outer partition to distribute workload among the CPUs and the coprocessors. In addition to the above, thread-level parallelism along with various optimizations is done on both the multi-core CPU and the many-core accelerator.
Exascale Challenges for Numerical Weather Prediction: The ESCAPE Project - http://on-demand.gputechconf.com/gtc/2016/presentation/s6855-olivier-marsden-exascale-challenge-numerical-weather-prediction.pdf
Graphics processing unit optimizations for the dynamics of the HIRLAM weather forecast model - http://onlinelibrary.wiley.com/doi/10.1002/cpe.2951/abstract
Programmable graphics processing units (GPUs) nowadays offer tremendous computational resources for diverse applications. In this paper, we present the implementation of the dynamics routine of the HIRLAM weather forecast model on the NVIDIA GTX 480. The original Fortran code has been converted manually to C and CUDA. Empirically, it is determined what the optimal number of grid points per thread is, and what the best thread and block structures are. A significant amount of the elapsed time consists of transferring data between CPU and GPU. To reduce the impact of these transfer costs, we overlap calculation and transfer of data using multiple CUDA streams. We developed an algorithm that enables our code generator CTADEL to generate automatically the optimal CUDA streams program. Experiments are performed to find out if the applicability of GPUs is useful for Numerical Weather Prediction, in particular for the dynamics part.Directive-Based Parallelization of the NIM Weather Model for GPUs - http://ieeexplore.ieee.org/document/7081678/
The NIM is a performance-portable model that runs on CPU, GPU and MIC architectures with a single source code. The single source plus efficient code design allows application scientists to maintain the Fortran code, while computer scientists optimize performance and portability using OpenMP, OpenACC, and F2CACC directives. The F2C-ACC compiler was developed in 2008 at NOAA's Earth System Research Laboratory (ESRL) to support GPU parallelization before commercial Fortran GPU compilers were available. Since then, a number of vendors have built GPU compilers that are compliant to the emerging OpenACC standard. The paper will compare parallelization and performance of NIM using the F2C-ACC, Cray and PGI Fortran GPU compilers.Unified CPU+GPU Programming for the ASUCA Production Weather Model - http://on-demand.gputechconf.com/gtc/2016/presentation/s6621-michel-muller-unified-cpu-gpu-programming.pdf
Unleashing the Performance Potential of GPU for Atmospheric Dynamic Solvers - http://on-demand.gputechconf.com/gtc/2016/presentation/s6354-haohuan-fu-unleashing-the-performance-gpus.pdf
Task-Based Dynamic Scheduling Approach to Accelerating NASA's GEOS-5 - http://on-demand.gputechconf.com/gtc/2016/presentation/s6343-eric-kelmelis-nasa-geos-5.pdf
Parallelization and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures - http://on-demand.gputechconf.com/gtc/2016/presentation/s6117-mark-govett-parallelization-performance-nim-weather-model.pdf
GPGPU Applications for Hydrological and Atmospheric Simulations and Visualizations on the Web - http://on-demand.gputechconf.com/gtc/2016/presentation/s6388-ibrahim-demir-gpgpu-hydrological-atmospheric-applications.pdf
GPU Effectiveness for DART - http://on-demand.gputechconf.com/gtc/2016/posters/GTC_2016_Supercomputing_SC_04_P6221_WEB.pdf
"Piz Daint" and "Piz Kesch": From General Purpose GPU-Accelerated Supercomputing to an Appliance for Weather Forecasting - http://on-demand.gputechconf.com/gtc/2016/presentation/s6683-thomas-schulthess-piz-daint-and-piz-kesch.pdf
Accelerating CICE on the GPU - http://on-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulwes.pdf
Running the NIM Next-Generation Weather Model on GPUs - http://dl.acm.org/citation.cfm?id=1845128
We are using GPUs to run a new weather model being developed at NOAA’s Earth System Research Laboratory (ESRL). The parallelization approach is to run the entire model on the GPU and only rely on the CPU for model initialization, I/O, and inter-processor communications. We have written a compiler to convert Fortran into CUDA, and used it to parallelize the dynamics portion of the model. Dynamics, the most computationally intensive part of the model, is currently running 34 times faster on a single GPU than the CPU. We also describe our approach and progress to date in running NIM on multiple GPUs.
Experience Applying Fortran GPU Compilers to Numerical Weather Prediction - http://www.esrl.noaa.gov/gsd/ab/ac/Presentations/SAAHPC2011_HendersonTBtalk.pdf
Optimizing Weather Model Radiative Transfer Physics for Intel’s Many Integrated Core (MIC) Architecture - http://www.worldscientific.com/doi/abs/10.1142/S0129626416500195
Large numerical weather prediction (NWP) codes such as the Weather Research and Forecast (WRF) model and the NOAA Nonhydrostatic Multiscale Model (NMM-B) port easily to Intel's Many Integrated Core (MIC) architecture. But for NWP to significantly realize MIC’s one- to two-TFLOP/s peak computational power, we must expose and exploit thread and fine-grained (vector) parallelism while overcoming memory system bottlenecks that starve floating-point performance. We report on our work to improve the Rapid Radiative Transfer Model (RRTMG), responsible for 10-20 percent of total NMM-B run time. We isolated a standalone RRTMG benchmark code and workload from NMM-B and then analyzed performance using hardware performance counters and scaling studies. We restructured the code to improve vectorization, thread parallelism, locality, and thread contention. The restructured code ran three times faster than the original on MIC and, also importantly, 1.3x faster than the original on the host Xeon Sandybridge.
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code - http://www.sciencedirect.com/science/article/pii/S0010465516300959
"Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor."
High performance Python for direct numerical simulations of turbulent flows - http://www.sciencedirect.com/science/article/pii/S0010465516300200
"Direct Numerical Simulations (DNS) of the Navier Stokes
equations is an invaluable research tool in fluid dynamics. Still,
there are few publicly available research codes and, due to the heavy
number crunching implied, available codes are usually written in
low-level languages such as C/C++ or Fortran. In this paper we describe a
pure scientific Python pseudo-spectral DNS code that nearly matches the
performance of C++ for thousands of processors and billions of
unknowns. We also describe a version optimized through Cython, that is
found to match the speed of C++. The solvers are written from scratch in
Python, both the mesh, the MPI domain decomposition, and the temporal
integrators. The solvers have been verified and benchmarked on the
Shaheen supercomputer at the KAUST supercomputing laboratory, and we are
able to show very good scaling up to several thousand cores.
A
very important part of the implementation is the mesh decomposition (we
implement both slab and pencil decompositions) and 3D parallel Fast
Fourier Transforms (FFT). The mesh decomposition and FFT routines have
been implemented in Python using serial FFT routines (either NumPy,
pyFFTW or any other serial FFT module), NumPy array manipulations and
with MPI communications handled by MPI for Python (mpi4py). We
show how we are able to execute a 3D parallel FFT in Python for a slab
mesh decomposition using 4 lines of compact Python code, for which the
parallel performance on Shaheen is found to be slightly better than
similar routines provided through the FFTW library. For a pencil mesh
decomposition 7 lines of code is required to execute a transform."
Implementation of the Vanka-type multigrid solver for the finite element approximation of the Navier–Stokes equations on GPU - http://www.sciencedirect.com/science/article/pii/S0010465515004026
"The first attempts to solve the Navier–Stokes equations on the GPUs date back to 2005. In [2],
the authors implemented the SMAC method on rectangular grids. They
approximate the velocity explicitly while the pressure implicitly. The
linear system for pressure is solved by the Jacobi method. They achieved
an overall speed-up more than 20 on GeForce FX 5900 and GeForce 6800
Ultra compared to Pentium IV running at 2 GHz. In [3],
the authors present a GPU implementation of a solver for the
compressible Euler equations for hypersonic flow on complex domains.
They applied a semi-implicit scheme with a finite difference
discretization in space. The resulting linear system is solved by the
multigrid method. It allowed a speed-up 15–40 using Core 2 Duo E6600 CPU
at 2.4 GHz and GeForce 8800 GTX GPU. Both models were restricted to two
dimensions.
The results of 3D simulations on a structured grid of quadrilateral elements were presented in [4]
with a speed-up 29 and 16 in 2D and 3D respectively (on Core 2 Duo at
2.33 GHz and GeForce 8800 GTX). 3D computations of inviscid compressible
flow using unstructured grids were studied in [5].
The cell-centered finite volume method with the explicit Runge–Kutta
timestepping was implemented and a speed-up 33 was achieved (on Intel
Core 2 Q9450 and Nvidia Tesla card). The multigrid method for the Full
Approximation Scheme in 2D was implemented on GPU in [6] with a speed-up 10. A computation on a multi-GPU system was presented in [7].
On four GPUs, a speed-up was 100 (on AMD Opteron 2.4 GHz and four
Nvidia Tesla S870 cards). A two-phase flow solver for the Navier–Stokes
equations was partially ported to GPU in [8].
On a GPU cluster with eight Nvidia Tesla S1070 GPUs made of two
workstations equipped with Intel Core i7-920 at 2.66 GHz, the authors
achieved a speed-up up to 115."
"Piz
Daint" and "Piz Kesch": From General Purpose GPU-Accelerated
Supercomputing to an Appliance for Weather Forecasting - See more at:
http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=&searchItems=&sessionTopic=&sessionEvent=2&sessionYear=2016&sessionFormat=&submit=&select=#sthash.SsFwLw8Z.dpuf
"Piz
Daint" and "Piz Kesch": From General Purpose GPU-Accelerated
Supercomputing to an Appliance for Weather Forecasting - See more at:
http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=&searchItems=&sessionTopic=&sessionEvent=2&sessionYear=2016&sessionFormat=&submit=&select=#sthash.SsFwLw8Z.dpuf
No comments:
Post a Comment