Pages

Wednesday, August 24, 2016

Julia

Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library. Julia’s Base library, largely written in Julia itself, also integrates mature, best-of-breed open source C and Fortran libraries for linear algebrarandom number generationsignal processing, and string processing.

http://julialang.org/

https://github.com/JuliaLang/julia

Registered Julia Packages - http://pkg.julialang.org/

Curated Julia Language Links - https://github.com/svaksha/Julia.jl

Vectorization and Metaprogramming - http://www.juliabloggers.com/optimizing-details-of-vectorization-and-metaprogramming/

Creating domain-specific languages in Julia using macros - https://julialang.org/blog/2017/08/dsl

the shell

Awesome Shell: A Curated List - https://github.com/alebcay/awesome-shell

Filenames and Pathnames in Shell: How to Do It Correctly - http://www.dwheeler.com/essays/filenames-in-shell.html

Friday, August 19, 2016

Jupyterhub

"JupyterHub is a server that gives multiple users access to Jupyter notebooks, running an independent Jupyter notebook server for each user.

To use JupyterHub, you need a Unix server (typically Linux) running somewhere that is accessible to your team on the network. The JupyterHub server can be on an internal network at your organisation, or it can run on the public internet (in which case, take care with security). Users access JupyterHub in a web browser, by going to the IP address or domain name of the server.

Different authenticators control access to JupyterHub. The default one (pam) uses the user accounts on the server where JupyterHub is running. If you use this, you will need to create a user account on the system for each user on your team. Using other authenticators, you can allow users to sign in with e.g. a Github account, or with any single-sign-on system your organisation has.

Next, spawners control how JupyterHub starts the individual notebook server for each user. The default spawner will start a notebook server on the same machine running under their system username. The other main option is to start each server in a separate container, often using Docker."

https://jupyterhub.readthedocs.io/en/latest/

https://github.com/jupyterhub/jupyterhub


How to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support - https://github.com/PiercingDan/spark-Jupyter-AWS

Jupyter + Pachyderm — Part 1, Exploring and Understanding Historical Analyses - https://medium.com/pachyderm-data/jupyter-pachyderm-part-1-exploring-and-understanding-historical-analyses-2a37e56c6578

Related Software

nteract - a desktop application that allows you to develop rich documents that contain prose, executable code (in almost any language!), and images

https://github.com/nteract/nteract

dockerspawner - Spawns JupyterHub user servers in Docker container.

https://github.com/TACC/dockerspawner 

CLFORTRAN

"CLFORTRAN is an open source (LGPL) Fortran module, designed to provide direct access to GPU, CPU and accelerator based computing resources available by the OpenCL standard.

 Taking advantage of recent Fortran language features, CLFORTRAN is implemented in pure Fortran (no C/C++ code involved). Therefore, it allows scientists to add GPU capabilities nativeley to a familiar Fortran environment.

Providing complete access to OpenCL API, the module is standard conformant, written in simple Fortran and maintains the same functional signature as in C/C++, to allow knowledge reuse and greater flexibility."

http://www.cass-hpc.com/solutions/libraries/clfortran-pure-fortran-interface-to-opencl/

https://github.com/cass-support/clfortran

KBLAS


"KBLAS (KAUST-BLAS) is a small open-source library that optimizes critical numerical kernels on CUDA-enabled GPUs. KBLAS provides a subset of standard BLAS functions. It also proposes some function with BLAS-like interface that target both single and multi- GPU systems.


The ultimate goal for KBLAS is performance. KBLAS has a set of tuning parameters that affect its performance according to the GPU architecture, and the CUDA runtime version. While we cannot guarantee optimal performance with the default tuning parameters, the user can easily edit such parameters on his local system. KBLAS might be shipped with autotuners in the future. The user can refer to the tuning chapter in this document."


LITMUS-RT

"The LITMUSRT patch is a real-time extension of the Linux kernel with a focus on multiprocessor real-time scheduling and synchronization. The Linux kernel is modified to support the sporadic task model and modular scheduler plugins. Clustered, partitioned, and global scheduling are included, and semi-partitioned scheduling is supported as well."

http://www.litmus-rt.org/

https://wiki.litmus-rt.org/litmus/FrontPage

https://wiki.litmus-rt.org/litmus/Publications

 http://www.cs.unc.edu/~gelliott/main/Papers_%26_Projects/Entries/2013/9/23_GPUSync__A_Framework_for_Real-Time_GPU_Management.html

https://github.com/LITMUS-RT/litmus-rt

https://github.com/GElliott/litmus-rt-gpusync

http://on-demand.gputechconf.com/gtc/2014/poster/pdf/P4286_realtime_scheduling_automotive_safety.pdf

MPI

"Message Passing Interface (MPI) is a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computers. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several well-tested and efficient implementations of MPI, many of which are open-source or in the public domain. These fostered the development of a parallel software industry, and encouraged development of portable and scalable large-scale parallel applications."

https://en.wikipedia.org/wiki/Message_Passing_Interface


Tutorials


The latest OpenMPI includes these capabilities.

Bones

"Recent advances in multi-core and many-core processors requires programmers to exploit an increasing amount of parallelism from their applications. Data parallel languages such as CUDA and OpenCL make it possible to take advantage of such processors, but still require a large amount of effort from programmers. To address the challenge of parallel programming, we introduce Bones.

Bones is a source-to-source compiler based on algorithmic skeletons and a new algorithm classification. The compiler takes C-code annotated with class information as input and generates parallelized target code. Targets include NVIDIA GPUs (through CUDA), AMD GPUs (through OpenCL) and x86 CPUs (through OpenCL and OpenMP). Bones is open-source, written in the Ruby programming language, and is available through our website. The compiler is based on the C-parser CAST, which is used to parse the input code into an abstract syntax tree (AST) and to generate the target code from a transformed AST"

http://parse.ele.tue.nl/research/bones/

DataMPI

"DataMPI is an efficient, flexible, and productive communication library, which provides a set of key-value pair based communication interfaces that extends MPI for Big Data. Through utilizing the efficient communication technologies in the High-Performance Computing area, DataMPI can speedup the emerging data intensive computing applications. DataMPI takes a step in bridging the two fields of HPC and Big Data.

 DataMPI can support multiple modes for various Big Data Computing applications, including Common, MapReduce, Streaming, and Iteration. The current version implements the functionalities and features of the Common mode, which aims to support the single program, multiple data (SPMD) applications. The remaining modes will be released in the future.

http://datampi.org/

hiCL

"A high level OpenCL abstraction layer for C/C++ and Fortran scientific computing.  hiCL is a C/C++ and Fortran wrapper that makes it easier to use OpenCL for scientific computing. Writing an OpenCL code evolves including hundreds of lines of host code. In scientific computing, host code is usually cumbersome and very verbose in such a manner scientists would spend more time putting together the host code rather than focusing on accelerating their workloads on GPUs or on any other OpenCL capable hardware.

hiCL extensively reduces the need to focus on the host code and offers a set of functionalities in C/C++ and Fortran to help efficiently exploit hardware accelerators for scientific computing.

hiCL offers a transparent way to manage memory objects on different hardware accelerators with different memory models thanks to a set of bitwise flags.  A paper about hiCL was published at the International OpenCL Workshop in Vienna, Austria on late April 2016."


https://github.com/issamsaid/hiCL 

https://github.com/issamsaid/ezCU 

SHTOOLS

"An archive of Fortran 95 and Python software that can be used to perform spherical harmonic transforms and reconstructions, rotations of data expressed in spherical harmonics, and multitaper spectral analyses on the sphere."

https://github.com/SHTOOLS/SHTOOLS 

http://shtools.ipgp.fr/ 

LFRic

"LFRic is the name given to the programme of work that is intended to deliver a replacement for the Unified Model towards the end of this decade. The name LFRic is chosen in recognition of Lewis Fry Richardson whose fantasy of parallel computation of weather is nearing its centenary. LFric is pronunced elfrick.

Currently LFRic is a Met Office-led project working closely with the ​GungHo project. In turn, the GungHo project is a collaboration between the Met Office, NERC and STFC to develop a new and scalable dynamical core to replace ENDGame, which is the dynamical core used in the current Met Office Unified Model (UM). GungHo is a project within the JWCRP framework.

The key challenges of GungHo and LFRic are to deliver models that produce good weather forecasts and climate simulations, and that run effectively on future generations of supercomputers. At the moment we cannot be sure what these computers will be like, but it is likely that they will comprise many hundreds of thousands of individual compute cores, each with a relatively small amount of local memory.

The designs of GungHo and LFRic will therefore focus on enabling calculations to be broken down into chunks that are sufficiently small to fit into the local memory of a processor, but large enough to enable to processor to effectively vectorise the calculations. The challenge is both a scientific and technical challenge."

https://puma.nerc.ac.uk/trac/GungHo/wiki/LFRic

https://puma.nerc.ac.uk/trac/GungHo/wiki/PSyclone

PSyclone - https://www2.cisl.ucar.edu/sites/default/files/Ford_Slides.pdf

Adventures with Automatic Code Transformation - https://www2.cisl.ucar.edu/sites/default/files/Maynard_Slides.pdf 

WRF

"The Weather Research and Forecasting (WRF) Model is a next-generation mesoscale numerical weather prediction system designed for both atmospheric research and operational forecasting needs. It features two dynamical cores, a data assimilation system, and a software architecture facilitating parallel computation and system extensibility. The model serves a wide range of meteorological applications across scales from tens of meters to thousands of kilometers"

http://www.wrf-model.org/index.php

WRF in Docker - https://github.com/NCAR/docker-wrf

KGen - Fortran Kernel Generator

"A Python tool that extracts partial codes out of a large Fortran application and converts them into a standalone/verifiable/executable kernel."

https://github.com/NCAR/KGen

KGen in Practice - https://www2.cisl.ucar.edu/sites/default/files/Kim_Slides.pdf

Accelerated Climate Modeling for Energy (ACME)

"The Accelerated Climate Modeling for Energy (ACME) project is a newly launched project sponsored by the Earth System Modeling (ESM) program within U.S. Department of Energy's (DOE’s) Office of Biological and Environmental Research. ACME is an unprecedented collaboration among eight national laboratories and six partner institutions to develop and apply the most complete, leading-edge climate and Earth system models to challenging and demanding climate-change research imperatives. It is the only major national modeling project designed to address DOE mission needs to efficiently utilize DOE leadership computing resources now and in the future."

https://climatemodeling.science.energy.gov/projects/accelerated-climate-modeling-energy

Experiences with CUDA and OpenACC from Porting ACME to GPUs - https://www2.cisl.ucar.edu/sites/default/files/Norman_Slildes.pdf

High-Order Method Modeling Environment (HOMME)

"The High-Order Method Modeling Environment (HOMME) is a community model supported by the NSF and the DOE with contributions from NCAR, DOE laboratories and universities.

HOMME uses fully unstructured quadrilateral based finite element meshes on the sphere, such as the cubed-sphere mesh for quasi-uniform resolution.

Employing Spectral Element (SE) and Discontinuous Galerkin (DG) methods to solve the shallow water or the dry/moist primitive equations, HOMME is an extremely scalable and efficient dynamical core.

HOMME is the default dynamical core of the Community Atmosphere Model (CAM) and the Community Earth System Model (CESM)."

https://www.homme.ucar.edu/

 Many-Core Optimization of the CESM - https://www2.cisl.ucar.edu/sites/default/files/Dennis_Slides.pdf


Model for Prediction Across Scales (MPAS)

"The Model for Prediction Across Scales (MPAS) is a collaborative project for developing atmosphere, ocean, and other earth-system simulation components for use in climate, regional climate, and weather studies. The primary development partners are the climate modeling group at Los Alamos National Laboratory (COSIM) and the National Center for Atmospheric Research. Both primary partners are responsible for the MPAS framework, operators, and tools common to the applications; LANL has primary responsibility for the ocean model, and NCAR has primary responsibility for the atmospheric model.

The MPAS framework facilitates the rapid development and prototyping of models by providing infrastructure typically required by model developers, including high-level data types, communication routines, and I/O routines. By using MPAS, developers can leverage pre-existing code and focus more on development of their model."

https://github.com/MPAS-Dev/MPAS-Release

https://mpas-dev.github.io/

GPU Acceleration of NWP: Benchmark Kernels Web Page

"We are working to identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications."

http://www2.mmm.ucar.edu/wrf/WG2/GPU/

Numerical Ocean Modeling and Simulation with CUDA

No sign of the code anywhere, other than a research description from 2011 on Clupo's page at Cal Poly. Sent a query to Clupo on 8/19/16.

"ROMS is software that models and simulates an ocean region using a finite difference grid and time stepping. ROMS simulations can take from hours to days to complete due to the compute-intensive nature of the software. As a result, the size and resolution of simulations are constrained by the performance limitations of modern computing hardware. To address these issues, the existing ROMS code can be run in parallel with either OpenMP or MPI. In this work, we implement a new parallelization of ROMS on a graphics processing unit (GPU) using CUDA Fortran. We exploit the massive parallelism offered by modern GPUs to gain a performance benefit at a lower cost and with less power. To test our implementation, we benchmark with idealistic marine conditions as well as real data collected from coastal waters near central California. Our implementation yields a speedup of up to 8x over a serial implementation and 2.5x over an OpenMP implementation, while demonstrating comparable performance to a MPI implementation."

http://hgpu.org/?p=7072

http://users.csc.calpoly.edu/~clupo/research/parcomp/

http://myroms.org/

Thursday, August 18, 2016

Omni Compiler Project

"Omni compiler is a collection of programs and libraries that allow users to build code transformation compilers. Omni Compiler is to translate C and Fortran programs with XcalableMP and OpenACC directives into parallel code suitable for compiling with a native compiler linked with the Omni Compiler runtime library. Moreover, Omni Compiler supports XcalableACC programming model for accelerated cluster systems.

Omni XcalableMP compiler is a source-to-source compiler that translates an XMP/C or XMP/Fortran code into a parallel code using an XcalableMP runtime library. The parallel code is compiled by the native compiler of the machine (e.g. Cray, PGI, Intel, gcc and so on). Omni Xcalable compiler supports most part of the latest XcalableMP specification.

Omni OpenACC compiler is an open-source OpenACC compiler that translates C code with OpenACC directives to C code with the CUDA API. It is implemented by using the Omni compiler infrastructure. It supports most part of OpenACC specification version 1.0.

Omni XcalableACC compiler translates C code with XMP and OpenACC directives to C code with the CUDA API.

XcodeML is an intermediate code written in XML for C and Fortran languages.

http://omni-compiler.org/

https://github.com/omni-compiler/omni-compiler

http://omni-compiler.org/download/papers/PASC2016.pdf

CLAW Fortran Compiler - https://github.com/C2SM-RCM/claw-compiler

Sector/Sphere

"Sector/Sphere supports distributed data storage, distribution, and processing over large clusters of commodity computers, either within a data center or across multiple data centers. Sector is a high performance, scalable, and secure distributed file system. Sphere is a high performance parallel data processing engine that can process Sector data files on the storage nodes with very simple programming interfaces."

http://sector.sourceforge.net/

http://udt.sourceforge.net/

GPUSPH

"The Smoothed Particle Hydrodynamics (SPH) method is a Lagrangian method for fluid flow simulation. In SPH the continuous medium is discretised into a set of particles that interact with each other and move at the fluid's velocity.

GPUSPH was the first implementation of SPH to run entirely on GPU with CUDA and aims to provide a basis for cutting edge SPH simulations."

 http://www.gpusph.org/

http://hgpu.org/?p=8796

Glasgow Model Coupling Framework (GMCF)

"The aim of GMCF is to make Model Coupling easier and more suited to modern heterogeneous manycore architectures. Our approach is to use modern language, compiler and runtime technologies so that in the end the user only has to write a scenario to realise the coupling.

GMCF consists of a run-time thread pool with FIFO communication, based on my GPRM framework (formerly called Gannet). On top of this, I added a Fortran-C++ integration code generator and the actual model coupling API. This API has three levels and ultimately accesses the GPRM API.

You need a compiler with OpenMP support and OpenCL for the device you want to use. I have used gcc's OpenMP, NVIDIA's OpenCL for GPU and Intel's OpenCL for the CPU."

https://github.com/wimvanderbauwhede/gmcf

http://hgpu.org/?p=13842

AutoParallel-Fortran

"Domain specific, automatically parallelising fortran compiler that takes scientific Fortran as input and produces paralell Fortran/OpenCL."

https://github.com/KombuchaShroomz/AutoParallel-Fortran

easyWave

"An application that is used to simulate tsunami generation and propagation in the context of early warning. It makes use of GPU acceleration to speed up the calculations."

http://trac.gfz-potsdam.de/easywave

http://hgpu.org/?p=14024

Exoclimes Simulation Platform

"The Exoclimes Simulation Platform (ESP) was born from a necessity to move beyond Earth-centric climate simulators, to provide the scientific community studying exoplanets with an ultra-fast, open-source and all-encompassing set of simulational tools. The ESP is designed to harness the power of GPU (graphics processing unit) computing, which affords a speed-up at the order-of-magnitude level. Our ultimate goal is for ESP users to become co-developers."

http://www.exoclime.net/

 HELIOS

 GPU-Accelerated Radiative Transfer Code For Exoplanetary Atmospheres - https://github.com/exoclime/HELIOS

http://arxiv.org/abs/1606.05474

THOR

Atmospheric fluid dynamics solver optimized for GPUs. - https://github.com/exoclime/THOR

http://arxiv.org/abs/1607.05535

Global Arrays (GA)

"Global Arrays (GA) is a Partitioned Global Address Space (PGAS) programming model. It provides primitives for one-sided communication (Get, Put, Accumulate) and Atomic Operations (read increment). It supports blocking and non-blocking primtives, and supports location consistency.
  
The library was developed by scientists at Pacific Northwest National Laboratory for parallel computing. GA provides a friendly API for shared-memory programming on distributed-memory computers for multidimensional arrays. The GA library is a predecessor to the GAS (global address space) languages currently being developed for high-performance computing.

The GA toolkit has additional libraries including a Memory Allocator (MA), Aggregate Remote Memory Copy Interface (ARMCI), and functionality for out-of-core storage of arrays (ChemIO). Although GA was initially developed to run with TCGMSG, a message passing library that came before the MPI standard (Message Passing Interface), it is now fully compatible with MPI. GA includes simple matrix computations (matrix-matrix multiplication, LU solve) and works with ScaLAPACK.

https://en.wikipedia.org/wiki/Global_Arrays

http://hpc.pnl.gov/globalarrays/

DRBL

"DRBL (Diskless Remote Boot in Linux) is free software, open source solution to managing the deployment of the GNU/Linux operating system across many clients. Imagine the time required to install GNU/Linux on 40, 30, or even 10 client machines individually! DRBL allows for the configuration all of your client computers by installing just one server (remember, not just any virtual private server) machine.

DRBL provides a diskless or systemless environment for client machines. It works on Debian, Ubuntu, Red Hat, Fedora, CentOS and SuSE. DRBL uses distributed hardware resources and makes it possible for clients to fully access local hardware. It also includes Clonezilla, a partitioning and disk cloning utility similar to True Image® or Norton Ghost®.

 DRBL uses PXE/etherboot, NFS, and NIS to provide services to client machines so that it is not necessary to install GNU/Linux on the client hard drives individually. Once the server is ready to be a DRBL server, the client machines can boot via PXE/etherboot (diskless). "DRBL" does NOT touch the client hard drives, therefore, other operating systems (e.g. MS Windows) installed on the client machines will be unaffected. This could be useful in, for example, during a phased deployment of GNU/Linux where users still want to have the option of booting to Windows and running some applications only available on MS windows. DRBL allows great flexibility in the deployment of GNU/Linux."

 http://drbl.sourceforge.net/

Clonezilla

"Clonezilla is a partition and disk imaging/cloning program similar to True Image® or Norton Ghost®. It helps you to do system deployment, bare metal backup and recovery. Two types of Clonezilla are available, Clonezilla live and Clonezilla SE (server edition). Clonezilla live is suitable for single machine backup and restore. While Clonezilla SE is for massive deployment, it can clone many (40 plus!) computers simultaneously. Clonezilla saves and restores only used blocks in the hard disk. This increases the clone efficiency. With some high-end hardware in a 42-node cluster, a multicast restoring at rate 8 GB/min was reported."

http://www.clonezilla.org/

CUDA

"CUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing – an approach known as GPGPU. The CUDA platform is a software layer that gives direct access to the GPU's virtual instruction set and parallel computational elements, for the execution of compute kernels.

The CUDA platform is designed to work with programming languages such as C, C++ and Fortran. This accessibility makes it easier for specialists in parallel programming to utilize GPU resources, as opposed to previous API solutions like Direct3D and OpenGL, which required advanced skills in graphics programming. Also, CUDA supports programming frameworks such as OpenACC and OpenCL. When it was first introduced by NVIDIA, the name CUDA was an acronym for Compute Unified Device Architecture, but NVIDIA subsequently dropped the use of the acronym.

NVIDIA CUDA - http://www.nvidia.com/object/cuda_home_new.html

https://en.wikipedia.org/wiki/CUDA

NVIDIA CUDA Forums - https://devtalk.nvidia.com/

stackoverflow - http://stackoverflow.com/questions/tagged/cuda

Wednesday, August 17, 2016

MVAPICH

An implementation of the MPI standard developed by Ohio State University.  MVAPICH2 is available in several variations:

MVAPICH2, with support for InfiniBand, iWARP, RoCE, and Intel Omni-Path

MVAPICH2-X, with support for PGAS and OpenSHMEM

MVAPICH2-GDR, with support for InfiniBand and NVIDIA CUDA GPUs

MVAPICH2-MIC, with support for InfiniBand and Intel MIC

MVAPICH2-Virt, with support for InfiniBand and SR-IOV

MVAPICH2-EA, which is energy-aware and supports InfiniBand, iWARP, and RoCE

http://mvapich.cse.ohio-state.edu/

MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies - http://on-demand.gputechconf.com/gtc/2016/presentation/s6418-dk-panda-gpus-pgas-openshmem.pdf

SHMEM

SHMEM is a communications library that is used for Partitioned Global Address Space (PGAS) style programming.   PGAS is a parallel programming model. It assumes a global memory address space that is logically partitioned and a portion of it is local to each process, thread, or processing element.[1] The novelty of PGAS is that the portions of the shared memory space may have an affinity for a particular process, thereby exploiting locality of reference. The PGAS model is the basis of Unified Parallel C, Coarray Fortran, Split-C, Fortress, Chapel, X10, UPC++, Global Arrays, DASH and SHMEM. In standard Fortran, this model is now an integrated part of the language (as of Fortran 2008). PGAS attempts to combine the advantages of a SPMD programming style for distributed memory systems (as employed by MPI) with the data referencing semantics of shared memory systems. This is more realistic than the traditional shared memory approach of one flat address space, because hardware-specific data locality can be modeled in the partitioning of the address space.

The key features of SHMEM include one-sided point-to-point and collective communication, a shared memory view, and atomic operations that operate on globally visible, or “symmetric” vari-
ables in the program. Such symmetric variables can be either per-process globals (on the heap) or allocated by SHMEM (from a special symmetric heap) at runtime. These features allow the use of remote direct memory access (rDMA) when supported by the network.

Modern commodity networking technology such as Infiniband enables the efficient use of this PGAS library on a wide range of platforms. Runtime systems that have been developed for PGAS languages can easily be used by SHMEM library implementations.

https://en.wikipedia.org/wiki/SHMEM 

https://en.wikipedia.org/wiki/Partitioned_global_address_space


OpenSHMEM

OpenSHMEM is a standard for SHMEM library implementations. Many SHMEM libraries exist but they do not conform to a particular standard and have similar but not identical APIs and behavior, which hinders portability.

The present reference implementation of OpenSHMEM is based on GASNet.

http://openshmem.org/site/

https://github.com/openshmem-org

https://gasnet.lbl.gov/

 Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World - http://on-demand.gputechconf.com/gtc/2016/presentation/s6418-dk-panda-gpus-pgas-openshmem.pdf

GASNet

"GASNet is a language-independent, low-level networking layer that provides network-independent, high-performance communication primitives tailored for implementing parallel global address space SPMD languages and libraries such as UPC, UPC++, Co-Array Fortran, Legion, Chapel, and many others. The interface is primarily intended as a compilation target and for use by runtime library writers (as opposed to end users), and the primary goals are high performance, interface portability, and expressiveness. GASNet stands for "Global-Address Space Networking". 

 The design of GASNet is partitioned into two layers to maximize porting ease without sacrificing performance: the lower level is a narrow but very general interface called the GASNet core API - the design is based heavily on Active Messages, and is implemented directly on top of each individual network architecture. The upper level is a wider and more expressive interface called the GASNet extended API, which provides high-level operations such as remote memory access and various collective operations."

http://gasnet.lbl.gov/

https://github.com/mbdriscoll/pygas

rCUDA

"rCUDA, which stands for Remote CUDA, is a type of middleware software framework for remote GPU virtualization. Fully compatible with the CUDA application programming interface (API), it allows the allocation of one or more CUDA-enabled GPUs to a single application. Each GPU can be part of a cluster or running inside of a virtual machine. The approach is aimed at improving performance in GPU clusters that are lacking full utilization. GPU virtualization reduces the number of GPUs needed in a cluster, and in turn, leads to a lower cost configuration – less energy, acquisition, and maintenance.

 The recommended distributed acceleration architecture is a high performance computing cluster with GPUs attached to only a few of the cluster nodes. When a node without a local GPU executes an application needing GPU resources, remote execution of the kernel is supported by data and code transfers between local system memory and remote GPU memory. rCUDA is designed to accommodate this client-server architecture. On one end, clients employ a library of wrappers to the high-level CUDA Runtime API, and on the other end, there is a network listening service that receives requests on a TCP port. Several nodes running different GPU-accelerated applications can concurrently make use of the whole set of accelerators installed in the cluster. The client forwards the request to one of the servers, which accesses the GPU installed in that computer and executes the request in it. Time-multiplexing the GPU, or in other words sharing it, is accomplished by spawning different server processes for each remote GPU execution request."

https://en.wikipedia.org/wiki/RCUDA

http://www.rcuda.net/

Reducing Remote GPU Execution's Overhead with mrCUDA - http://on-demand.gputechconf.com/gtc/2016/posters/GTC_2016_Supercomputing_SC_08_P6130_WEB.pdf

gpucc: An Open Source GPGPU Compiler

"We''ll present gpucc, an LLVM-based, fully open-source, CUDA-compatible compiler for high performance computing. Its Clang-based front-end supports modern language features such as those in C++11 and C++14. Its compile time is faster than nvcc. It generates better code than nvcc on key end-to-end internal benchmarks and is on par with nvcc on a variety of open-source benchmarks."

Compiling CUDA C/C++ with LLVM -
http://llvm.org/docs/CompileCudaWithLLVM.html

Hacker News Thread - https://news.ycombinator.com/item?id=11565036

GPUCC: An Open-Source GPGPU Compiler - http://on-demand.gputechconf.com/gtc/2016/presentation/s6202-jingyue-wu-gpucc.pdf

StarPU

"StarPU is a software tool aiming to allow programmers to exploit the computing power of the available CPUs and GPUs, while relieving them from the need to specially adapt their programs to the target machine and processing units.

At the core of StarPU is its run-time support library, which is responsible for scheduling application-provided tasks on heterogeneous CPU/GPU machines. In addition, StarPU comes with programming language support, in the form of extensions to languages of the C family (C Extensions), as well as an OpenCL front-end (SOCL OpenCL Extensions).

StarPU's run-time and programming language extensions support a task-based programming model. Applications submit computational tasks, with CPU and/or GPU implementations, and StarPU schedules these tasks and associated data transfers on available CPUs and GPUs. The data that a task manipulates are automatically transferred among accelerators and the main memory, so that programmers are freed from the scheduling issues and technical details associated with these transfers.

StarPU takes particular care of scheduling tasks efficiently, using well-known algorithms from the literature (Task Scheduling Policy). In addition, it allows scheduling experts, such as compiler or computational library developers, to implement custom scheduling policies in a portable fashion (Defining A New Scheduling Policy).

http://starpu.gforge.inria.fr/

 A Case Study on Programming Heterogeneous Multi-GPU with StarPU Library - http://on-demand.gputechconf.com/gtc/2016/posters/GTC_2016_Tools_and_Libraries_TL_01_P6104_WEB.pdf

ppOpen-HPC

"We propose an open source infrastructure for development and execution of optimized and reliable simulation code on post-peta-scale (pp) parallel computers based on many-core architectures. We have named this infrastructure “ppOpen-HPC”. ppOpen-HPC consists of various types of libraries, which cover various procedures for scientific computation. Source code developed on a PC with a single processor is linked with these libraries, and the parallel code generated is optimized for post-peta-scale systems. A capability for automatic tuning is important, and is seen as a critical technology for further development on new architectures, as well as for maintenance of the framework.

 Libraries in ppOpen-APPL, ppOpen-MATH, and ppOpen-SYS are called from user’s programs written in Fortran and C/C++ with MPI. All issues related to hybrid parallel programming models are “hidden” from users by ppOpen-AT."

http://ppopenhpc.cc.u-tokyo.ac.jp/ppopenhpc/

Utilization and Expansion of ppOpen-AT for OpenACC - http://on-demand.gputechconf.com/gtc/2016/posters/GTC_2016_Tools_and_Libraries_TL_06_P6163_WEB.pdf

GPU Computing with Apache Spark and Python

"We''ll demonstrate how Python and the Numba JIT compiler can be used for GPU programming that easily scales from your workstation to an Apache Spark cluster. Using an example application, we show how to write CUDA kernels in Python, compile and call them using the open source Numba JIT compiler, and execute them both locally and remotely with Spark. We also describe techniques for managing Python dependencies in a Spark cluster with the tools in the Anaconda Platform. Finally, we conclude with some tips and tricks for getting best performance when doing GPU computing with Spark and Python."

http://on-demand.gputechconf.com/gtc/2016/presentation/s6413-stanley-seibert-apache-spark-python.pdf

Mesos

"Apache Mesos is an open-source cluster manager that was developed at the University of California, Berkeley. It "provides efficient resource isolation and sharing across distributed applications, or frameworks". The software enables resource sharing in a fine-grained manner, improving cluster utilization.

Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.

Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.

Mesos consists of a master daemon that manages agent daemons running on each cluster node, andMesos frameworks that run tasks on these agents.
The master enables fine-grained sharing of resources (CPU, RAM, …) across frameworks by making them resource offers.  The master decides how many resources to offer to each framework according to a given organizational policy, such as fair sharing or strict priority. To support a diverse set of policies, the master employs a modular architecture that makes it easy to add new allocation modules via a plugin mechanism.

A framework running on top of Mesos consists of two components: a scheduler that registers with the master to be offered resources, and an executor process that is launched on agent nodes to run the framework’s tasks (see the App/Framework development guide for more details about framework schedulers and executors). While the master determines how many resources are offered to each framework, the frameworks' schedulers select which of the offered resources to use. When a framework accepts offered resources, it passes to Mesos a description of the tasks it wants to run on them. In turn, Mesos launches the tasks on the corresponding agents."

https://en.wikipedia.org/wiki/Apache_Mesos

http://mesos.apache.org/

http://mesos.apache.org/documentation/latest/frameworks/

https://github.com/mesosphere/marathon

Software that will run on Mesos includes:
  • Cray Chapel is a productive parallel programming language. The Chapel Mesos scheduler lets you run Chapel programs on Mesos.
  • Dpark is a Python clone of Spark, a MapReduce-like framework written in Python, running on Mesos.
  • Exelixi is a distributed framework for running genetic algorithms at scale.
  • Hadoop Running Hadoop on Mesos distributes MapReduce jobs efficiently across an entire cluster.
  • Hama is a distributed computing framework based on Bulk Synchronous Parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms.
  • MPI is a message-passing system designed to function on a wide variety of parallel computers.
  • Spark is a fast and general-purpose cluster computing system which makes parallel jobs easy to write.
  • Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

MPI

" We have also run MPICH2 directly on top of mesos. See MESOS_HOME/frameworks/mpi for this port. Basically it sets up the MPICH2 MPD ring for you when you use nmpiexec."

http://mesos.readthedocs.io/en/0.21.1/running-torque-or-mpi-on-mesos/

https://github.com/apache/mesos/tree/master/mpi

https://github.com/mesosphere/mesos-hydra

Spark

"Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.

The availability of RDDs facilitates the implementation of both iterative algorithms, that visit their dataset multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications (compared to Apache Hadoop, a popular MapReduce implementation) may be reduced by several orders of magnitude. Among the class of iterative algorithms are the training algorithms for machine learning systems, which formed the initial impetus for developing Apache Spark.

Apache Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster), Hadoop YARN, or Apache Mesos. For distributed storage, Spark can interface with a wide variety, including Hadoop Distributed File System (HDFS), MapR File System (MapR-FS), Cassandra, OpenStack Swift, Amazon S3, Kudu, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per CPU core."

https://en.wikipedia.org/wiki/Apache_Spark

http://spark.apache.org/

 GPU Computing with Apache Spark and Python - http://on-demand.gputechconf.com/gtc/2016/presentation/s6413-stanley-seibert-apache-spark-python.pdf

 Apache Spark for Scientific Data at Scale - http://bcn.boulder.co.us/~neal/talks/spark-science-scale/

Numba

"Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.

Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware, and is designed to integrate with the Python scientific software stack."

http://numba.pydata.org/

https://github.com/numba/numba

https://eng.climate.com/2015/04/09/numba-vs-cython-how-to-choose/

https://www.continuum.io/blog/developer/accelerating-python-libraries-numba-part-1

GPU Computing with Apache Spark and Python - http://on-demand.gputechconf.com/gtc/2016/presentation/s6413-stanley-seibert-apache-spark-python.pdf

NumbaPro: High-Level GPU Programming in Python for Rapid Development - http://on-demand.gputechconf.com/gtc/2014/presentations/S4413-numbrapro-gpu-programming-python-rapid-dev.pdf

CUDA Programming

Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model. Kernels written in Numba appear to have direct access to NumPy arrays. NumPy arrays are transferred between the CPU and the GPU automatically.

Numba supports CUDA-enabled GPU with compute capability 2.0 or above with an up-to-data Nvidia driver.

You will need the CUDA toolkit installed. If you are using Conda, just type:

$ conda install cudatoolkit

Numba now contains preliminary support for CUDA programming. Numba will eventually provide multiple entry points for programmers of different levels of expertise on CUDA. For now, Numba provides a Python dialect for low-level programming on the CUDA hardware. It provides full control over the hardware for fine tunning the performance of CUDA kernels.

The CUDA JIT is a low-level entry point to the CUDA features in Numba. It translates Python functions into PTX code which execute on the CUDA hardware. The jit decorator is applied to Python functions written in our Python dialect for CUDA. Numba interacts with the CUDA Driver API to load the PTX onto the CUDA device and execute.

Most of the CUDA public API for CUDA features are exposed in the numba.cuda module:

from numba import cuda

Bibliography of GPUs in Climate Simulation

Atlas: A Parallel Framework for Earth System Modeling - https://drive.google.com/file/d/0B6jgPQA_2GE6UUxJUTN1enU5Yk0/view


GPU Accleration of Numerical  Weather Prediction - http://www.worldscientific.com/doi/abs/10.1142/S0129626408003557

Weather and climate prediction software has enjoyed the benefits of exponentially increasing processor power for almost 50 years. Even with the advent of large-scale parallelism in weather models, much of the performance increase has come from increasing processor speed rather than increased parallelism. This free ride is nearly over. Recent results also indicate that simply increasing the use of large-scale parallelism will prove ineffective for many scenarios where strong scaling is required. We present an alternative method of scaling model performance by exploiting emerging architectures using the fine-grain parallelism once used in vector machines. The paper shows the promise of this approach by demonstrating a nearly 10 × speedup for a computationally intensive portion of the Weather Research and Forecast (WRF) model on a variety of NVIDIA Graphics Processing Units (GPU). This change alone speeds up the whole weather model by 1.23×.

Using Compiler Directives to Port Large Scientific Applications to GPUs: An Example from Atmospheric Science - http://www.worldscientific.com/doi/abs/10.1142/S0129626414500030

 For many scientific applications, Graphics Processing Units (GPUs) can be an interesting alternative to conventional CPUs as they can deliver higher memory bandwidth and computing power. While it is conceivable to re-write the most execution time intensive parts using a low-level API for accelerator programming, it may not be feasible to do it for the entire application. But, having only selected parts of the application running on the GPU requires repetitively transferring data between the GPU and the host CPU, which may lead to a serious performance penalty. In this paper we assess the potential of compiler directives, based on the OpenACC standard, for porting large parts of code and thus achieving a full GPU implementation. As an illustrative and relevant example, we consider the climate and numerical weather prediction code COSMO (Consortium for Small Scale Modeling) and focus on the physical parametrizations, a part of the code which describes all physical processes not accounted for by the fundamental equations of atmospheric motion. We show, by porting three of the dominant parametrization schemes, the radiation, microphysics and turbulence parametrizations, that compiler directives are an efficient tool both in terms of final execution time as well as implementation effort. Compiler directives enable to port large sections of the existing code with minor modifications while still allowing for further optimizations for the most performance critical parts. With the example of the radiation parametrization, which contains the solution of a block tri-diagonal linear system, the required code modifications and key optimizations are discussed in detail. Performance tests for the three physical parametrizations show a speedup of between 3× and 7× for execution time obtained on a GPU and on a multi-core CPU of an equivalent generation.


GPU Accelerated Discontinuous Galerkin Methods for Shallow Water Equations - https://www.cambridge.org/core/journals/communications-in-computational-physics/article/gpu-accelerated-discontinuous-galerkin-methods-for-shallow-water-equations/B1A0C93B3B4074B99D1313F3FF670F13

We discuss the development, verification, and performance of a GPU accelerated discontinuous Galerkin method for the solutions of two dimensional nonlinear shallow water equations. The shallow water equations are hyperbolic partial differential equations and are widely used in the simulation of tsunami wave propagations. Our algorithms are tailored to take advantage of the single instruction multiple data (SIMD) architecture of graphic processing units. The time integration is accelerated by local time stepping based on a multi-rate Adams-Bashforthscheme. A total variational bounded limiter is adopted for nonlinear stability of the numerical scheme. This limiter is coupled with a mass and momentum conserving positivity preserving limiter for the special treatment of a dry or partially wet element in the triangulation. Accuracy, robustness and performance are demonstrated with the aid of test cases. Furthermore, we developed a unified multi-threading model OCCA. The kernels expressed in OCCA model can be cross-compiled with multi-threading models OpenCL, CUDA, and OpenMP. We compare the performance of the OCCA kernels when cross-compiled with these models.

Weather, Climate, and Earth System Modeling on Intel HPC Architectures (2015) - https://www2.cisl.ucar.edu/sites/default/files/Mills_Slides.pdf


Update on GPU Acceleration of Earth System Models (2015) - https://www2.cisl.ucar.edu/sites/default/files/Ponder_Slides.pdf

Using the GPU to Predict Drift in the Ocean - http://on-demand.gputechconf.com/gtc/2016/posters/GTC_2016_Earth_Systems_Modelling_ESM_03_P6193_WEB.pdf

Co-Designing GPU-Based Systems and Tools for Numerical Weather Predictions - http://on-demand.gputechconf.com/gtc/2016/presentation/s6628-thoms-schulthess-co-designing-gpu-based-systems-and-tools.pdf

Development of a hybrid parallel MCV-based high-order global shallow-water model - http://link.springer.chttp://link.springer.com/article/10.1007/s11227-017-1958-1om/article/10.1007/s11227-017-1958-1

Utilization of high-order spatial discretizations is an important trend in developing global atmospheric models. As a competitive choice, the multi-moment constrained volume (MCV) method can achieve high accuracy while maintaining similar parallel scalability to classical finite volume methods. In this work, we introduce the development of a hybrid parallel MCV-based global shallow-water model on the cubed-sphere grid. Based on a sequential code, we perform parallelization on both the process and the thread levels. To enable process-level parallelism, we first decompose the six patches of the cubed-sphere in a same 2-D partition and then employ a conflict-free pipe-flow communication scheme for overlapping the halo exchange with computations. To further exploit the heterogeneous computing capacity of an Intel Xeon Phi accelerated supercomputer, we propose a guided panel-based inner–outer partition to distribute workload among the CPUs and the coprocessors. In addition to the above, thread-level parallelism along with various optimizations is done on both the multi-core CPU and the many-core accelerator.

Graphics processing unit optimizations for the dynamics of the HIRLAM weather forecast model - http://onlinelibrary.wiley.com/doi/10.1002/cpe.2951/abstract
Programmable graphics processing units (GPUs) nowadays offer tremendous computational resources for diverse applications. In this paper, we present the implementation of the dynamics routine of the HIRLAM weather forecast model on the NVIDIA GTX 480. The original Fortran code has been converted manually to C and CUDA. Empirically, it is determined what the optimal number of grid points per thread is, and what the best thread and block structures are. A significant amount of the elapsed time consists of transferring data between CPU and GPU. To reduce the impact of these transfer costs, we overlap calculation and transfer of data using multiple CUDA streams. We developed an algorithm that enables our code generator CTADEL to generate automatically the optimal CUDA streams program. Experiments are performed to find out if the applicability of GPUs is useful for Numerical Weather Prediction, in particular for the dynamics part.
Directive-Based Parallelization of the NIM Weather Model for GPUs - http://ieeexplore.ieee.org/document/7081678/

 The NIM is a performance-portable model that runs on CPU, GPU and MIC architectures with a single source code. The single source plus efficient code design allows application scientists to maintain the Fortran code, while computer scientists optimize performance and portability using OpenMP, OpenACC, and F2CACC directives. The F2C-ACC compiler was developed in 2008 at NOAA's Earth System Research Laboratory (ESRL) to support GPU parallelization before commercial Fortran GPU compilers were available. Since then, a number of vendors have built GPU compilers that are compliant to the emerging OpenACC standard. The paper will compare parallelization and performance of NIM using the F2C-ACC, Cray and PGI Fortran GPU compilers.
Unified CPU+GPU Programming for the ASUCA Production Weather Model - http://on-demand.gputechconf.com/gtc/2016/presentation/s6621-michel-muller-unified-cpu-gpu-programming.pdf

Unleashing the Performance Potential of GPU for Atmospheric Dynamic Solvers - http://on-demand.gputechconf.com/gtc/2016/presentation/s6354-haohuan-fu-unleashing-the-performance-gpus.pdf

Task-Based Dynamic Scheduling Approach to Accelerating NASA's GEOS-5 - http://on-demand.gputechconf.com/gtc/2016/presentation/s6343-eric-kelmelis-nasa-geos-5.pdf

Parallelization and Performance of the NIM Weather Model on CPU, GPU and MIC Architectures - http://on-demand.gputechconf.com/gtc/2016/presentation/s6117-mark-govett-parallelization-performance-nim-weather-model.pdf

GPGPU Applications for Hydrological and Atmospheric Simulations and Visualizations on the Web - http://on-demand.gputechconf.com/gtc/2016/presentation/s6388-ibrahim-demir-gpgpu-hydrological-atmospheric-applications.pdf

GPU Effectiveness for DART - http://on-demand.gputechconf.com/gtc/2016/posters/GTC_2016_Supercomputing_SC_04_P6221_WEB.pdf

"Piz Daint" and "Piz Kesch": From General Purpose GPU-Accelerated Supercomputing to an Appliance for Weather Forecasting - http://on-demand.gputechconf.com/gtc/2016/presentation/s6683-thomas-schulthess-piz-daint-and-piz-kesch.pdf

Accelerating CICE on the GPU - http://on-demand.gputechconf.com/gtc/2015/presentation/S5322-Rob-Aulwes.pdf

Running the NIM Next-Generation Weather Model on GPUs - http://dl.acm.org/citation.cfm?id=1845128

 We are using GPUs to run a new weather model being developed at NOAA’s Earth System Research Laboratory (ESRL). The parallelization approach is to run the entire model on the GPU and only rely on the CPU for model initialization, I/O, and inter-processor communications. We have written a compiler to convert Fortran into CUDA, and used it to parallelize the dynamics portion of the model. Dynamics, the most computationally intensive part of the model, is currently running 34 times faster on a single GPU than the CPU. We also describe our approach and progress to date in running NIM on multiple GPUs.

 Experience Applying Fortran GPU Compilers to Numerical Weather Prediction - http://www.esrl.noaa.gov/gsd/ab/ac/Presentations/SAAHPC2011_HendersonTBtalk.pdf

Optimizing Weather Model Radiative Transfer Physics for Intel’s Many Integrated Core (MIC) Architecture - http://www.worldscientific.com/doi/abs/10.1142/S0129626416500195

Large numerical weather prediction (NWP) codes such as the Weather Research and Forecast (WRF) model and the NOAA Nonhydrostatic Multiscale Model (NMM-B) port easily to Intel's Many Integrated Core (MIC) architecture. But for NWP to significantly realize MIC’s one- to two-TFLOP/s peak computational power, we must expose and exploit thread and fine-grained (vector) parallelism while overcoming memory system bottlenecks that starve floating-point performance. We report on our work to improve the Rapid Radiative Transfer Model (RRTMG), responsible for 10-20 percent of total NMM-B run time. We isolated a standalone RRTMG benchmark code and workload from NMM-B and then analyzed performance using hardware performance counters and scaling studies. We restructured the code to improve vectorization, thread parallelism, locality, and thread contention. The restructured code ran three times faster than the original on MIC and, also importantly, 1.3x faster than the original on the host Xeon Sandybridge.

Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code - http://www.sciencedirect.com/science/article/pii/S0010465516300959

"Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor."

High performance Python for direct numerical simulations of turbulent flows - http://www.sciencedirect.com/science/article/pii/S0010465516300200

"Direct Numerical Simulations (DNS) of the Navier Stokes equations is an invaluable research tool in fluid dynamics. Still, there are few publicly available research codes and, due to the heavy number crunching implied, available codes are usually written in low-level languages such as C/C++ or Fortran. In this paper we describe a pure scientific Python pseudo-spectral DNS code that nearly matches the performance of C++ for thousands of processors and billions of unknowns. We also describe a version optimized through Cython, that is found to match the speed of C++. The solvers are written from scratch in Python, both the mesh, the MPI domain decomposition, and the temporal integrators. The solvers have been verified and benchmarked on the Shaheen supercomputer at the KAUST supercomputing laboratory, and we are able to show very good scaling up to several thousand cores.
A very important part of the implementation is the mesh decomposition (we implement both slab and pencil decompositions) and 3D parallel Fast Fourier Transforms (FFT). The mesh decomposition and FFT routines have been implemented in Python using serial FFT routines (either NumPy, pyFFTW or any other serial FFT module), NumPy array manipulations and with MPI communications handled by MPI for Python (mpi4py). We show how we are able to execute a 3D parallel FFT in Python for a slab mesh decomposition using 4 lines of compact Python code, for which the parallel performance on Shaheen is found to be slightly better than similar routines provided through the FFTW library. For a pencil mesh decomposition 7 lines of code is required to execute a transform."

Implementation of the Vanka-type multigrid solver for the finite element approximation of the Navier–Stokes equations on GPU - http://www.sciencedirect.com/science/article/pii/S0010465515004026

"The first attempts to solve the Navier–Stokes equations on the GPUs date back to 2005. In  [2], the authors implemented the SMAC method on rectangular grids. They approximate the velocity explicitly while the pressure implicitly. The linear system for pressure is solved by the Jacobi method. They achieved an overall speed-up more than 20 on GeForce FX 5900 and GeForce 6800 Ultra compared to Pentium IV running at 2 GHz. In  [3], the authors present a GPU implementation of a solver for the compressible Euler equations for hypersonic flow on complex domains. They applied a semi-implicit scheme with a finite difference discretization in space. The resulting linear system is solved by the multigrid method. It allowed a speed-up 15–40 using Core 2 Duo E6600 CPU at 2.4 GHz and GeForce 8800 GTX GPU. Both models were restricted to two dimensions.

The results of 3D simulations on a structured grid of quadrilateral elements were presented in  [4] with a speed-up 29 and 16 in 2D and 3D respectively (on Core 2 Duo at 2.33 GHz and GeForce 8800 GTX). 3D computations of inviscid compressible flow using unstructured grids were studied in  [5]. The cell-centered finite volume method with the explicit Runge–Kutta timestepping was implemented and a speed-up 33 was achieved (on Intel Core 2 Q9450 and Nvidia Tesla card). The multigrid method for the Full Approximation Scheme in 2D was implemented on GPU in  [6] with a speed-up 10. A computation on a multi-GPU system was presented in  [7]. On four GPUs, a speed-up was 100 (on AMD Opteron 2.4 GHz and four Nvidia Tesla S870 cards). A two-phase flow solver for the Navier–Stokes equations was partially ported to GPU in  [8]. On a GPU cluster with eight Nvidia Tesla S1070 GPUs made of two workstations equipped with Intel Core i7-920 at 2.66 GHz, the authors achieved a speed-up up to 115."

"Piz Daint" and "Piz Kesch": From General Purpose GPU-Accelerated Supercomputing to an Appliance for Weather Forecasting - See more at: http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=&searchItems=&sessionTopic=&sessionEvent=2&sessionYear=2016&sessionFormat=&submit=&select=#sthash.SsFwLw8Z.dpuf
"Piz Daint" and "Piz Kesch": From General Purpose GPU-Accelerated Supercomputing to an Appliance for Weather Forecasting - See more at: http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=&searchItems=&sessionTopic=&sessionEvent=2&sessionYear=2016&sessionFormat=&submit=&select=#sthash.SsFwLw8Z.dpuf