Deck Chairs and Fiddles: January 2018

Sunday, January 21, 2018

kubed-sh

"The Kubernetes distributed shell for the casual cluster user. In a nutshell, kubed-sh (pronunciation) lets you execute a program in a Kubernetes cluster without having to create a container image or learn new commands.

In addition to launching (Linux ELF) binaries directly, the following interpreted environments are currently supported:

When you enter node script.js, a Node.js (default version: 9.4) environment is provided and script.js is executed.
When you enter python script.py, a Python (default version: 3.6) environment is provided and the script.py is executed.
When you enter ruby script.rb, a Ruby (default version: 2.5) environment is provided and the script.rb is executed.

Note that kubed-sh is a proper shell environment, that is, you can expect features such as auto-complete, history operations, or CTRL+L clearing the screen to work as per usual."

http://kubed.sh/

https://github.com/mhausenblas/kubed-sh

https://www.youtube.com/watch?v=gqi1-XLiq-o

Tuesday, January 16, 2018

Manticore

"Manticore is a high-level parallel programming language aimed at general-purpose applications running on multi-core processors. Manticore supports parallelism at multiple levels: explicit concurrency and coarse-grain parallelism via CML-style constructs and fine-grain parallelism via various light-weight notations, such as parallel tuple expressions and NESL/Nepal-style parallel array comprehensions.

We have been working on a compiler and runtime system for Manticore since the beginning of 2007. Currently we have most of the parallel features implemented and running on Linux and MacOS X on the x86-64 (a.k.a. AMD64) architecture. Our current implementation efforts are focused on performance tuning, extending the language implementation with NESL-style flattening, and adding mutable state cleanly."

http://manticore.cs.uchicago.edu/

Legion

"Legion is a data-centric programming model for writing high-performance applications for distributed heterogeneous architectures. Making the programming system aware of the structure of program data gives Legion programs three advantages:

User-Specification of Data Properties: Legion provides abstractions for programmers to explicitly declare properties of program data including organization, partitioning, privileges, and coherence. Unlike current programming systems in which these properties are implicitly managed by programmers, Legion makes them explicit and provides the implementation for programmers.
Automated Mechanisms: current programming models require developers to explicitly specify parallelism and issue data movement operations. Both responsibilities can easily lead to the introduction of bugs in complex applications. By understanding the structure of program data and how it is used, Legion can implicitly extract parallelism and issue the necessary data movement operations in accordance with the application-specified data properties, thereby removing a significant burden from the programmer.
User-Controlled Mapping: by providing abstractions for representing both tasks and data, Legion makes it easy to describe how to map applications onto different architectures. Legion provides a mapping interface which gives programmers direct control over all the details of how an application is mapped and executed. Furthermore, Legion’s understanding of program data makes the mapping process orthogonal to correctness. This simplifies program performance tuning and enables easy porting of applications to new architectures.

Applications targeting Legion have the option of either being written in the Regent programming language or written directly to the Legion C++ runtime interface. Applications written in Regent are compiled to LLVM (and call a C wrapper for the C++ runtime API).

The Legion runtime system implements the Legion programming model and supports all the necessary API calls for writing Legion applications. Mappers are special C++ objects that are built on top of the Legion mapping interface which is queried by the Legion runtime system to make all mapping decisions when executing a Legion program. Applications can either chose to use the default Legion mapper or write custom mappers for higher performance.

The Legion runtime system sits on top of a low-level runtime interface called Realm. The Realm interface is designed to provide portability to the entire Legion system by providing primitives which can be implemented on a wide range of architectures. Realm is a modular runtime and supports a variety of underlying technologies for portability across a variety of machines, including GASNet for high-performance networking on a variety of interconnects and CUDA for GPUs."

http://legion.stanford.edu/overview/index.html

pkgsrc

"pkgsrc (package source) is a package management system for Unix-like operating systems. It was forked from the FreeBSD ports collection in 1997 as the primary package management system for NetBSD. Since then it has evolved independently: in 1999, support for Solaris was added, later followed by support for other operating systems. DragonFly BSD, from release 1.4 to 3.4, used pkgsrc as its official packaging system.^[2] MINIX 3 and the Dracolinux distribution both include pkgsrc in their main releases.^[3] SmartOS is another operating system using pkgsrc as the main form of package management.

There are multiple ways to install programs using pkgsrc. The pkgsrc bootstrap contains a traditional ports collection that utilizes a series of makefiles to compile software from source. Another method is to install pre-built binary packages via the pkg_add and pkg_delete tools. A high-level utility named pkgin also exists, and is designed to automate the installation, removal, and update of binary packages in a manner similar to APT or yum.^[4]

pkgsrc currently contains over 17000 packages (over 20000 including work-in-progress packages maintained outside the official tree) and includes most popular open source software. It now supports around 23 operating systems, including AIX, various BSD derivatives, HP-UX, IRIX, Linux, macOS, Solaris, and QNX."

https://www.pkgsrc.org/

https://en.wikipedia.org/wiki/Pkgsrc

GDF

"The basic approach for the GPU Data Frame (GDF) is pretty simple: if applications and libraries agree on an in-memory data format for tabular data and associate metadata, then just a device pointer to the data structure need be exchanged. Additionally, the IPC mechanism built into the CUDA driver allows device pointers to be moved between processes.

Currently, the GDF format is a subset of the Apache Arrow specification. The precise subset has not been fully defined yet, but currently includes numerical columns, and will soon include dictionary-encoded columns (sometimes called "categorical" columns in other data frame systems).

Fundamentally, one can implement GDF by following the Arrow specification directly. In some cases, that is the easiest approach. However, there are some common operations that we expect many GDF-supporting applications will need. To help jumpstart other GDF users, we are working to develop several layers of GDF functionality that can be reused in other projects:
[Much of this functionality is still in progress...]

libgdf: A C library of helper functions, including:
- Copying the GDF metadata block to the host and parsing it to a host-side struct. (typically needed for function dispatch)
- Importing/exporting a GDF using the CUDA IPC mechanism.
- CUDA kernels to perform element-wise math operations on GDF columns.
- CUDA sort, join, and reduction operations on GDFs.
pygdf: A Python library for manipulating GDFs
- Creating GDFs from numpy arrays and Pandas DataFrames
- Performing math operations on columns
- Import/export via CUDA IPC
- Sort, join, reductions
- JIT compilation of group by and filter kernels using Numba
dask_gdf: Extension for Dask to work with distributed GDFs.
- Same operations as pygdf, but working on GDFs chunked onto different GPUs and different servers.
https://github.com/gpuopenanalytics/libgdf/wiki

https://github.com/gpuopenanalytics/libgdf

Apache Arrow

"Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication.

Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python, Ruby, and JavaScript implementations are in progress.

The reference Arrow implementations contain a number of distinct software components:

Columnar vector and table-like containers (similar to data frames) supporting flat or nested types
Fast, language agnostic metadata messaging layer (using Google's Flatbuffers library)
Reference-counted off-heap buffer memory management, for zero-copy memory sharing and handling memory-mapped files
Low-overhead IO interfaces to files on disk, HDFS (C++ only)
Self-describing binary wire formats (streaming and batch/file-like) for remote procedure calls (RPC) and interprocess communication (IPC)
Integration tests for verifying binary compatibility between the implementations (e.g. sending data from Java to C++)
Conversions to and from other in-memory data structures (e.g. Python's pandas library)

https://github.com/apache/arrow

https://arrow.apache.org/

Apache Arrow and the "10 Things I Hate About Pandas" - http://wesmckinney.com/blog/apache-arrow-pandas-internals/

Cloverleaf

"
CloverLeaf is a mini-app that solves the compressible Euler equations on a Cartesian grid, using an explicit, second-order accurate method. Each cell stores three values: energy, density, and pressure. A velocity vector is stored at each cell corner. This arrangement of data, with some quantities at cell centers, and others at cell corners is known as a staggered grid. CloverLeaf currently solves the equations in two dimensions, but a 3D implementation has been started in: CloverLeaf3D.

The computation in CloverLeaf has been broken down into "kernels" — low level building blocks with minimal complexity. Each kernel loops over the entire grid and updates one (or some) mesh variables, based on a kernel-dependent computational stencil. Control logic within each kernel is kept to a minimum , allowing maximum optimisation by the compiler. Memory is sacrificed in order to increase peformance, and any updates to variables that would introduce dependencies between loop iterations are written into copies of the mesh.

Below the top level Leaf directory there is a directory called CloverLeaf. The sub directories in this directory contain the various implementation of the code.

Serial - contains a serial version with no MPI or OpenMP
OpenMP - contains an OpenMP version only with no MPI
MPI - contains an MPI only implementation
OpenACC - contains an OpenACC/MPI implementation that works under the Cray compiler
HMPP- contains another OpenACC/MPI implementation that works with the CAPS and Cray compiler
Offload - contains an Intel Offload/MPI implementation
CUDA - contains the CUDA/MPI implementation
Ref - contains a hybrid OpenMP/MPI implemention. The Serial, OpenMP and MPI versions are extracted from this version so should not diverge from it apart from the removal of the relevant software models.

http://uk-mac.github.io/CloverLeaf/

https://github.com/UK-MAC/CloverLeaf

OPS

"The OPS (Oxford Parallel Structured software) project is developing an open-source framework for the execution of multi-block structured mesh applications on clusters of GPUs or multi-core CPUs and accelerators. Although OPS is designed to look like a conventional library, the implementation uses source-source translation to generate the appropriate back-end code for the different target platforms."

https://github.com/OP-DSL/OPS

OpenSBLI

"A framework for the automated derivation of finite difference solvers from high-level problem descriptions.

OpenSBLI is a Python-based modelling framework that is capable of expanding a set of differential equations written in Einstein notation, and automatically generating C code that performs the finite difference approximation to obtain a solution. This C code is then targetted with the OPS library towards specific hardware backends, such as MPI/OpenMP for execution on CPUs, and CUDA/OpenCL for execution on GPUs.

The main focus of OpenSBLI is on the solution of the compressible Navier-Stokes equations with application to shock-boundary layer interactions (SBLI). However, in principle, any set of equations that can be written in Einstein notation may be solved.

https://opensbli.github.io/

TAL_SH

Tensor algebra library routines for shared memory systems. The TAL_SH library provides API for performing basic tensor algebra operations on multicore CPU, NVidia GPU, Intel Xeon Phi, and other accelerators. Basic tensor algebra operations include tensor contraction, tensor product, tensor addition, tensor transpose, multiplication by a scalar, etc., which operate on locally stored tensors. The execution of tensor operations on accelerators is asynchronous with respect to the CPU host, if the underlying node is heterogeneous. Both Fortran and C/C++ API interfaces are provided. The library has a simplified object-oriented design, although without explicit object-oriented syntax.

https://github.com/DmitryLyakh/TAL_SH

cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs - https://hgpu.org/?p=17219

http://on-demand.gputechconf.com/gtc/2017/presentation/s7255-antti-pekka-hynninen-cutt-a-high-performance-tensor-transpose-library-for-gpus.pdf

SLACK

"This paper presents the design and implementation of low-level library to compute general sums and products over multi-dimensional arrays (tensors). Using only 3 low-level functions, the API at once generalizes core BLAS1-3 as well as eliminates the need for most tensor transpositions. De- spite their relatively low operation count, we show that these transposition steps can become performance limiting in typical use cases for BLAS on tensors. The execution of the present API achieves peak performance on the same order of magnitude as for vendor-optimized GEMM by utilizing a code generator to output CUDA source code for all computational kernels. The outline for these kernels is a multi-dimensional generalization of the MAGMA BLAS matrix multiplication on GPUs. Separate transpositions steps can be skipped because every kernel allows arbitrary multi- dimensional transpositions of the arguments. The library,including its methodology and programming techniques, are made available in SLACK."

https://github.com/frobnitzem/slack

Efficient Primitives for Standard Tensor Linear Algebra - https://dl.acm.org/citation.cfm?id=2949580

medina

"The global climate model ECHAM/MESSy Atmospheric Chemistry (EMAC) is a modular global model that simulates climate change and air quality scenarios. The application includes different sub-models for the calculation of chemical species concentrations, their interaction with land and sea, and the human interaction. The paper presents a source-to-source parser that enables support for Graphics Processing Units (GPU) by the Kinetic Pre-Processor (KPP) general purpose open-source software tool. The requirements of the host system are also described. The source code of the source-to-source parser is available under the MIT License."

https://github.com/CyICastorc/medina

MEDINA: MECCA Development in Accelerators – KPP Fortran to CUDA source-to-source Preprocessor - https://hgpu.org/?p=17241

Snowflake

"Stencil computations are not well optimized by general-purpose production compilers and the increased use of multicore, manycore, and accelerator-based systems makes the optimization problem even more challenging. In this paper we present Snowflake, a Domain Specific Language (DSL) for stencils that uses a "micro-compiler" approach, i.e., small, focused, domain-specific code generators. The approach is similar to that used in image processing stencils, but Snowflake handles the much more complex stencils that arise in scientific computing, including complex boundary conditions, higher- order operators (larger stencils), higher dimensions, variable coefficients, non-unit-stride iteration spaces, and multiple input or output meshes. Snowflake is embedded in the Python language, allowing it to interoperate with popular scientific tools like SciPy and iPython; it also takes advantage of built-in Python libraries for powerful dependence analysis as part of a just-in-time compiler. We demonstrate the power of the Snowflake language and the micro-compiler approach with a complex scientific benchmark, HPGMG, that exercises the generality of stencil support in Snowflake. By generating OpenMP comparable to, and OpenCL within a factor of 2x of hand-optimized HPGMG, Snowflake demonstrates that a micro-compiler can support diverse processor architectures and is performance-competitive whilst preserving a high-level Python implementation."

https://github.com/ucb-sejits/snowflake

Snowflake: A Lightweight Portable Stencil DSL - https://hgpu.org/?p=17333

Bifrost

"A stream processing framework, created to ease the development of high-throughput processing CPU/GPU pipelines. It is specifically designed for digital signal processing (DSP) applications within radio astronomy. A portable C API is provided, along with C++ and Python wrappers.

The heart of bifrost is a flexible ring buffer implementation that allows different signal processing blocks to be connected to form a pipeline. Each block may be assigned to a CPU core, and the ring buffers are used to transport data to and from blocks. Processing blocks may be run on either the CPU or GPU, and the ring buffer will take care of memory copies between the CPU and GPU spaces.

The purpose of bifrost is to allow rapid development of streaming DSP pipelines; that is, it is designed for stream-like data. A simple example of a data stream is the time series voltage data from a radio telescope’s digitizer card. Unlike file-like data, stream-like data has no well defined start and stop points. One can of course take a series of files, each containing a chunk of a time stream, and treat them as a stream.

https://github.com/ledatelescope/bifrost

Bifrost: a Python/C++ Framework for High-Throughput Stream Processing in Astronomy - https://hgpu.org/?p=17424

Synkhronos

"Synkhronos is a Python package for accelerating computation of Theano functions under data parallelism with multiple GPUs. The aim of this package is to speed up program execution with minimum changes to user code. Variables and graphs are constructed as usual with Theano or extensions such as Lasagne. Synkhronos replicates the user-constructed functions and GPU-memory variables on all devices. The user calls these functions as in a serial program; parallel execution across all GPUs is automated. Synkhronos supports management of Theano shared variables across devices, either by reading/writing individually or through collective communications, such as all-reduce, broadcast, etc."

https://github.com/astooke/Synkhronos

Hybrid Fortran

"Hybrid Fortran is ..

.. a directive based extension for the Fortran language.
.. a way for you to keep writing your Fortran code like you're used to - only now with GPGPU support.
.. a preprocessor for your code - its input are Fortran files (with Hybrid Fortran extensions), its output is CUDA Fortran or OpenMP Fortran code (or whatever else you'd like to have as a backend).
.. a build system that handles building two separate versions (CPU / GPU) of your codebase automatically, including all the preprocessing.
.. a test system that handles verification of your outputs automatically after setup.
.. a framework for you to build your own parallel code implementations (OpenCL, ARM, FPGA, Hamster Wheel.. as long as it has some parallel Fortran support you're good) while keeping the same source files.

Hybrid Fortran has been successfully used for porting the Physical Core of Japan's national next generation weather prediction model to GPGPU. We're currently planning to port the internationally used Open Source weather model WRF to Hybrid Fortran as well."

https://github.com/muellermichel/Hybrid-Fortran

Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction Model - https://hgpu.org/?p=17723

In this work we use the GPU porting task for the operative Japanese weather prediction model "ASUCA" as an opportunity to examine productivity issues with OpenACC when applied to structured grid problems. We then propose "Hybrid Fortran", an approach that combines the advantages of directive based methods (no rewrite of existing code necessary) with that of stencil DSLs (memory layout is abstracted). This gives the ability to define multiple parallelizations with different granularities in the same code. Without compromising on performance, this approach enables a major reduction in the code changes required to achieve a hybrid GPU/CPU parallelization – as demonstrated with our ASUCA implementation using Hybrid Fortran.

Refactorf4acc

An Automated Fortran Code Refactoring Tool to Facilitate Acceleration of Numerical Simulations.

https://github.com/wimvanderbauwhede/RefactorF4Acc

Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation - https://hgpu.org/?p=17771

Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent a powerful and affordable tool for scientists who look to speed up simulations of complex systems. However, porting code to such devices requires a detailed understanding of heterogeneous programming tools and effective strategies for parallelization. In this paper we present a source to source compilation approach with whole-program analysis to automatically transform single-threaded FORTRAN 77 legacy code into OpenCL-accelerated programs with parallelized kernels. The main contributions of our work are: (1) whole-source refactoring to allow any subroutine in the code to be offloaded to an accelerator. (2) Minimization of the data transfer between the host and the accelerator by eliminating redundant transfers. (3) Pragmatic auto-parallelization of the code to be offloaded to the accelerator by identification of parallelizable maps and reductions. We have validated the code transformation performance of the compiler on the NIST FORTRAN 78 test suite and several real-world codes: the Large Eddy Simulator for Urban Flows, a high-resolution turbulent flow model; the shallow water component of the ocean model Gmodel; the Linear Baroclinic Model, an atmospheric climate model and Flexpart-WRF, a particle dispersion simulator. The automatic parallelization component has been tested on as 2-D Shallow Water model (2DSW) and on the Large Eddy Simulator for Urban Flows (UFLES) and produces a complete OpenCL-enabled code base. The fully OpenCL-accelerated versions of the 2DSW and the UFLES are resp. 9x and 20x faster on GPU than the original code on CPU, in both cases this is the same performance as manually ported code.

AClang

"ACLang is an open source LLVM Clang based compiler that implements the OpenMP Accelerator Model. It adds a new runtime library to LLVM/CLang that supports OpenMP offloading to accelerators like GPUs. Kernel functions are extracted from OpenMP annotated regions and are dispatched as OpenCL or SPIR code to be loaded and compiled by OpenCL drivers before being executed by the device. AClang leverages on the ISL implementation of the polyhedral model to implement a multilevel tiling optimization on the extracted kernels. AClang also provides a vectorization pass developed specifically to exploit the vector instructions available in OpenCL. This whole process is transparent and does not require any programmer intervention."

https://omp2ocl.github.io/aclang/

Compiling and Optimizing OpenMP 4.X Programs to OpenCL and SPIR - https://hgpu.org/?p=17798

Given their massively parallel computing capabilities heterogeneous architectures comprised of CPUs and accelerators have been increasingly used to speed-up scientific and engineering applications. Nevertheless, programming such architectures is a challenging task for most non-expert programmers as typical accelerator programming languages (e.g. CUDA and OpenCL) demand a thoroughly understanding of the underlying hardware to enable an effective application speed-up. To achieve that, programmers are usually required to significantly change and adapt program structures and algorithms, thus impacting both performance and productivity. A simpler alternative is to use high-level directive-based programming models like OpenACC and OpenMP. These models allow programmers to insert both directives and runtime calls into existing source code, thus providing hints to the compiler and runtime to perform certain transformations and optimizations on the annotated code regions. In this paper, we present ACLang, an open-source LLVM/Clang compiler framework (http://www.aclang.org) that implements the recently released OpenMP 4.X Accelerator Programming Model. ACLang automatically converts OpenMP 4.X annotated program regions into OpenCL/SPIR kernels, while providing a set of polyhedral based optimizations like tiling and vectorization. OpenCL kernels resulting from ACLang can be executed on any OpenCL/SPIR compatible acceleration device, not only GPUs, but also FPGA accelerators like those found in the Intel HARP architecture. To the best of our knowledge and at the time this paper was written, this is the first LLVM/Clang implementation of the OpenMP 4.X Accelerator Model that provides a source-totarget OpenCL conversion.

CUDAnative.jl

"Julia support for native CUDA programming.

GPUs and other accelerators are popular devices for accelerating compute-intensive, parallelizable applications. However, programming these devices is a difficult task. Writing efficient device code is challenging, and is typically done in a low-level programming language. High-level languages are rarely supported, or do not integrate with the rest of the high-level language ecosystem. To overcome this, we propose compiler infrastructure to efficiently add support for new hardware or environments to an existing programming language. We evaluate our approach by adding support for NVIDIA GPUs to the Julia programming language. By integrating with the existing compiler, we significantly lower the cost to implement and maintain the new compiler, and facilitate reuse of existing application code. Moreover, use of the high-level Julia programming language enables new and dynamic approaches for GPU programming. This greatly improves programmer productivity, while maintaining application performance similar to that of the official NVIDIA CUDA toolkit.

https://github.com/JuliaGPU/CUDAnative.jl

https://hgpu.org/?p=17878

LIFT

"Lift is a novel approach to achieving performance portability on parallel accelerators. Lift combines a high-level functional data parallel language with a system of rewrite rules which encode algorithmic and hardware-specific optimisation choices. Applications written in Lift are able to take advantage of GPUs (and in the future other accelerators), transparently from the user.

"Lift is a high-level domain-specific language and source-to-source compiler, targeting single system as well as distributed heterogeneous hardware. Initially targeting image processing algorithms, our framework now also handles general stencil-based operations. It resembles OpenCL, but abstracts away performance optimization details which instead are handled by our source-to-source compiler. Machine learning-based auto-tuning is used to determine which optimizations to apply. For the distributed case, by measuring performance counters on a small input on one device, previously trained performance models are used to predict the throughput of the application on multiple different devices, making it possible to balance the load evenly. Models for the communication overhead are created in a similar fashion and used to predict the optimal number of nodes to use."

http://www.lift-project.org/

https://github.com/lift-project/lift

https://hgpu.org/?p=17926

ImageCL

"ImageCL is a simplified version of OpenCL, specifically aimed at image processing. A source-to-source translator can translate a single ImageCL program into multiple different OpenCL versions, each version with different optimizations applied. An auto-tuner can then be used to pick the best OpenCL version for a given device. This makes it possible to write a program once, and run it with high performance on a range of different devives."

https://github.com/acelster/ImageCL

https://hgpu.org/?p=17926

ratfor90

“Ratfor” is a dialect of Fortran that is more concise than raw Fortran. The newest Ratfor “compiler”, ratfor90 is a simple preprocesor program written in Perl that inputs an attractive Fortran-like dialect and outputs Fortran90. Mainly, the preprocessor produces Fortran statements like ” end do ”, “end if”, “end program”, and “end module”, from the Ratfor ”}”. Ratfor source is about 25-30% smaller than the equivalent Fortran making it equivalently more readable.

Bare-bones Fortran is our most universal computer language for computational physics. For general programming, however, it has been surpassed by C. Ratfor is Fortran with C-like syntax. Ratfor was invented by the people who invented C. After inventing C, they realized that they had made a mistake (too many semicolons) and they fixed it in Ratfor, although it was too late for C. Otherwise, Ratfor uses C-like syntax, the syntax that is also found in the popular languages C++ and Java .

At SEP we supplemented Ratfor77 by preprocessors to give Fortran77 the ability to allocate memory on the fly. These abilities are built into Fortran90 and are seamlessly included in Ratfor90. To take advantage of Fortran90's new features while maintaining the concise coding style provided by Ratfor required us to write a new Ratfor preprocessor, Ratfor90, which produces Fortran90, rather than Fortran77 code."

http://sepwww.stanford.edu/doku.php?id=sep:software:ratfor90